Sort only Certain Column Names based on Month and Year - pandas

I have more than 100 of columns and organize as below:
import pandas as pd
data = [[11, 1, 6, 8, 45, 67, '30-06-2021', 43578, 3.4, '30-04-2022', 6.7, 5000, 6744, 8.9, 8978, '31-03-2022', '31-01-2022',
'28-02-2022', 5.6]]
dat = pd.DataFrame(data, columns = ['a', 'b', 't', 'u', 'g', 'd', 'Start', 'Apr-22Total', 'Mar-22Rate', 'Apr-22', 'Feb-22Rate', 'Feb-22Total', 'Jan-22Total',
'Apr-22Rate', 'Mar-22Total', 'Mar-22', 'Jan-22', 'Feb-22', 'Jan-22Rate'])
a b t u g d Start Apr-22Total Mar-22Rate Apr-22 Feb-22Rate Feb-22Total Jan-22Total Apr-22Rate Mar-22Total Mar-22 Jan-22 Feb-22 Jan-22Rate
0 11 1 6 8 45 67 30-06-2021 43578 3.4 30-04-2022 6.7 5000 6744 8.9 8978 31-03-2022 31-01-2022 28-02-2022 5.6
How can I organize the order of column names that contain month and year only according to the month and year order?
My expectation is as follow:

Here's a solution:
cols = dat.filter(regex='(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d\d').columns
sorted_cols = cols[pd.Series(cols).str.split('-', expand=True).set_axis(['0','1'],axis=1).pipe(lambda x: x.assign(**{'0':pd.to_datetime(x['0'], format='%b').dt.month})).sort_values(['0', '1']).index]
dat = dat[dat.drop(cols, axis=1).columns.tolist() + sorted_cols.tolist()]
Output:
>>> dat
a b t u g d Start Jan-22 Jan-22Rate Jan-22Total Feb-22 Feb-22Rate Feb-22Total Mar-22 Mar-22Rate Mar-22Total Apr-22 Apr-22Rate Apr-22Total
0 11 1 6 8 45 67 30-06-2021 31-01-2022 5.6 6744 28-02-2022 6.7 5000 31-03-2022 3.4 8978 30-04-2022 8.9 43578

Related

How to combine rows in pandas to be unique and make a sum from values in that rows and keep the wanted data? [duplicate]

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?
You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1
Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).

Pandas Decile Rank

I just used the pandas qcut function to create a decile ranking, but how do I look at the bounds of each ranking. Basically, how do I know what numbers fall in the range of the ranking of 1 or 2 or 3 etc?
I hope the following python code with 2 short examples can help you. For the second example I used the isin method.
import numpy as np
import pandas as pd
df = {'Name' : ['Mike', 'Anton', 'Simon', 'Amy',
'Claudia', 'Peter', 'David', 'Tom'],
'Score' : [42, 63, 75, 97, 61, 30, 80, 13]}
df = pd.DataFrame(df, columns = ['Name', 'Score'])
df['decile_rank'] = pd.qcut(df['Score'], 10,
labels = False)
print(df)
Output:
Name Score decile_rank
0 Mike 42 2
1 Anton 63 5
2 Simon 75 7
3 Amy 97 9
4 Claudia 61 4
5 Peter 30 1
6 David 80 8
7 Tom 13 0
rank_1 = df[df['decile_rank']==1]
print(rank_1)
Output:
Name Score decile_rank
5 Peter 30 1
rank_1_and_2 = df[df['decile_rank'].isin([1,2])]
print(rank_1_and_2)
Output:
Name Score decile_rank
0 Mike 42 2
5 Peter 30 1

Sample and group by, and pivot time series data

I'm struggling to handle a complex (imho) operation on time series data.
I have a time series data set and would like to break it into nonoverlapping pivoted grouped by chunks. It is organized by customer, year, and value. For the purposes of this toy example, I am trying to break it out into a simple forecast of the next 3 months.
df = pd.DataFrame({'Customer': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'a', 6: 'a', 7: 'b', 8: 'b', 9: 'b'},
'Year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020, 5: 2021, 6: 2021, 7: 2020, 8: 2020, 9: 2020},
'Month': {0: 8, 1: 9, 2: 10, 3: 11, 4: 12, 5: 1, 6: 2, 7: 1, 8: 2, 9: 3},
'Value': {0: 5.2, 1: 2.2, 2: 1.7, 3: 9.0, 4: 5.5, 5: 2.5, 6: 1.9, 7: 4.5, 8: 2.9, 9: 3.1}})
My goal is to create a dataframe where each row contains non overlapping data that is in 3 month pivoted increments. So each row has the 3 "value" data points upcoming from that point in time. I'd also like this data to include the most recent data, so if there is an odd amount of data, that data is dropped (see example a).
| Customer | Year | Month | Month1 | Month2 | Month3 |
| b | 2020 | 1 | 4.5 | 2.9 | 3.1 |
| a | 2020 | 9 | 2.2 | 1.7 | 9.0 |
| a | 2020 | 12 | 5.5 | 2.5 | 1.9 |
Much appreciated.
One way is to first sort_values on your df so latest month goes first, assign group numbers and drop those not in groups of 3:
df = df.sort_values(["Year", "Month", "Customer"], ascending=False)
df["group"] = (df.groupby("Customer").cumcount()%3).eq(0).cumsum()
df = df[(df.groupby(["Customer", "group"])["Year"].transform("size").eq(3))]
df["num"] = (df.groupby("Customer").cumcount()%3+1).replace({1:3, 3:1})
print (df)
Customer Year Month Value group num
6 a 2021 2 1.9 2 3
5 a 2021 1 2.5 2 2
4 a 2020 12 5.5 2 1
3 a 2020 11 9.0 3 3
2 a 2020 10 1.7 3 2
1 a 2020 9 2.2 3 1
9 b 2020 3 3.1 5 3
8 b 2020 2 2.9 5 2
7 b 2020 1 4.5 5 1
Now you can pivot your data:
print (df.assign(Month=df["Month"].where(df["num"].eq(1)).bfill(),
Year=df["Year"].where(df["num"].eq(1)).bfill(),
num="Month"+df["num"].astype(str))
.pivot(["Customer","Month","Year"], "num", "Value")
.reset_index())
num Customer Month Year Month1 Month2 Month3
0 a 9.0 2020.0 2.2 1.7 9.0
1 a 12.0 2020.0 5.5 2.5 1.9
2 b 1.0 2020.0 4.5 2.9 3.1
There might be a better way to do this, but this will give you the output you want :
First we add a Customer_counter column to add an ID to rows member of the same chunk, and we remove extra rows.
df["Customer_chunk"] = (df[::-1].groupby("Customer").cumcount()) // 3
df = df.groupby(["Customer", "Customer_chunk"]).filter(lambda group: len(group) == 3)
Then we group by Customer and Customer_chunk to generate each column of the desired output.
df_grouped = df.groupby(["Customer", "Customer_chunk"])
colYear = df_grouped["Year"].first()
colMonth = df_grouped["Month"].first()
colMonth1 = df_grouped["Value"].first()
colMonth2 = df_grouped["Value"].nth(2)
colMonth3 = df_grouped["Value"].last()
And finally we create the output by merging every columns.
df_output = pd.concat([colYear, colMonth, colMonth1, colMonth2, colMonth3], axis=1).reset_index().drop("Customer_chunk", axis=1)
Some steps feel a bit clunky, there's probably room for improvement in this code but it shouldn't impact performance too much.

Pandas transform rows with specific character

i am working on features transformation, and ran into this issue. Let me know what you think. Thanks!
I have a table like this
And I want to create an output column like this
Some info:
All the outputs will be based on numbers that end with a ':'
I have 100M+ rows in this table. Need to consider performance issue.
Let me know if you have some good ideas. Thanks!
Here is some copy and paste-able sample data:
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
Solution #1:
You can use .str.contains(':') with np.where() to identify the values, otherwise return np.nan. Then, use ffill() to fill down on nan values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
df['Output'] = np.where(df['Number'].str.contains(':'),df['Number'].str.split(':').str[0],np.nan)
df['Output'] = df['Output'].ffill()
df
Solution #2 - Even easier and potentially better performance you can do some regex with str.extract() and then again ffill():
df['Output'] = df['Number'].str.extract('^(\d+):').ffill()
df
Out[1]:
Number Output
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7
I think this is what you are looking for:
import pandas as pd
c = ['Number']
d = ['1:00',100,1001,1321,3254,'15:00',20,60,80,90,'4:00',26,45,90,89]
df = pd.DataFrame(data=d,columns=c)
temp= df['Number'].str.split(":", n = 1, expand = True)
df['New_Val'] = temp[0].ffill()
print(df)
The output of this will be as follows:
Number New_Val
0 1:00 1
1 100 1
2 1001 1
3 1321 1
4 3254 1
5 15:00 15
6 20 15
7 60 15
8 80 15
9 90 15
10 4:00 4
11 26 4
12 45 4
13 90 4
14 89 4
Looks like your DataFrame has string values. I considered them as a mix of numbers and strings.
Here's the solution if df['Number'] is all strings.
df1 = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
temp= df1['Number'].str.split(":", n = 1, expand = True)
temp.loc[temp[1].astype(bool) != False, 'New_val'] = temp[0]
df1['New_val'] = temp['New_val'].ffill()
print (df1)
The output of df1 will be:
Number New_val
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7

How to merge two field on one csv file with one field of other csv file?

I would like to merge two CSV files as follow:
First CSV File :
df = pd.DataFrame()
df["ticket_number"] = ['AAA', 'AAA', 'AAA', 'ABC', 'ABA','ADC','ABA','BBB']
df["train_board_station"] = ['Tokyo', 'LA', 'Paris', 'New_York', 'Delhi','Phoenix', 'London','LA']
df["train_off_station"] = ['Phoenix', 'London', 'Sydney', 'Berlin', 'Shanghai','LA', 'Paris', 'New_York']
Second CSV file:
rec = pd.DataFrame()
rec["code"] = ['Tokyo','London','Paris','New_York','Shanghai','LA','Sydney','Berlin','Phoenix','Delhi']
rec["count_A"] = ['1.2','7.8','4','8','7.8','3','8','5','2','10']
rec["count_B"] = ['12','78','4','8','78','36','88','51','25','10']
I use the following code:
for x in ["board", "off"]:
df["station"] = df["train_" + x + "_station"]
df["code"] = df["train_" + x + "_station"]
df = pd.concat([df,rec], axis=1, join_axes=[df.index])
df[x + "_count_A"] = df["count_A"]
df[x + "_count_B"] = df["count_B"]
df = df.drop(["station", "code","count_A","count_B"], axis=1)
I get the following incorrect output :
ticket_number,train_board_station,train_off_station,board_count_A,board_count_B,off_count_A,off_count_B
AAA,Tokyo,Phoenix,1.2,12,1.2,12
AAA,LA,London,7.8,78,7.8,78
AAA,Paris,Sydney,4,4,4,4
ABC,New_York,Berlin,8,8,8,8
ABA,Delhi,Shanghai,7.8,78,7.8,78
ADC,Phoenix,LA,3,36,3,36
ABA,London,Paris,8,88,8,88
BBB,LA,New_York,5,51,5,51
I notice that instead of count_A and count_B merging with train_board station and train_off_station of same line, first line gets merged with train_board_station and second lines gets merged with train_off_station twice.
The expected output is:
ticket_number,train_board_station,train_off_station,board_count_A,board_count_B,off_count_A,off_count_B
AAA,Tokyo,Phoenix,1.2,12,2,25
AAA,LA,London,3,36,7.8,78
AAA,Paris,Sydney,4,4,8,88
ABC,New_York,Berlin,8,8,5,51
ABA,Delhi,Shanghai,10,10,7.8,78
ADC,Phoenix,LA,2,26,3,36
ABA,London,Paris,7.7,78,4,4
BBB,LA,New_York,36,36,8,8
There is problem with duplicates, I use join with left join:
for x in ["board", "off"]:
df["code"] = df["station"] = df["train_" + x + "_station"]
df = df.join(rec.set_index('code'), on='code')
df[x + "_count_A"] = df["count_A"]
df[x + "_count_B"] = df["count_B"]
df = df.drop(["station", "code","count_A","count_B"], axis=1)
print (df)
ticket_number train_board_station train_off_station board_count_A \
0 AAA Tokyo Phoenix 1.2
1 AAA LA London 3
2 AAA Paris Sydney 4
3 ABC New_York Berlin 8
4 ABA Delhi Shanghai 10
5 ADC Phoenix LA 2
6 ABA London Paris 7.8
7 BBB LA New_York 3
board_count_B off_count_A off_count_B
0 12 2 25
1 36 7.8 78
2 4 8 88
3 8 5 51
4 10 7.8 78
5 25 3 36
6 78 4 4
7 36 8 8