How to put groupby result into the same row - pandas

I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'id' :["c1","c1","c1","c2","c2","c3","c3","c3","c3","c4","c4","c5","c6","c6","c6","c7","c7"],'store' : ["first","second","second","first",
"second","first","third","fourth",
"fifth","second","fifth","first",
"first","second","third","fourth","fifth"],
'purchase': [10,10,10,20,20,30,30,30,30,40,40,50,60,60,60,70,70]})
after you do groupby:
df_group= df.groupby(['id','store']).agg({'purchase': ["sum"]})
Result of df_group
I want to let each card have all the purchases in the different stores appear in the same row, for example:
id 1_store 1_sum 2_store 2_sum 3_store 3_sum 4_store 4_sum...
0 c1 first 10 second 20
1 C2 first 20 second 20
2 c3 fifth 30 first 30 fourth 30 third 30
I don't want to use unstack on store, the reason behind it is there are so many stores, it will cause too much columns for stores and most of them are empty.
How can I achieve the above result?
Thanks

Need to create a cumcount variable to get the column labels, then this becomes a .pivot_table problem: You get quite the MultiIndex on the columns, which we can collapse:
df_group['idx'] = df_group.groupby(level=0).cumcount()+1
df_res = (df_group.reset_index()
.pivot_table(index='id',
columns='idx',
values=['store', 'purchase'],
aggfunc='first')
.sort_index(level=2, axis=1))
Output:
purchase store purchase store purchase store purchase store
sum sum sum sum
idx 1 1 2 2 3 3 4 4
id
c1 10.0 first 20.0 second NaN NaN NaN NaN
c2 20.0 first 20.0 second NaN NaN NaN NaN
c3 30.0 fifth 30.0 first 30.0 fourth 30.0 third
c4 40.0 fifth 40.0 second NaN NaN NaN NaN
c5 50.0 first NaN NaN NaN NaN NaN NaN
c6 60.0 first 60.0 second 60.0 third NaN NaN
c7 70.0 fifth 70.0 fourth NaN NaN NaN NaN
If need to collapse the columns (probably a good idea since it's not lexsorted anymore):
df_res.columns = ['_'.join(map(str, [y for y in x[::-1] if y != ''])) for x in df_res.columns]
1_sum_purchase 1_store 2_sum_purchase 2_store 3_sum_purchase 3_store 4_sum_purchase 4_store
id
c1 10.0 first 20.0 second NaN NaN NaN NaN
c2 20.0 first 20.0 second NaN NaN NaN NaN
c3 30.0 fifth 30.0 first 30.0 fourth 30.0 third
c4 40.0 fifth 40.0 second NaN NaN NaN NaN
c5 50.0 first NaN NaN NaN NaN NaN NaN
c6 60.0 first 60.0 second 60.0 third NaN NaN
c7 70.0 fifth 70.0 fourth NaN NaN NaN NaN

Related

Adding columns with null values in pandas dataframe [duplicate]

When summing two pandas columns, I want to ignore nan-values when one of the two columns is a float. However when nan appears in both columns, I want to keep nan in the output (instead of 0.0).
Initial dataframe:
Surf1 Surf2
0 0
NaN 8
8 15
NaN NaN
16 14
15 7
Desired output:
Surf1 Surf2 Sum
0 0 0
NaN 8 8
8 15 23
NaN NaN NaN
16 14 30
15 7 22
Tried code:
-> the code below ignores nan-values but when taking the sum of two nan-values, it gives 0.0 in the output where I want to keep it as NaN in that particular case to keep these empty values separate from values that are actually 0 after summing.
import pandas as pd
import numpy as np
data = pd.DataFrame({"Surf1": [10,np.nan,8,np.nan,16,15], "Surf2": [22,8,15,np.nan,14,7]})
print(data)
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1)
print(data)
From the documentation pandas.DataFrame.sum
By default, the sum of an empty or all-NA Series is 0.
>>> pd.Series([]).sum() # min_count=0 is the default 0.0
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.
Change your code to
data.loc[:,'Sum'] = data.loc[:,['Surf1','Surf2']].sum(axis=1, min_count=1)
output
Surf1 Surf2
0 10.0 22.0
1 NaN 8.0
2 8.0 15.0
3 NaN NaN
4 16.0 14.0
5 15.0 7.0
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You could mask the result by doing:
df.sum(1).mask(df.isna().all(1))
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
You can do:
df['Sum'] = df.dropna(how='all').sum(1)
Output:
Surf1 Surf2 Sum
0 10.0 22.0 32.0
1 NaN 8.0 8.0
2 8.0 15.0 23.0
3 NaN NaN NaN
4 16.0 14.0 30.0
5 15.0 7.0 22.0
You can use min_count, this will sum all the row when there is at least on not null, if all null return null
df['SUM']=df.sum(min_count=1,axis=1)
#df.sum(min_count=1,axis=1)
Out[199]:
0 0.0
1 8.0
2 23.0
3 NaN
4 30.0
5 22.0
dtype: float64
I think All the solutions listed above work only for the cases when when it is the FIRST column value that is missing. If you have cases when the first column value is non-missing but the second column value is missing, try using:
df['sum'] = df['Surf1']
df.loc[(df['Surf2'].notnull()), 'sum'] = df['Surf1'].fillna(0) + df['Surf2']

Creating new columns in Pandas dataframe reading csv file

I'm reading a simple csv file and creating a pandas dataframe. The csv file can have 1 row or 2 rows or 10 rows.
If the csv file has 1 row then I want to create few columns and if it has <=2 rows, then create couple of new columns and if it has 10 rows, then I want to create 10 new columns.
After reading the csv, my sample dataframe looks like below.
df=pd.read_csv('/home/abc/myfile.csv',sep=',')
print(df)
id rate amount address lb ub msa
1 2.50 100 abcde 30 90 101
10 20 102
103
104
105
106
107
108
109
110
Case 1)If the dataframe has only 1 record then I want to create new columns 'new_id', 'new_rate' & 'new_address' and assign the values from 'id', 'rate' and 'address' columns coming from the dataframe
Expected Output:
id rate amount address lb ub msa new_id new_rate new_address
1 2.50 100 abcde 30 90 101 1 2.50 abcde
Case 2)If the dataframe has <=2 records then I want to create for the 1st record 'lb_1', 'ub_1' with values 30 and 90 and for the 2nd record 'lb_2' & 'ub_2' with values 10 & 20 coming from the dataframe
Expected Output:
if there is only 1 row:
id rate amount address lb ub msa lb_1 ub_1
1 2.50 100 abcde 30 90 101 30 90
if there are 2 rows:
id rate amount address lb ub msa lb_1 ub_1 lb_2 ub_2
1 2.50 100 abcde 30 90 101 30 90 10 20
10 20 102
Case 3)If the dataframe has 10 records then I want to create 10 new columns ie, msa_1,msa_2....msa_10 and assign the respective values msa_1=101, msa_2=102.......msa_10=110 for each row coming from the dataframe
Expected Output:
id rate amount address lb ub msa msa_1 msa_2 msa_3 msa_4 msa_5 msa_6 msa_7 msa_8 msa_9 msa_10
1 2.50 100 abcde 30 90 101 101 102 103 104 105 106 107 108 109 110
10 20 102
103
104
105
106
107
108
109
110
I'm trying to write the code as below but for 2nd and 3rd case, I'm not sure how to do it and also if there is any better way to handle all the 3 cases, that would be great.
Appreciate if anyone can show me the best way to get it done. Thanks in advance
Case1:
if df.shape[0]==1:
df.loc[(df.shape[0]==1), "new_id"] = df["id"]
df.loc[(df.shape[0]==1),"new_rate"]= df["rate"]
df.loc[(df.shape[0]==1),"new_address"]= df["address"]
Case2:
if df.shape[0]<=2:
for i in 1 to len(df.index)
df.loc[df['lb_i']]=db['lb']
df.loc[df['ub_i']]=df['ub']
Case3:
if df.shape[0]<=10:
for i in 1 to len(df.index)
df.loc[df['msa_i']]=df['msa']
for case 2 and case 3, you can do something like this -
Case 2-
# case 2
df= pd.read_csv('test.txt')
lb_dict = { f'lb_{i}': value for i,value in enumerate(df['lb'].to_list(),start=1)}
lb_df = pd.DataFrame.from_dict(lb_dict, orient='index').transpose()
ub_dict = { f'ub_{i}': value for i,value in enumerate(df['ub'].to_list(),start=1)}
ub_df = pd.DataFrame.from_dict(ub_dict, orient='index').transpose()
final_df = pd.concat([df,lb_df,ub_df],axis =1)
print(final_df)
output-
id
rate
amount
address
lb
ub
msa
lb_1
lb_2
ub_1
ub_2
0
1.0
2.5
100.0
abcde
30
90
101
30.0
10.0
90.0
20.0
1
NaN
NaN
NaN
NaN
10
20
102
NaN
NaN
NaN
NaN
For case 3 -
# case 3
df= pd.read_csv('test.txt')
msa_dict = { f'msa_{i}': value for i,value in enumerate(df['msa'].to_list(),start=1)}
msa_df = pd.DataFrame.from_dict(msa_dict, orient='index').transpose()
pd.concat([df,msa_df],axis =1)
Output -
id
rate
amount
address
lb
ub
msa
msa_1
msa_2
msa_3
msa_4
msa_5
msa_6
msa_7
msa_8
msa_9
msa_10
0
1.0
2.5
100.0
abcde
30.0
90.0
101
101.0
102.0
103.0
104.0
105.0
106.0
107.0
108.0
109.0
110.0
1
NaN
NaN
NaN
NaN
10.0
20.0
102
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
NaN
NaN
NaN
NaN
NaN
NaN
103
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
3
NaN
NaN
NaN
NaN
NaN
NaN
104
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
NaN
NaN
NaN
NaN
NaN
NaN
105
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
5
NaN
NaN
NaN
NaN
NaN
NaN
106
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
6
NaN
NaN
NaN
NaN
NaN
NaN
107
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7
NaN
NaN
NaN
NaN
NaN
NaN
108
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
8
NaN
NaN
NaN
NaN
NaN
NaN
109
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
9
NaN
NaN
NaN
NaN
NaN
NaN
110
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Solution -
I've just created a dictionary from the required column and then I concatenated it with the original dataframe column-wise.

Foward-fill dataframe based on mask. Fill with last valid value

I have a dataframe like the following:
index,col1,col2
1,NaN,NaN
2,NaN,NaN
3,NaN,20
4,NaN,21
5,10,22
6,11,23
7,12,24
8,13,NaN
9,NaN,NaN
And a boolean mask dataframe like the following:
index,col1,col2
1,False,False
2,False,False
3,False,False
4,False,True
5,False,False
6,False,False
7,True,True
8,True,False
9,False,False
I would like to convert them to this final dataframe:
index,col1,col2
1,NaN,NaN
2,NaN,NaN
3,NaN,20
4,NaN,20
5,10,22
6,11,23
7,11,23
8,11,NaN
9,NaN,NaN
That is: foward-filling the values matching True on the mask with the last value in the column having False in the mask.
How can I get this?
Let's try:
df.mask(mask).ffill().where(df.notna())
Output:
col1 col2
index
1 NaN NaN
2 NaN NaN
3 NaN 20.0
4 NaN 20.0
5 10.0 22.0
6 11.0 23.0
7 11.0 23.0
8 11.0 NaN
9 NaN NaN

Key Error is generated by Pivot_table for column which is not used in Pivot Table

I am creating Pivot table from my dataframe which contains mix of columns such as Text, Numbers, Date and Time.
I am able to successfully read the file into dataframe and able to process few groupby operations. Based on this I am trying to create the Pivot Table which group the data by week and count certain occurrence of data based on some criteria. However the Pivot_table is keep raising Keyerror for column which is not used in Pivot Table.
Here is my Dataframe:
H1 H2 H3 H4 H5 H6 H7 H8 H10
RA2 RB2, H2 2020-07-25 11:30 60 1774 RG2 RH2 RJ2
RA3 RB3, H2 2020-07-25 11:30 60 1791 RG3 RH3 RJ3
RA4 RB4, H2 2020-07-25 11:30 35 1806 RG4 RH4 RJ4
RA5 RB1, H3 2020-07-25 12:30 35 1771 RG5 RH5 RJ5
RA6 RB2, H3 2020-07-25 12:45 60 1813 RG6 RH6 RJ6
RA7 RB3, H3 2020-07-25 13:00 60 1789 RG7 RH7 RJ7
RA8 RB4, H3 2020-07-25 13:00 60 1790 RG8 RH8 RJ8
RA9 RB1, H4 2020-07-25 13:00 60 1808 RG9 RH9 RJ9
RA10 RB2, H4 2020-07-25 14:00 60 1822 RG10 RH10 RJ10
Here is my code where its failing:
pivot = pd.pivot_table(df, index=['H1', pd.Grouper(key='H3', freq='W-MON')], columns='H10',\
margins=True, aggfunc={'H10':np.count_nonzero}).reset_index()
Error I am getting is as follow:
Function: createPivot Raised: 'H2'
I am stuck with this issue for a week now and unable to get around it. I have also posted another post related to this issue on SO but unable to get any answer.
So I really appreciate if I can get some expert opinion.
Really appreciate your help and consideration.
Pivot table will try to use all columns as values in your dataframe, unless you set them explicitly. So seeing a dictionary in the aggfunc argument it tries to look up the aggregating function for each remaining column, not just for H10.
However, in your example, even if you specify H10 as values explicitly, you'll run into the issue of trying to use the same column for both columns and values arguments which gives the Grouper for 'H10' not 1-dimensional error.
You might be better off with pd.crosstab:
pd.crosstab(
index=[df['H1'], df['H3']],
values=df['H10'], columns=df['H10'],
margins=True, aggfunc=np.count_nonzero)
H10 RJ10 RJ2 RJ3 RJ4 RJ5 RJ6 RJ7 RJ8 RJ9 All
H1 H3
RA10 2020-07-25 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 1
RA2 2020-07-25 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN 1
RA3 2020-07-25 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN 1
RA4 2020-07-25 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN 1
RA5 2020-07-25 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN 1
RA6 2020-07-25 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN 1
RA7 2020-07-25 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN 1
RA8 2020-07-25 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN 1
RA9 2020-07-25 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1
All NaT 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 9

From 15 object variables to final target variable (0 or 1)

Can i go from 15 object variables to one final binary target variable?
Those 15 variables has ~10.000 different codes, my dataset is about 21.000.000 records. What im trying to do is at first replace the codes i want with 1 and the other with 0, then if one of fifteen variables is 1 the target variable will be 1, if all fifteen variables are 0 the target variable will be 0.
i have tried to work with to_replace, as_type, to_numeric, infer_objects with not good results,for example my dataset look like this head(5):
D P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
41234 1234 4367 874 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
42345 7657 4367 874 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
34212 7654 4347 474 NAN NAN NAN 789 NAN NAN NAN NAN NAN NAN NAN NAN
34212 8902 4317 374 NAN 452 NAN 719 NAN NAN NAN NAN NAN NAN NAN NAN
19374 2564 4387 274 NAN 452 NAN 799 NAN NAN NAN NAN NAN NAN NAN NAN
I want to transform all nan as 0, and selected codes with 1, so all the P1-P15 will be binary and the i will create a final P variable with them.
For example if P1-P15 have '3578','9732','4734'...(im using about 200 codes) i want to become 1.
All the other values i want to become 0.
D variable should stay as it is.
The final dataset will be (D,P), then i will add the train variables
Any ideas? The following code gives me wrong results.
selCodes=['3722','66']
dfnew['P']=(dfnew.loc[:,'PR1':].astype(str).isin(selCodes).any(axis=1).astype(int))
Take a look at a test dataset(left), and new P(right).With the example code 3722 P should be 1.
IIUC, Use, DataFrame.isin:
# example select codes
selCodes = ['1234', '9732', '719']
df['P'] = (
df.loc[:, 'P1':].astype(str)
.isin(selCodes).any(axis=1).astype(int)
)
df = df[['D', 'P']]
Result:
D P
0 41234 1
1 42345 0
2 34212 0
3 34212 1
4 19374 0