Filling NaN values in pandas after grouping - pandas

This question is slightly different from usual filling of NaN values.
Suppose I have a dataframe, where in I group by some category. Now I want to fill the NaN values of a column by using the mean value of that group but from different column.
Let me take an example:
a = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
a['diff'] = a.salary - a.expenditure
Occupation salary expenditure diff
0 driver 100 20.0 80.0
1 driver 150 40.0 110.0
2 mechanic 70 10.0 60.0
3 teacher 300 100.0 200.0
4 mechanic 90 NaN NaN
5 teacher 250 80.0 170.0
6 unemployed 10 0.0 10.0
7 driver 90 NaN NaN
8 mechanic 110 40.0 70.0
9 teacher 350 120.0 230.0
So, in the above case, I would like to fill the NaN values in expenditure as:
salary - mean(difference) for each group.
How do I do that using pandas?

You can create that new series with the desired values, groupby.transform and use to update the target column.
Assuming you want to group by Occupation
a['mean_diff'] = a.groupby('Occupation')['diff'].transform('mean')
a.expenditure.mask(
a.expenditure.isna(),
a.salary - a.mean_diff,
inplace=True
)
Output
Occupation salary expenditure diff mean_diff
0 driver 100 20.0 80.0 95.0
1 driver 150 40.0 110.0 95.0
2 mechanic 70 10.0 60.0 65.0
3 teacher 300 100.0 200.0 200.0
4 mechanic 90 25.0 NaN 65.0
5 teacher 250 80.0 170.0 200.0
6 unemployed 10 0.0 10.0 10.0
7 driver 90 -5.0 NaN 95.0
8 mechanic 110 40.0 70.0 65.0
9 teacher 350 120.0 230.0 200.0

Related

group/merge/pivot data by varied weight ranges in Pandas

Is there a way in Pandas to fit in the value according to weight ranges when pivoting the dataframe? I see some answers with setting bins but these are varied weight ranges depending on how the data is entered.
Here's my dataset.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A"],
'weight_start': [1,61,161,201,1,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
df
Desired Output:
desired df
Since location 2 is 0 percent covering the ranges 1 to 500, I want it to populate 0 based on the ranges prescribed for tier 1 service A instead of having its own row.
Edit: Mozway's answer works when there is one service. When I added a second service, the dataframe ungrouped.
Here's the new dataset with service B.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"],
'weight_start': [1,61,161,201,1,1,61,161,201,1,1,81,101,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500,500,80,100,200,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3,1,2,2,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0,50,70,50,10,65,55,45,5]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 NaN 60.0
500 NaN 0.0 NaN
61 160 30.0 NaN 20.0
161 200 10.0 NaN 5.0
201 500 0.0 NaN 0.0
B 1 60 NaN NaN 65.0
80 NaN 70.0 NaN
500 50.0 NaN NaN
61 160 NaN NaN 55.0
81 100 NaN 50.0 NaN
101 200 NaN 10.0 NaN
161 200 NaN NaN 45.0
201 500 NaN NaN 5.0
Desired Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 60 50 70 65.0
80 50 70.0 55
61 160 50 NaN 55.0
81 100 50 50.0 55
101 200 50 10.0 NaN
161 200 50 10 45.0
201 500 50 NaN 5.0
This will work
data = (df.set_index(['tier','services','weight_start','weight_end'])
.pivot(columns='location')['discount']
.reset_index()
.rename_axis(None, axis=1)
)
IIUC, you can (temporarily) exclude the columns with 0/nan and check if all remaining values are only NaNs per row. If so, drop those rows:
mask = ~pivot_df.loc[:, pivot_df.any()].isna().all(1)
out = pivot_df[mask].fillna(0)
output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
per group:
def drop(d):
mask = ~ d.loc[:, d.any()].isna().all(1)
return d[mask].fillna(0)
out = pivot_df.groupby(['services']).apply(drop)
output:
location 1 2 3
services tier services weight_start weight_end
A 1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 B 1 60 0.0 0.0 65.0
80 0.0 70.0 0.0
500 50.0 0.0 0.0
61 160 0.0 0.0 55.0
81 100 0.0 50.0 0.0
101 200 0.0 10.0 0.0
161 200 0.0 0.0 45.0
201 500 0.0 0.0 5.0

Filling NaN values in pandas using Train Data Statistics

I will explain my problem statement:
Suppose I have train data and test data.
I have NaN values in same columns for both train and test. Now my strategy for the nan imputation is this:
Groupby some column and fill the nans with mean of that group. Example:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 NaN
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 NaN
8 mechanic 110 40.0
9 teacher 350 120.0
For train data I can do this like this:
x_train['expenditure'] = x_train.groupby('Occupation')['expenditure'].transform(lambda x:x.fillna(x.mean())
But how do I do something like this for test data. Where the mean would be the training group's mean.
I am trying to do this using for loop, but it's taking forever.
Create mean to Series:
mean = x_train.groupby('Occupation')['expenditure'].mean()
print (mean)
Occupation
driver 30.0
mechanic 25.0
teacher 100.0
unemployed 0.0
Name: expenditure, dtype: float64
And then replace missing values by Series.map and Series.fillna:
x_train['expenditure'] = x_train['expenditure'].fillna(x_train['Occupation'].map(mean))
print (x_train)
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 25.0
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 30.0
8 mechanic 110 40.0
9 teacher 350 120.0
And same way use for test data:
x_test['expenditure'] = x_test['expenditure'].fillna(x_test['Occupation'].map(mean))
EDIT:
Solution for multiple columns - instead map use DataFrame.join:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'expenditure1': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'col':list('aabbddeehh')})
mean = x_train.groupby('Occupation').mean()
print (mean)
salary expenditure expenditure1
Occupation
driver 113.333333 30.0 30.0
mechanic 90.000000 25.0 25.0
teacher 300.000000 100.0 100.0
unemployed 10.000000 0.0 0.0
x_train = x_train.fillna(x_train[['Occupation']].join(mean, on='Occupation'))
print (x_train)
Occupation salary expenditure expenditure1 col
0 driver 100 20.0 20.0 a
1 driver 150 40.0 40.0 a
2 mechanic 70 10.0 10.0 b
3 teacher 300 100.0 100.0 b
4 mechanic 90 25.0 25.0 d
5 teacher 250 80.0 80.0 d
6 unemployed 10 0.0 0.0 e
7 driver 90 30.0 30.0 e
8 mechanic 110 40.0 40.0 h
9 teacher 350 120.0 120.0 h
x_test = x_test.fillna(x_test[['Occupation']].join(mean, on='Occupation'))

Joining two pandas dataframes with multi-indexed columns

I want to join two pandas dataframes, one of which has multi-indexed columns.
This is how I make the first dataframe.
data_large = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 60, 50], "buy":[20, 30, 40]})
data_mini = pd.DataFrame({"name":["b", "c", "d"], "sell":[60, 20, 10], "buy":[30, 50, 40]})
data_topix = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 80, 0], "buy":[70, 30, 40]})
df_out = pd.concat([dfi.set_index('name') for dfi in [data_large, data_mini, data_topix]],
keys=['Large', 'Mini', 'Topix'], axis=1)\
.rename_axis(mapper=['name'], axis=0).rename_axis(mapper=['product','buy_sell'], axis=1)
df_out
And this is the second dataframe.
group = pd.DataFrame({"name":["a", "b", "c", "d"], "group":[1, 1, 2, 2]})
group
How can I join the second to the first, on the column name, keeping the multi-indexed columns?
This did not work and it flattened the multi-index.
df_final = df_out.merge(group, on=['name'], how='left')
Any help would be appreciated!
If need MultiIndex after merge is necessary convert column group to MultiIndex DataFrame, here is converted column name to index for merge by index, else both columns has to be converted to MultiIndex:
group = group.set_index('name')
group.columns = pd.MultiIndex.from_product([group.columns, ['new']])
df_final = df_out.merge(group, on=['name'], how='left')
Or:
df_final = df_out.merge(group, left_index=True, right_index=True, how='left')
print (df_final)
product Large Mini Topix group
buy_sell sell buy sell buy sell buy new
name
a 10.0 20.0 NaN NaN 10.0 70.0 1
b 60.0 30.0 60.0 30.0 80.0 30.0 1
c 50.0 40.0 20.0 50.0 0.0 40.0 2
d NaN NaN 10.0 40.0 NaN NaN 2
Another possible way, but with warning is convert values to MultiIndex after merge:
df_final = df_out.merge(group, on=['name'], how='left')
UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)
warnings.warn(msg, UserWarning)
L = [x if isinstance(x, tuple) else (x, 'new') for x in df_final.columns.tolist()]
df_final.columns = pd.MultiIndex.from_tuples(L)
print (df_final)
name Large Mini Topix group
new sell buy sell buy sell buy new
0 a 10.0 20.0 NaN NaN 10.0 70.0 1
1 b 60.0 30.0 60.0 30.0 80.0 30.0 1
2 c 50.0 40.0 20.0 50.0 0.0 40.0 2
3 d NaN NaN 10.0 40.0 NaN NaN 2
EDIT: If need group in MultiIndex:
group = group.set_index(['name'])
group.columns = pd.MultiIndex.from_product([group.columns, ['new']])
df_final = (df_out.merge(group, on=['name'], how='left')
.set_index([('group','new')], append=True)
.rename_axis(['name','group']))
print (df_final)
product Large Mini Topix
buy_sell sell buy sell buy sell buy
name group
a 1 10.0 20.0 NaN NaN 10.0 70.0
b 1 60.0 30.0 60.0 30.0 80.0 30.0
c 2 50.0 40.0 20.0 50.0 0.0 40.0
d 2 NaN NaN 10.0 40.0 NaN NaN
Or:
df_final = df_out.merge(group, on=['name'], how='left').set_index(['name','group'])
df_final.columns = pd.MultiIndex.from_tuples(df_final.columns)
print (df_final)
Large Mini Topix
sell buy sell buy sell buy
name group
a 1 10.0 20.0 NaN NaN 10.0 70.0
b 1 60.0 30.0 60.0 30.0 80.0 30.0
c 2 50.0 40.0 20.0 50.0 0.0 40.0
d 2 NaN NaN 10.0 40.0 NaN NaN

fill_value in the pandas shift doesn't work with groupby

I need to shift column in pandas dataframe, for every name and fill resulting NA's with predefined value. Below is code snippet compiled with python 2.7
import pandas as pd
d = {'Name': ['Petro', 'Petro', 'Petro', 'Petro', 'Petro', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta'],
'Month': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'Value': [25, 2.5, 24.6, 28, 26.4, 35, 24, 35, 22, 27, 30, 30, 34, 30, 23]
}
data = pd.DataFrame(d)
data['ValueLag'] = data.groupby('Name').Value.shift(-1, fill_value = 20)
print data
After running code above I get the following output
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 NaN
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 NaN
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 NaN
Looks like fill_value did not work here. While I need NaN to be filled with some number let's say 4.
Or if to tell all the story I need that last value to be extended like this
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 26.4
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 27.0
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 23.0
Is there a way to fill with last value forward or first value backward if shifting positive number of periods?
It seems that the fill value is by group rather than a single value. Try the following,
data['ValueLag'] = data.groupby('Name').Value.shift(-1).ffill()

merge multiple columns into one column from the same dataframe pandas

My input dataset is as follows, and I want to rename multiple columns to the same variable name T1, T2, T3, T4 and bind the columns with the same name as one column.
df
ID Q3.4 Q3.6 Q3.8 Q3.18 Q4.4 Q4.6 Q4.8 Q4.12
1 NaN NaN NaN NaN 20 60 80 20
2 10 20 20 40 NaN NaN NaN NaN
3 30 40 40 40 NaN NaN NaN NaN
4 NaN NaN NaN NaN 50 50 50 50
rename vars
T1 = ['Q3.4', 'Q4.4']
T2 = ['Q3.6', 'Q4.6']
T3 = ['Q3.8', 'Q4.8']
T4 = ['Q3.18', 'Q4.12']
Step 1: I have renamed variables by (let me know if there is any faster code please)
df.rename(columns = {'Q3.4': 'T1',
'Q4.4': 'T1',
inplace = True)
df.rename(columns = {'Q3.6': 'T2',
'Q4.6': 'T2',
inplace = True)
df.rename(columns = {'Q3.8': 'T3',
'Q4.8': 'T3',
inplace = True)
df.rename(columns = {'Q3.18': 'T4',
'Q4.12': 'T4',
inplace = True)
ID T1 T2 T3 T4 T1 T2 T3 T4
1 NaN NaN NaN NaN 20 60 80 20
2 10 20 20 40 NaN NaN NaN NaN
3 30 40 40 40 NaN NaN NaN NaN
4 NaN NaN NaN NaN 50 50 50 50
How can I merge the columns into the following expected df?
ID T1 T2 T3 T4
1 20 60 80 20
2 10 20 20 40
3 30 40 40 40
4 50 50 50 50
Thanks!
Start with your original df, groupby with axis=1
d={'Q3.4': 'T1','Q4.4': 'T1',
'Q3.6': 'T2','Q4.6': 'T2',
'Q3.8': 'T3','Q4.8': 'T3',
'Q3.18': 'T4','Q4.12': 'T4'}
df.set_index('ID').groupby(d,axis=1).first()
Out[80]:
T1 T2 T3 T4
ID
1 20.0 60.0 80.0 20.0
2 10.0 20.0 20.0 40.0
3 30.0 40.0 40.0 40.0
4 50.0 50.0 50.0 50.0
How about this:
df.sum(level=0, axis=1)
Out[313]:
ID T1 T2 T3 T4
0 1.0 20.0 60.0 80.0 20.0
1 2.0 10.0 20.0 20.0 40.0
2 3.0 30.0 40.0 40.0 40.0
3 4.0 50.0 50.0 50.0 50.0
Try:
# set index if not already
df = df.set_index('ID')
# stack unstack:
df = df.stack().unstack().reset_index()
output:
ID T1 T2 T3 T4
0 1 20.0 60.0 80.0 20.0
1 2 10.0 20.0 20.0 40.0
2 3 30.0 40.0 40.0 40.0
3 4 50.0 50.0 50.0 50.0