Filling NaN values in pandas using Train Data Statistics - pandas

I will explain my problem statement:
Suppose I have train data and test data.
I have NaN values in same columns for both train and test. Now my strategy for the nan imputation is this:
Groupby some column and fill the nans with mean of that group. Example:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 NaN
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 NaN
8 mechanic 110 40.0
9 teacher 350 120.0
For train data I can do this like this:
x_train['expenditure'] = x_train.groupby('Occupation')['expenditure'].transform(lambda x:x.fillna(x.mean())
But how do I do something like this for test data. Where the mean would be the training group's mean.
I am trying to do this using for loop, but it's taking forever.

Create mean to Series:
mean = x_train.groupby('Occupation')['expenditure'].mean()
print (mean)
Occupation
driver 30.0
mechanic 25.0
teacher 100.0
unemployed 0.0
Name: expenditure, dtype: float64
And then replace missing values by Series.map and Series.fillna:
x_train['expenditure'] = x_train['expenditure'].fillna(x_train['Occupation'].map(mean))
print (x_train)
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 25.0
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 30.0
8 mechanic 110 40.0
9 teacher 350 120.0
And same way use for test data:
x_test['expenditure'] = x_test['expenditure'].fillna(x_test['Occupation'].map(mean))
EDIT:
Solution for multiple columns - instead map use DataFrame.join:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'expenditure1': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'col':list('aabbddeehh')})
mean = x_train.groupby('Occupation').mean()
print (mean)
salary expenditure expenditure1
Occupation
driver 113.333333 30.0 30.0
mechanic 90.000000 25.0 25.0
teacher 300.000000 100.0 100.0
unemployed 10.000000 0.0 0.0
x_train = x_train.fillna(x_train[['Occupation']].join(mean, on='Occupation'))
print (x_train)
Occupation salary expenditure expenditure1 col
0 driver 100 20.0 20.0 a
1 driver 150 40.0 40.0 a
2 mechanic 70 10.0 10.0 b
3 teacher 300 100.0 100.0 b
4 mechanic 90 25.0 25.0 d
5 teacher 250 80.0 80.0 d
6 unemployed 10 0.0 0.0 e
7 driver 90 30.0 30.0 e
8 mechanic 110 40.0 40.0 h
9 teacher 350 120.0 120.0 h
x_test = x_test.fillna(x_test[['Occupation']].join(mean, on='Occupation'))

Related

Python - count and Difference data frames

I have two data frames about occupation in industry in 2005 and 2006. I would like to create a df using the column with the result of the changed of these years, if it growth or decreased. Here is a sample:
import pandas as pd
d = {'OCC2005': [1234, 1234, 1234 ,1234, 2357,2357,2357,2357, 4321,4321,4321,4321, 3333], 'IND2005': [4, 5, 6, 7, 5,6,7,4, 6,7,5,4,5], 'Result': [7, 8, 12, 1, 11,15,20,1,5,12,8,4,3]}
df = pd.DataFrame(data=d)
print(df)
d2 = {'OCC2006': [1234, 1234, 1234 ,1234, 2357,2357,2357,2357, 4321,4321,4361,4321, 3333,4444], 'IND2006': [4, 5, 6, 7, 5,6,7,4, 6,7,5,4,5,8], 'Result': [17, 18, 12, 1, 1,5,20,1,5,2,18,4,0,15]}
df2 = pd.DataFrame(data=d2)
print(df2)
Final_Result = df2['Result'] - df['Result']
print(Final_Result)
I would like to create a df with occ- ind- final_result
Rename columns of df to match column names of df2:
MAP = dict(zip(df.columns, df2.columns))
out = (df2.set_index(['OCC2006', 'IND2006'])
.sub(df.rename(columns=MAP).set_index(['OCC2006', 'IND2006']))
.reset_index())
print(out)
# Output
OCC2006 IND2006 Result
0 1234 4 10.0
1 1234 5 10.0
2 1234 6 0.0
3 1234 7 0.0
4 2357 4 0.0
5 2357 5 -10.0
6 2357 6 -10.0
7 2357 7 0.0
8 3333 5 -3.0
9 4321 4 0.0
10 4321 5 NaN
11 4321 6 0.0
12 4321 7 -10.0
13 4361 5 NaN
14 4444 8 NaN

Filling NaN values in pandas after grouping

This question is slightly different from usual filling of NaN values.
Suppose I have a dataframe, where in I group by some category. Now I want to fill the NaN values of a column by using the mean value of that group but from different column.
Let me take an example:
a = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
a['diff'] = a.salary - a.expenditure
Occupation salary expenditure diff
0 driver 100 20.0 80.0
1 driver 150 40.0 110.0
2 mechanic 70 10.0 60.0
3 teacher 300 100.0 200.0
4 mechanic 90 NaN NaN
5 teacher 250 80.0 170.0
6 unemployed 10 0.0 10.0
7 driver 90 NaN NaN
8 mechanic 110 40.0 70.0
9 teacher 350 120.0 230.0
So, in the above case, I would like to fill the NaN values in expenditure as:
salary - mean(difference) for each group.
How do I do that using pandas?
You can create that new series with the desired values, groupby.transform and use to update the target column.
Assuming you want to group by Occupation
a['mean_diff'] = a.groupby('Occupation')['diff'].transform('mean')
a.expenditure.mask(
a.expenditure.isna(),
a.salary - a.mean_diff,
inplace=True
)
Output
Occupation salary expenditure diff mean_diff
0 driver 100 20.0 80.0 95.0
1 driver 150 40.0 110.0 95.0
2 mechanic 70 10.0 60.0 65.0
3 teacher 300 100.0 200.0 200.0
4 mechanic 90 25.0 NaN 65.0
5 teacher 250 80.0 170.0 200.0
6 unemployed 10 0.0 10.0 10.0
7 driver 90 -5.0 NaN 95.0
8 mechanic 110 40.0 70.0 65.0
9 teacher 350 120.0 230.0 200.0

Joining two pandas dataframes with multi-indexed columns

I want to join two pandas dataframes, one of which has multi-indexed columns.
This is how I make the first dataframe.
data_large = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 60, 50], "buy":[20, 30, 40]})
data_mini = pd.DataFrame({"name":["b", "c", "d"], "sell":[60, 20, 10], "buy":[30, 50, 40]})
data_topix = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 80, 0], "buy":[70, 30, 40]})
df_out = pd.concat([dfi.set_index('name') for dfi in [data_large, data_mini, data_topix]],
keys=['Large', 'Mini', 'Topix'], axis=1)\
.rename_axis(mapper=['name'], axis=0).rename_axis(mapper=['product','buy_sell'], axis=1)
df_out
And this is the second dataframe.
group = pd.DataFrame({"name":["a", "b", "c", "d"], "group":[1, 1, 2, 2]})
group
How can I join the second to the first, on the column name, keeping the multi-indexed columns?
This did not work and it flattened the multi-index.
df_final = df_out.merge(group, on=['name'], how='left')
Any help would be appreciated!
If need MultiIndex after merge is necessary convert column group to MultiIndex DataFrame, here is converted column name to index for merge by index, else both columns has to be converted to MultiIndex:
group = group.set_index('name')
group.columns = pd.MultiIndex.from_product([group.columns, ['new']])
df_final = df_out.merge(group, on=['name'], how='left')
Or:
df_final = df_out.merge(group, left_index=True, right_index=True, how='left')
print (df_final)
product Large Mini Topix group
buy_sell sell buy sell buy sell buy new
name
a 10.0 20.0 NaN NaN 10.0 70.0 1
b 60.0 30.0 60.0 30.0 80.0 30.0 1
c 50.0 40.0 20.0 50.0 0.0 40.0 2
d NaN NaN 10.0 40.0 NaN NaN 2
Another possible way, but with warning is convert values to MultiIndex after merge:
df_final = df_out.merge(group, on=['name'], how='left')
UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)
warnings.warn(msg, UserWarning)
L = [x if isinstance(x, tuple) else (x, 'new') for x in df_final.columns.tolist()]
df_final.columns = pd.MultiIndex.from_tuples(L)
print (df_final)
name Large Mini Topix group
new sell buy sell buy sell buy new
0 a 10.0 20.0 NaN NaN 10.0 70.0 1
1 b 60.0 30.0 60.0 30.0 80.0 30.0 1
2 c 50.0 40.0 20.0 50.0 0.0 40.0 2
3 d NaN NaN 10.0 40.0 NaN NaN 2
EDIT: If need group in MultiIndex:
group = group.set_index(['name'])
group.columns = pd.MultiIndex.from_product([group.columns, ['new']])
df_final = (df_out.merge(group, on=['name'], how='left')
.set_index([('group','new')], append=True)
.rename_axis(['name','group']))
print (df_final)
product Large Mini Topix
buy_sell sell buy sell buy sell buy
name group
a 1 10.0 20.0 NaN NaN 10.0 70.0
b 1 60.0 30.0 60.0 30.0 80.0 30.0
c 2 50.0 40.0 20.0 50.0 0.0 40.0
d 2 NaN NaN 10.0 40.0 NaN NaN
Or:
df_final = df_out.merge(group, on=['name'], how='left').set_index(['name','group'])
df_final.columns = pd.MultiIndex.from_tuples(df_final.columns)
print (df_final)
Large Mini Topix
sell buy sell buy sell buy
name group
a 1 10.0 20.0 NaN NaN 10.0 70.0
b 1 60.0 30.0 60.0 30.0 80.0 30.0
c 2 50.0 40.0 20.0 50.0 0.0 40.0
d 2 NaN NaN 10.0 40.0 NaN NaN

fill_value in the pandas shift doesn't work with groupby

I need to shift column in pandas dataframe, for every name and fill resulting NA's with predefined value. Below is code snippet compiled with python 2.7
import pandas as pd
d = {'Name': ['Petro', 'Petro', 'Petro', 'Petro', 'Petro', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta'],
'Month': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'Value': [25, 2.5, 24.6, 28, 26.4, 35, 24, 35, 22, 27, 30, 30, 34, 30, 23]
}
data = pd.DataFrame(d)
data['ValueLag'] = data.groupby('Name').Value.shift(-1, fill_value = 20)
print data
After running code above I get the following output
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 NaN
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 NaN
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 NaN
Looks like fill_value did not work here. While I need NaN to be filled with some number let's say 4.
Or if to tell all the story I need that last value to be extended like this
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 26.4
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 27.0
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 23.0
Is there a way to fill with last value forward or first value backward if shifting positive number of periods?
It seems that the fill value is by group rather than a single value. Try the following,
data['ValueLag'] = data.groupby('Name').Value.shift(-1).ffill()

Pandas dataframe creating multiple rows at once via .loc

I can create a new row in a dataframe using .loc():
>>> df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
>>> df
a b
1 10 100
2 20 200
>>> df.loc[3, 'a'] = 30
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
But how can I create more than one row using the same method?
>>> df.loc[[4, 5], 'a'] = [40, 50]
...
KeyError: '[4 5] not in index'
I'm familiar with .append() but am looking for a way that does NOT require constructing a new row into a Series before having it appended to df.
Desired input:
>>> df.loc[[4, 5], 'a'] = [40, 50]
Desired output
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Where last 2 rows are newly added.
Admittedly, this is a very late answer, but I have had to deal with a similar problem and think my solution might be helpful to others as well.
After recreating your data, it is basically a two-step approach:
Recreate data:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
df.loc[3, 'a'] = 30
Extend the df.index using .reindex:
idx = list(df.index)
new_rows = list(map(str, range(4, 6))) # easier extensible than new_rows = ["4", "5"]
idx.extend(new_rows)
df = df.reindex(index=idx)
Set the values using .loc:
df.loc[new_rows, "a"] = [40, 50]
giving you
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Example data
>>> data = pd.DataFrame({
'a': [10, 6, -3, -2, 4, 12, 3, 3],
'b': [6, -3, 6, 12, 8, 11, -5, -5],
'id': [1, 1, 1, 1, 6, 2, 2, 4]})
Case 1 Note that range can be altered to whatever it is that you desire.
>>> for i in range(10):
... data.loc[i, 'a'] = 30
...
>>> data
a b id
0 30.0 6.0 1.0
1 30.0 -3.0 1.0
2 30.0 6.0 1.0
3 30.0 12.0 1.0
4 30.0 8.0 6.0
5 30.0 11.0 2.0
6 30.0 -5.0 2.0
7 30.0 -5.0 4.0
8 30.0 NaN NaN
9 30.0 NaN NaN
Case 2 Here we are adding a new column to a data frame that had 8 rows to begin with. As we extend our new column c to be of length 10 the other columns are extended with NaN.
>>> for i in range(10):
... data.loc[i, 'c'] = 30
...
>>> data
a b id c
0 10.0 6.0 1.0 30.0
1 6.0 -3.0 1.0 30.0
2 -3.0 6.0 1.0 30.0
3 -2.0 12.0 1.0 30.0
4 4.0 8.0 6.0 30.0
5 12.0 11.0 2.0 30.0
6 3.0 -5.0 2.0 30.0
7 3.0 -5.0 4.0 30.0
8 NaN NaN NaN 30.0
9 NaN NaN NaN 30.0
Also somewhat late, but my solution was similar to the accepted one:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index=[1,2])
# single index assignment always works
df.loc[3, 'a'] = 30
# multiple indices
new_rows = [4,5]
# there should be a nicer way to add more than one index/row at once,
# but at least this is just one extra line:
df = df.reindex(index=df.index.append(pd.Index(new_rows))) # note: Index.append() doesn't accept non-Index iterables?
# multiple new rows now works:
df.loc[new_rows, "a"] = [40, 50]
print(df)
... which yields:
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
This also works now (useful when performance on aggregating dataframes matters):
# inserting whole rows:
df.loc[new_rows] = [[41, 51], [61,71]]
print(df)
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 41.0 51.0
5 61.0 71.0