fill_value in the pandas shift doesn't work with groupby - pandas

I need to shift column in pandas dataframe, for every name and fill resulting NA's with predefined value. Below is code snippet compiled with python 2.7
import pandas as pd
d = {'Name': ['Petro', 'Petro', 'Petro', 'Petro', 'Petro', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykola', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta', 'Mykyta'],
'Month': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'Value': [25, 2.5, 24.6, 28, 26.4, 35, 24, 35, 22, 27, 30, 30, 34, 30, 23]
}
data = pd.DataFrame(d)
data['ValueLag'] = data.groupby('Name').Value.shift(-1, fill_value = 20)
print data
After running code above I get the following output
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 NaN
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 NaN
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 NaN
Looks like fill_value did not work here. While I need NaN to be filled with some number let's say 4.
Or if to tell all the story I need that last value to be extended like this
Month Name Value ValueLag
0 1 Petro 25.0 2.5
1 2 Petro 2.5 24.6
2 3 Petro 24.6 28.0
3 4 Petro 28.0 26.4
4 5 Petro 26.4 26.4
5 1 Mykola 35.0 24.0
6 2 Mykola 24.0 35.0
7 3 Mykola 35.0 22.0
8 4 Mykola 22.0 27.0
9 5 Mykola 27.0 27.0
10 1 Mykyta 30.0 30.0
11 2 Mykyta 30.0 34.0
12 3 Mykyta 34.0 30.0
13 4 Mykyta 30.0 23.0
14 5 Mykyta 23.0 23.0
Is there a way to fill with last value forward or first value backward if shifting positive number of periods?

It seems that the fill value is by group rather than a single value. Try the following,
data['ValueLag'] = data.groupby('Name').Value.shift(-1).ffill()

Related

Filling NaN values in pandas after grouping

This question is slightly different from usual filling of NaN values.
Suppose I have a dataframe, where in I group by some category. Now I want to fill the NaN values of a column by using the mean value of that group but from different column.
Let me take an example:
a = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
a['diff'] = a.salary - a.expenditure
Occupation salary expenditure diff
0 driver 100 20.0 80.0
1 driver 150 40.0 110.0
2 mechanic 70 10.0 60.0
3 teacher 300 100.0 200.0
4 mechanic 90 NaN NaN
5 teacher 250 80.0 170.0
6 unemployed 10 0.0 10.0
7 driver 90 NaN NaN
8 mechanic 110 40.0 70.0
9 teacher 350 120.0 230.0
So, in the above case, I would like to fill the NaN values in expenditure as:
salary - mean(difference) for each group.
How do I do that using pandas?
You can create that new series with the desired values, groupby.transform and use to update the target column.
Assuming you want to group by Occupation
a['mean_diff'] = a.groupby('Occupation')['diff'].transform('mean')
a.expenditure.mask(
a.expenditure.isna(),
a.salary - a.mean_diff,
inplace=True
)
Output
Occupation salary expenditure diff mean_diff
0 driver 100 20.0 80.0 95.0
1 driver 150 40.0 110.0 95.0
2 mechanic 70 10.0 60.0 65.0
3 teacher 300 100.0 200.0 200.0
4 mechanic 90 25.0 NaN 65.0
5 teacher 250 80.0 170.0 200.0
6 unemployed 10 0.0 10.0 10.0
7 driver 90 -5.0 NaN 95.0
8 mechanic 110 40.0 70.0 65.0
9 teacher 350 120.0 230.0 200.0

How to add 1 to previous data if NaN in pandas

I was wondering if it is possible to add 1 (or n) to missing values in a pandas DataFrame / Series.
For example:
1
10
nan
15
25
nan
nan
nan
30
Would return :
1
10
11
15
25
26
27
28
30
Thank you,
Use .ffill + the result of a groupby.cumcount to determine n
df[0].ffill() + df.groupby(df[0].notnull().cumsum()).cumcount()
0 1.0
1 10.0
2 11.0
3 15.0
4 25.0
5 26.0
6 27.0
7 28.0
8 30.0
dtype: float64

Plot histogram on non distirbuted data

I have a pandas series like
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0
10 4.0
11 58.0
12 20.0
13 39.0
14 14.0
15 55.0
16 2.0
17 NaN
while trying to plot on a an histogram like
plt.hist(train_df['Age'])
I have the following error:
ValueError: max must be larger than min in range parameter

Pandas dataframe creating multiple rows at once via .loc

I can create a new row in a dataframe using .loc():
>>> df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
>>> df
a b
1 10 100
2 20 200
>>> df.loc[3, 'a'] = 30
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
But how can I create more than one row using the same method?
>>> df.loc[[4, 5], 'a'] = [40, 50]
...
KeyError: '[4 5] not in index'
I'm familiar with .append() but am looking for a way that does NOT require constructing a new row into a Series before having it appended to df.
Desired input:
>>> df.loc[[4, 5], 'a'] = [40, 50]
Desired output
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Where last 2 rows are newly added.
Admittedly, this is a very late answer, but I have had to deal with a similar problem and think my solution might be helpful to others as well.
After recreating your data, it is basically a two-step approach:
Recreate data:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
df.loc[3, 'a'] = 30
Extend the df.index using .reindex:
idx = list(df.index)
new_rows = list(map(str, range(4, 6))) # easier extensible than new_rows = ["4", "5"]
idx.extend(new_rows)
df = df.reindex(index=idx)
Set the values using .loc:
df.loc[new_rows, "a"] = [40, 50]
giving you
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Example data
>>> data = pd.DataFrame({
'a': [10, 6, -3, -2, 4, 12, 3, 3],
'b': [6, -3, 6, 12, 8, 11, -5, -5],
'id': [1, 1, 1, 1, 6, 2, 2, 4]})
Case 1 Note that range can be altered to whatever it is that you desire.
>>> for i in range(10):
... data.loc[i, 'a'] = 30
...
>>> data
a b id
0 30.0 6.0 1.0
1 30.0 -3.0 1.0
2 30.0 6.0 1.0
3 30.0 12.0 1.0
4 30.0 8.0 6.0
5 30.0 11.0 2.0
6 30.0 -5.0 2.0
7 30.0 -5.0 4.0
8 30.0 NaN NaN
9 30.0 NaN NaN
Case 2 Here we are adding a new column to a data frame that had 8 rows to begin with. As we extend our new column c to be of length 10 the other columns are extended with NaN.
>>> for i in range(10):
... data.loc[i, 'c'] = 30
...
>>> data
a b id c
0 10.0 6.0 1.0 30.0
1 6.0 -3.0 1.0 30.0
2 -3.0 6.0 1.0 30.0
3 -2.0 12.0 1.0 30.0
4 4.0 8.0 6.0 30.0
5 12.0 11.0 2.0 30.0
6 3.0 -5.0 2.0 30.0
7 3.0 -5.0 4.0 30.0
8 NaN NaN NaN 30.0
9 NaN NaN NaN 30.0
Also somewhat late, but my solution was similar to the accepted one:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index=[1,2])
# single index assignment always works
df.loc[3, 'a'] = 30
# multiple indices
new_rows = [4,5]
# there should be a nicer way to add more than one index/row at once,
# but at least this is just one extra line:
df = df.reindex(index=df.index.append(pd.Index(new_rows))) # note: Index.append() doesn't accept non-Index iterables?
# multiple new rows now works:
df.loc[new_rows, "a"] = [40, 50]
print(df)
... which yields:
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
This also works now (useful when performance on aggregating dataframes matters):
# inserting whole rows:
df.loc[new_rows] = [[41, 51], [61,71]]
print(df)
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 41.0 51.0
5 61.0 71.0

Compute a sequential rolling mean in pandas as array function?

I am trying to calculate a rolling mean on dataframe with NaNs in pandas, but pandas seems to reset the window when it meets a NaN, hears some code as an example...
import numpy as np
from pandas import *
foo = DataFrame(np.arange(0.0,13.0))
foo['1'] = np.arange(13.0,26.0)
foo.ix[4:6,0] = np.nan
foo.ix[4:7,1] = np.nan
bar = rolling_mean(foo, 4)
gives the rolling mean that resets the window after each NaN's, not just skipping out the NaNs
bar =
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 8.5 NaN
11 9.5 22.5
12 10.5 23.5
I have found an ugly iter/ dropna() work around that gives the right answer
def sparse_rolling_mean(df_data, window):
...: f_data = DataFrame(np.nan,index=df_data.index, columns=df_data.columns)
...: for i in f_data.columns:
...: f_data.ix[:,i] = rolling_mean(df_data.ix[:,i].dropna(),window)
...: return f_data
bar = sparse_rolling_mean(foo,4)
bar
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
does anyone know if it is possible to do this as an array function ?
many thanks in advance.
you may do:
>>> def sparse_rolling_mean(ts, window):
... return rolling_mean(ts.dropna(), window).reindex_like(ts)
...
>>> foo.apply(sparse_rolling_mean, args=(4,))
0 1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.50 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 3.25 NaN
8 5.00 16.5
9 6.75 18.5
10 8.50 20.5
11 9.50 22.5
12 10.50 23.5
[13 rows x 2 columns]
you can control what get's naned out with the min_periods arg
In [12]: rolling_mean(foo, 4,min_periods=1)
Out[12]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 2.0 15.0
5 2.5 15.5
6 3.0 16.0
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
You can do this if you want results, except when the original was nan
In [27]: rolling_mean(foo, 4,min_periods=1)[foo.notnull()]
Out[27]:
0 1
0 0.0 13.0
1 0.5 13.5
2 1.0 14.0
3 1.5 14.5
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 7.0 NaN
8 7.5 21.0
9 8.0 21.5
10 8.5 22.0
11 9.5 22.5
12 10.5 23.5
[13 rows x 2 columns]
Your expected are a bit odd, as the first 3 rows should have values.