mode returns Exception: Must produce aggregated value - pandas

for this dataframe
values ii
0 3.0 4
1 0.0 1
2 3.0 8
3 2.0 5
4 2.0 1
5 3.0 5
6 2.0 4
7 1.0 8
8 0.0 5
9 1.0 1
This line returns "Must ptoduce aggregated values
bii2=df.groupby(['ii'])['values'].agg(pd.Series.mode)
While this line works
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x)[0])
Could you explain why is that?

Problem is mode return sometimes 2 or more values, check solution with GroupBy.apply:
bii2=df.groupby(['ii'])['values'].apply(pd.Series.mode)
print (bii2)
ii
1 0 0.0
1 1.0
2 2.0
4 0 2.0
1 3.0
5 0 0.0
1 2.0
2 3.0
8 0 1.0
1 3.0
Name: values, dtype: float64
And pandas agg need scalar in output, so return error. So if select first value it working nice
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x).iat[0])
print (bii3)
ii
1 0.0
4 2.0
5 0.0
8 1.0
Name: values, dtype: float64

Related

How to get the index of the last condition and assign it to other columns

condition is column 'A' > 0.5
I want to calculate the index of the last condition established and assign it to column 'cond_index'
A cond_index
0 0.001566 NaN
1 0.174676 NaN
2 0.553506 2
3 0.583377 3
4 0.418854 3
5 0.836482 5
6 0.927756 6
7 0.800908 7
8 0.277646 7
9 0.388323 7
Use Index.to_series with replace missing values if not match condition in Series.where with comapre for greater like 0.5 and last forward filling missing values:
df['new'] = df.index.to_series().where(df['A'].gt(0.5)).ffill()
print (df)
A cond_index new
0 0.001566 NaN NaN
1 0.174676 NaN NaN
2 0.553506 2.0 2.0
3 0.583377 3.0 3.0
4 0.418854 3.0 3.0
5 0.836482 5.0 5.0
6 0.927756 6.0 6.0
7 0.800908 7.0 7.0
8 0.277646 7.0 7.0
9 0.388323 7.0 7.0

Pandas how to group rows by a dictionary of {row : group}

I have a dataframe n rows:
1 2 3
3 4 1
5 3 2
9 8 2
7 2 6
0 0 0
4 4 4
8 4 1
...
and a dictionary of keys , so that row is a key and the value is the group:
d = {0 : 0 , 1: 0, 2 : 0, 3 : 1, 4 : 1, 5: 2, 6: 2}
I want to group by the keys and then apply mean on the groups.
So I will get:
3 3 2 #This is the mean of rows 0,1,2 from the original df, as d[0]=d[1]=d[2]=0
8 5 4
2 2 2
8 4 1
What is the best way to do so?
Simply use the dictionary in the groupby it will replace the index value by the dictionary value matching on the key:
df.groupby(d).mean()
output:
a b c
0.0 3.0 3.0 2.0
1.0 8.0 5.0 4.0
2.0 2.0 2.0 2.0
If you also want to get the missing keys, use dropna=False in groupby. Those keys will be listed in the 'NaN' group:
df.groupby(d, dropna=False).mean()
output:
a b c
0.0 3.0 3.0 2.0
1.0 8.0 5.0 4.0
2.0 2.0 2.0 2.0
NaN 8.0 4.0 1.0
And for a range index instead of the dictionary keys:
df.groupby(d, dropna=False, as_index=False).mean()
output:
a b c
0 3.0 3.0 2.0
1 8.0 5.0 4.0
2 2.0 2.0 2.0
3 8.0 4.0 1.0
used input:
a b c
0 1 2 3
1 3 4 1
2 5 3 2
3 9 8 2
4 7 2 6
5 0 0 0
6 4 4 4
7 8 4 1

How to perform a rolling window on a pandas DataFrame, whereby each row consists nan values that should not be replaced?

I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!
Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]

how to calculate how many times is changed in the column

how I can calculate on the most easy way, how much values changes I have in the specific DataFrame columns. For example I have follow DF:
a b
0 1
1 1
2 1
3 2
4 1
5 2
6 2
7 3
8 3
9 3
In this Data Frame the values in the column b have been changed 4 times (in the rows 4,5,6 and 8).
My very simple solution is:
a = 0
for i in range(df.shape[0] - 1):
if df['b'].iloc[i] != df['b'].iloc[i+1]:
a+=1
I think need boolean indexing with index:
idx = df.index[df['b'].diff().shift().fillna(0).ne(0)]
print (idx)
Int64Index([4, 5, 6, 8], dtype='int64')
For more general solution is possible indexing by arange:
a = np.arange(len(df))[df['b'].diff().shift().bfill().ne(0)].tolist()
print (a)
[4, 5, 6, 8]
Explanation:
First get difference by Series.diff:
print (df['b'].diff())
0 NaN
1 0.0
2 0.0
3 1.0
4 -1.0
5 1.0
6 0.0
7 1.0
8 0.0
9 0.0
Name: b, dtype: float64
Then shift by one value:
print (df['b'].diff().shift())
0 NaN
1 NaN
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
Replace first NaNs by fillna:
print (df['b'].diff().shift().fillna(0))
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 -1.0
6 1.0
7 0.0
8 1.0
9 0.0
Name: b, dtype: float64
And compare for not equal to 0
print (df['b'].diff().shift().fillna(0).ne(0))
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
9 False
Name: b, dtype: bool
If the a is a column and not the index:
idx = df['a'].loc[df['b'].diff().shift().fillna(0) != 0]

(pandas) Why does .bfill().ffill() act differently than ffill().bfill() on groups?

I think I'm missing something basic conceptually, but I'm not able to find the answer in the docs.
>>> df=pd.DataFrame({'a':[1,1,2,2,3,3], 'b':[5,np.nan, 6, np.nan, np.nan, np.nan]})
>>> df
a b
0 1 5.0
1 1 NaN
2 2 6.0
3 2 NaN
4 3 NaN
5 3 NaN
Using ffill() and then bfill():
>>> df.groupby('a')['b'].ffill().bfill()
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Using bfill() and then ffill():
>>> df.groupby('a')['b'].bfill().ffill()
0 5.0
1 5.0
2 6.0
3 6.0
4 6.0
5 6.0
Doesn't the second way break the groupings? Will the first way always make sure that the values are filled in only with other values in that group?
I think you need:
print (df.groupby('a')['b'].apply(lambda x: x.ffill().bfill()))
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
print (df.groupby('a')['b'].apply(lambda x: x.bfill().ffill()))
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
because in your sample only first ffill or bfill is DataFrameGroupBy.ffill or DataFrameGroupBy.bfill, second is working with output Series. So it break groups, because Series has no groups.
print (df.groupby('a')['b'].ffill())
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
print (df.groupby('a')['b'].bfill())
0 5.0
1 NaN
2 6.0
3 NaN
4 NaN
5 NaN
Name: b, dtype: float64