Python Pandas groupby-apply strange behavior - pandas

Can anyone help me understand why there is different behavior between the two calls to apply below? Thank you.
In [34]: df
Out[34]:
A B C
0 1 0 0
1 1 7 4
2 2 9 8
3 2 2 4
4 2 2 1
5 3 3 3
6 3 3 2
7 3 5 7
In [35]: g = df.groupby('A')
In [36]: g.apply(max)
Out[36]:
A B C
A
1 1 7 4
2 2 9 8
3 3 5 7
In [37]: g.apply(lambda x: max(x))
Out[37]:
A
1 C
2 C
3 C
dtype: object

Short answer - you probably just want
df.groupby('A').max()
Longer answer - max is a generic python function that finds the max of any iterable. Because iterating a DataFrame is over the columns, calling the python max just finds the "largest" column, which happens in your second case.
In the first case - pandas has intercept logic, which turns things like g.apply(sum) into g.sum().

Related

Using groupby() and cut() in pandas

I have a dataframe and for each group value I want to label values. If value is less that group mean then label is 1 and if group value is more than group mean then label is 2.
input data frame is
groups num1
0 a 2
1 a 5
2 a Nan
3 b 10
4 b 4
5 b 0
6 b 7
7 c 2
8 c 4
9 c 1
Here mean values for group a, b ,c are 3.5, 5.25 and 2.33 respectively and output data frame is .
groups out
0 a 1
1 a 2
2 a Nan
3 b 2
4 b 1
5 b 1
6 b 2
7 c 1
8 c 2
9 c 1
I want to use panads.cut and may be pandas.groupby and pandas.apply also.
and also how can I skip Null values here?
Thanks in advance
cut is not really pertinent here. Use groupby.transform('mean') and numpy.where:
df['out'] = np.where(df['num1'].lt(df.groupby('groups')['num1']
.transform('mean')),
1, 2)
Output (as new column "out" for clarity):
groups num1 out
0 a 2 1
1 a 5 2
2 a 7 2
3 b 10 2
4 b 4 1
5 b 0 1
6 b 7 2
7 c 2 1
8 c 4 2
9 c 1 1
I really want cut
OK, but it's not really nice and performant:
(df.groupby('groups')['num1']
.transform(lambda g: pd.cut(g, [-np.inf, g.mean(), np.inf], labels=[1, 2]))
)

Assign column values from another dataframe with repeating key values

Please help me in Pandas, i cant find good solution
Tried map, assign, merge, join, set_index.
Maybe just i am too tired :)
df:
m_num A B
0 1 0 9
1 1 1 8
2 2 2 7
3 2 3 6
4 3 4 5
5 3 5 4
df1:
m_num C
0 2 99
1 2 88
df_final:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99
3 2 3 6 88
4 3 4 5 NaN
5 3 5 4 NaN
Try:
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')
print(df2)
Prints:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99.0
3 2 3 6 88.0
4 3 4 5 NaN
5 3 5 4 NaN
Explanation:
There may be better solutions out there but this was my thought process. The problem is slightly tricky in the sense that because 'm_num' is the only common key and it and it has repeating values.
So first I created a dataframe matching df and df1 here so that I can use the index as another key for the subsequent merge.
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
This prints:
m_num A B
0 2 2 7
1 2 3 6
As you can see above, now we have the index 0 and 1 in addition to the m_num as key which we can use to match with df1.
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
This prints:
m_num A B C
0 2 2 7 99
1 2 3 6 88
Then tie the above resultant dataframe to the original df and do a left join to get the output.
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Conditional count of cumulative sum Dataframe - Loop through columns

Im trying to compute a cumulative sum with a reset within a dataframe, based on the sign of each values. The idea is to the same exercise for each column separately.
For example, let's assume I have the following dataframe:
df = pd.DataFrame({'A': [1,1,1,-1,-1,1,1,1,1,-1,-1,-1],'B':[1,1,-1,-1,-1,1,1,1,-1,-1,-1,1]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
For each column, I want to compute the cumulative sum until I find a change in sign; in which case, the sum should be reset to 1. For the example above, I am expecting the following result:
df1=pd.DataFrame({'A_cumcount':[1,2,3,1,2,1,2,3,4,1,2,3],'B_cumcount':[1,2,1,2,3,1,2,3,1,2,3,4],index=[0,1,2,3,4,5,6,7,8,9,10,11]})
Similar issue has been discussed here: Pandas: conditional rolling count
I have tried the following code:
nb_col=len(df.columns) #number of columns in dataframe
for i in range(0,int(nb_col)): #Loop through the number of columns in the dataframe
name=df.columns[i] #read the column name
name=name+'_cumcount'
#add column for the calculation
df=df.reindex(columns=np.append(df.columns.values, [name]))
df=df[df.columns[nb_col+i]]=df.groupby((df[df.columns[i]] != df[df.columns[i]].shift(1)).cumsum()).cumcount()+1
My question is, is there a way to avoid this for loop? So I can avoid appending a new column each time and make the computation faster. Thank you
Answers received (all working fine):
From #nixon
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum()).cumcount()+1).add_suffix('_cumcount')
From #jezrael
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1).add_suffix('_cumcount'))
From #Scott Boston:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
I think in pandas need loop, e.g. by apply:
df1 = (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1)
.add_suffix('_cumcount'))
print (df1)
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
You can try this:
df.apply(lambda x: x.groupby(x.diff().bfill().ne(0).cumsum()).cumcount() + 1)
Output:
A B
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1
You can start by grouping by where the changes in the sequence occur by doing x.diff().ne(0).cumsum(), and using cumcount over the groups:
df.apply(lambda x: x.groupby(x.diff().ne(0).cumsum())
.cumcount()+1).add_suffix('_cumcount')
A_cumcount B_cumcount
0 1 1
1 2 2
2 3 1
3 1 2
4 2 3
5 1 1
6 2 2
7 3 3
8 4 1
9 1 2
10 2 3
11 3 1

pandas column operation on certain row in succession

I have a panda dataframe like this:
second block
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
This is a sequential data and I would like to get a new column which is the time difference between the current block and next time it repeats.
second block freq
0 1 a 3 //(4-1)
1 2 b 0 //(not repeating)
2 3 c 2 //(5-3)
3 4 a 0 //(not repeating)
4 5 c 0 //(not repeating)
I have tried to get the unique list of blocks. Then a for loop that do as below.
for i in unique_block:
df['freq'] = df['timestamp'].shift(-1) - df['timestamp']
I do not know how to get 0 for row index 1,3,4 and since the dataframe is too big. This is not efficient. This is not working.
Thanks.
Use groupby + diff(periods=-1). Multiply by -1 to get your difference convention and fillna with 0.
df['freq'] = (df.groupby('block').diff(-1)*-1).fillna(0)
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
You can use shift and transform in your groupby:
df['freq'] = df.groupby('block').second.transform(lambda x: x.shift(-1) - x).fillna(0)
>>> df
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
Using
df.groupby('block').second.apply(lambda x : x.diff().shift(-1)).fillna(0)
Out[242]:
0 3.0
1 0
2 2.0
3 0
4 0
Name: second, dtype: float64