Compute lagged means per name and round in pandas - pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas

You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5

Related

Pandas: new column where value is based on a specific value within subgroup

I have a dataframe where I want to create a new column ("NewValue") where it will take the value from the "Group" with Subgroup = A.
Group SubGroup Value NewValue
0 1 A 1 1
1 1 B 2 1
2 2 A 3 3
3 2 C 4 3
4 3 B 5 NaN
5 3 C 6 NaN
Can this be achieved using a groupby / transform function?
Use Series.map with filtered DataFrame in boolean indexing:
df['NewValue'] = df['Group'].map(df[df.SubGroup.eq('A')].set_index('Group')['Value'])
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN
Alternative with left join in DataFrame.merge with rename column:
df1 = df.loc[df.SubGroup.eq('A'),['Group','Value']].rename(columns={'Value':'NewValue'})
df = df.merge(df1, how='left')
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN

how to use pandas concatenate string within rolling window for each group?

I have a data set like below:
cluster order label
0 1 1 a
1 1 2 b
2 1 3 c
3 1 4 c
4 1 5 b
5 2 1 b
6 2 2 b
7 2 3 c
8 2 4 a
9 2 5 a
10 2 6 b
11 2 7 c
12 2 8 c
I want to add a column to concatenate a rolling window of 3 for the previous values of the column label. It seems pandas rolling can only do calculations for numerical. Is there a way to concatenate string?
cluster order label roll3
0 1 1 a NaN
1 1 2 b NaN
2 1 3 c NaN
3 1 4 c abc
4 1 5 b bcc
5 2 1 b NaN
6 2 2 b NaN
7 2 3 c NaN
8 2 4 a bbc
9 2 5 a bca
10 2 6 b caa
11 2 7 c aab
12 2 8 c abc
Use groupby.apply to shift and concat the labels:
df['roll3'] = (df.groupby('cluster')['label']
.apply(lambda x: x.shift(3) + x.shift(2) + x.shift(1)))
# cluster order label roll3
# 0 1 1 a NaN
# 1 1 2 b NaN
# 2 1 3 c NaN
# 3 1 4 c abc
# 4 1 5 b bcc
# 5 2 1 b NaN
# 6 2 2 b NaN
# 7 2 3 c NaN
# 8 2 4 a bbc
# 9 2 5 a bca
# 10 2 6 b caa
# 11 2 7 c aab
# 12 2 8 c abc

Average per category per last N round in Pandas and lag it

I have a following problem.
I want to compute mean of last 2 observations per name and round and lag it. See following example:
df = pd.DataFrame(data={ 'name':["a","a","a","a","b","b","c" ] , 'value':[6,5,4,3,1,2,1] ,
'round':[1,2,3,4,1,2,1 ]})
Desired output is:
df = pd.DataFrame(data={ 'name':["a","a","a","a","b","b","c" ] , 'value':[6,5,4,3,1,2,1] ,
'round':[1,2,3,4,1,2,1 ], 'mean_last_2':["NaN","NaN",5.5,4.5,"NaN","NaN","NaN"]})
I tried this, but got "AttributeError: 'float' object has no attribute 'shift'":
df['mean_last_2'] = df.groupby("name")['value'].apply(lambda x:
x.tail(2).mean().shift(1))
How can I fix it please?
You could try something like this:
df['mean_last_2'] = df.groupby('name')['value'].apply(lambda x: x.rolling(2).mean().shift())
Output:
name value round mean_last_2
0 a 6 1 NaN
1 a 5 2 NaN
2 a 4 3 5.5
3 a 3 4 4.5
4 b 1 1 NaN
5 b 2 2 NaN
6 c 1 1 NaN
You can do something like
df.groupby("name").apply(lambda d: d.assign(mean_last_2 = d['value'].rolling(2).mean().shift()))
to get
name value round mean_last_2
name
a 0 a 6 1 NaN
1 a 5 2 NaN
2 a 4 3 5.5
3 a 3 4 4.5
b 4 b 1 1 NaN
5 b 2 2 NaN
c 6 c 1 1 NaN

Assign column values from another dataframe with repeating key values

Please help me in Pandas, i cant find good solution
Tried map, assign, merge, join, set_index.
Maybe just i am too tired :)
df:
m_num A B
0 1 0 9
1 1 1 8
2 2 2 7
3 2 3 6
4 3 4 5
5 3 5 4
df1:
m_num C
0 2 99
1 2 88
df_final:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99
3 2 3 6 88
4 3 4 5 NaN
5 3 5 4 NaN
Try:
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')
print(df2)
Prints:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99.0
3 2 3 6 88.0
4 3 4 5 NaN
5 3 5 4 NaN
Explanation:
There may be better solutions out there but this was my thought process. The problem is slightly tricky in the sense that because 'm_num' is the only common key and it and it has repeating values.
So first I created a dataframe matching df and df1 here so that I can use the index as another key for the subsequent merge.
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
This prints:
m_num A B
0 2 2 7
1 2 3 6
As you can see above, now we have the index 0 and 1 in addition to the m_num as key which we can use to match with df1.
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
This prints:
m_num A B C
0 2 2 7 99
1 2 3 6 88
Then tie the above resultant dataframe to the original df and do a left join to get the output.
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')

pandas column operation on certain row in succession

I have a panda dataframe like this:
second block
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
This is a sequential data and I would like to get a new column which is the time difference between the current block and next time it repeats.
second block freq
0 1 a 3 //(4-1)
1 2 b 0 //(not repeating)
2 3 c 2 //(5-3)
3 4 a 0 //(not repeating)
4 5 c 0 //(not repeating)
I have tried to get the unique list of blocks. Then a for loop that do as below.
for i in unique_block:
df['freq'] = df['timestamp'].shift(-1) - df['timestamp']
I do not know how to get 0 for row index 1,3,4 and since the dataframe is too big. This is not efficient. This is not working.
Thanks.
Use groupby + diff(periods=-1). Multiply by -1 to get your difference convention and fillna with 0.
df['freq'] = (df.groupby('block').diff(-1)*-1).fillna(0)
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
You can use shift and transform in your groupby:
df['freq'] = df.groupby('block').second.transform(lambda x: x.shift(-1) - x).fillna(0)
>>> df
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
Using
df.groupby('block').second.apply(lambda x : x.diff().shift(-1)).fillna(0)
Out[242]:
0 3.0
1 0
2 2.0
3 0
4 0
Name: second, dtype: float64