Average per category per last N round in Pandas and lag it - pandas

I have a following problem.
I want to compute mean of last 2 observations per name and round and lag it. See following example:
df = pd.DataFrame(data={ 'name':["a","a","a","a","b","b","c" ] , 'value':[6,5,4,3,1,2,1] ,
'round':[1,2,3,4,1,2,1 ]})
Desired output is:
df = pd.DataFrame(data={ 'name':["a","a","a","a","b","b","c" ] , 'value':[6,5,4,3,1,2,1] ,
'round':[1,2,3,4,1,2,1 ], 'mean_last_2':["NaN","NaN",5.5,4.5,"NaN","NaN","NaN"]})
I tried this, but got "AttributeError: 'float' object has no attribute 'shift'":
df['mean_last_2'] = df.groupby("name")['value'].apply(lambda x:
x.tail(2).mean().shift(1))
How can I fix it please?

You could try something like this:
df['mean_last_2'] = df.groupby('name')['value'].apply(lambda x: x.rolling(2).mean().shift())
Output:
name value round mean_last_2
0 a 6 1 NaN
1 a 5 2 NaN
2 a 4 3 5.5
3 a 3 4 4.5
4 b 1 1 NaN
5 b 2 2 NaN
6 c 1 1 NaN

You can do something like
df.groupby("name").apply(lambda d: d.assign(mean_last_2 = d['value'].rolling(2).mean().shift()))
to get
name value round mean_last_2
name
a 0 a 6 1 NaN
1 a 5 2 NaN
2 a 4 3 5.5
3 a 3 4 4.5
b 4 b 1 1 NaN
5 b 2 2 NaN
c 6 c 1 1 NaN

Related

pandas dataframe auto fill values if have same value on specific column [duplicate]

I have the data as below, the new pandas version doesn't preserve the grouped columns after the operation of fillna/ffill/bfill. Is there a way to have the grouped data?
data = """one;two;three
1;1;10
1;1;nan
1;1;nan
1;2;nan
1;2;20
1;2;nan
1;3;nan
1;3;nan"""
df = pd.read_csv(io.StringIO(data), sep=";")
print(df)
one two three
0 1 1 10.0
1 1 1 NaN
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
print(df.groupby(['one','two']).ffill())
three
0 10.0
1 10.0
2 10.0
3 NaN
4 20.0
5 20.0
6 NaN
7 NaN
With the most recent pandas if we would like keep the groupby columns , we need to adding apply here
out = df.groupby(['one','two']).apply(lambda x : x.ffill())
Out[219]:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Does it what you expect?
df['three']= df.groupby(['one','two'])['three'].ffill()
print(df)
# Output:
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 NaN
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
Yes please set the index and then try grouping it so that it will preserve the columns as shown here:
df = pd.read_csv(io.StringIO(data), sep=";")
df.set_index(['one','two'], inplace=True)
df.groupby(['one','two']).ffill()

Pandas: new column where value is based on a specific value within subgroup

I have a dataframe where I want to create a new column ("NewValue") where it will take the value from the "Group" with Subgroup = A.
Group SubGroup Value NewValue
0 1 A 1 1
1 1 B 2 1
2 2 A 3 3
3 2 C 4 3
4 3 B 5 NaN
5 3 C 6 NaN
Can this be achieved using a groupby / transform function?
Use Series.map with filtered DataFrame in boolean indexing:
df['NewValue'] = df['Group'].map(df[df.SubGroup.eq('A')].set_index('Group')['Value'])
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN
Alternative with left join in DataFrame.merge with rename column:
df1 = df.loc[df.SubGroup.eq('A'),['Group','Value']].rename(columns={'Value':'NewValue'})
df = df.merge(df1, how='left')
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN

how to replace nan value using the value of which from other rows with common column value

Using column B as the reference how can I replace NaN value
>>> a
A B
1 1
Nan 3
1 1
Nan 1
Nan 2
5 3
1 1
2 2
I want result like this.
>> result
A B
1 1
5 3
1 1
1 1
2 2
5 3
1 1
2 2
I tried merging on the column b but couldn't figure out
b=a.groupby('B').reset_index()
dfM = pd.merge(a,b,on='B', how ='left')
We need a map from values in column B to the values in A.
mapping = a.dropna().drop_duplicates().set_index("B")["A"]
It looks like this
B
1 1.0
3 5.0
2 2.0
Name: A, dtype: float64
Filling null values becomes irrelevant at this point. We can just map B to get column A
a["B"].map(mapping)
This gives you
0 1.0
1 5.0
2 1.0
3 1.0
4 2.0
5 5.0
6 1.0
7 2.0
Name: B, dtype: float64
Cast to int and use it to overwrite column A in your original dataframe if you need to.

Any function/attribute in dataframe similar to attribute 'remove' or 'pop'

Is there any attribute/function for dataframe similar to like 'remove' attribute in series, to remove the 1st occirance of similar indexes in a dataframe.
Dataframe:
a b c d
100 1 2 3 NaN
200 4 5 6 NaN
100 7 9 10 NaN
Desired output:(after the desired command)
a b c d
200 4 5 6 NaN
100 7 9 10 NaN
Try with loc and duplicated with keep='last':
>>> df[~df.index.duplicated(keep='last')]
a b c d
200 4 5 6 NaN
100 7 9 10 NaN
>>>
Edit:
df.iloc[np.where(df.index.duplicated(keep='last'))]

Compute lagged means per name and round in pandas

I need to compute lagged means per groups in my dataframe. This is how my df looks like:
name value round
0 a 5 3
1 b 4 3
2 c 3 2
3 d 1 2
4 a 2 1
5 c 1 1
0 c 1 3
1 d 4 3
2 b 3 2
3 a 1 2
4 b 5 1
5 d 2 1
I would like to compute lagged means for column value per name and round. That is, for name a in round 3 I need to have value_mean = 1.5 (because (1+2)/2). And of course, there will be nan values when round = 1.
I tried this:
df['value_mean'] = df.groupby('name').expanding().mean().groupby('name').shift(1)['value'].values
but it gives a nonsense:
name value round value_mean
0 a 5 3 NaN
1 b 4 3 5.0
2 c 3 2 3.5
3 d 1 2 NaN
4 a 2 1 4.0
5 c 1 1 3.5
0 c 1 3 NaN
1 d 4 3 3.0
2 b 3 2 2.0
3 a 1 2 NaN
4 b 5 1 1.0
5 d 2 1 2.5
Any idea, how can I do this, please? I found this, but it seems not relevant for my problem: Calculate the mean value using two columns in pandas
You can do that as follows
# sort the values as they need to be counted
df.sort_values(['name', 'round'], inplace=True)
df.reset_index(drop=True, inplace=True)
# create a grouper to calculate the running count
# and running sum as the basis of the average
grouper= df.groupby('name')
ser_sum= grouper['value'].cumsum()
ser_count= grouper['value'].cumcount()+1
ser_mean= ser_sum.div(ser_count)
ser_same_name= df['name'] == df['name'].shift(1)
# finally you just have to set the first entry
# in each name-group to NaN (this usually would
# set the entries for each name and round=1 to NaN)
df['value_mean']= ser_mean.shift(1).where(ser_same_name, np.NaN)
# if you want to see the intermediate products,
# you can uncomment the following lines
#df['sum']= ser_sum
#df['count']= ser_count
df
Output:
name value round value_mean
0 a 2 1 NaN
1 a 1 2 2.0
2 a 5 3 1.5
3 b 5 1 NaN
4 b 3 2 5.0
5 b 4 3 4.0
6 c 1 1 NaN
7 c 3 2 1.0
8 c 1 3 2.0
9 d 2 1 NaN
10 d 1 2 2.0
11 d 4 3 1.5