Pandas: DataFrame Rolling Average on a Row - pandas

I have a Row of values in a dataframe and want to calculate the rolling average (3 Period) by creating a new row.
existing_row 1 2 3 4 5 6 7 8 9
create_new_row 2 3 4 5 6 7 8

Use DataFrame.rolling with axis=1 and mean:
print (df)
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9
df1 = df.rolling(3, axis=1).mean()
print (df1)
0 1 2 3 4 5 6 7 8
0 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
If need join to original pass to concat:
df = pd.concat([df, df1], ignore_index=True)
print (df)
0 1 2 3 4 5 6 7 8
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0

Use rolling_mean:
out = df.append(df.rolling(3, axis=1).mean(), ignore_index=True)
print(out)
# Output
A B C D E F G H I
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Setup:
df = pd.DataFrame({'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'D': {0: 4}, 'E': {0: 5},
'F': {0: 6}, 'G': {0: 7}, 'H': {0: 8}, 'I': {0: 9}})
print(df)
# Output
A B C D E F G H I
0 1 2 3 4 5 6 7 8 9

Related

Pandas Shift: Looking for better alternative [duplicate]

This question already has answers here:
Make Multiple Shifted (Lagged) Columns in Pandas
(4 answers)
Closed 3 months ago.
import pandas as pd
df = pd.DataFrame(np.array([[1, 0, 0], [4, 5, 0], [7, 7, 7], [7, 4, 5], [4, 5, 0], [7, 8, 9], [3, 2, 9], [9, 3, 6], [6, 8, 5]]),
columns=['a', 'b', 'c'],
index = ['1/1/2000', '1/1/2001', '1/1/2002', '1/1/2003', '1/1/2004', '1/1/2005', '1/1/2006', '1/1/2007', '1/1/2008'])
df['a_1'] = df['a'].shift(1)
df['a_3'] = df['a'].shift(3)
df['a_5'] = df['a'].shift(5)
df['a_7'] = df['a'].shift(7)
Above is a dummy example of how I am shifting.
Issues: 1. Need extra line for different period of shift, can it be done in one go?
2. Above df is small, in case of massive dataframe this operation is slow. I checked different questions: most are relating it to shift not being cython optimized, is there a faster way (apart from numba which few answer do talk about)
nums = [1, 3, 5, 7]
pd.concat([df] + [df['a'].shift(i).to_frame(f'a_{i}') for i in nums], axis=1)
result:
a b c a_1 a_3 a_5 a_7
1/1/2000 1 0 0 NaN NaN NaN NaN
1/1/2001 4 5 0 1.0 NaN NaN NaN
1/1/2002 7 7 7 4.0 NaN NaN NaN
1/1/2003 7 4 5 7.0 1.0 NaN NaN
1/1/2004 4 5 0 7.0 4.0 NaN NaN
1/1/2005 7 8 9 4.0 7.0 1.0 NaN
1/1/2006 3 2 9 7.0 7.0 4.0 NaN
1/1/2007 9 3 6 3.0 4.0 7.0 1.0
1/1/2008 6 8 5 9.0 7.0 7.0 4.0

Subtract values from different groups

I have the following DataFrame:
A X
Time
1 a 10
2 b 17
3 b 20
4 c 21
5 c 36
6 d 40
given by pd.DataFrame({'Time': [1, 2, 3, 4, 5, 6], 'A': ['a', 'b', 'b', 'c', 'c', 'd'], 'X': [10, 17, 20, 21, 36, 40]}).set_index('Time')
The desired output is:
Time Difference
0 2 7
1 4 1
2 6 4
The first difference 1 is a result of subtracting 21 from 20: (first "c" value - last "b" value).
I'm open to numPy transformations as well.
Aggregate by GroupBy.agg with GroupBy.first,
GroupBy.last and then subtract shifted values for last column with omit first row by positions:
df = df.reset_index()
df1 = df.groupby('A',as_index=False, sort=False).agg(first=('X', 'first'),
last=('X','last'),
Time=('Time','first'))
df1['Difference'] = df1['first'].sub(df1['last'].shift(fill_value=0))
df1 = df1[['Time','Difference']].iloc[1:].reset_index(drop=True)
print (df1)
Time Difference
0 2 7
1 4 1
2 6 4
IIUC, you can pivot, ffill the columns, and compute the difference:
g = df.reset_index().groupby('A')
(df.assign(col=g.cumcount().values)
.pivot('A', 'col', 'X')
.ffill(axis=1)
.assign(Time=g['Time'].first(),
diff=lambda d: d[0]-d[1].shift())
[['Time', 'diff']].iloc[1:]
.rename_axis(index=None, columns=None)
)
output:
Time Difference
b 2 7.0
c 4 1.0
d 6 4.0
Intermediate, pivoted/ffilled dataframe:
col 0 1 Time Difference
A
a 10.0 10.0 1 NaN
b 17.0 20.0 2 7.0
c 21.0 36.0 4 1.0
d 40.0 40.0 6 4.0
Another possible solution:
(df.assign(Y = df['X'].shift())
.iloc[df.index % 2 == 0]
.assign(Difference = lambda z: z['X'] - z['Y'])
.reset_index()
.loc[:, ['Time', 'Difference']]
)
Output:
Time Difference
0 2 7.0
1 4 1.0
2 6 4.0

groupby transform with if condition in pandas

I have a data frame as given below
df = pd.DataFrame({'key': ['a', 'a', 'a', 'b', 'c', 'c'] , 'val' : [10, np.nan, 9 , 10, 11, 13]})
df
key val
0 a 10.0
1 a NaN
2 a 9.0
3 b 10.0
4 c 11.0
5 c 13.0
I want to perform groupby and transform that new column is each value divided by group mean , which I can do as below
df['new'] = df.groupby('key')['val'].transform(lambda g : g/g.mean())
df.new
0 1.052632
1 NaN
2 0.947368
3 1.000000
4 0.916667
5 1.083333
Name: new, dtype: float64
Now I have condition that if val is np.nan then new column value will be np.inf which should result as below
0 1.052632
1 np.inf
2 0.947368
3 1.000000
4 0.916667
5 1.083333
Name: new, dtype: float64
In other words how can I have this check if a val is np.nan with groupby and transform.
Thanks in advance
Add Series.replace:
df['new'] = (df.groupby('key')['val'].transform(lambda g : g/g.mean())
.replace(np.nan, np.inf))
print (df)
key val new
0 a 10.0 1.052632
1 a NaN inf
2 a 9.0 0.947368
3 b 10.0 1.000000
4 c 11.0 0.916667
5 c 13.0 1.083333
Or numpy.where:
df['new'] = np.where(df.val.isna(),
np.inf, df.groupby('key')['val'].transform(lambda g : g/g.mean()))
print (df)
key val new
0 a 10.0 1.052632
1 a NaN inf
2 a 9.0 0.947368
3 b 10.0 1.000000
4 c 11.0 0.916667
5 c 13.0 1.083333

The previous value in each group is padded with missing values

If there are three columns of data, the first column is some category id, the second column and the third column have some missing values, I want to aggregate the id of the first column, after grouping, fill in the third column of each group by the method: 'ffill' Missing value
I found a good idea here: Pandas: filling missing values by weighted average in each group! , but it didn't solve my problem because the output it produced was not what I wanted
Enter the following code to get an example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'sss':[1, np.nan, 3, np.nan, np.nan, np.nan, 2, np.nan, np.nan]})
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN NaN
2 B NaN 3.0
3 B 2.0 NaN
4 B 3.0 NaN
5 B 1.0 NaN
6 C 3.0 2.0
7 C NaN NaN
8 C 3.0 NaN
Fill in missing values with a previous value after grouping
Then I ran the following code, but it outputs strange results:
df["sss"] = df.groupby("name").transform(lambda x: x.fillna(axis = 0,method = 'ffill'))
df
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN NaN
3 B 2.0 2.0
4 B 3.0 3.0
5 B 1.0 1.0
6 C 3.0 3.0
7 C NaN 3.0
8 C 3.0 3.0
The result I want is this:
Out[13]:
name value sss
0 A 1.0 1.0
1 A NaN 1.0
2 B NaN 3.0
3 B 2.0 3.0
4 B 3.0 3.0
5 B 1.0 3.0
6 C 3.0 2.0
7 C NaN 2.0
8 C 3.0 2.0
Can someone point out where I am wrong?strong text

Adding new column to pandas dataframe after groupby and rolling on a column

I am trying to add a new column to pandas dataframe after groupby and rolling average but the newly generated column changes order after reset_index()
original dataframe
Name Values
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 C 3
6 A 2
7 A 6
8 B 8
9 B 3
10 D 0
after groupby and rolling it looks something like:
Name
A 0 NaN
1 NaN
2 2.000000
6 2.333333
7 3.666667
B 3 NaN
4 NaN
8 3.666667
9 4.333333
C 5 NaN
D 10 NaN
Name: Values, dtype: float64
Now can someone help me to add this result in new column in the original dataframe? Because when I try to reset_index(), the order changes to the groupby order.
Use apply to apply rolling mean on each group,
df['rolling_mean'] = df.groupby('Name').Values.apply(lambda x: x.rolling(3).mean())
df
Name Values rolling_mean
0 A 1 NaN
1 A 2 NaN
2 A 3 2.000000
3 B 1 NaN
4 B 2 NaN
5 C 3 NaN
6 A 2 2.333333
7 A 6 3.666667
8 B 8 3.666667
9 B 3 4.333333
10 D 0 NaN
Here is an example:
df = pd.DataFrame({'Name': {0: 'A',
1: 'A',
2: 'A',
3: 'B',
4: 'B',
5: 'C',
6: 'A',
7: 'A',
8: 'B',
9: 'B',
10: 'D'},
'Values': {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 3, 6: 2, 7: 6, 8: 8, 9: 3, 10: 0}})
df2 = pd.DataFrame({2: {('A', 0): np.nan,
('A', 1): np.nan,
('A', 2): 2.0,
('A', 6): 2.333333,
('A', 7): 3.666667,
('B', 3): np.nan,
('B', 4): np.nan,
('B', 8): 3.666667,
('B', 9): 4.3333330000000005,
('C', 5): np.nan,
('D', 10): np.nan}})
df.merge(df2.reset_index(level=0), left_index=True, right_index=True)
Name Values 0 2
0 A 1 A NaN
1 A 2 A NaN
2 A 3 A 2.000000
3 B 1 B NaN
4 B 2 B NaN
5 C 3 C NaN
6 A 2 A 2.333333
7 A 6 A 3.666667
8 B 8 B 3.666667
9 B 3 B 4.333333
10 D 0 D NaN
or join:
df.join(df2.reset_index(level=0))
Name Values 0 2
0 A 1 A NaN
1 A 2 A NaN
2 A 3 A 2.000000
3 B 1 B NaN
4 B 2 B NaN
5 C 3 C NaN
6 A 2 A 2.333333
7 A 6 A 3.666667
8 B 8 B 3.666667
9 B 3 B 4.333333
10 D 0 D NaN