Adding new column to pandas dataframe after groupby and rolling on a column - pandas

I am trying to add a new column to pandas dataframe after groupby and rolling average but the newly generated column changes order after reset_index()
original dataframe
Name Values
0 A 1
1 A 2
2 A 3
3 B 1
4 B 2
5 C 3
6 A 2
7 A 6
8 B 8
9 B 3
10 D 0
after groupby and rolling it looks something like:
Name
A 0 NaN
1 NaN
2 2.000000
6 2.333333
7 3.666667
B 3 NaN
4 NaN
8 3.666667
9 4.333333
C 5 NaN
D 10 NaN
Name: Values, dtype: float64
Now can someone help me to add this result in new column in the original dataframe? Because when I try to reset_index(), the order changes to the groupby order.

Use apply to apply rolling mean on each group,
df['rolling_mean'] = df.groupby('Name').Values.apply(lambda x: x.rolling(3).mean())
df
Name Values rolling_mean
0 A 1 NaN
1 A 2 NaN
2 A 3 2.000000
3 B 1 NaN
4 B 2 NaN
5 C 3 NaN
6 A 2 2.333333
7 A 6 3.666667
8 B 8 3.666667
9 B 3 4.333333
10 D 0 NaN

Here is an example:
df = pd.DataFrame({'Name': {0: 'A',
1: 'A',
2: 'A',
3: 'B',
4: 'B',
5: 'C',
6: 'A',
7: 'A',
8: 'B',
9: 'B',
10: 'D'},
'Values': {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 3, 6: 2, 7: 6, 8: 8, 9: 3, 10: 0}})
df2 = pd.DataFrame({2: {('A', 0): np.nan,
('A', 1): np.nan,
('A', 2): 2.0,
('A', 6): 2.333333,
('A', 7): 3.666667,
('B', 3): np.nan,
('B', 4): np.nan,
('B', 8): 3.666667,
('B', 9): 4.3333330000000005,
('C', 5): np.nan,
('D', 10): np.nan}})
df.merge(df2.reset_index(level=0), left_index=True, right_index=True)
Name Values 0 2
0 A 1 A NaN
1 A 2 A NaN
2 A 3 A 2.000000
3 B 1 B NaN
4 B 2 B NaN
5 C 3 C NaN
6 A 2 A 2.333333
7 A 6 A 3.666667
8 B 8 B 3.666667
9 B 3 B 4.333333
10 D 0 D NaN
or join:
df.join(df2.reset_index(level=0))
Name Values 0 2
0 A 1 A NaN
1 A 2 A NaN
2 A 3 A 2.000000
3 B 1 B NaN
4 B 2 B NaN
5 C 3 C NaN
6 A 2 A 2.333333
7 A 6 A 3.666667
8 B 8 B 3.666667
9 B 3 B 4.333333
10 D 0 D NaN

Related

Subtract values from different groups

I have the following DataFrame:
A X
Time
1 a 10
2 b 17
3 b 20
4 c 21
5 c 36
6 d 40
given by pd.DataFrame({'Time': [1, 2, 3, 4, 5, 6], 'A': ['a', 'b', 'b', 'c', 'c', 'd'], 'X': [10, 17, 20, 21, 36, 40]}).set_index('Time')
The desired output is:
Time Difference
0 2 7
1 4 1
2 6 4
The first difference 1 is a result of subtracting 21 from 20: (first "c" value - last "b" value).
I'm open to numPy transformations as well.
Aggregate by GroupBy.agg with GroupBy.first,
GroupBy.last and then subtract shifted values for last column with omit first row by positions:
df = df.reset_index()
df1 = df.groupby('A',as_index=False, sort=False).agg(first=('X', 'first'),
last=('X','last'),
Time=('Time','first'))
df1['Difference'] = df1['first'].sub(df1['last'].shift(fill_value=0))
df1 = df1[['Time','Difference']].iloc[1:].reset_index(drop=True)
print (df1)
Time Difference
0 2 7
1 4 1
2 6 4
IIUC, you can pivot, ffill the columns, and compute the difference:
g = df.reset_index().groupby('A')
(df.assign(col=g.cumcount().values)
.pivot('A', 'col', 'X')
.ffill(axis=1)
.assign(Time=g['Time'].first(),
diff=lambda d: d[0]-d[1].shift())
[['Time', 'diff']].iloc[1:]
.rename_axis(index=None, columns=None)
)
output:
Time Difference
b 2 7.0
c 4 1.0
d 6 4.0
Intermediate, pivoted/ffilled dataframe:
col 0 1 Time Difference
A
a 10.0 10.0 1 NaN
b 17.0 20.0 2 7.0
c 21.0 36.0 4 1.0
d 40.0 40.0 6 4.0
Another possible solution:
(df.assign(Y = df['X'].shift())
.iloc[df.index % 2 == 0]
.assign(Difference = lambda z: z['X'] - z['Y'])
.reset_index()
.loc[:, ['Time', 'Difference']]
)
Output:
Time Difference
0 2 7.0
1 4 1.0
2 6 4.0

Pandas: DataFrame Rolling Average on a Row

I have a Row of values in a dataframe and want to calculate the rolling average (3 Period) by creating a new row.
existing_row 1 2 3 4 5 6 7 8 9
create_new_row 2 3 4 5 6 7 8
Use DataFrame.rolling with axis=1 and mean:
print (df)
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9
df1 = df.rolling(3, axis=1).mean()
print (df1)
0 1 2 3 4 5 6 7 8
0 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
If need join to original pass to concat:
df = pd.concat([df, df1], ignore_index=True)
print (df)
0 1 2 3 4 5 6 7 8
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Use rolling_mean:
out = df.append(df.rolling(3, axis=1).mean(), ignore_index=True)
print(out)
# Output
A B C D E F G H I
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
1 NaN NaN 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Setup:
df = pd.DataFrame({'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'D': {0: 4}, 'E': {0: 5},
'F': {0: 6}, 'G': {0: 7}, 'H': {0: 8}, 'I': {0: 9}})
print(df)
# Output
A B C D E F G H I
0 1 2 3 4 5 6 7 8 9

Drop a column based on the existence of another column

I'm actually trying to figure out how to drop a column based on the existence of another column. Here is my problem :
I start with this DataFrame. Each "X" column is associated with a "Y" column using a number. (X_1,Y_1 / X_2,Y_2 ...)
Index X_1 X_2 Y_1 Y_2
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
I drop NaN values using pd.dropna(). The result I get is this DataFrame :
Index X_1 X_2 Y_1
1 4 0 A
2 7 0 A
3 6 0 B
4 2 0 B
5 8 0 A
The problem is that I want to delete the "X" column associated to the "Y" column that just got dropped. I would like to use a condition that basically says :
"If Y_2 is not in the DataFrame, drop the X_2 column"
I used a for loop combined to if, but it doesn't seem to work. Any ideas ?
Thanks and have a good day.
Setup
>>> df
CHA_COEXPM1_COR CHA_COEXPM2_COR CHA_COFMAT1_COR CHA_COFMAT2_COR
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Solution
Identify the columns having NaN values in any row
Group the identified columns using the numeric identifier and transform using any
Filter the columns using the boolean mask created in the previous step
m = df.isna().any()
m = m.groupby(m.index.str.extract(r'(\d+)_')[0]).transform('any')
Result
>>> df.loc[:, ~m]
CHA_COEXPM1_COR CHA_COFMAT1_COR
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
Slightly modified example to be closer to actual DataFrame:
df = pd.DataFrame({
'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'X_V1_C': {0: 4, 1: 7, 2: 6, 3: 2, 4: 8},
'X_V2_C': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Y_V1_C': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A'},
'Y_V2_C': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
})
Index X_V1_C X_V2_C Y_V1_C Y_V2_C
0 1 4 0 A NaN
1 2 7 0 A NaN
2 3 6 0 B NaN
3 4 2 0 B NaN
4 5 8 0 A NaN
set_index on any columns to be "saved"
Extract the numbers from the columns and create a MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
0 1 2 1 2 # Numbers Extracted From Columns
X_V1_C X_V2_C Y_V1_C Y_V2_C
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Check where There are groups with all NaN columns with DataFrame.isna all on axis=0 (columns) then any relative to level=0 (the number that was extracted)
col_mask = ~df.isna().all(axis=0).any(level=0)
0
1 True # Keep 1 Group
2 False # Don't Keep 2 Group
dtype: bool
4.filter the DataFrame with the mask using loc then droplevel on the added number level
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
All Together
df = df.set_index('Index')
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
col_mask = ~df.isna().all(axis=0).any(level=0)
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
df:
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
drop nas
df.dropna(axis=1, inplace=True)
compute suffixes and columns with both suffixes
suffixes = [i[2:] for i in df.columns]
cols = [c for c in df.columns if suffixes.count(c[2:]) == 2]
filter columns
df[cols]
full code:
df = df.set_index('Index').dropna(axis=1)
suffixes = [i[2:] for i in df2.columns]
df[[c for c in df2.columns if suffixes.count(c[2:]) == 2]]

groupby and count on multiple columns of dataframe

I have a df
df = pd.DataFrame([
[1, 1, 'A', 10],
[4, 1 ,'A', 6],
[7, 2 ,'A', 3],
[2, 2 ,'A', 4],
[6, 2 ,'B', 9],
[5, 2 ,'B', 7],
[5, 1 ,'B', 12],
[5, 1 ,'B', 4],
[5, 2 ,'C', 9],
[5, 1 ,'C', 3],
[5, 1 ,'C', 4],
[5, 2 ,'C', 7]
],
index=['A', 'A', 'A','A','A','A','A','A','A','A','A','A'],
columns=['A', 'B', 'C', 'D'])
I can count the number of non zero values for column D grouped by column A using:
df['countTrans'] = df['D'].ne(0).groupby(df['A']).transform('sum')
where the output is:
df:
A B C D countTrans
A 1 1 A 10 1.0
A 4 1 A 0 0.0
A 7 2 A 3 1.0
A 2 2 A 4 1.0
A 6 2 B 9 1.0
A 5 2 B 7 7.0
A 5 1 B 12 7.0
A 5 1 B 4 7.0
A 5 2 C 9 7.0
A 5 1 C 3 7.0
A 5 1 C 4 7.0
A 5 2 C 7 7.0
however I would like to also group by not only by column A but also column B.
I have tried variants of:
df['countTrans'] = df['D'].ne(0).groupby(df['A'], df['B']).transform('sum')
df['countTrans'] = df['D'].ne(0).groupby(df['A','B']).transform('sum')
without success
my desired output would look like:
df:
A B C D countTrans
A 1 1 A 10 1.0
A 4 1 A 0 0.0
A 7 2 A 3 1.0
A 2 2 A 4 1.0
A 6 2 B 9 1.0
A 5 2 B 7 3.0
A 5 1 B 12 4.0
A 5 1 B 4 4.0
A 5 2 C 9 3.0
A 5 1 C 3 4.0
A 5 1 C 4 4.0
A 5 2 C 7 3.0
Possible solution is pass Series to list:
df['countTrans'] = df['D'].ne(0).groupby([df['A'], df['B']]).transform('sum')
print (df)
A B C D countTrans
A 1 1 A 10 1
A 4 1 A 6 1
A 7 2 A 3 1
A 2 2 A 4 1
A 6 2 B 9 1
A 5 2 B 7 3
A 5 1 B 12 4
A 5 1 B 4 4
A 5 2 C 9 3
A 5 1 C 3 4
A 5 1 C 4 4
A 5 2 C 7 3
Or create helper column by DataFrame.assign (more 'clean' in my opinion):
df['countTrans'] = df.assign(E = df['D'].ne(0)).groupby(['A','B'])['E'].transform('sum')
#similar solution with overwrite D
#df['countTrans'] = df.assign(D = df['D'].ne(0)).groupby(['A','B'])['D'].transform('sum')

Pandas - groupby one column and get mean of all other columns

I have a dataframe, with columns:
cols = ['A', 'B', 'C']
If I groupby one column, say, 'A', like so:
df.groupby('A')['B'].mean()
It works.
But I need to groupby one column and then get the mean of all other columns. I've tried:
df[cols].groupby('A').mean()
But I get the error:
KeyError: 'A'
What am I missing?
Please try:
df.groupby('A').agg('mean')
sample data
B C A
0 1 4 K
1 2 6 S
2 4 7 K
3 6 3 K
4 2 1 S
5 7 3 K
6 8 9 K
7 9 3 K
print(df.groupby('A').agg('mean'))
B C
A
K 5.833333 4.833333
S 2.000000 3.500000
You can use df.groupby('col').mean(). For example to calcualte mean for columns 'A', 'B' and 'C':
A B C D
0 1 NaN 1 1
1 1 2.0 2 1
2 2 3.0 1 1
3 1 4.0 1 1
4 2 5.0 2 1
df[['A', 'B', 'C']].groupby('A').mean()
or
df.groupby('A')[['A', 'B', 'C']].mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
If you need mean for all columns:
df.groupby('A').mean()
Output:
B C D
A
1 3.0 1.333333 1.0
2 4.0 1.500000 1.0
Perhaps the missing column is string rather than numeric?
df = pd.DataFrame({
'A': ['big', 'small','small', 'small'],
'B': [1,0,0,0],
'C': [1,1,1,0],
'D': ['1','0','0','0']
})
df.groupby(['A']).mean()
Output:
A
B
C
big
1.0
1.0
small
0.0
0.6666666666666666
Here, converting the column to a numeric type such as int or float produces the desired result:
df.D = df.D.astype(int)
df.groupby(['A']).mean()
Output:
A
B
C
D
big
1.0
1.0
1.0
small
0.0
0.6666666666666666
0.0