Pandas - groupby one column and get mean of all other columns - pandas

I have a dataframe, with columns:
cols = ['A', 'B', 'C']
If I groupby one column, say, 'A', like so:
df.groupby('A')['B'].mean()
It works.
But I need to groupby one column and then get the mean of all other columns. I've tried:
df[cols].groupby('A').mean()
But I get the error:
KeyError: 'A'
What am I missing?

Please try:
df.groupby('A').agg('mean')
sample data
B C A
0 1 4 K
1 2 6 S
2 4 7 K
3 6 3 K
4 2 1 S
5 7 3 K
6 8 9 K
7 9 3 K
print(df.groupby('A').agg('mean'))
B C
A
K 5.833333 4.833333
S 2.000000 3.500000

You can use df.groupby('col').mean(). For example to calcualte mean for columns 'A', 'B' and 'C':
A B C D
0 1 NaN 1 1
1 1 2.0 2 1
2 2 3.0 1 1
3 1 4.0 1 1
4 2 5.0 2 1
df[['A', 'B', 'C']].groupby('A').mean()
or
df.groupby('A')[['A', 'B', 'C']].mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
If you need mean for all columns:
df.groupby('A').mean()
Output:
B C D
A
1 3.0 1.333333 1.0
2 4.0 1.500000 1.0

Perhaps the missing column is string rather than numeric?
df = pd.DataFrame({
'A': ['big', 'small','small', 'small'],
'B': [1,0,0,0],
'C': [1,1,1,0],
'D': ['1','0','0','0']
})
df.groupby(['A']).mean()
Output:
A
B
C
big
1.0
1.0
small
0.0
0.6666666666666666
Here, converting the column to a numeric type such as int or float produces the desired result:
df.D = df.D.astype(int)
df.groupby(['A']).mean()
Output:
A
B
C
D
big
1.0
1.0
1.0
small
0.0
0.6666666666666666
0.0

Related

Subtract values from different groups

I have the following DataFrame:
A X
Time
1 a 10
2 b 17
3 b 20
4 c 21
5 c 36
6 d 40
given by pd.DataFrame({'Time': [1, 2, 3, 4, 5, 6], 'A': ['a', 'b', 'b', 'c', 'c', 'd'], 'X': [10, 17, 20, 21, 36, 40]}).set_index('Time')
The desired output is:
Time Difference
0 2 7
1 4 1
2 6 4
The first difference 1 is a result of subtracting 21 from 20: (first "c" value - last "b" value).
I'm open to numPy transformations as well.
Aggregate by GroupBy.agg with GroupBy.first,
GroupBy.last and then subtract shifted values for last column with omit first row by positions:
df = df.reset_index()
df1 = df.groupby('A',as_index=False, sort=False).agg(first=('X', 'first'),
last=('X','last'),
Time=('Time','first'))
df1['Difference'] = df1['first'].sub(df1['last'].shift(fill_value=0))
df1 = df1[['Time','Difference']].iloc[1:].reset_index(drop=True)
print (df1)
Time Difference
0 2 7
1 4 1
2 6 4
IIUC, you can pivot, ffill the columns, and compute the difference:
g = df.reset_index().groupby('A')
(df.assign(col=g.cumcount().values)
.pivot('A', 'col', 'X')
.ffill(axis=1)
.assign(Time=g['Time'].first(),
diff=lambda d: d[0]-d[1].shift())
[['Time', 'diff']].iloc[1:]
.rename_axis(index=None, columns=None)
)
output:
Time Difference
b 2 7.0
c 4 1.0
d 6 4.0
Intermediate, pivoted/ffilled dataframe:
col 0 1 Time Difference
A
a 10.0 10.0 1 NaN
b 17.0 20.0 2 7.0
c 21.0 36.0 4 1.0
d 40.0 40.0 6 4.0
Another possible solution:
(df.assign(Y = df['X'].shift())
.iloc[df.index % 2 == 0]
.assign(Difference = lambda z: z['X'] - z['Y'])
.reset_index()
.loc[:, ['Time', 'Difference']]
)
Output:
Time Difference
0 2 7.0
1 4 1.0
2 6 4.0

Pandas Merge(): Appending data from merged columns and replace null values (Extension from question https://stackoverflow.com/questions/68471939)

I'd like to merge two tables while replacing the null value in one column from one table with the non-null values from the same labelled column from another table.
The code below is an example of the tables to be merged:
# Table 1 (has rows with missing values)
a=['x','x','x','y','y','y']
b=['z', 'z', 'z' ,'w', 'w' ,'w' ]
c=[1 for x in a]
d=[2 for x in a]
e=[3 for x in a]
f=[4 for x in a]
g=[1,1,1,np.nan, np.nan, np.nan]
table_1=pd.DataFrame({'a':a, 'b':b, 'c':c, 'd':d, 'e':e, 'f':f, 'g':g})
table_1
a b c d e f g
0 x z 1 2 3 4 1.0
1 x z 1 2 3 4 1.0
2 x z 1 2 3 4 1.0
3 y w 1 2 3 4 NaN
4 y w 1 2 3 4 NaN
5 y w 1 2 3 4 NaN
# Table 2 (new table to be merged to table_1, and would need to use values in column 'c' to replace values in the same column in table_1, while keeping the values in the other non-null rows)
a=['y', 'y', 'y']
b=['w', 'w', 'w']
g=[2,2,2]
table_2=pd.DataFrame({'a':a, 'b':b, 'g':g})
table_2
a b g
0 y w 2
1 y w 2
2 y w 2
This is the code I use for merging the 2 tables, and the ouput I get
merged_table=pd.merge(table_1, table_2, on=['a', 'b'], how='left')
merged_table
Current output:
a b c d e f g_x g_y
0 x z 1 2 3 4 1.0 NaN
1 x z 1 2 3 4 1.0 NaN
2 x z 1 2 3 4 1.0 NaN
3 y w 1 2 3 4 NaN 2.0
4 y w 1 2 3 4 NaN 2.0
5 y w 1 2 3 4 NaN 2.0
6 y w 1 2 3 4 NaN 2.0
7 y w 1 2 3 4 NaN 2.0
8 y w 1 2 3 4 NaN 2.0
9 y w 1 2 3 4 NaN 2.0
10 y w 1 2 3 4 NaN 2.0
11 y w 1 2 3 4 NaN 2.0
Desired output:
a b c d e f g
0 x z 1 2 3 4 1.0
1 x z 1 2 3 4 1.0
2 x z 1 2 3 4 1.0
3 y w 1 2 3 4 2.0
4 y w 1 2 3 4 2.0
5 y w 1 2 3 4 2.0
There are some problems you have to solve:
Tables 1,2 'g' column type: it should be float. So we use DataFrame.astype({'column_name':'type'}) for both tables 1,2;
Indexes. You are allowed to insert data by index, because other columns of table_1 contain the same data : 'y w 1 2 3 4'. Therefore we should filter NaN values from 'g' column of the table 1: ind=table_1[*pd.isnull*(table_1['g'])] and create a new Series with new indexes from table 1 that cover NaN values from 'g': pd.Series(table_2['g'].to_list(),index=ind.index)
try this solution:
table_1=table_1.astype({'a':'str','b':'str','g':'float'})
table_2=table_2.astype({'a':'str','b':'str','g':'float'})
ind=table_1[pd.isnull(table_1['g'])]
table_1.loc[ind.index,'g']=pd.Series(table_2['g'].to_list(),index=ind.index)
Here is the output.

Sum columns in pandas having string and number

I need to sum column and column b, which contain string in 1st row
>>> df
a b
0 c d
1 1 2
2 3 4
>>> df['sum'] = df.sum(1)
>>> df
a b sum
0 c d cd
1 1 2 3
2 3 4 7
I only need to add numeric values and get an output like
>>> df
a b sum
0 c d "dummyString/NaN"
1 1 2 3
2 3 4 7
I need to add only some columns
df['sum']=df['a']+df['b']
solution if mixed data - numeric with strings:
I think simpliest is convert non numeric values after sum by to_numeric to NaNs:
df['sum'] = pd.to_numeric(df[['a','b']].sum(1), errors='coerce')
Or:
df['sum'] = pd.to_numeric(df['a']+df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
EDIT:
Solutions id numbers are strings represenation - first convert to numeric and then sum:
df['sum'] = pd.to_numeric(df['a'], errors='coerce') + pd.to_numeric(df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
Or:
df['sum'] = (df[['a', 'b']].apply(lambda x: pd.to_numeric(x, errors='coerce'))
.sum(axis=1, min_count=1))
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0

Complete an incomplete dataframe in pandas

Good morning.
I have a dataframe that can be both like this:
df1 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
and like this:
df2 =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
The difference between the two is only that the case may arise in which one, or several but not all, zones do have data for the highest of the time periods (column date). My desired result is to be able to complete the dataframe until a certain period of time (3 in the example), in the following way in each of the cases:
df1_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 2000 4
7 B 3 6809 20
8 C 3 288 5
df2_result =
zone date p1 p2
0 A 1 154 2
1 B 1 2647 7
2 C 1 0 0
3 A 2 1280 3
4 B 2 6809 20
5 C 2 288 5
6 A 3 1280 3
7 B 3 6809 20
8 C 3 288 5
I've tried different combinations of pivot and fillna with different methods, but I can't achieve the previous result.
I hope my explanation was understood.
Many thanks in advance.
You can use reindex to create entries for all dates in the range, and then forward fill the last value into it.
import pandas as pd
df1 = pd.DataFrame([['A', 1,154, 2],
['B', 1,2647, 7],
['C', 1,0, 0],
['A', 2,1280, 3],
['B', 2,6809, 20],
['C', 2,288, 5],
['A', 3,2000, 4]],
columns=['zone', 'date', 'p1', 'p2'])
result = df1.groupby("zone").apply(lambda x: x.set_index("date").reindex(range(1, 4), method='ffill'))
print(result)
To get
zone p1 p2
zone date
A 1 A 154 2
2 A 1280 3
3 A 2000 4
B 1 B 2647 7
2 B 6809 20
3 B 6809 20
C 1 C 0 0
2 C 288 5
3 C 288 5
IIUC, you can reconstruct a pd.MultiIndex from your original df and use fillna to get the max from each subgroup of zone you have.
first, build your index
ind = df1.set_index(['zone', 'date']).index
levels = ind.levels
n = len(levels[0])
labels = [np.tile(np.arange(n), n), np.repeat(np.arange(0, n), n)]
Then, use pd.MultiIndex constructor to reindex
df1.set_index(['zone', 'date'])\
.reindex(pd.MultiIndex(levels= levels, labels= labels))\
.fillna(df1.groupby(['zone']).max())
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
To fill df2, just change from df1 in this last line of code to df2 and you get
p1 p2
zone date
A 1 154.0 2.0
B 1 2647.0 7.0
C 1 0.0 0.0
A 2 1280.0 3.0
B 2 6809.0 20.0
C 2 288.0 5.0
A 3 2000.0 4.0
B 3 6809.0 20.0
C 3 288.0 5.0
I suggest not to copy/paste directly the code and try to run, but rather try to understand the process and make slight changes if needed depending on how different your original data frame is from what you posted.

pandas: How to compare value of column with next value

I have a dataframe which looks as follows:
colA colB
0 A 10
1 B 20
2 C 5
3 D 2
4 F 30
....
I would like to compare column 1 values to detect two successive decrements. That is, I want to report the index values where I have two successive decrements of column 1. For example, I want to report 'B' because there are two successive rows following B where column 1 values are decremented. I am not sure how to approach this without writing a loop. ( If there is no way to avoid a loop I'd like to know.)
Thanks
You can use loc for this:
desired=frame.loc[(frame["colB"]>=frame["colB"].shift(-1)) &
(frame["colB"].shift(-1)>=frame["colB"].shift(-2) )]
print(desired)
The output will be:
colA colB
1 B 20
if you only wish to report the value B:
desired=frame["colA"].loc[(frame["colB"]>=frame["colB"].shift(-1)) &
(frame["colB"].shift(-1)>=frame["colB"].shift(-2) )]
print(desired.values)
The output will be:
['B']
Yes you can do this without using loop.
df = pd.DataFrame({'colA':['A', 'B', 'C', 'D', 'F'], 'colB':[10, 20, 5, 2, 30]})
>>> df['colC'] = df['colB'].diff(-1)
>>> df
colA colB colC
0 A 10 -10.0
1 B 20 15.0
2 C 5 3.0
3 D 2 -28.0
4 F 30 NaN
'colC' is the difference between the consecutive row.
>>> df['colD'] = np.where(df['colC'] > 0, 1, 0)
>>> df
colA colB colC colD
0 A 10 -10.0 0
1 B 20 15.0 1
2 C 5 3.0 1
3 D 2 -28.0 0
4 F 30 -1.0 0
In 'colD' we are marking flag where the difference is greater than 0.
>>> df1['s'] = df1['colD'].shift(-1)
>>> df1
colA colB colC colD s
0 A 10 -10.0 0 1.0
1 B 20 15.0 1 1.0
2 C 5 3.0 1 0.0
3 D 2 -28.0 0 0.0
4 F 30 -1.0 0 NaN
In column 's' we shift the value of 'colD'.
>>> df1['flag'] = np.where((df1['colD'] == 1) & (df1['colD'] == df1['s']), 1, 0)
>>> df1
colA colB colC colD s flag
0 A 10 -10.0 0 1.0 0
1 B 20 15.0 1 1.0 1
2 C 5 3.0 1 0.0 0
3 D 2 -28.0 0 0.0 0
4 F 30 -1.0 0 NaN 0
Then 'flag' is required column.
Need a little bit logic here
s=df.colB.diff().gt(0) # get the diff
df.loc[df.groupby(s.cumsum()).colA.transform('count').ge(3)&s,'colA'] # then we using count to see which one is more than 3 items (include the line start to two items decreasing )
Out[45]:
1 B
Name: colA, dtype: object