Python Pandas: How to take categorical average of a column? - pandas

for a given dataframe as follows:
1 a 10
2 a 20
3 a 30
4 b 10
5 b 100
where column 1 is index, column 2 is some categorical value and column 3 is a number. I want categorical mean over column 2, which should look something like this:
a 20
b 55
The value for a is calculated as
(10+20+30)/3 = 20
The value for b is calculated as
(10+100)/2 = 55

I think you can use groupby with mean and reset_index:
print df
a b c
0 1 a 10
1 2 a 20
2 3 a 30
3 4 b 10
4 5 b 100
df1 = df.groupby('b')['c'].mean().reset_index()
print df1
b c
0 a 20
1 b 55
print df1.c.max()
55
print df1.c.min()
20

Related

Create a new pandas DataFrame Column with a groupby

I have a dataframe and I'd like to group by a column value and then do a calculation to create a new column. Below is the set up data:
import pandas as pd
df = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100]
})
df.groupby('Groups').apply(print)
What I want to do is create a 'TOTAL' column in the original dataframe. If it is the first record of the group 'TOTAL' gets a zero otherwise TOTAL will get the ['Blue'] at index subtracted by ['Red'] at index-1.
I tried to do this in a function below but it does not work.
def funct(group):
count = 0
lst = []
for info in group:
if count == 0:
lst.append(0)
count += 1
else:
num = group.iloc[count]['Blue'] - group.iloc[count-1]['Red']
lst.append(num)
count += 1
group['Total'] = lst
return group
df = df.join(df.groupby('Groups').apply(funct))
The code works for the first group but then errors out.
The desired outcome is:
df_final = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100],
'Total':[0,0,29,37,48,0,65,74,83,92]
})
df_final
df_final.groupby('Groups').apply(print)
Thank you for the help!
For each group, calculate the difference between Blue and shifted Red (Red at previous index):
df['Total'] = (df.groupby('Groups')
.apply(lambda g: g.Blue - g.Red.shift().fillna(g.Blue))
.reset_index(level=0, drop=True))
df
Red Groups Blue Total
0 1 A 10 0.0
1 2 B 20 0.0
2 3 A 30 29.0
3 4 A 40 37.0
4 5 B 50 48.0
5 6 C 60 0.0
6 7 B 70 65.0
7 8 C 80 74.0
8 9 B 90 83.0
9 10 C 100 92.0
Or as #anky has commented, you can avoid apply by shifting Red column first:
df['Total'] = (df.Blue - df.Red.groupby(df.Groups).shift()).fillna(0, downcast='infer')
df
Red Groups Blue Total
0 1 A 10 0
1 2 B 20 0
2 3 A 30 29
3 4 A 40 37
4 5 B 50 48
5 6 C 60 0
6 7 B 70 65
7 8 C 80 74
8 9 B 90 83
9 10 C 100 92

Update dataframe column with values from another dataframe by index

I have two DataFrames.
One of them contains: item id, name, quantity and price.
Another: item id, name and quantity.
The problem is to update names and quantity in first DataFrame taking information from the second DataFrame by item id. Also, first DataFrame has not all item id's, so I need to take into account only those rows from the second DataFrame, which are in the first one.
DataFrame 1
In [1]: df1
Out[1]:
id name quantity price
0 10 X 10 15
1 11 Y 30 20
2 12 Z 20 15
3 13 X 15 10
4 14 X 12 15
DataFrame 2
In [2]: df2
Out[2]:
id name quantity
0 10 A 3
1 12 B 3
2 13 C 6
I've tried to use apply to iterate through rows and modify column value by condition like this:
def modify(row):
row['name'] = df2[df2['id'] == row['id']]['name'].get_values()[0]
row['quantity'] = df2[df2['id'] == row['id']]['quantity'].get_values()[0]
df1.apply(modify, axis=1)
But it doesn't have any results. DataFrame 1 is still the same
I am expecting something like this first:
In [1]: df1
Out[1]:
id name quantity price
0 10 A 3 15
1 11 Y 30 20
2 12 B 3 15
3 13 C 6 10
4 14 X 12 15
After that I want to drop the rows, which were not modified to get:
In [1]: df1
Out[1]:
id name quantity price
0 10 A 3 15
1 12 B 3 15
2 13 C 6 10
Using update
df1=df1.set_index('id')
df1.update(df2.set_index('id'))
df1=df1.reset_index()
Out[740]:
id name quantity price
0 10 A 3.0 15
1 11 Y 30.0 20
2 12 B 3.0 15
3 13 C 6.0 10
4 14 X 12.0 15
new_df = df.merge(df2, on='id')
new.drop(['name_x','quantity_x'], inplace=True, axis=1)
new.columns = ['id','price','name','quantity']
Output
id price name quantity
0 10 15 A 3
1 12 15 B 3
2 13 10 C 6

Split a column by element and create new ones with pandas

Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18

pandas grouping based on conditions on other columns

import pandas as pd
df = pd.DataFrame(columns=['A','B'])
df['A']=['A','B','A','A','B','B','B']
df['B']=[2,4,3,5,6,7,8]
df
A B
0 A 2
1 B 4
2 A 3
3 A 5
4 B 6
5 B 7
6 B 8
df.columns=['id','num']
df
id num
0 A 2
1 B 4
2 A 3
3 A 5
4 B 6
5 B 7
6 B 8
I would like to apply groupby on id column but some condition on num column
I want to have 2 columns is_even_count and is_odd_count columns in final data frame where is_even_count only counts even numbers from num column after grouping and is_odd_count only counts odd numbers from num column after grouping.
My output will be
is_even_count is_odd_count
A 1 2
B 3 1
how can i do this in pandas
Use modulo division by 2 and compare by 1 with map:
d = {True:'is_odd_count', False:'is_even_count'}
df = df.groupby(['id', (df['num'] % 2 == 1).map(d)]).size().unstack(fill_value=0)
print (df)
num is_even_count is_odd_count
id
A 1 2
B 3 1
Another solution with crosstab:
df = pd.crosstab(df['id'], (df['num'] % 2 == 1).map(d))
Alternative with numpy.where:
a = np.where(df['num'] % 2 == 1, 'is_odd_count', 'is_even_count')
df = pd.crosstab(df['id'], a)

issue merging two dataframes with pandas by summing element by element [duplicate]

I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20
This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45
In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64
If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20