I have a dataframe
df = pd.DataFrame(
[np.random.randint(1,10,8),
np.random.randint(1,10,8),
np.random.randint(1,10,8),
np.random.randint(1,10,8)]).T
# left col is the index
>> a b c d group
0 5 6 3 2 g1
1 5 6 6 6 g1
2 3 9 5 3 g1
3 5 6 8 2 g1
4 2 2 9 6 g1
5 9 5 4 8 g2
6 1 3 5 2 g2
7 3 8 8 6 g2
I want to groupby "group" column and then do a few different operations:
• For column "a" I want to get the min and max value
• For the rest I want to sum them
min_max_col = ['a']
sum_cols = ['b','c','d']
Is there a simple way to do this?
The result should look something like this:
>> min max sum_b sum_c sum_d
g1 2 5 29 48 19
g2 1 9 16 48 16
Use agg
df = df.groupby('group').agg({'a':[ np.min, np.max], 'b': np.sum, 'c': np.sum, 'd': np.sum})
df.columns = ['min', 'max', 'sum_b', 'sum_c', 'sum_d']
df = df.reset_index()
group min max sum_b sum_c sum_d
0 g1 2 5 29 31 19
1 g2 1 9 16 17 16
This is different because we are leveraging pandas internally referenced sum, min, and max functions. It is my opinion that we should leverage those as much as possible.
f = dict(
a=['min', 'max'],
b='sum',
c='sum',
d='sum'
)
df.groupby('group').agg(f)
a b c d
min max sum sum sum
group
g1 2 5 29 31 19
g2 1 9 16 17 16
Related
I have a dataframe and I'd like to group by a column value and then do a calculation to create a new column. Below is the set up data:
import pandas as pd
df = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100]
})
df.groupby('Groups').apply(print)
What I want to do is create a 'TOTAL' column in the original dataframe. If it is the first record of the group 'TOTAL' gets a zero otherwise TOTAL will get the ['Blue'] at index subtracted by ['Red'] at index-1.
I tried to do this in a function below but it does not work.
def funct(group):
count = 0
lst = []
for info in group:
if count == 0:
lst.append(0)
count += 1
else:
num = group.iloc[count]['Blue'] - group.iloc[count-1]['Red']
lst.append(num)
count += 1
group['Total'] = lst
return group
df = df.join(df.groupby('Groups').apply(funct))
The code works for the first group but then errors out.
The desired outcome is:
df_final = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100],
'Total':[0,0,29,37,48,0,65,74,83,92]
})
df_final
df_final.groupby('Groups').apply(print)
Thank you for the help!
For each group, calculate the difference between Blue and shifted Red (Red at previous index):
df['Total'] = (df.groupby('Groups')
.apply(lambda g: g.Blue - g.Red.shift().fillna(g.Blue))
.reset_index(level=0, drop=True))
df
Red Groups Blue Total
0 1 A 10 0.0
1 2 B 20 0.0
2 3 A 30 29.0
3 4 A 40 37.0
4 5 B 50 48.0
5 6 C 60 0.0
6 7 B 70 65.0
7 8 C 80 74.0
8 9 B 90 83.0
9 10 C 100 92.0
Or as #anky has commented, you can avoid apply by shifting Red column first:
df['Total'] = (df.Blue - df.Red.groupby(df.Groups).shift()).fillna(0, downcast='infer')
df
Red Groups Blue Total
0 1 A 10 0
1 2 B 20 0
2 3 A 30 29
3 4 A 40 37
4 5 B 50 48
5 6 C 60 0
6 7 B 70 65
7 8 C 80 74
8 9 B 90 83
9 10 C 100 92
My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
I have two DataFrame objects which I want to apply an element-wise multiplication on each row onto:
df_prob_wc.shape # (3505, 13)
df_prob_c.shape # (13, 1)
I thought I could do it with DataFrame.apply()
df_prob_wc.apply(lambda x: x.multiply(df_prob_c), axis=1)
which gives me:
TypeError: ("'int' object is not iterable", 'occurred at index $')
or with
df_prob_wc.apply(lambda x: x * df_prob_c, axis=1)
which gives me:
TypeError: 'int' object is not iterable
But it's not working.
However, I can do this:
df_prob_wc.apply(lambda x: x * np.asarray([1,2,3,4,5,6,7,8,9,10,11,12,13]), axis=1)
What am I doing wrong here?
It seems you need multiple by Series created with df_prob_c by iloc:
df_prob_wc = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df_prob_wc)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
df_prob_c = pd.DataFrame([[4,5,6,1,2,3]])
#for align data same columns in both df
df_prob_c.index = df_prob_wc.columns
print (df_prob_c)
0
A 4
B 5
C 6
D 1
E 2
F 3
print (df_prob_wc.shape)
(3, 6)
print (df_prob_c.shape)
(6, 1)
print (df_prob_c.iloc[:,0])
A 4
B 5
C 6
D 1
E 2
F 3
Name: 0, dtype: int64
print (df_prob_wc.mul(df_prob_c.iloc[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
Another solution is multiple by numpy array, only need [:,0] for select:
print (df_prob_wc.mul(df_prob_c.values[:,0], axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
And another solution with DataFrame.squeeze:
print (df_prob_wc.mul(df_prob_c.squeeze(), axis=1))
A B C D E F
0 4 20 42 1 10 21
1 8 25 48 3 6 12
2 12 30 54 5 12 9
I have two dataframes
A B
0 1 2
1 1 2
2 1 2
and
C D
0 1 4
1 2 5
2 3 6
I need the mean of the cross products (AC, AD, BC, BD). As such I was hoping to be able to compute
AC AD BC BD
0 1 4 2 8
1 2 5 4 10
2 3 6 6 12
but so far I have been unable to do so. I tried multiply etc, but to no avail. I can do it using loops obviously, but is there an elegant way to do it?
Cheers, Mike
consider the dataframes d1 and d2
d1 = pd.DataFrame([[1, 2]] * 3, columns=list('AB'))
d2 = pd.DataFrame(np.arange(1, 7).reshape(2, 3).T, columns=list('CD'))
Then the kronecker product is
kp = pd.DataFrame(np.kron(d1, d2), columns=pd.MultiIndex.from_product([d1, d2]))
kp
NOTE
This is equivalent to flattening the outer products of each pair of columns. Not the cross products.
for python 3.7, given dataframes data1 and data2
def kronecker(data1:'Dataframe 1',data2:'Dataframe 2'):
Combination = pd.DataFrame(); d1 = pd.DataFrame()
for i in data2.columns:
d1 = data1.multiply(data2[i] , axis="index")
d1.columns = [f'{i}{j}' for j in data1.columns]
Combination = pd.concat([Combination, d1], axis = 1)
return Combination
To complement the answer of #piRSquared, if you want a partial Kronecker product like described in the question (along a single axis):
import numpy as np
pd.DataFrame(np.einsum('nk,nl->nkl', df1, df2).reshape(df1.shape[0], -1),
columns=pd.MultiIndex.from_product([df1, df2]).map(''.join)
)
output:
AC AD BC BD
0 1 4 2 8
1 2 5 4 10
2 3 6 6 12
In contrast, the other answer would give:
AC AD BC BD
0 1 4 2 8
1 2 5 4 10
2 3 6 6 12
3 1 4 2 8
4 2 5 4 10
5 3 6 6 12
6 1 4 2 8
7 2 5 4 10
8 3 6 6 12