Consider two dataframes:
>> import pandas as pd
>> df1 = pd.DataFrame({"category": ["foo", "foo", "bar", "bar", "bar"], "quantity": [1,2,1,2,3]})
>> print(df1)
category quantity
0 foo 1
1 foo 2
2 bar 1
3 bar 2
4 bar 3
>> df2 = pd.DataFrame({
"category": ["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"],
"item": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
})
>> print(df2)
category item
0 foo A
1 foo B
2 foo C
3 foo D
4 bar E
5 bar F
6 bar G
7 bar H
8 bar I
9 bar J
How can I create a new column in df1 (new dataframe called df3) which joins on category column of df1 and allocates the item column in df2. So, create something like:
>> df3 = pd.DataFrame({
"category": ["foo", "foo", "bar", "bar", "bar"],
"quantity": [1,2,1,2,3],
"item": ["A", "B,C", "E", "F,G", "H,I,J"]
})
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F,G
4 bar 3 H,I,J
You can create helper DataFrame by repeat rows by quantity column by Index.repeat with DataFrame.loc, convert index to column for avoid lost indices and create helper column g in both DataFrames for merging by duplicated categories by GroupBy.cumcount, then use DataFrame.merge with aggregate join:
df11 = (df1.loc[df1.index.repeat(df1['quantity'])].reset_index()
.assign(g = lambda x: x.groupby('category').cumcount()))
df22 = df2.assign(g = df2.groupby('category').cumcount())
df = (df11.merge(df22, on=['g','category'], how='left')
.groupby(['index','category','quantity'])['item']
.agg(lambda x: ','.join(x.dropna()))
.droplevel(0)
.reset_index())
print (df)
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F,G
4 bar 3 H,I,J
You can use an iterator with itertools.islice:
from itertools import islice
# aggregate the items as iterator
s = df2.groupby('category')['item'].agg(iter)
# for each category, allocate as many items as needed and join
df1['item'] = (df1.groupby('category', group_keys=False)['quantity']
.apply(lambda g:
g.map(lambda x: ','.join(list(islice(s[g.name], x)))))
)
Output:
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F,G
4 bar 3 H,I,J
Note that if you don't have enough items, this will just use what is available.
Example using df2 truncated after F as input:
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F
4 bar 3
def function1(dd:pd.DataFrame):
col2=dd.quantity.cumsum()
col1=col2.shift(fill_value=0)
return dd.assign(col1=col1,col2=col2).apply(lambda ss:",".join(
df2.loc[df2.category==ss.category,"item"].iloc[ss.col1:ss.col2].tolist()
),axis=1)
df1.assign(item=df1.groupby('category').apply(function1).droplevel(0))
out
category quantity item
0 foo 1 A
1 foo 2 B,C
2 bar 1 E
3 bar 2 F,G
4 bar 3 H,I,J
Related
If I have a pd.DataFrame that looks like:
new_df = []
for i in range(10):
df_example = pd.DataFrame(np.random.normal(size=[10,1]))
cols = [round(np.random.uniform(low=0,high=10)),round(np.random.uniform(low=0,high=10)),
round(np.random.uniform(low=0,high=10)),round(np.random.uniform(low=0,high=10))]
keys = ['A','B','C','D']
new_ix = pd.MultiIndex.from_tuples([cols],names=keys)
df_example.columns = new_ix
new_df.append(df_example)
new_df = pd.concat(new_df,axis=1)
Which could yield something like:
Now, if I want where C=4 and A=1 I can do:
df.xs(axis=1,level=['A','C'],key=[1,4])
How do I express if I want:
C in [4,2] and A in [5,2]
C in [4,2] or A in [5,2]
To the best of my knowledge, you can't use anything but tuples for key parameter in xs, so such queries are not possible.
The next best thing is to define helper functions for that purpose, such as the following:
def xs_or(df: pd.DataFrame, params: dict[str, list[int]]) -> pd.DataFrame:
"""Helper function.
Args:
df: input dataframe.
params: columns/values to query.
Returns:
Filtered dataframe.
"""
df = pd.concat(
[
df.xs(axis=1, level=[level], key=(key,))
for level, keys in params.items()
for key in keys
],
axis=1,
)
for level in params.keys():
try:
df = df.droplevel([level], axis=1)
except KeyError:
pass
return df
def xs_and(df: pd.DataFrame, params: dict[str, list[int]]) -> pd.DataFrame:
"""Helper function.
Args:
df: input dataframe.
params: columns/values to query.
Returns:
Filtered dataframe.
"""
for level, keys in params.items():
df = xs_or(df, {level: keys})
return df
And so, with the following dataframe named df:
A 4 7 3 1 7 9 4 0 3 9
B 6 7 4 6 7 5 8 0 8 0
C 2 10 5 2 9 9 4 3 4 5
D 0 1 7 3 8 3 6 7 9 10
0 -0.199458 1.155345 1.298027 0.575606 0.785291 -1.126484 0.019082 1.765094 0.034631 -0.243635
1 1.173873 0.523277 -0.709546 1.378983 0.266661 1.626118 1.647584 -0.228162 -1.708271 0.111583
2 0.321156 0.049470 -0.611111 -1.238887 1.092369 0.019503 -0.473618 1.804474 -0.850320 -0.217921
3 0.339307 -0.758909 0.072159 1.636119 -0.541920 -0.160791 -1.131100 1.081766 -0.530082 -0.546489
4 -1.523110 -0.662232 -0.434115 1.698073 0.568690 0.836359 -0.833581 0.230585 0.166119 1.085600
5 0.020645 -1.379587 -0.608083 -1.455928 1.855402 1.714663 -0.739409 1.270043 1.650138 -0.718430
6 1.280583 -1.317288 0.899278 -0.032213 -0.347234 2.543415 0.272228 -0.664116 -1.404851 -0.517939
7 -1.201619 0.724669 -0.705984 0.533725 0.820124 0.651339 0.363214 0.727381 -0.282170 0.651201
8 1.829209 0.049628 0.655277 -0.237327 -0.007662 1.849530 0.095479 0.295623 -0.856162 -0.350407
9 -0.690613 1.419008 -0.791556 0.180751 -0.648182 0.240589 -0.247574 -1.947492 -1.010009 1.549234
You can filter like this:
# C in [10, 2] or A in [1, 0]
print(xs_or(df, {"C": [10, 2], "A": [1, 0]}))
# Output
B 7 6 2 3
D 1 0 3 3 7
0 1.155345 -0.199458 0.575606 0.575606 1.765094
1 0.523277 1.173873 1.378983 1.378983 -0.228162
2 0.049470 0.321156 -1.238887 -1.238887 1.804474
3 -0.758909 0.339307 1.636119 1.636119 1.081766
4 -0.662232 -1.523110 1.698073 1.698073 0.230585
5 -1.379587 0.020645 -1.455928 -1.455928 1.270043
6 -1.317288 1.280583 -0.032213 -0.032213 -0.664116
7 0.724669 -1.201619 0.533725 0.533725 0.727381
8 0.049628 1.829209 -0.237327 -0.237327 0.295623
9 1.419008 -0.690613 0.180751 0.180751 -1.947492
# C in [10, 2] and A in [1, 7]
print(xs_and(df, {"C": [10, 2], "A": [1, 7]}))
# Output
B 6 7
D 3 1
0 0.575606 1.155345
1 1.378983 0.523277
2 -1.238887 0.049470
3 1.636119 -0.758909
4 1.698073 -0.662232
5 -1.455928 -1.379587
6 -0.032213 -1.317288
7 0.533725 0.724669
8 -0.237327 0.049628
9 0.180751 1.419008
I have a df and I want to calculate mean of the 3rd quintile for each group. The way do is to write a self defined function and to apply for each group; but there are some issues. The code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': pd.Series(np.array(range(20))), 'B': ['a','a','a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b']})
def func_mean_quintile(df):
# Make sure data is in DataFrame
df = pd.DataFrame(df)
df['pct'] = pd.to_numeric(pd.cut(df.iloc[:,0], 5, labels=np.r_[1:6]))
avg = df[df['pct'] == 3].iloc[:,0].mean()
return np.full((len(df)), avg)
df['C'] = df.groupby('B')['A'].apply(func_mean_quintile)
The result is NaN for all column C
I don't know where is it wrong?
Plus if you know how to make self defined function perform better, please help
Thank you
Proposed solution without function
You do not need a function; this should do the calc:
q_lo = 0.4 # start of 3d quintile
q_hi = 0.6 # end of 3d quintile
(df.groupby('B')
.apply(lambda g:g.assign(C = g.loc[(g['A'] >= g['A'].quantile(q_lo)) & (g['A'] < g['A'].quantile(q_hi)), 'A' ].mean()))
.reset_index(drop = True)
)
output:
A B C
0 0 a 4.5
1 1 a 4.5
2 2 a 4.5
3 3 a 4.5
4 4 a 4.5
5 5 a 4.5
6 6 a 4.5
7 7 a 4.5
8 8 a 4.5
9 9 a 4.5
10 10 b 14.5
11 11 b 14.5
12 12 b 14.5
13 13 b 14.5
14 14 b 14.5
15 15 b 14.5
16 16 b 14.5
17 17 b 14.5
18 18 b 14.5
19 19 b 14.5
Your original solution
Also works if you replace the line df['C'] = ... with
df['C'] = df.groupby('B')['A'].transform(func_mean_quintile)
Do it like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': pd.Series(np.array(range(20))), 'B':['a','a','a','a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b' ,'b']})
def func_mean_quintile(df):
# Make sure data is in DataFrame
df = pd.DataFrame(df)
df['pct'] = pd.to_numeric(pd.cut(df.iloc[:,0], 5, labels=np.r_[1:6]))
avg = df[df['pct'] == 3].iloc[:,0].mean()
return np.full((len(df)), avg)
means = df.groupby('B').apply(func_mean_quintile)
df['C'][df["B"]=='a'] = means["a"]
df['C'][df["B"]=='b'] = means["b"]
This will give you the required output.
Think its easier if you split it in two different steps. First label each datapoint with which quantile it is in. Secondly just an aggregation per quantile.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"a": pd.Series(np.array(range(20))),
"b": ["a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"],
}
)
df["a_quantile"] = pd.cut(df.a, bins=4, labels=["q1", "q2", "q3", "q4"])
df_agg = df.groupby("a_quantile").agg({"a": ["mean"]})
df_agg.head()
With the aggregation results shown below:
Out[9]:
a
mean
a_quantile
q1 2
q2 7
q3 12
q4 17
Hello I am obliged to downgrade Pandas versioon to '0.24.2'
As a result, the function pd.NamedAgg is not recognizable anymore.
import pandas as pd
import numpy as np
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
Can you help me please change my code to make it compliant with the version 0.24.2??
Thank you a lot.
Sample:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
df = pd.DataFrame({
'A':list('a')*6,
'B':[4,5,4,5,5,4],
'C':[7]*6,
'Foo':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
agg_cols = ['A', 'B', 'C']
agg_df = df.groupby(agg_cols).agg(
max_foo=pd.NamedAgg(column='Foo', aggfunc=np.max),
min_foo=pd.NamedAgg(column='Foo', aggfunc=np.min)
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Because there is only one column Foo for processing add column Foo after groupby and pass tuples with new columns names with aggregate functions:
agg_df = df.groupby(agg_cols)['Foo'].agg(
[('max_foo', np.max),('min_foo', np.min)]
).reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Another idea is pass dictionary of lists of aggregate functions:
agg_df = df.groupby(agg_cols).agg({'Foo':['max', 'min']})
agg_df.columns = [f'{b}_{a}' for a, b in agg_df.columns]
agg_df = agg_df.reset_index()
print (agg_df)
A B C max_foo min_foo
0 a 4 7 5 0
1 a 5 7 7 1
Here is a dataframe:
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'bar', 'bar'],
'B' : ['1', '2','2', '4', '1']})
Below is how I want it to look,
And here is how I have tried and failed.
groups = df.groupby([A])
groups.apply(lambda g: g[g[B] == g[B].first()]).reset_index(drop=True)
You can do:
df['B'] = df.groupby('A')['B'].transform('first')
or, if data already sorted by A as showned:
df['B'] = df['B'].mask(df['A'].duplicated()).ffill()
Output:
A B
0 foo 1
1 foo 1
2 bar 2
3 bar 2
4 bar 2
Use drop_duplicates + repeat
s=df.drop_duplicates('A')
s=s.reindex(s.index.repeat(df.A.value_counts()))
Out[555]:
A B
0 foo 1
0 foo 1
0 foo 1
2 bar 2
2 bar 2
I have the following pandas dataframe:
>>> df = pd.DataFrame(
... {'A': ['foo', 'foo', 'bar', 'bar', 'baz', 'baz'],
... 'B': ['red', 'blue', 'yellow', 'green', 'grey', 'red']})
>>> df
A B
0 foo red
1 foo blue
2 bar yellow
3 bar green
4 baz grey
5 baz red
I want to filter for all of the rows in which an element in column A has a value of 'red' in column B. If I do a simple filter I get:
>>> df[df['B'] == 'red']
A B
0 foo red
5 baz red
But I want all of the rows for foo and baz since any of those rows have 'red' in column B:
A B
0 foo red
1 foo blue
4 baz grey
5 baz red
You can first find all unique values of A where condition:
print (df.ix[df['B'] == 'red', 'A'].unique())
['foo' 'baz']
Then use another condition with isin with boolean indexing:
print (df.A.isin(df.ix[df['B'] == 'red', 'A'].unique()))
0 True
1 True
2 False
3 False
4 True
5 True
Name: A, dtype: bool
print (df[df.A.isin(df.ix[df['B'] == 'red', 'A'].unique())])
A B
0 foo red
1 foo blue
4 baz grey
5 baz red
In sample can be omit unique, if there is only one red value per group. But if there is multiple values, unique is necessary.