Pandas - Add many new columns based on many aggregate functions - pandas

Pandas 1.0.5
import pandas as pd
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
#add columns
d['count'] = d.groupby(['card_id', 'day'])["amount"].transform('count')
d['min'] = d.groupby(['card_id', 'day'])["amount"].transform('min')
d['max'] = d.groupby(['card_id', 'day'])["amount"].transform('max')
I would like to change the three transform lines to one line. I tried this:
d['count', 'min', 'max'] = d.groupby(['card_id', 'day'])["amount"].transform('count', 'min', 'max')
Error: "TypeError: count() takes 1 positional argument but 3 were given"
I also tried this:
d[('count', 'min', 'max')] = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
Error: "TypeError: incompatible index of inserted column with frame index"

Use merge,
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
df_out = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
d.merge(df_out, left_on=['card_id', 'day'], right_index=True)
Output:
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40
The output of you groupyby is creating a multilevel index and the index of this ouput doesn't match the index of d, hence the error. However, we can join the columns in d to the index in output of the groupby using merge with column names and right_index=True.

You could use the assign function to get the results in one go:
grouping = df.groupby(["card_id", "day"])
df.assign(
count=grouping.transform("count"),
min=grouping.transform("min"),
max=grouping.transform("max"),
)
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40

Related

Cut continuous data by outliers

For example I have DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'b': [2, 2, 4, 3, 1000, 2000, 1, 500, 3]})
I need to cut by outliers and get these intervals: 1-4, 5-6, 7, 8, 9.
Cutting with pd.cut and pd.qcut does not give these results
You can group them by consecutive values depending on the above/below mask:
m = df['b'].gt(100)
df['group'] = m.ne(m.shift()).cumsum()
output:
a b group
0 1 2 1
1 2 2 1
2 3 4 1
3 4 3 1
4 5 1000 2
5 6 2000 2
6 7 1 3
7 8 500 4
8 9 3 5

pandas sort_values with condition

I have a dataframe that I'd like to sort on cols time and b, where b sort is conditional on value of a. So if a == 1, sort from highest to lowest, and if a == -1, sort from lowest to highest. I would normally do something like df.sort_values(by=['time', 'b']) but I think it sorts b always from lowest to highest.
df = pd.DataFrame({'time': [0, 3, 2, 2, 1], 'a': [1, -1, 1, 1, -1], 'b': [4, 5, 1, 6, 2]})
time a b
0 0 1 4
1 3 -1 5
2 2 1 1
3 2 1 6
4 1 -1 2
desired output
time a b
0 0 1 4
1 1 -1 2
2 2 1 6
3 2 1 1
4 3 -1 5
Multiply a and b and use it as sorting key:
df['sort'] = df['a']*df['b']
df.sort_values(by=['time', 'sort'], ascending=[True, False]).drop('sort', axis=1)
output:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5
alternative:
df['sort'] = (1-df['a'])*df['b']
df.sort_values(by=['time', 'sort']).drop('sort', axis=1)
Pass ascending after create additional col for sorting
out = df.assign(key = df.a*df.b).sort_values(['time','key'],ascending=[True,False]).drop('key',1)
Out[59]:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5

create column based on column values - merge integers

I would like to create a new column "Group". The integer values from column "Step_ID" should be converted into 1 and 2. The fist two values should be converted to 1, the second two values to 2, the third two values to 1 etc. See the image below.
import pandas as pd
data = {'Step_ID': [1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10, 11, 11]}
df1 = pd.DataFrame(data)
You can try:
m = (df.Step_ID % 2) + df.Step_ID
df['new_group'] = (m.ne(m.shift()).cumsum() % 2).replace(0,2)
OUTPUT:
Step_ID new_group
0 1 1
1 1 1
2 2 1
3 2 1
4 3 2
5 4 2
6 5 1
7 6 1
8 6 1
9 7 2
10 8 2
11 8 2
12 9 1
13 10 1
14 11 2
15 11 2

Get and Modify column in groups on rows that meet a condition

I have this DataFrame:
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12], 'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
day hour sales
0 1 10 0
1 1 10 40
2 1 10 30
3 2 11 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 12 20
And I would like to filter to get the first entry of each day that has volume greater than 0. And as an additional thing I would like to change the 'sales' column for these to 9.
So to get something like this:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
I only came up with this iterative solution. But is there a solution, how I can apply it in a more functional way?
# Group by day:
groups = df.groupby(by=['day'])
# Get all indices of first non-zero sales entry per day:
indices = []
for name, group in groups:
group = group[group['sales'] > 0]
indices.append(group.index.to_list()[0])
# Change their values:
df.iloc[indices, df.columns.get_loc('hour')] = 9
You can create a group of df['day'] after checking of sales is greater than 0 , then get idxmax and filter out groups which doesnot have any value greater than 0 by using any , then assign with loc[]
g = df['sales'].gt(0).groupby(df['day'])
idx = g.idxmax()
df.loc[idx[g.any()],'hour']=9
print(df)
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
Create a mask m that groups by day as well as rows where the sales are not 0.
Then, use this mask as well as df['sales'] > 0 to change those specific rows to 9 with np.where()
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12],
'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
m = df.groupby(['day', df['sales'].ne(0)])['sales'].transform('first')
df['hour'] = np.where((df['sales'] == m) & (df['sales'] > 0), 9, df['hour'])
df
Out[37]:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

Why does groupby in Pandas place counts under existing column names?

I'm coming from R and do not understand the default groupby behavior in pandas. I create a dataframe and groupby the column 'id' like so:
d = {'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]}
df = DataFrame(data=d)
freq = df.groupby('id').count()
When I check the header of the resulting dataframe, all the original columns are there instead of just 'id' and 'freq' (or 'id' and 'count').
list(freq)
Out[117]: ['color', 'size']
When I display the resulting dataframe, the counts have replaced the values for the columns not employed in the count:
freq
Out[114]:
color size
id
1 1 1
2 3 3
3 1 1
4 2 2
I was planning to use groupby and then to filter by the frequency column. Do I need to delete the unused columns and add the frequency column manually? What is the usual approach?
count aggregate all columns of DataFrame with excluding NaNs values, if need id as column use as_index=False parameter or reset_index():
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 1
1 2 3 3
2 3 1 1
3 4 2 2
So if add NaNs in each column should be differences:
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 0
1 2 3 3
2 3 1 1
3 4 2 2
You can specify columns for count:
freq = df.groupby('id', as_index=False)['color'].count()
print (freq)
id color
0 1 1
1 2 3
2 3 1
3 4 2
If need count with NaNs:
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
Thanks Bharath for pointed for another solution with value_counts, differences are explained here:
freq = df['id'].value_counts().rename_axis('id').to_frame('freq').reset_index()
print (freq)
id freq
0 2 3
1 4 2
2 3 1
3 1 1