Why does groupby in Pandas place counts under existing column names? - pandas

I'm coming from R and do not understand the default groupby behavior in pandas. I create a dataframe and groupby the column 'id' like so:
d = {'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]}
df = DataFrame(data=d)
freq = df.groupby('id').count()
When I check the header of the resulting dataframe, all the original columns are there instead of just 'id' and 'freq' (or 'id' and 'count').
list(freq)
Out[117]: ['color', 'size']
When I display the resulting dataframe, the counts have replaced the values for the columns not employed in the count:
freq
Out[114]:
color size
id
1 1 1
2 3 3
3 1 1
4 2 2
I was planning to use groupby and then to filter by the frequency column. Do I need to delete the unused columns and add the frequency column manually? What is the usual approach?

count aggregate all columns of DataFrame with excluding NaNs values, if need id as column use as_index=False parameter or reset_index():
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 1
1 2 3 3
2 3 1 1
3 4 2 2
So if add NaNs in each column should be differences:
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 0
1 2 3 3
2 3 1 1
3 4 2 2
You can specify columns for count:
freq = df.groupby('id', as_index=False)['color'].count()
print (freq)
id color
0 1 1
1 2 3
2 3 1
3 4 2
If need count with NaNs:
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
Thanks Bharath for pointed for another solution with value_counts, differences are explained here:
freq = df['id'].value_counts().rename_axis('id').to_frame('freq').reset_index()
print (freq)
id freq
0 2 3
1 4 2
2 3 1
3 1 1

Related

Replace a single value in a column of a dataframe with 3 new column values

I would effectively like to replace a single value in a column (based on a criteria) with 3 new column values. For example:
Ch
Time
A
1 min
A
1 min
B
2 min
B
2 min
A1
A2
A3
1
2
3
2
3
4
1
1
2
B1
B2
B3
1
2
1
1
1
1
In the above table, If Ch==A, then I would like to see (All values from table 2 repeated twice and so on so forth for Ch==B etc):
ColA
ColB
ColC
1
2
3
1
3
4
1
1
2
1
2
3
1
3
4
1
1
2
How can I go about doing this efficiently? Thank you in advance for your help!
I tried replacing the value C, however I am not sure of how to insert 3 new columns.
IIUC, you can concat and merge:
df1 = pd.DataFrame({'Ch': ['A', 'A', 'B', 'B'], 'Time': ['1 min', '1 min', '2 min', '2 min']})
df2 = pd.DataFrame({'A1': [1, 2, 1], 'A2': [2, 3, 1], 'A3': [3, 4, 2]})
df3 = pd.DataFrame({'B1': [1, 1], 'B2': [2, 1], 'B3': [1, 1]})
d = {'A': df2, 'B': df3}
df1.merge(pd.concat({k: v.set_axis(['ColA', 'ColB', 'ColC'], axis=1)
for k,v in d.items()}
).droplevel(1),
left_on='Ch', right_index=True
)
Output:
Ch Time ColA ColB ColC
0 A 1 min 1 2 3
0 A 1 min 2 3 4
0 A 1 min 1 1 2
1 A 1 min 1 2 3
1 A 1 min 2 3 4
1 A 1 min 1 1 2
2 B 2 min 1 2 1
2 B 2 min 1 1 1
3 B 2 min 1 2 1
3 B 2 min 1 1 1

Iterate over rows and subtract values in pandas df

I have the following table:
ID
Qty_1
Qty_2
A
1
10
A
2
0
A
3
0
B
3
29
B
2
0
B
1
0
I want to iterate based on the ID, and subtract Qty_2 - Qty_1 and update the next row with that result.
The result would be:
ID
Qty_1
Qty_2
A
1
10
A
2
8
A
3
5
B
3
29
B
2
27
B
1
26
Ideally, I would also like to start by subtracting the first row end a new ID appears and only after that start the loop:
ID
Qty_1
Qty_2
A
1
9
A
2
7
A
3
4
B
3
26
B
2
24
B
1
23
Each of the solutions is ok! Thank you!
First compute the difference between 'Qty_1' and 'Qty_2' row by row, then group by 'ID' and compute cumulative sum:
df['Qty_2'] = df.assign(Qty_2=df['Qty_2'].sub(df['Qty_1'])) \
.groupby('ID')['Qty_2'].cumsum()
print(df)
# Output:
ID Qty_1 Qty_2
0 A 1 9
1 A 2 7
2 A 3 4
3 B 3 26
4 B 2 24
5 B 1 23
Setup:
data = {'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'Qty_1': [1, 2, 3, 3, 2, 1],
'Qty_2': [10, 0, 0, 29, 0, 0]}
df = pd.DataFrame(data)

pandas sort_values with condition

I have a dataframe that I'd like to sort on cols time and b, where b sort is conditional on value of a. So if a == 1, sort from highest to lowest, and if a == -1, sort from lowest to highest. I would normally do something like df.sort_values(by=['time', 'b']) but I think it sorts b always from lowest to highest.
df = pd.DataFrame({'time': [0, 3, 2, 2, 1], 'a': [1, -1, 1, 1, -1], 'b': [4, 5, 1, 6, 2]})
time a b
0 0 1 4
1 3 -1 5
2 2 1 1
3 2 1 6
4 1 -1 2
desired output
time a b
0 0 1 4
1 1 -1 2
2 2 1 6
3 2 1 1
4 3 -1 5
Multiply a and b and use it as sorting key:
df['sort'] = df['a']*df['b']
df.sort_values(by=['time', 'sort'], ascending=[True, False]).drop('sort', axis=1)
output:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5
alternative:
df['sort'] = (1-df['a'])*df['b']
df.sort_values(by=['time', 'sort']).drop('sort', axis=1)
Pass ascending after create additional col for sorting
out = df.assign(key = df.a*df.b).sort_values(['time','key'],ascending=[True,False]).drop('key',1)
Out[59]:
time a b
0 0 1 4
4 1 -1 2
3 2 1 6
2 2 1 1
1 3 -1 5

Pandas - Add many new columns based on many aggregate functions

Pandas 1.0.5
import pandas as pd
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
#add columns
d['count'] = d.groupby(['card_id', 'day'])["amount"].transform('count')
d['min'] = d.groupby(['card_id', 'day'])["amount"].transform('min')
d['max'] = d.groupby(['card_id', 'day'])["amount"].transform('max')
I would like to change the three transform lines to one line. I tried this:
d['count', 'min', 'max'] = d.groupby(['card_id', 'day'])["amount"].transform('count', 'min', 'max')
Error: "TypeError: count() takes 1 positional argument but 3 were given"
I also tried this:
d[('count', 'min', 'max')] = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
Error: "TypeError: incompatible index of inserted column with frame index"
Use merge,
d = pd.DataFrame({
"card_id": [1, 1, 2, 2, 1, 1, 2, 2],
"day": [1, 1, 1, 1, 2, 2, 2, 2],
"amount": [1, 2, 10, 20, 3, 4, 30, 40]
})
df_out = d.groupby(['card_id', 'day']).agg(
count = pd.NamedAgg('amount', 'count')
,min = pd.NamedAgg('amount', 'min')
,max = pd.NamedAgg('amount', 'max')
)
d.merge(df_out, left_on=['card_id', 'day'], right_index=True)
Output:
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40
The output of you groupyby is creating a multilevel index and the index of this ouput doesn't match the index of d, hence the error. However, we can join the columns in d to the index in output of the groupby using merge with column names and right_index=True.
You could use the assign function to get the results in one go:
grouping = df.groupby(["card_id", "day"])
df.assign(
count=grouping.transform("count"),
min=grouping.transform("min"),
max=grouping.transform("max"),
)
card_id day amount count min max
0 1 1 1 2 1 2
1 1 1 2 2 1 2
2 2 1 10 2 10 20
3 2 1 20 2 10 20
4 1 2 3 2 3 4
5 1 2 4 2 3 4
6 2 2 30 2 30 40
7 2 2 40 2 30 40

Get group counts of level 1 after doing a group by on two columns

I am doing a group by on two columns and need the count of the number of values in level-1
I tried the following:
>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['one', 'one', 'two', 'three', 'three', 'one'], 'B': [1, 2, 0, 4, 3, 4], 'C': [3,3,3,3,4,8]})
>>> print(df)
A B C
0 one 1 3
1 one 2 3
2 two 0 3
3 three 4 3
4 three 3 4
5 one 4 8
>>> aggregator = {'C': {'sC' : 'sum','cC':'count'}}
>>> df.groupby(["A", "B"]).agg(aggregator)
/envs/pandas/lib/python3.7/site-packages/pandas/core/groupby/generic.py:1315: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
C
sC cC
A B
one 1 3 1
2 3 1
4 8 1
three 3 4 1
4 3 1
two 0 3 1
I want an output something like this where the last column tC gives me the count corresponding to group one, two and three.
C
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1
If there is only one column for aggregation pass list of tuples:
aggregator = [('sC' , 'sum'),('cC', 'count')]
df = df.groupby(["A", "B"])['C'].agg(aggregator)
For last column convert first level to Series of MultiIndex, get counts by GroupBy.transform and GroupBy.size and for first values only use numpy.where:
s = df.index.get_level_values(0).to_series()
df['tC'] = np.where(s.duplicated(), np.nan, s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3.0
2 3 1 NaN
4 8 1 NaN
three 3 4 1 2.0
4 3 1 NaN
two 0 3 1 1.0
You can also set duplicated values to empty string in tC column, but then later all numeric operation with this column failed, because mixed values - numeric with strings:
df['tC'] = np.where(s.duplicated(), '', s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1