How to implement multiple aggregations using pandas groupby, referencing a specific column - pandas

I have data in a pandas data frame, and need to aggregate it. I need to do different aggregations across different columns similar to the below.
group min(rank) min(rank) min sum
title t_no t_descr rank stores
A 1 a 1 1000
B 1 a 1 1000
B 2 b 2 800
C 2 b 2 800
D 1 a 1 1000
D 2 b 2 800
F 4 d 4 500
E 3 c 3 700
to:
title t_no t_descr rank stores
A 1 a 1 1000
B 1 a 1 1800
C 2 b 2 800
D 1 a 1 1800
E 3 c 3 700
F 4 d 4 500
You'll notice that title B and D have been aggregated, keeping the t_no & t_descr that corresponded to the minimum of the rank for the respective title group, while stores are summed. t_no & t_descr are just arbitrary text. I need the top rank by title, sum the stores, and keep the corresponding t_no & t_descr.
How can I do this within a single pandas groupby? This is dummy data; the real problem that I'm working on has many more aggregations, and I'd prefer not to have to do each aggregation individually, which I know how to do.
I started with the below, but realized that I really need the mins & maxs for t_no & t_descr to be based on rank col of the subgroup, not the columns themselves.
aggs = {
'rank': 'min',
't_no': 'min', # need t_no for row that is min(rank) by title.
't_descr': 'min' # need t_descr for row that is min(rank) by title.
}
df2.groupby('title').agg(aggs).reset_index()
Perhaps there's a way to do this with a lambda? I'm sure there's a straightforward way to do this. And if groupby isn't the right method I'm obviously open to suggestions.
Thanks!

Two step process...
aggregate for sum of stores and idxmin for rank...
then use idxmin to slice original dataframe and join it with the aggregate
agged = df.groupby('title').agg(dict(rank='idxmin', stores='sum'))
df.loc[agged['rank'], ['title', 't_no', 't_descr', 'rank']].join(agged.stores, on='title')
title t_no t_descr rank stores
0 A 1 a 1 1000
1 B 1 a 1 1800
3 C 2 b 2 800
4 D 1 a 1 1800
7 E 3 c 3 700
6 F 4 d 4 500

This is a slightly different approach from #piRSquared, but gets you to the same spot:
Code:
# Set min and sum functions according to columns and generate new dataframe
f = {'rank':min, 'rank':min, 'stores':sum}
grouped = df.groupby('title').agg(f).reset_index()
# Then merge with original dataframe (keeping only the merged and new columns)
pd.merge(grouped, df[['title','rank','t_no','t_descr']], on=['title','rank'])
Output:
title stores rank t_no t_descr
0 A 1000 1 1 a
1 B 1800 1 1 a
2 C 800 2 2 b
3 D 1800 1 1 a
4 E 700 3 3 c
5 F 500 4 4 d
Of course you can organize the columns as you see fit.

Related

Subquery or CTE to identify the mix of area in an extra column

I have the following table for which I am looking to create a new column, type which can either be "pure" or "mix" based on two different conditions.
id
unit
area
n_unit
qty
1245
5485245
A
2
1
1245
2488754
B
2
1
2358
548754
A
3
1
2358
84447
A
3
1
2358
548754
A
3
1
4582
84447
C
2
1
4582
548754
D
2
1
9696
84447
B
2
1
9696
548754
K
2
1
I am looking to have a result as below:
id
unit
area
n_unit
qty
type
1245
5485245
A
2
1
mix
1245
2488754
B
2
1
mix
2358
548754
A
3
1
pure
2358
84447
A
3
1
pure
2358
548754
A
3
1
pure
4582
84447
C
2
1
pure
4582
548754
D
2
1
pure
9696
84447
B
2
1
mix
9696
548754
K
2
1
mix
My logic is this:
If all the rows with the same Id are either Area A, C or D then all rows with that Id are type "pure".
Otherwise, i.e. if a letter which is not A, C or D exists within the Id, all rows with the same Id are type "mix".
The n_units is based on the total units i.e. the number of rows with the same Id.
Looking forward to your kind help.
It requires you one window function and one case expression as follows:
SELECT *, MIN(CASE WHEN area IN ('A', 'C', 'D')
THEN 'pure'
ELSE 'mix' END) OVER(PARTITION BY id) AS type
FROM tab
If there's a 'mix' in your output, it will become the minimum value to be assigned to the partition, otherwise you will get 'pure'.
Check the demo here.

Remove duplicates from dataframe, based on two columns A,B, keeping [list of values] in another column C

I have a pandas dataframe which contains duplicates values according to two columns (A and B):
A B C
1 2 1
1 2 4
2 7 1
3 4 0
3 4 8
I want to remove duplicates keeping the values in column C inside a list of len N values in C (example 2 values in this example). This would lead to:
A B C
1 2 [1,4]
2 7 1
3 4 [0,8]
I cannot figure out how to do that. Maybe use groupby and drop_duplicates?

How to sum values of two columns by an ID column, keeping some columns with repeated values and excluding others?

I need to organize a large df adding values of a column by a column ID (the ID is not sequencial), keeping some columns of the df that have repeated values by ID and excluding column that have different values by ID. Below I inserted a reproducible example and the output I need. I think there is a simple way to do that, but I am not soo familiar with R.
df=read.table(textConnection("
ID spp effort generalist specialist
1 a 10 1 0
1 b 10 1 0
1 c 10 0 1
1 d 10 0 1
2 a 16 1 0
2 b 16 1 0
2 e 16 0 1
"), header = TRUE)
The output I need:
ID effort generalist specialist
1 10 2 2
2 16 2 1

Dynamic transpose of rows to column without pivot (Number of rows are not fixed all the time)

i have a table like
a 1
a 2
b 1
b 3
b 2
b 4
i wanted out put like this
1 2 3 4
a a
b b b b
Number of rows in output may vary.
Pivoting is not working as it is in exasol, and case cant work as it is dynamic

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])