Aggregating counts of different columns subsets - pandas

I have a dataset with a tree structure and for each path in the tree, I want to compute the corresponding counts at each level. Here is a minimal reproducible example with two levels.
import pandas as pd
data = pd.DataFrame()
data['level_1'] = np.random.choice(['1', '2', '3'], 100)
data['level_2'] = np.random.choice(['A', 'B', 'C'], 100)
I know I can get the counts on the last level by doing
counts = data.groupby(['level_1','level_2']).size().reset_index(name='count_2')
print(counts)
level_1 level_2 count_2
0 1 A 10
1 1 B 12
2 1 C 8
3 2 A 10
4 2 B 10
5 2 C 10
6 3 A 17
7 3 B 12
8 3 C 11
What I would like to have is a dataframe with one row for each possible path in the tree with the counts at each level in that path. For the example above, it would be something like
level_1 level_2 count_1 count_2
0 1 A 30 10
1 1 B 30 12
2 1 C 30 8
3 2 A 30 10
4 2 B 30 10
5 2 C 30 10
6 3 A 40 17
7 3 B 40 12
8 3 C 40 11
This is an example with only two levels, which is easy to solve, but I would like to have a way to get those counts for an arbitrary number of levels.

This will be the transform
counts['count_1']=counts.groupby(['level_1']).count_2.transform('sum')
counts
Out[445]:
level_1 level_2 count_2 count_1
0 1 A 7 30
1 1 B 13 30
2 1 C 10 30
3 2 A 7 30
4 2 B 7 30
5 2 C 16 30
6 3 A 9 40
7 3 B 10 40
8 3 C 21 40

You can make do from your original data:
groups = data.groupby('level_1').level_2
pd.merge(groups.value_counts(),
groups.size(),
left_index=True,
right_index=True)
which gives:
level_2_x level_2_y
level_1 level_2
1 A 14 39
B 14 39
C 11 39
2 C 13 34
A 12 34
B 9 34
3 B 12 27
C 9 27
A 6 27

Related

pandas: get top n including the duplicates of a sorted column

I have some data like
This is a table sorted by score column and also then by cat column
score cat
18 B
18 A
17 A
16 B
16 A
15 B
14 B
13 A
12 A
10 B
9 B
I want to get the top 5 of score including the duplicates and also add the rank
i.e
rank score cat
1 18 B
1 18 A
2 17 A
3 16 B
3 16 A
4 15 B
5 14 B
How can i get this using pandas
Since the data frame is ordered, try factorize
df['rnk'] = df.score.factorize()[0]+1
out = df[df['rnk'] <= 5]
out
score cat rnk
0 18 B 1
1 18 A 1
2 17 A 2
3 16 B 3
4 16 A 3
5 15 B 4
6 14 B 5

Stack multiple columns into single column while maintaining other columns in Pandas?

Given pandas multiple columns as below
cl_a cl_b cl_c cl_d cl_e
0 1 a 5 6 20
1 2 b 4 7 21
2 3 c 3 8 22
3 4 d 2 9 23
4 5 e 1 10 24
I would like to stack the column cl_c cl_d cl_e into a single column with the name ax. But, please note that, the columns cl_a cl_b were maintained.
cl_a cl_b ax from_col
1,a,5,cl_c
2,b,4,cl_c
3,c,3,cl_c
4,d,2,cl_c
5,e,1,cl_c
1,a,6,cl_d
2,b,7,cl_d
3,c,8,cl_d
4,d,9,cl_d
5,e,10,cl_d
1,a,20,cl_e
2,b,21,cl_e
3,c,22,cl_e
4,d,23,cl_e
5,e,24,cl_e
So far, the following code does the job
df = pd.DataFrame ( {'cl_a': [1,2,3,4,5], 'cl_b': ['a','b','c','d','e'],
'cl_c': [5,4,3,2,1],'cl_d': [6,7,8,9,10],
'cl_e': [20,21,22,23,24]})
df_new = pd.DataFrame()
for col_name in ['cl_c','cl_d','cl_e']:
df_new=df_new.append (df [['cl_a', 'cl_b', col_name]].rename(columns={col_name: "ax"}))
However, I am curious whether there is Pandas build-in approach that can do the trick
Edit:
Upon Quong answer, I realise of the need to include another column (i.e., from_col) beside the ax. The from_col indicate the origin of ax previous column name.
Yes, it's called melt:
df.melt(['cl_a','cl_b'], value_name='ax').drop(columns='variable')
Output:
cl_a cl_b ax
0 1 a 5
1 2 b 4
2 3 c 3
3 4 d 2
4 5 e 1
5 1 a 6
6 2 b 7
7 3 c 8
8 4 d 9
9 5 e 10
10 1 a 20
11 2 b 21
12 3 c 22
13 4 d 23
14 5 e 24
Or equivalently set_index().stack():
(df.set_index(['cl_a','cl_b']).stack()
.reset_index(level=-1, drop=True)
.reset_index(name='ax')
)
with a slightly different output:
cl_a cl_b ax
0 1 a 5
1 1 a 6
2 1 a 20
3 2 b 4
4 2 b 7
5 2 b 21
6 3 c 3
7 3 c 8
8 3 c 22
9 4 d 2
10 4 d 9
11 4 d 23
12 5 e 1
13 5 e 10
14 5 e 24

Sum of group but keep the same value for each row in pandas

How to solve same problem in this link Sum of group but keep the same value for each row in r using pandas?
I can generate separate df have the sum for each group and then merge the generated df with the original.
You can use groupby & transform as below to get your output.
df['sumx']=df.groupby(['ID', 'Group'],sort=False)['x'].transform(sum)
df['sumy']=df.groupby(['ID', 'Group'],sort=False)['y'].transform(sum)
df
output
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23

Split a column by element and create new ones with pandas

Goal: I want to split one single column by elements (not the strings cells) and, from that division, create new columns, where the element is the title of the new column and the other values from another columns compose the respective column.
There is a way of doing that with pandas? Thanks in advance.
Example:
[IN]:
A 1
A 2
A 6
A 99
B 7
B 8
B 19
B 18
[OUT]:
A B
1 7
2 8
6 19
99 18
Just an alternative if 2 column input data:
print(df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1=pd.DataFrame(df.groupby('col1')['col2'].apply(list).to_dict())
print(df1)
A B
0 1 7
1 2 8
2 6 19
3 99 18
Use Series.str.split with GroupBy.cumcount for counter, then reshape by DataFrame.set_index with Series.unstack:
print (df)
col
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
df1 = df['col'].str.split(expand=True)
g = df1.groupby(0).cumcount()
df2 = df1.set_index([0, g])[1].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18
If 2 columns input data:
print (df)
col1 col2
0 A 1
1 A 2
2 A 6
3 A 99
4 B 7
5 B 8
6 B 19
7 B 18
g = df.groupby('col1').cumcount()
df2 = df.set_index(['col1', g])['col2'].unstack(0).rename_axis(None, axis=1)
print (df2)
A B
0 1 7
1 2 8
2 6 19
3 99 18

Pandas Group By two columns and based on the value in one of them (categorical) write data into a specific column [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have following dataframe:
df = pd.DataFrame([[1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3,3],['A','B','B','B','C','D','D','E','A','C','C','C','A','B','B','B','B','D','E'], [18,25,47,27,31,55,13,19,73,55,58,14,2,46,33,35,24,60,7]]).T
df.columns = ['Brand_ID','Category','Price']
Brand_ID Category Price
0 1 A 18
1 1 B 25
2 1 B 47
3 1 B 27
4 1 C 31
5 1 D 55
6 1 D 13
7 1 E 19
8 2 A 73
9 2 C 55
10 2 C 58
11 2 C 14
12 3 A 2
13 3 B 46
14 3 B 33
15 3 B 35
16 3 B 24
17 3 D 60
18 3 E 7
What I need to do is to group by Brand_ID and category and count (similar to the first part of this question). However, I need instead to write the output into a different column depending on the category. So my Output should look like follows:
Brand_ID Category_A Category_B Category_C Category_D Category_E
0 1 1 3 1 2 1
1 2 1 0 3 0 0
2 3 1 4 0 1 1
Is there any possibility to do this directly with pandas?
Try:
df.groupby(['Brand_ID','Category'])['Price'].count()\
.unstack(fill_value=0)\
.add_prefix('Category_')\
.reset_index()\
.rename_axis([None], axis=1)
Output
Brand_ID Category_A Category_B Category_C Category_D Category_E
0 1 1 3 1 2 1
1 2 1 0 3 0 0
2 3 1 4 0 1 1
OR
pd.crosstab(df.Brand_ID, df.Category)\
.add_prefix('Category_')\
.reset_index()\
.rename_axis([None], axis=1)
You're describing a pivot_table:
df.pivot_table(index='Brand_ID', columns='Category', aggfunc='size', fill_value=0)
Output:
Category A B C D E
Brand_ID
1 1 3 1 2 1
2 1 0 3 0 0
3 1 4 0 1 1