multilevel columns set as index in pivot_table - pandas

I have a data frame (df) with multi column headers:
enter image description here
yearQ YearC YearS Type1 Type2
index City State Year1 Year2 Year3 Year4 Year5 Year6
0 New York NY 355 189 115 234 178 422
1 Los Angeles CA 100 207 298 230 214 166
2 Chicago IL 1360 300 211 121 355 435
3 Philadelphia PA 270 156 455 232 532 355
4 Phoenix AZ 270 234 112 432 344 116
I want to count the average number for each type. the final format should be like the following:
City State Type1 Type2
New York NY avg of(355+189+115) avg of (234+178+422)
.......
Can anybody give me a hint?
Many thanks.
Kath

I think you can use groupby by first level of Multindex in columns with aggregate sum:
print (df.index)
MultiIndex(levels=[[0, 1, 2, 3, 4],
['Chicago', 'Los Angeles', 'New York', 'Philadelphia', 'Phoenix'],
['AZ', 'CA', 'IL', 'NY', 'PA']],
labels=[[0, 1, 2, 3, 4], [2, 1, 0, 3, 4], [3, 1, 2, 4, 0]])
print (df.columns)
MultiIndex(levels=[['Type1', 'Type2'],
['Year1', 'Year2', 'Year3', 'Year4', 'Year5', 'Year6']],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 4, 5]],
names=['YearQ', 'index'])
df = df.groupby(axis=1, level=0).sum()
print (df)
YearQ Type1 Type2
0 New York NY 659 834
1 Los Angeles CA 605 610
2 Chicago IL 1871 911
3 Philadelphia PA 881 1119
4 Phoenix AZ 616 892
But maybe before is necessary set_index:
print (df.index)
Int64Index([0, 1, 2, 3, 4], dtype='int64')
print (df.columns)
MultiIndex(levels=[['Type1', 'Type2', 'YearC', 'YearS'],
['City', 'State', 'Year1', 'Year2', 'Year3', 'Year4', 'Year5', 'Year6']],
labels=[[2, 3, 0, 0, 0, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7]],
names=['YearQ', 'index'])
df = df.set_index([('YearC','City'), ('YearS','State')])
df = df.groupby(axis=1, level=0).sum()
print (df)
YearQ Type1 Type2
(YearC, City) (YearS, State)
New York NY 659 834
Los Angeles CA 605 610
Chicago IL 1871 911
Philadelphia PA 881 1119
Phoenix AZ 616 892

Related

Pandas: How to limit the number of rows for a given index in a pivot table

I have the following Pandas data frame df:
import pandas as pd
df = pd.DataFrame({
'city': ['New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'Los Angeles', 'Los Angeles', 'Houston', 'Houston', 'Houston', 'Boston', 'Boston', 'Boston', 'Boston'],
'airport': ['LGA', 'EWR', 'JFK', 'TEB', 'CWD', 'TTN', 'LAX', 'BUR', 'IAH', 'HOU', 'EFD', 'BOS', 'ACK', 'MVY', 'WST'],
'distance': [38, 32, 8, 78, 120, 180, 8, 19, 90, 78, 120, 9, 97, 72, 150]
})
df
city airport distance
0 New York LGA 38
1 New York EWR 32
2 New York JFK 8
3 New York TEB 78
4 New York CWD 120
5 New York TTN 180
6 Los Angeles LAX 8
7 Los Angeles BUR 19
8 Houston IAH 90
9 Houston HOU 78
10 Houston EFD 120
11 Boston BOS 9
12 Boston ACK 97
13 Boston MVY 72
14 Boston WST 150
I create a sorted pivot table using the following:
pivot_table = pd.pivot_table(df, index = ['city', 'airport'], values = 'distance')
sorted_table = pivot_table.reset_index().sort_values(['city', 'distance'], ascending=[1,0]).set_index(['city', 'airport'])
distance
city airport
Boston WST 150
ACK 97
MVY 72
BOS 9
Houston EFD 120
IAH 90
HOU 78
Los Angeles BUR 19
LAX 8
New York TTN 180
CWD 120
TEB 78
LGA 38
EWR 32
JFK 8
As you can see, some cities have more than 3 associated airports (e.g. Boston and New York).
How can I limit the number of results for city to a maximum of three (3)?
The desired pivot table would look like this:
distance
city airport
Boston WST 150
ACK 97
MVY 72
Houston EFD 120
IAH 90
HOU 78
Los Angeles BUR 19
LAX 8
New York TTN 180
CWD 120
TEB 78
Thanks!
df = pd.DataFrame({
'city': ['New York', 'New York', 'New York', 'New York', 'New York', 'New York', 'Los Angeles', 'Los Angeles', 'Houston', 'Houston', 'Houston', 'Boston', 'Boston', 'Boston', 'Boston'],
'airport': ['LGA', 'EWR', 'JFK', 'TEB', 'CWD', 'TTN', 'LAX', 'BUR', 'IAH', 'HOU', 'EFD', 'BOS', 'ACK', 'MVY', 'WST'],
'distance': [38, 32, 8, 78, 120, 180, 8, 19, 90, 78, 120, 9, 97, 72, 150]
})
pivot_table = pd.pivot_table(df, index = ['city', 'airport'], values = 'distance')
sorted_table = pivot_table.reset_index().sort_values(['city', 'distance'], ascending=[1,0]).set_index(['city', 'airport'])
limited_table = sorted_table.groupby('city').head(3)
print(limited_table)
distance
city airport
Boston WST 150
ACK 97
MVY 72
Houston EFD 120
IAH 90
HOU 78
Los Angeles BUR 19
LAX 8
New York TTN 180
CWD 120
TEB 78
You can also do it like this without the creation of multiindex:
df.sort_values(['city', 'distance'], ascending=[True, False])\
.groupby(['city']).head(3)
Output:
city airport distance
14 Boston WST 150
12 Boston ACK 97
13 Boston MVY 72
10 Houston EFD 120
8 Houston IAH 90
9 Houston HOU 78
7 Los Angeles BUR 19
6 Los Angeles LAX 8
5 New York TTN 180
4 New York CWD 120
3 New York TEB 78

How to groupby, iterate over rows and count in pandas?

I have a dataframe with columns city, date, source and count of each source.
I need to group by city, then iterate over rows of date column with the following condition: check each row and if the difference between them is <30 days then find start_date(min) and end_date(max) for the group of dates that meet condition, and add them as a separate columns. Also, count the source type and add again as separate columns.
city date source count
0 Alexandria 2022-10-13 black 117
1 Alexandria 2022-10-14 black 85
2 Alexandria 2022-10-15 black 63
3 Alexandria 2022-10-16 black 190
4 Alexandria 2022-10-17 black 389
5 Alexandria 2022-10-18 black 284
6 Alexandria 2022-10-19 black 179
7 Amsterdam 2018-08-05 red 1
8 Amsterdam 2018-08-28 red 111
9 Amsterdam 2019-08-17 red 1669
10 Amsterdam 2019-08-18 red 1584
11 Amsterdam 2019-08-19 red 940
12 Amsterdam 2019-08-21 red 1498
13 Amsterdam 2019-08-22 red 2281
14 Amsterdam 2019-08-23 red 2038
15 Amsterdam 2019-08-24 red 1516
16 Amsterdam 2019-08-25 red 1952
17 Amsterdam 2019-08-26 red 1434
18 Amsterdam 2019-08-27 red 881
19 Amsterdam 2019-08-29 red 1482
20 Amsterdam 2019-08-30 red 978
21 Amsterdam 2019-08-31 red 1423
22 Amsterdam 2019-09-01 red 1120
23 Amsterdam 2019-09-02 red 1117
24 Amsterdam 2019-09-06 red 1
To reproduce my dataframe:
# initialize list of lists
data1 = [['Alexandria', '2022-10-13', 'black', 117],
['Alexandria', '2022-10-14', 'black', 85],
['Alexandria', '2022-10-15', 'black', 63],
['Alexandria', '2022-10-16', 'black', 190],
['Alexandria', '2022-10-17', 'black', 389],
['Alexandria', '2022-10-18', 'black', 284],
['Alexandria', '2022-10-19', 'black', 179],
['Amsterdam', '2018-08-05', 'red', 1],
['Amsterdam', '2018-08-28', 'red', 111],
['Amsterdam', '2019-08-17', 'red', 1669],
['Amsterdam', '2019-08-18', 'red', 1584],
['Amsterdam', '2019-08-19', 'red', 940],
['Amsterdam', '2019-08-21', 'red', 1498],
['Amsterdam', '2019-08-22', 'red', 2281],
['Amsterdam', '2019-08-23', 'red', 2038],
['Amsterdam', '2019-08-24', 'red',1516],
['Amsterdam', '2019-08-25', 'red', 1952],
['Amsterdam', '2019-08-26', 'red', 1434],
['Amsterdam', '2019-08-27', 'red', 881],
['Amsterdam', '2019-08-29', 'red', 1482],
['Amsterdam', '2019-08-30', 'red', 978],
['Amsterdam', '2019-08-31', 'red', 1423],
['Amsterdam', '2019-09-01', 'red', 1120],
['Amsterdam', '2019-09-02', 'red',1117],
['Amsterdam', '2019-09-06', 'red', 1],
]
# Create the pandas DataFrame
df_b = pd.DataFrame(data1, columns=['city', 'date', 'source','count'])
The output needed(looks like pivot table):
So, for Alexandria i have only 1 round where dates are less than 30 days between them: from 2022-10-13 till 2022-10-19 (6th row). These 2 dates are min and max.
For Amsterdam we have 2 rounds of dates that have less that 30 days between them: 1st round are rows 7-8, 2nd round are rows 9-24.
Between 8 and 9 rows for Amsterdam there are more than 30 days, meaning it starts a new round.
city start_date end_date black red
0 Alexandria 2022-10-13 2022-10-19 1307 0
1 Amsterdam 2018-08-05 2018-08-28 0 112
2 Amsterdam 2019-08-17 2019-09-06 0 21914
To reproduce the output:
# initialize list of lists
data = [['Alexandria', '2022-10-13', '2022-10-19', 1307, 0],
['Amsterdam', '2018-08-05', '2018-08-28', 0, 112],
['Amsterdam', '2019-08-17', '2019-09-06', 0,21914 ]]
# Create the pandas DataFrame
df_a = pd.DataFrame(data, columns=['city', 'start_date', 'end_date','black', 'red'])
I know this does not fully answer the question, but is hopefully helpful in getting you partly there, since you did not show what you have tried (if anything). The 30 days part presumably would involve some custom way of grouping and aggregating the data which I have not attempted here. As such, this answer shows how to:
Create start_date and end_date columns for minimum and maximum dates grouped by city, respectively
Sum the count for each source value into new columns for each identified source value (black and red)
Merge the two resultant dataframes into a single one
This answer assumes the original data is in a dateframe named df.
Minimum and maximum dates
I know this is not exactly what is wanted due to the 30 days logic, but shows a straightforward grouping by city with minimum and maximum dates for each.
date_agg_df = df.groupby("city").agg({"date": ["min", "max"]}).reset_index()
date_agg_df.columns = ["city", "start_date", "end_date"]
Data in date_agg_df:
city start_date end_date
0 Alexandria 2022-10-13 2022-10-19
1 Amsterdam 2018-08-05 2019-09-06
Sum the count of each source value
black_red_df = df.fillna(0).pivot_table(index="city", columns="source", values="count", aggfunc="sum").fillna(0).reset_index()
black_red_df.columns = ["city", "black", "red"]
Data in black_red_df:
city black red
0 Alexandria 1307.0 0.0
1 Amsterdam 0.0 22026.0
Merge the dataframes
mrg_df = pd.merge(date_agg_df, black_red_df, on="city")
Data in mrg_df:
city start_date end_date black red
0 Alexandria 2022-10-13 2022-10-19 1307.0 0.0
1 Amsterdam 2018-08-05 2019-09-06 0.0 22026.0

I want to sort value in pivot using pandas at multilevel

df = pd.DataFrame({'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Pune', 'Mumbai', 'Pune', 'Mumbai', 'Pune'],
'C': [27, 23, 21, 25, 24]})
A B C
0 John Pune 27
1 Boby Mumbai 23
2 Mina Pune 21
3 Peter Mumbai 25
4 Nicky Pune 24
table = pd.pivot_table(df, values ='C', index =['B', 'A'],
aggfunc = np.sum)
table
C
B A
Mumbai Boby 23
Peter 25
Pune John 27
Mina 21
Nicky 24
I want to sort this pivot 'B' groupwise summation in descending order.Here as pune have (27+21+24) 72 numbers which is greater than mumbai(23+25=48)
SO pune should be at level 1 and mumbai.
Again in Pune group it should be in s=descending order .i.e John,Nicky and then Mina
in Mumbai group peter and Boby.
There could be many values in B
required output:
C
B A
Pune John 27
Nicky 24
Mina 21
Mumbai Peter 25
Boby 23
Create helper level with aggregated values and use it for sorting with first level:
#simplify solution - aggregate sum for all numeric columns
table = df.groupby(['B', 'A']).sum()
table = (table.set_index(table.groupby(level=0)['C'].transform('sum'), append=True)
.sort_index(level=[-1,0], ascending=[False, True])
.droplevel(-1))
print (table)
C
B A
Pune John 27
Mina 21
Nicky 24
Mumbai Boby 23
Peter 25

Pandas group by aggregation for non numeric data

my sample df looks like this:
sid score cat_type
101 70 na
102 56 PNP
101 65 BAW
103 88 SAO
103 50 na
102 42 VVG
105 79 SAE
....
df_groupby = df.groupby(['sid']).agg(
score_max = ('score','max'),
cat_type_first_row = ('cat_type', '?')
)
with the group by, I want to get the first row value of cat_type that is not na and assign it to cat_type_first_row
my final df should look like this:
sid score_max cat_type_first_row
101 70 BAW
102 56 PNP
103 88 SAO
105 79 SAE
....
Could you please assist me in solving this problem?
Try replace na to NaN and do first
df_groupby = df.replace('na',np.nan).groupby(['sid']).agg(
score_max = ('score','max'),
cat_type_first_row = ('cat_type', 'first')
)
df_groupby
score_max cat_type_first_row
sid
101 70 BAW
102 56 PNP
103 88 SAO
105 79 SAE
If you do not want to drop any row, you could use:
(df.merge((df.dropna(subset=['cat_type'])
.groupby('sid')['cat_type']
.first()
.rename('cat_type_first_row')
), on='sid')
)
output:
sid score cat_type cat_type_first_row
0 101 70 NaN BAW
1 101 65 BAW BAW
2 102 56 PNP PNP
3 102 42 VVG PNP
4 103 88 SAO SAO
5 103 50 NaN SAO
6 105 79 SAE SAE
You can define a function that takes as input the grouped pandas series. I've tested this code and got your desired output (added rows for cases when all cat_type values are np.nan for a group):
df = {
'sid': [101, 102, 101, 103, 103, 102, 105, 106, 106],
'score': [70, 56, 65, 88, 50, 42, 79, 0, 0],
'cat_type': [np.nan, 'PNP', 'BAW', 'SAO', np.nan, 'VVG', 'SAE', np.nan, np.nan]
}
df = pd.DataFrame(df)
display(df)
def get_cat_type_first_row(series):
series_nona = series.dropna()
if len(series_nona) == 0:
return np.nan
else:
return series.dropna().iloc[0]
df_groupby = df.groupby(['sid']).agg(
score_max = ('score','max'),
cat_type_first_row = ('cat_type', get_cat_type_first_row)
)
df_groupby
Output:

Groupby each group and then do division of two divisions (current yr/latest yr for each column)

I'd like to create a new column by dividing current year by its latest year in Col_1 and Col_2 respectively for each group. Then, divide the two divisions.
Methodology: Calculate (EachYrCol_1/Yr2000Col_1)/(EachYrCol_2/Yr2000Col_2) for each group
See example below:
Year
Group
Col_1
Col_2
New Column
1995
A
100
11
(100/600)/(11/66)
1996
A
200
22
(200/600)/(22/66)
1997
A
300
33
(300/600)/(33/66)
1998
A
400
44
.............
1999
A
500
55
.............
2000
A
600
66
.............
1995
B
700
77
(700/1200)/(77/399)
1996
B
800
88
(800/1200)/(88/399)
1997
B
900
99
(900/1200)/(99/399)
1998
B
1000
199
.............
1999
B
1100
299
.............
2000
B
1200
399
.............
Sample dataset:
import pandas as pd
df = pd.DataFrame({'Year':[1995, 1996, 1997, 1998, 1999, 2000,1995, 1996, 1997, 1998, 1999, 2000],
'Group':['A', 'A', 'A','A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Col_1':[100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200],
'Col_2':[11, 22, 33, 44, 55, 66, 77, 88, 99, 199, 299, 399]})
Use GroupBy.transform with GroupBy.last for helper DataFrame, so possible divide each column:
df1 = df.groupby('Group').transform('last')
df['New'] = df['Col_1'].div(df1['Col_1']).div(df['Col_2'].div(df1['Col_2']))
print (df)
Year Group Col_1 Col_2 New
0 1995 A 100 11 1.000000
1 1996 A 200 22 1.000000
2 1997 A 300 33 1.000000
3 1998 A 400 44 1.000000
4 1999 A 500 55 1.000000
5 2000 A 600 66 1.000000
6 1995 B 700 77 3.022727
7 1996 B 800 88 3.022727
8 1997 B 900 99 3.022727
9 1998 B 1000 199 1.670854
10 1999 B 1100 299 1.223244
11 2000 B 1200 399 1.000000