How have 4 percentage slice after groupby? - pandas

I have a dataframe like this:
city age
paris 8
paris 12
paris 45
paris 65
LA 65
LA 78
LA 42
I would like to groupby city and know age percentage in 4 interval :
% of people where age <16 years (slice1)
16 < % of people where age < 30 (slice2)
30 < % of people where age < 40 (slice3)
40 < % of people where age (slice4)
Expected output like this:
city slice1 slice2 slice3 slice4
paris 2% 6% 70% 22%
LA 1% 40% 9% 50%
How can I do this with Pandas ?

Use pandas.cut to define age groups and pandas.crosstab with normalize='index' to compute the proportion per city:
age_groups = [0,16,30,40]
age_labels = [f'slice{i+1}' for i in range(len(age_groups))]
ages = pd.cut(df['age'], bins=age_groups+[float('inf')],
labels=age_labels, right=False)
df_out = (pd.crosstab(df['city'], ages, normalize='index')
.reindex(age_labels, axis=1, fill_value=0)
)
output:
>>> df_out
age slice1 slice2 slice3 slice4
city
LA 0.0 0 0 1.0
paris 0.5 0 0 0.5
as percent:
>>> df_out*100
age slice1 slice2 slice3 slice4
city
LA 0.0 0 0 100.0
paris 50.0 0 0 50.0

Related

How to convert wide dataframe to long based on similar column

I have a pandas dataframe like this
and i want to convert it to below dataframe
i am not sure how to use pd.wide_to_long function here
below is the dataset for creating dataframe:
Date, IN:Male teacher ,IN:Male engineer, IN: Male Atronaut , IN:female teacher ,IN:female engineer, IN: female Atronaut ,GB:Male teacher ,GB:Male engineer, GB: Male Atronaut,GB:female teacher ,GB:female engineer, GB: female Atronaut
20220405,25,29,5,41,23,23,12,23,34,11,22,34
20220404,21,29,4,40,23,22,12,23,32,10,23,34
Convert Date column to index and for all another columns remove possible traling spaces by str.strip, then replace spaces to : and last split by one or more : to MultiIndex, so possible reshape by DataFrame.stack with DataFrame.rename_axis for new columns names created by DataFrame.reset_index:
df1 = df.set_index('Date')
df1.columns = df1.columns.str.strip().str.replace('\s+', ':').str.split('[:]+', expand=True)
df1 = df1.stack([0,1]).rename_axis(['Date','Symbol','Gender']).reset_index()
print (df1)
Date Symbol Gender Atronaut engineer teacher
0 20220405 GB Male 34 23 12
1 20220405 GB female 34 22 11
2 20220405 IN Male 5 29 25
3 20220405 IN female 23 23 41
4 20220404 GB Male 32 23 12
5 20220404 GB female 34 23 10
6 20220404 IN Male 4 29 21
7 20220404 IN female 22 23 40
pivot_longer from pyjanitor offers an easy way to abstract the reshaping; in this case it can be solved with a regular expression:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = 'Date',
names_to = ('symbol', 'gender', '.value'),
names_pattern = r"(.+):\s*(.+)\s+(.+)",
sort_by_appearance = True)
Date symbol gender teacher engineer Atronaut
0 20220405 IN Male 25 29 5
1 20220405 IN female 41 23 23
2 20220405 GB Male 12 23 34
3 20220405 GB female 11 22 34
4 20220404 IN Male 21 29 4
5 20220404 IN female 40 23 22
6 20220404 GB Male 12 23 32
7 20220404 GB female 10 23 34
The regular expression has capture groups, any group paired with .value stays as a header, the rest become column values.

ValueError: grouper for xxx not 1-dimensional with pandas pivot_table()

I am working on olympics dataset and want to create another dataframe that has total number of athletes and total number of medals won by type for each country.
Using following pivot_table gives me an error "ValueError: Grouper for 'ID' not 1-dimensional"
pd.pivot_table(olymp, index='NOC', columns=['ID','Medal'], values=['ID','Medal'], aggfunc={'ID':pd.Series.nunique,'Medal':'count'}).sort_values(by='Medal')
Result should have one row for each country with columns for totalAthletes, gold, silver, bronze. Not sure how to go about it using pivot_table. I can do this using merge of crosstab but would like to use just one pivottable statement.
Here is what original df looks like.
Update
I would like to get the medal breakdown as well e.g. gold, silver, bronze. Also I need unique count of athlete id's so I use nunique since one athlete may participate in multiple events. Same with medal, ignoring NA values
IIUC:
out = df.pivot_table('ID', 'NOC', 'Medal', aggfunc='count', fill_value=0)
out['ID'] = df[df['Medal'].notna()].groupby('NOC')['ID'].nunique()
Output:
>>> out
Medal Bronze Gold Silver ID
NOC
AFG 2 0 0 1
AHO 0 0 1 1
ALG 8 5 4 14
ANZ 5 20 4 25
ARG 91 91 92 231
.. ... ... ... ...
VIE 0 1 3 3
WIF 5 0 0 4
YUG 93 130 167 317
ZAM 1 0 1 2
ZIM 1 17 4 16
[149 rows x 4 columns]
Old answer
You can't have the same column for columns and values:
out = olymp.pivot_table(index='NOC', values=['ID','Medal'],
aggfunc={'ID':pd.Series.nunique, 'Medal':'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]
Another way to get the result above:
out = olym.groupby('NOC').agg({'ID': pd.Series.nunique, 'Medal': 'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]

Pandas: Aggregate mean ("totals") for each combination of dimensions

I have a table like this:
gender
city
age
time
value
male
newyork
10_20y
2010
10.5
female
newyork
10_20y
2010
11
I'd like to add all possible means for combinations of dimensions 1-3 to the same table (or a new dataframe that I can concatenate with the original).
The time dimension should not be aggregated. Example of means added for different combinations:
gender
city
age
time
value
total
total
total
2010
(mean)
male
total
total
2010
(mean)
male
newyork
total
2010
(mean)
female
total
total
2010
(mean)
female
newyork
total
2010
(mean)
... total
... total
10_20y
2010
(mean)
Using groupby on multiple columns will groupby with all combinations of these columns. So a simple df.groupby(["city", "age"]).mean() will achieve the mean for the "total", "city", "age" combination. The problem here is you want all combinations of all size for the list ["gender", "city", "age"]. It's not straightforward and I think a more pythonic way can be found but here is my proposal :
## create artificial data
cols = ["gender", "city", "age"]
genders = ["male", "female"]
cities = ["newyork", "losangeles", "chicago"]
ages = ["10_20y", "20_30y"]
n = 20
df = pd.DataFrame({"gender" : np.random.choice(genders, n),
"city" : np.random.choice(cities, n),
"age" : np.random.choice(ages, n),
"value": np.random.randint(1, 20, n)})
## the dataframe we will append during the process
new = pd.DataFrame(columns = cols)
## itertools contains the function combinations
import itertools
## list all size combinations possible
for n in range(0, len(cols)+1):
for i in itertools.combinations(cols, n):
## if n > 0, the combinations is not empty,
## so we can directly groupby this sublist and take the mean
if n != 0:
agg_df = df.groupby(list(i)).mean().reset_index()
## reset index since for multiple columns, the result will be multiindex
## if n=0, we just want to take the mean of the whole dataframe
else:
agg_df = pd.DataFrame(columns = cols)
agg_df.loc[0, "value"] = df.loc[:, "value"].mean()
## a bit ugly since this mean is an integer not a dataframe
## from here agg_df will have n+1 columns,
## for instance for ["gender"] we will have only "gender" and "value" columns
## "city" and "age" are missing since we aggregate on it
for j in cols:
if j not in i:
agg_df.loc[:, j] = "total" ## adding total as you asked for
new = new.append(agg_df)
For instance, I find :
new
gender city age value
0 total total total 9.750000
0 female total total 8.083333
1 male total total 12.250000
0 total chicago total 8.000000
1 total losangeles total 11.428571
2 total newyork total 11.666667
0 total total 10_20y 8.100000
1 total total 20_30y 11.400000
0 female chicago total 7.333333
1 female losangeles total 9.250000
2 female newyork total 8.000000
3 male chicago total 9.000000
4 male losangeles total 14.333333
5 male newyork total 19.000000
0 female total 10_20y 7.333333
1 female total 20_30y 8.833333
2 male total 10_20y 9.250000
3 male total 20_30y 15.250000
0 total chicago 10_20y 7.285714
1 total chicago 20_30y 9.666667
2 total losangeles 10_20y 10.000000
3 total losangeles 20_30y 12.500000
4 total newyork 20_30y 11.666667
0 female chicago 10_20y 7.250000
1 female chicago 20_30y 7.500000
2 female losangeles 10_20y 7.500000
3 female losangeles 20_30y 11.000000
4 female newyork 20_30y 8.000000
5 male chicago 10_20y 7.333333
6 male chicago 20_30y 14.000000
7 male losangeles 10_20y 15.000000
8 male losangeles 20_30y 14.000000
9 male newyork 20_30y 19.000000
It's a base work, I think you can work around

Overall sum by groupby pandas

I have a dataframe as shown below, which is area usage of whole city say Bangalore.
Sector Plot Usage Status Area
A 1 Villa Constructed 40
A 2 Residential Constructed 50
A 3 Substation Not_Constructed 120
A 4 Villa Not_Constructed 60
A 5 Residential Not_Constructed 30
A 6 Substation Constructed 100
B 1 Villa Constructed 80
B 2 Residential Constructed 60
B 3 Substation Not_Constructed 40
B 4 Villa Not_Constructed 80
B 5 Residential Not_Constructed 100
B 6 Substation Constructed 40
Bangalore consist of two sectors A and B.
From the above I would like to calculate total area of Bangalore and its distribution of usage.
Expected Output:
City Total_Area %_Villa %_Resid %_Substation %_Constructed %_Not_Constructed
Bangalore(A+B) 800 32.5 30 37.5 46.25 53.75
I think you need set scalar value to column city before apply solution (if there are only sectors A and B):
df['Sector'] = 'Bangalore(A+B)'
#aggregate sum per 2 columns Sector and Usage
df1 = df.groupby(['Sector', 'Usage'])['Area'].sum()
#percentage by division of total per Sector
df1 = df1.div(df1.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#aggregate sum per 2 columns Sector and Status
df2 = df.groupby(['Sector', 'Status'])['Area'].sum()
df2 = df2.div(df2.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#total Area per Sector
s = df.groupby('Sector')['Area'].sum().rename('Total_area')
#join all together
dfA = pd.concat([s, df1, df2], axis=1).reset_index()
print (dfA)
Sector Total_area %_Residential %_Substation %_Villa \
0 Bangalore(A+B) 800 30.0 37.5 32.5
%_Constructed %_Not_Constructed
0 46.25 53.75
Simple Pivot Table can help!
1. One Line Pandas Solution: 80% work done
pv =
df.pivot_table(values='Area',aggfunc=np.sum,index=['Status'],columns=['Usage'],margins=True,margins_name='Total',fill_value=0).unstack()
2. Now formatting for %: 90% work done
ans =
pd.DataFrame([[pv['Villa']['Total']/pv['Total']['Total'].astype('float'),pv['Resid']['Total']/pv['Total']['Total'].astype('float'),pv['Substation']['Total']/pv['Total']['Total'].astype('float'),pv['Total']['Constructed']/pv['Total']['Total'].astype('float'),pv['Total']['Not_Constructed']/pv['Total']['Total'].astype('float')]]).round(2)*100
3. Adding the Total Column: 99% work done
ans['Total'] = pv['Total']['Total']
4. Renaming Columns and Arranging in your expected order: and done!
ans.columns=['%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed','Total']
ans = ans[['Total',''%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed']]

How to plot this graph?

total_income = df.groupby('genres')['gross'].sum()
average_income = df.groupby('genres')['gross'].mean()
total_income.plot.bar(label="Total Income", color = 'r')
average_income.plot.bar(label="Average Income")
plt.xlabel("Genres")
plt.ylabel("Dollars (Gross)")
plt.yscale("log")
Here's my code that plots the sum and average of gross by the genres of movies. The problem is when I plot the graph, it gives me a complete black graph. I believe it is due to the length of words in the genres because it contains multiple genres.
How Can I fix this so it shows the graph and it's genres? I need assistance.
You can use str.split for lists, then get len for length.
Last create new DataFrame by constructor with numpy.repeat and numpy.concatenate:
df = pd.DataFrame({'genres':['Comedy|Crime|Drama|Thriller','Comedy|Crime|Drama','Comedy|Crime','Drama|Thriller','Drama','Comedy|Crime'],
'gross':[10,20,30,40,50,60]})
print (df)
genres gross
0 Comedy|Crime|Drama|Thriller 10
1 Comedy|Crime|Drama 20
2 Comedy|Crime 30
3 Drama|Thriller 40
4 Drama 50
5 Comedy|Crime 60
splitted = df['genres'].str.split('|')
l = splitted.str.len()
df = pd.DataFrame({'gross': np.repeat(df['gross'], l), 'genres':np.concatenate(splitted)})
print (df)
genres gross
0 Comedy 10
0 Crime 10
0 Drama 10
0 Thriller 10
1 Comedy 20
1 Crime 20
1 Drama 20
2 Comedy 30
2 Crime 30
3 Drama 40
3 Thriller 40
4 Drama 50
5 Comedy 60
5 Crime 60
d = {'mean':'Average','sum':'Total'}
df1 = df.groupby('genres')['gross'].agg(['sum','mean']).rename(columns=d)
print (df1)
Total Average
genres
Comedy 120 30
Crime 120 30
Drama 120 30
Thriller 50 25
df1.plot.bar()