using pandas dataframe group agg function - pandas

There is a dataframe, say
df
Country Continent PopulationEst
0 Germany Europe 8.036970e+07
1 Canada North America 35.239865+07
...
I want to create a dateframe that displays the size (the number of countries in each continent), and the sum, mean, and std deviation for the estimated population of each country.
I did the following:
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
But the result df2 has multiple level columns like below:
df2.columns
MultiIndex(levels=[['PopulationEst'], ['size', 'sum', 'mean', 'std']],
labels=[[0, 0, 0, 0], [0, 1, 2, 3]])
How can I remove the PopulationEst from the columns, so just have ['size', 'sum', 'mean', 'std'] columns for the dataframe?

I think you need add ['PopulationEst'] - agg uses this column for aggregation:
df2 = df.groupby('Continent')['PopulationEst'].agg(['size', 'sum','mean','std'])
Sample:
df = pd.DataFrame({
'Country': ['Germany', 'Germany', 'Canada', 'Canada'],
'PopulationEst': [8, 4, 35, 50],
'Continent': ['Europe', 'Europe', 'North America', 'North America']},
columns=['Country','PopulationEst','Continent'])
print (df)
Country PopulationEst Continent
0 Germany 8 Europe
1 Germany 4 Europe
2 Canada 35 North America
3 Canada 50 North America
df2 = df.groupby('Continent')['PopulationEst'].agg(['size', 'sum','mean','std'])
print (df2)
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
print (df2)
PopulationEst
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602
Another solution is with MultiIndex.droplevel:
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
df2.columns = df2.columns.droplevel(0)
print (df2)
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602

I think this could do what you need:
grouping = {'Continent': ['size'], 'PopEst':['sum', 'mean', 'std']}
df.groupby('Continent').agg(grouping)

Related

extract conditional max based on a reference data frame

my reference data frame is of the following type:
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
a toy working df:
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR'],
['Afghanistan','AFN'],['Afghanistan','ALL'],['Afghanistan','EUR']],
columns=['country','currency'])
As my working df may have country and currency differently, for example country =='France' and 'currency'=='AFN', I would like to create a column with max score based on either, i.e., this country/currency combo would imply a score of 4.
Desired output:
Out[102]:
country currency score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
6 Afghanistan AFN 4
7 Afghanistan ALL 4
8 Afghanistan EUR 4
Here is what I have so far, but it's multiline and extremely clunky:
df = pd.merge(df, tbl[['country', 'score']],
how='left', on='country')
df['em_score'] = df['score']
df = df.drop('score', axis=1)
df = pd.merge(df, tbl[['currency', 'score']],
how='left', on='currency')
df['em_score'] = df[['em_score', 'score']].max(axis=1)
df = df.drop('score', axis=1)
Here's a way to do it:
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
Explanation:
for each column in tbl other than score (in your case, country and currency), create a Series with that column as index
use pd.concat() to create a new dataframe with multiple columns, each a Series object created using join() between the working df and one of the Series objects from the previous step
use max() on each row to get the desired em_score.
Full test code with sample df:
import pandas as pd
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR']],
columns=['country','currency'])
print('','tbl',tbl,sep='\n')
print('','df',df,sep='\n')
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
print('','output',df,sep='\n')
Output:
tbl
country currency score
0 Afghanistan AFN 4
1 Albania ALL 2
2 France EUR 1
df
country currency
0 France AFN
1 France ALL
2 France EUR
3 Albania AFN
4 Albania ALL
5 Albania EUR
output
country currency em_score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
So, for case if you have
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1],
['France', 'AFN', 0]],
columns=['country', 'currency', 'score'])
This code will find the max of either the max score for the country or the currency of each row:
np.maximum(np.array(tbl.groupby(['country']).max().loc[tbl['country'], 'score']),
np.array(tbl.groupby(['currency']).max().loc[tbl['currency'], 'score']))

How to group rows together based on conditions from a list? Pandas

I want to be able to group rows into one if they have matching values in certain columns, however I only want them to be grouped if the value is in a list. For example,
team_sports = ['football', 'basketball']
view of df
country sport age
USA football 21
USA football 28
USA golf 20
USA golf 44
China football 30
China basketball 22
China basketball 41
wanted outcome
country sport age
USA football 21,28
USA golf 20
USA golf 44
China football 30
China basketball 22,41
The attempt I made was,
team_sports = ['football', 'basketball']
for i in df['Sport']:
if i in team_sports:
group_df= df.groupby(['Country', 'Sport'])['Age'].apply(list).reset_index()
This is taking forever to run, the database I'm using has 100,000 rows.
Really appreciate any help, thanks
The more straightforward approach is to separate the DataFrame based on those rows where the sports column isin the list of team_sports. groupby aggregate separately then concat back together:
team_sports = ['football', 'basketball']
m = df['sport'].isin(team_sports)
cols = ['country', 'sport']
group_df = pd.concat([
# Group those that do match condition
df[m].groupby(cols, as_index=False)['age'].agg(list),
# Leave those that don't match condition as is
df[~m]
], ignore_index=True).sort_values(cols)
*sort_values is optional to regroup country and sport together
group_df:
country sport age
0 China basketball [22, 41]
1 China football [30]
2 USA football [21, 28]
3 USA golf 20
4 USA golf 44
The less straightforward approach would be to create a new grouping level based on whether or not a value is in the list of team sports using isin + cumsum:
team_sports = ['football', 'basketball']
group_df = (
df.groupby(
['country', 'sport',
(~df['sport'].sort_values().isin(team_sports)).cumsum().sort_index()],
as_index=False,
sort=False
)['age'].agg(list)
)
group_df:
country sport age
0 USA football [21, 28]
1 USA golf [20]
2 USA golf [44]
3 China football [30]
4 China basketball [22, 41]
How the groups are created:
team_sports = ['football', 'basketball']
print(pd.DataFrame({
'country': df['country'],
'sport': df['sport'],
'not_in_team_sports': (~df['sport'].sort_values()
.isin(team_sports)).cumsum().sort_index()
}))
country sport not_in_team_sports
0 USA football 0
1 USA football 0
2 USA golf 1 # golf 1
3 USA golf 2 # golf 2 (not in the same group)
4 China football 0
5 China basketball 0
6 China basketball 0
*sort_values is necessary here so that sport groups are not interrupted by sports that are not in the list.
df = pd.DataFrame({
'country': ['USA', 'USA', 'USA'],
'sport': ['football', 'golf', 'football'],
'age': [21, 28, 20]
})
team_sports = ['football', 'basketball']
print(pd.DataFrame({
'country': df['country'],
'sport': df['sport'],
'not_sorted': (~df['sport'].isin(team_sports)).cumsum(),
'sorted': (~df['sport'].sort_values()
.isin(team_sports)).cumsum().sort_index()
}))
country sport not_sorted sorted
0 USA football 0 0
1 USA golf 1 1
2 USA football 1 0 # football 1 (separate group if not sorted)
Sorting ensures that football go together so this does not happen
Setup:
import pandas as pd
df = pd.DataFrame({
'country': ['USA', 'USA', 'USA', 'USA', 'China', 'China', 'China'],
'sport': ['football', 'football', 'golf', 'golf', 'football', 'basketball',
'basketball'],
'age': [21, 28, 20, 44, 30, 22, 41]
})

Comparing strings in two different dataframe and adding a column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two dataframes as follows:
df1 =
Index Name Age
0 Bob1 20
1 Bob2 21
2 Bob3 22
The second dataframe is as follows -
df2 =
Index Country Name
0 US Bob1
1 UK Bob123
2 US Bob234
3 Canada Bob2
4 Canada Bob987
5 US Bob3
6 UK Mary1
7 UK Mary2
8 UK Mary3
9 Canada Mary65
I would like to compare the names from df1 to the countries in df2 and create a new dataframe as follows:
Index Country Name Age
0 US Bob1 20
1 Canada Bob2 21
2 US Bob3 22
Thank you.
Using merge() should solve the problem.
df3 = pd.merge(df1, df2, on='Name')
Outcome:
import pandas as pd
df1 = pd.DataFrame({ "Name":["Bob1", "Bob2", "Bob3"], "Age":[20,21,22]})
df2 = pd.DataFrame({ "Country":["US", "UK", "US", "Canada", "Canada", "US", "UK", "UK", "UK", "Canada"],
"Name":["Bob1", "Bob123", "Bob234", "Bob2", "Bob987", "Bob3", "Mary1", "Mary2", "Mary3", "Mary65"]})
df3 = pd.merge(df1, df2, on='Name')
df3

Row filtering with respect to intersection of 2 columns

I have a data frame named data_2010 with 3 columns CountryName, IndicatorName and Value.
For eg.
data_2010
CountryName IndicatorName Value
4839015 Arab World Access to electricity (% of population) 8.434222e+01
4839016 Arab World Access to electricity, rural (% of rural popul... 7.196990e+01
4839017 Arab World Access to electricity, urban (% of urban popul... 9.382846e+01
4839018 Arab World Access to non-solid fuel (% of population) 8.600367e+01
4839019 Arab World Access to non-solid fuel, rural (% of rural po... 7.455260e+01
... ... ... ...
5026216 Zimbabwe Urban population (% of total) 3.319600e+01
5026217 Zimbabwe Urban population growth (annual %) 1.279630e+00
5026218 Zimbabwe Use of IMF credit (DOD, current US$) 5.287290e+08
5026219 Zimbabwe Vitamin A supplementation coverage rate (% of ... 4.930002e+01
5026220 Zimbabwe Womens share of population ages 15+ living wi... 5.898546e+01
The problem is there are 247 Unique countries and 1299 Unique IndicatorNames and every country doesn't have the data for the all the indicators. I want the set of countries and Indicator names such that every country has the data of the same indicator names and vice versa
(Edit)
df:
df = pd.DataFrame({'CountryName': ['USA', 'USA','USA','UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
Expected output for df:
CountryName IndicatorName value
USA elec 1
USA fuel 3
UAE elec 4
UAE fuel 5
Zimbabve elec 8
Zimbabve fuel 9
Solution not working for this case:
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe', 'Spain'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission','population'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
})
Output got:
CountryName IndicatorName value
0 Saudi fuel 6
1 Saudi population 7
2 UAE elec 4
3 UAE fuel 5
4 USA elec 1
5 USA fuel 3
6 Zimbabwe elec 8
7 Zimbabwe fuel 9
Output expected:
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9
Though Saudi has 2 indicators but they're not common to the rest.
For eg if Saudi had 3 indicators like ['elec', 'fuel', credit] then Saudi would be added to the final df with elec and fuel.
You can groupby IndicatorName, get the number of unique countries that have the indicator name, then filter your df to keep only the rows that have that indicator for > 1 country.
Nit: your CountryName column is missing a comma between 'USA' 'UAE', fixed below.
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df_indicators = df.groupby('IndicatorName', as_index=False)['CountryName'].nunique()
df_indicators = df_indicators.rename(columns={'CountryName': 'CountryCount'})
df_indicators = df_indicators[df_indicators['CountryCount'] > 1]
# merge on only the indicator column, how='inner' - which is the default so no need to specify
# to keep only those indicators that have a value for > 1 country
df2use = df.merge(df_indicators[['IndicatorName']], on=['IndicatorName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
5 Saudi fuel 6
1 UAE elec 4
4 UAE fuel 5
0 USA elec 1
3 USA fuel 3
2 Zimbabwe elec 8
6 Zimbabwe fuel 9
Looks like you also want to exclude Saudi because it although it has fuel, it has only 1 common IndicatorName. If so, you can use a similar process for countries rather than indicators, starting with only the countries and indicators that survived the first round of filtering, so after the code above use:
df_countries = df2use.groupby('CountryName', as_index=False)['IndicatorName'].nunique()
df_countries = df_countries.rename(columns={'IndicatorName': 'IndicatorCount'})
df_countries = df_countries[df_countries['IndicatorCount'] > 1]
df2use = df2use.merge(df_countries[['CountryName']], on=['CountryName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9

How t add data in separate columns data in Pandas Dataframe?

question:
goldmedal = pd.DataFrame({'Country': ['india', 'japan', 'korea'],
'Medals': [5, 3, 4]}
)
silvermedal = pd.DataFrame({'Country': ['india', 'china', 'korea'],
'Medals': [9, 0, 6]}
)
bronzemedal = pd.DataFrame({'Country': ['japan', 'india', 'vietnam'],
'Medals': [4, 2, 2]}
)
I need to find the cumulative medals earned by the mentioned countries.
I tried this
add function: goldmedal.add(silvermedal, fill_value=0) O/P
Country Medals
0 indiaindia 14
1 japanchina 3
2 koreakorea 10
merge function: pd.merge (goldmedal,silvermedal, how='inner', on='Country')
O/P
Country Medalsx Medalsy
0 india 5 9
1 korea 4 6
How do I get the following output?
Country Medals
0 india 16
1 china 0
2 korea 10
3 vietnam 2
4 japan 7
pd.concat([goldmedal, silvermedal, bronzemedal]).groupby('Country').sum().reset_index()