Related
my reference data frame is of the following type:
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
a toy working df:
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR'],
['Afghanistan','AFN'],['Afghanistan','ALL'],['Afghanistan','EUR']],
columns=['country','currency'])
As my working df may have country and currency differently, for example country =='France' and 'currency'=='AFN', I would like to create a column with max score based on either, i.e., this country/currency combo would imply a score of 4.
Desired output:
Out[102]:
country currency score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
6 Afghanistan AFN 4
7 Afghanistan ALL 4
8 Afghanistan EUR 4
Here is what I have so far, but it's multiline and extremely clunky:
df = pd.merge(df, tbl[['country', 'score']],
how='left', on='country')
df['em_score'] = df['score']
df = df.drop('score', axis=1)
df = pd.merge(df, tbl[['currency', 'score']],
how='left', on='currency')
df['em_score'] = df[['em_score', 'score']].max(axis=1)
df = df.drop('score', axis=1)
Here's a way to do it:
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
Explanation:
for each column in tbl other than score (in your case, country and currency), create a Series with that column as index
use pd.concat() to create a new dataframe with multiple columns, each a Series object created using join() between the working df and one of the Series objects from the previous step
use max() on each row to get the desired em_score.
Full test code with sample df:
import pandas as pd
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1]],
columns=['country', 'currency', 'score'])
df = pd.DataFrame(
[['France','AFN'],['France','ALL'],['France','EUR'],
['Albania','AFN'],['Albania','ALL'],['Albania','EUR']],
columns=['country','currency'])
print('','tbl',tbl,sep='\n')
print('','df',df,sep='\n')
byCol = {col:tbl[[col,'score']].set_index(col) for col in tbl.columns if col != 'score'}
df['em_score'] = pd.concat([
df.join(byCol[col], on=col).score.rename('score_' + col) for col in byCol
], axis=1).max(axis=1)
print('','output',df,sep='\n')
Output:
tbl
country currency score
0 Afghanistan AFN 4
1 Albania ALL 2
2 France EUR 1
df
country currency
0 France AFN
1 France ALL
2 France EUR
3 Albania AFN
4 Albania ALL
5 Albania EUR
output
country currency em_score
0 France AFN 4
1 France ALL 2
2 France EUR 1
3 Albania AFN 4
4 Albania ALL 2
5 Albania EUR 2
So, for case if you have
tbl = pd.DataFrame([['Afghanistan', 'AFN', 4],
['Albania', 'ALL', 2],
['France', 'EUR', 1],
['France', 'AFN', 0]],
columns=['country', 'currency', 'score'])
This code will find the max of either the max score for the country or the currency of each row:
np.maximum(np.array(tbl.groupby(['country']).max().loc[tbl['country'], 'score']),
np.array(tbl.groupby(['currency']).max().loc[tbl['currency'], 'score']))
I have previously asked how to iterate through a prescribed grouping of items and received the solution.
import pandas as pd
data = [['apple', 1], ['orange', 2], ['pear', 3], ['peach', 4],['plum', 5], ['grape', 6]]
#index_groups = [0],[1,2],[3,4,5]
df = pd.DataFrame(data, columns=['Name', 'Number'])
for i in range(len(df)):
print(df['Number'][i])
Name Age
0 apple 1
1 orange 2
2 pear 3
3 peach 4
4 plum 5
5 grape 6
where :
for group in index_groups:
print(df.loc[group])
gave me just what I needed. Following up on this I would like to now sum the numbers per group but also copy down the first 'Name' in each group to the other names in the group, and then concatenate so one line per 'Name'.
In the above example the output I'm seeking would be
Name Age
0 apple 1
1 orange 5
2 peach 15
I can append the sums to a list easy enough
group_sum = []
group_sum.append(sum(df['Number'].loc[group]))
But I can't get the 'Names' in order to merge with the sums.
You could try:
df_final = pd.DataFrame()
for group in index_groups:
_df = df.loc[group]
_df["Name"] = df.loc[group].Name.iloc[0]
df_final = pd.concat([df_final, _df])
df_final.groupby("Name").agg(Age=("Number", "sum")).reset_index()
Output:
Name Age
0 apple 1
1 orange 5
2 peach 15
I have a data frame named data_2010 with 3 columns CountryName, IndicatorName and Value.
For eg.
data_2010
CountryName IndicatorName Value
4839015 Arab World Access to electricity (% of population) 8.434222e+01
4839016 Arab World Access to electricity, rural (% of rural popul... 7.196990e+01
4839017 Arab World Access to electricity, urban (% of urban popul... 9.382846e+01
4839018 Arab World Access to non-solid fuel (% of population) 8.600367e+01
4839019 Arab World Access to non-solid fuel, rural (% of rural po... 7.455260e+01
... ... ... ...
5026216 Zimbabwe Urban population (% of total) 3.319600e+01
5026217 Zimbabwe Urban population growth (annual %) 1.279630e+00
5026218 Zimbabwe Use of IMF credit (DOD, current US$) 5.287290e+08
5026219 Zimbabwe Vitamin A supplementation coverage rate (% of ... 4.930002e+01
5026220 Zimbabwe Womens share of population ages 15+ living wi... 5.898546e+01
The problem is there are 247 Unique countries and 1299 Unique IndicatorNames and every country doesn't have the data for the all the indicators. I want the set of countries and Indicator names such that every country has the data of the same indicator names and vice versa
(Edit)
df:
df = pd.DataFrame({'CountryName': ['USA', 'USA','USA','UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
Expected output for df:
CountryName IndicatorName value
USA elec 1
USA fuel 3
UAE elec 4
UAE fuel 5
Zimbabve elec 8
Zimbabve fuel 9
Solution not working for this case:
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe', 'Spain'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission','population'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
})
Output got:
CountryName IndicatorName value
0 Saudi fuel 6
1 Saudi population 7
2 UAE elec 4
3 UAE fuel 5
4 USA elec 1
5 USA fuel 3
6 Zimbabwe elec 8
7 Zimbabwe fuel 9
Output expected:
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9
Though Saudi has 2 indicators but they're not common to the rest.
For eg if Saudi had 3 indicators like ['elec', 'fuel', credit] then Saudi would be added to the final df with elec and fuel.
You can groupby IndicatorName, get the number of unique countries that have the indicator name, then filter your df to keep only the rows that have that indicator for > 1 country.
Nit: your CountryName column is missing a comma between 'USA' 'UAE', fixed below.
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df_indicators = df.groupby('IndicatorName', as_index=False)['CountryName'].nunique()
df_indicators = df_indicators.rename(columns={'CountryName': 'CountryCount'})
df_indicators = df_indicators[df_indicators['CountryCount'] > 1]
# merge on only the indicator column, how='inner' - which is the default so no need to specify
# to keep only those indicators that have a value for > 1 country
df2use = df.merge(df_indicators[['IndicatorName']], on=['IndicatorName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
5 Saudi fuel 6
1 UAE elec 4
4 UAE fuel 5
0 USA elec 1
3 USA fuel 3
2 Zimbabwe elec 8
6 Zimbabwe fuel 9
Looks like you also want to exclude Saudi because it although it has fuel, it has only 1 common IndicatorName. If so, you can use a similar process for countries rather than indicators, starting with only the countries and indicators that survived the first round of filtering, so after the code above use:
df_countries = df2use.groupby('CountryName', as_index=False)['IndicatorName'].nunique()
df_countries = df_countries.rename(columns={'IndicatorName': 'IndicatorCount'})
df_countries = df_countries[df_countries['IndicatorCount'] > 1]
df2use = df2use.merge(df_countries[['CountryName']], on=['CountryName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9
I am trying to replace the placeholder '.' string with NaN in the total revenue column. This is the code used to create the df.
raw_data = {'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Company': ['Microsoft', 'Oracle', "IBM", 'SAP', 'Symantec', 'EMC', 'VMware', 'HP', 'Salesforce.com', 'Intuit'],
'Company_HQ': ['USA', 'USA', 'USA', 'Germany', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA'],
'Software_revenue': ['$62,014', '$29,881', '$29,286', '$18,777', '$6,138', '$5,844', '$5,520', '$5,082', '$4,820', '$4,324'],
'Total_revenue': ['93,456', '38,828', '92,793', '23,289', '6,615', ".", '6,035', '110,577', '5,274', '4,573'],
'Percent_revenue_total': ['66.36%', '76.96%', '31.56%', '80.63%', '92.79%', '23.91%', '91.47%', '4.60%', '91.40%', '94.55%']}
df = pd.DataFrame(raw_data, columns = ['Rank', 'Company', 'Company_HQ', 'Software_revenue', 'Total_revenue', 'Percent_revenue_total'])
df
I have tried using:
import numpy as np
df['Total_revenue'] = df['Total_revenue'].replace('.', np.nan, regex=True)
df
However, this replaces the entire column with Nan instead of just the placeholder '.' value.
You only need to fix the regex=False. Because when you set it to True you are assuming the passed-in is a regular expression, setting it to False will treat the pattern as a literal string (which is what I believe you want):
import pandas as pd
raw_data = {'Rank': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Company': ['Microsoft', 'Oracle', "IBM", 'SAP', 'Symantec', 'EMC', 'VMware', 'HP', 'Salesforce.com', 'Intuit'],
'Company_HQ': ['USA', 'USA', 'USA', 'Germany', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA'],
'Software_revenue': ['$62,014', '$29,881', '$29,286', '$18,777', '$6,138', '$5,844', '$5,520', '$5,082', '$4,820', '$4,324'],
'Total_revenue': ['93,456', '38,828', '92,793', '23,289', '6,615', ".", '6,035', '110,577', '5,274', '4,573'],
'Percent_revenue_total': ['66.36%', '76.96%', '31.56%', '80.63%', '92.79%', '23.91%', '91.47%', '4.60%', '91.40%', '94.55%']}
df = pd.DataFrame(raw_data, columns = ['Rank', 'Company', 'Company_HQ', 'Software_revenue', 'Total_revenue', 'Percent_revenue_total'])
import numpy as np
df['Total_revenue'] = df['Total_revenue'].replace('.', np.nan, regex=False)
print(df)
Output:
Rank Company Company_HQ Software_revenue Total_revenue Percent_revenue_total
0 1 Microsoft USA $62,014 93,456 66.36%
1 2 Oracle USA $29,881 38,828 76.96%
2 3 IBM USA $29,286 92,793 31.56%
3 4 SAP Germany $18,777 23,289 80.63%
4 5 Symantec USA $6,138 6,615 92.79%
5 6 EMC USA $5,844 NaN 23.91%
6 7 VMware USA $5,520 6,035 91.47%
7 8 HP USA $5,082 110,577 4.60%
8 9 Salesforce.com USA $4,820 5,274 91.40%
9 10 Intuit USA $4,324 4,573 94.55%
. is special character in regex reprensent any character. You need escape it to make regex consider it as regular char
df['Total_revenue'].replace('\.', np.nan, regex=True)
Out[52]:
0 93,456
1 38,828
2 92,793
3 23,289
4 6,615
5 NaN
6 6,035
7 110,577
8 5,274
9 4,573
Name: Total_revenue, dtype: object
In your case, you should use mask
df['Total_revenue'].mask(df['Total_revenue'].eq('.'))
Out[58]:
0 93,456
1 38,828
2 92,793
3 23,289
4 6,615
5 NaN
6 6,035
7 110,577
8 5,274
9 4,573
Name: Total_revenue, dtype: object
I went one step further here and changed the column type to numeric, so you can also use if for calculations.
df.Total_revenue = pd.to_numeric(df.Total_revenue.str.replace(',',''),errors='coerce').astype('float')
df.Total_revenue
0 93456.0
1 38828.0
2 92793.0
3 23289.0
4 6615.0
5 NaN
6 6035.0
7 110577.0
8 5274.0
9 4573.0
Name: Total_revenue, dtype: float64
In my opinion "replace" is not required as user wanted to change "." Whole to nan. Inistead this will also work. It finds rows with "." And assign nan to it
df.loc[df['Total_revenue']==".", 'Total_revenue'] = np.nan
you can try below to apply your requirement to DataFrame
df.replace('.', np.nan)
or of you want to make if for specific column then use df['Total_revenue'] instead of df
where below is the output:
Rank Company Company_HQ Software_revenue Total_revenue Percent_revenue_total
0 1 Microsoft USA $62,014 93,456 66.36%
1 2 Oracle USA $29,881 38,828 76.96%
2 3 IBM USA $29,286 92,793 31.56%
3 4 SAP Germany $18,777 23,289 80.63%
4 5 Symantec USA $6,138 6,615 92.79%
5 6 EMC USA $5,844 NaN 23.91%
6 7 VMware USA $5,520 6,035 91.47%
7 8 HP USA $5,082 110,577 4.60%
8 9 Salesforce.com USA $4,820 5,274 91.40%
9 10 Intuit USA $4,324 4,573 94.55%
There is a dataframe, say
df
Country Continent PopulationEst
0 Germany Europe 8.036970e+07
1 Canada North America 35.239865+07
...
I want to create a dateframe that displays the size (the number of countries in each continent), and the sum, mean, and std deviation for the estimated population of each country.
I did the following:
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
But the result df2 has multiple level columns like below:
df2.columns
MultiIndex(levels=[['PopulationEst'], ['size', 'sum', 'mean', 'std']],
labels=[[0, 0, 0, 0], [0, 1, 2, 3]])
How can I remove the PopulationEst from the columns, so just have ['size', 'sum', 'mean', 'std'] columns for the dataframe?
I think you need add ['PopulationEst'] - agg uses this column for aggregation:
df2 = df.groupby('Continent')['PopulationEst'].agg(['size', 'sum','mean','std'])
Sample:
df = pd.DataFrame({
'Country': ['Germany', 'Germany', 'Canada', 'Canada'],
'PopulationEst': [8, 4, 35, 50],
'Continent': ['Europe', 'Europe', 'North America', 'North America']},
columns=['Country','PopulationEst','Continent'])
print (df)
Country PopulationEst Continent
0 Germany 8 Europe
1 Germany 4 Europe
2 Canada 35 North America
3 Canada 50 North America
df2 = df.groupby('Continent')['PopulationEst'].agg(['size', 'sum','mean','std'])
print (df2)
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
print (df2)
PopulationEst
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602
Another solution is with MultiIndex.droplevel:
df2 = df.groupby('Continent').agg(['size', 'sum','mean','std'])
df2.columns = df2.columns.droplevel(0)
print (df2)
size sum mean std
Continent
Europe 2 12 6.0 2.828427
North America 2 85 42.5 10.606602
I think this could do what you need:
grouping = {'Continent': ['size'], 'PopEst':['sum', 'mean', 'std']}
df.groupby('Continent').agg(grouping)