Remove none values from dataframe - pandas

I have a dataframe like :
Country Name Income Group
1 Norway High income
2 Switzerland Middle income
3 Qatar Low income
4 Luxembourg Low income
5 Macao High income
6 India Middle income
i need something like:
High income Middle income Low income
1 Norway Switzerland Qatar
2 Macao India Luxembourg
I have used pivot tables :
df= df.pivot(values='Country Name', index=None, columns='Income Group')
and i get something like :
High income Middle income Low income
1 Norway none none
2 none Switzerland none
.
.
.
Can someone suggest a better solution than pivot here so that i don't have to deal with none values?

The trick is to introduce a new column index whose values are groupby/cumcount values. cumcount returns a cumulative count -- thus numbering the items in each group:
df['index'] = df.groupby('Income Group').cumcount()
# Country Name Income Group index
# 1 Norway High income 0
# 2 Switzerland Middle income 0
# 3 Qatar Low income 0
# 4 Luxembourg Low income 1
# 5 Macao High income 1
# 6 India Middle income 1
Once you have the index column, the desired result can be obtained by pivoting:
import pandas as pd
df = pd.DataFrame({'Country Name': ['Norway', 'Switzerland', 'Qatar', 'Luxembourg', 'Macao', 'India'], 'Income Group': ['High income', 'Middle income', 'Low income', 'Low income', 'High income', 'Middle income']})
df['index'] = df.groupby('Income Group').cumcount() + 1
result = df.pivot(index='index', columns='Income Group', values='Country Name')
result.index.name = result.columns.name = None
print(result)
yields
High income Low income Middle income
1 Norway Qatar Switzerland
2 Macao Luxembourg India

Related

Pandas: Aggregate mean ("totals") for each combination of dimensions

I have a table like this:
gender
city
age
time
value
male
newyork
10_20y
2010
10.5
female
newyork
10_20y
2010
11
I'd like to add all possible means for combinations of dimensions 1-3 to the same table (or a new dataframe that I can concatenate with the original).
The time dimension should not be aggregated. Example of means added for different combinations:
gender
city
age
time
value
total
total
total
2010
(mean)
male
total
total
2010
(mean)
male
newyork
total
2010
(mean)
female
total
total
2010
(mean)
female
newyork
total
2010
(mean)
... total
... total
10_20y
2010
(mean)
Using groupby on multiple columns will groupby with all combinations of these columns. So a simple df.groupby(["city", "age"]).mean() will achieve the mean for the "total", "city", "age" combination. The problem here is you want all combinations of all size for the list ["gender", "city", "age"]. It's not straightforward and I think a more pythonic way can be found but here is my proposal :
## create artificial data
cols = ["gender", "city", "age"]
genders = ["male", "female"]
cities = ["newyork", "losangeles", "chicago"]
ages = ["10_20y", "20_30y"]
n = 20
df = pd.DataFrame({"gender" : np.random.choice(genders, n),
"city" : np.random.choice(cities, n),
"age" : np.random.choice(ages, n),
"value": np.random.randint(1, 20, n)})
## the dataframe we will append during the process
new = pd.DataFrame(columns = cols)
## itertools contains the function combinations
import itertools
## list all size combinations possible
for n in range(0, len(cols)+1):
for i in itertools.combinations(cols, n):
## if n > 0, the combinations is not empty,
## so we can directly groupby this sublist and take the mean
if n != 0:
agg_df = df.groupby(list(i)).mean().reset_index()
## reset index since for multiple columns, the result will be multiindex
## if n=0, we just want to take the mean of the whole dataframe
else:
agg_df = pd.DataFrame(columns = cols)
agg_df.loc[0, "value"] = df.loc[:, "value"].mean()
## a bit ugly since this mean is an integer not a dataframe
## from here agg_df will have n+1 columns,
## for instance for ["gender"] we will have only "gender" and "value" columns
## "city" and "age" are missing since we aggregate on it
for j in cols:
if j not in i:
agg_df.loc[:, j] = "total" ## adding total as you asked for
new = new.append(agg_df)
For instance, I find :
new
gender city age value
0 total total total 9.750000
0 female total total 8.083333
1 male total total 12.250000
0 total chicago total 8.000000
1 total losangeles total 11.428571
2 total newyork total 11.666667
0 total total 10_20y 8.100000
1 total total 20_30y 11.400000
0 female chicago total 7.333333
1 female losangeles total 9.250000
2 female newyork total 8.000000
3 male chicago total 9.000000
4 male losangeles total 14.333333
5 male newyork total 19.000000
0 female total 10_20y 7.333333
1 female total 20_30y 8.833333
2 male total 10_20y 9.250000
3 male total 20_30y 15.250000
0 total chicago 10_20y 7.285714
1 total chicago 20_30y 9.666667
2 total losangeles 10_20y 10.000000
3 total losangeles 20_30y 12.500000
4 total newyork 20_30y 11.666667
0 female chicago 10_20y 7.250000
1 female chicago 20_30y 7.500000
2 female losangeles 10_20y 7.500000
3 female losangeles 20_30y 11.000000
4 female newyork 20_30y 8.000000
5 male chicago 10_20y 7.333333
6 male chicago 20_30y 14.000000
7 male losangeles 10_20y 15.000000
8 male losangeles 20_30y 14.000000
9 male newyork 20_30y 19.000000
It's a base work, I think you can work around

Row filtering with respect to intersection of 2 columns

I have a data frame named data_2010 with 3 columns CountryName, IndicatorName and Value.
For eg.
data_2010
CountryName IndicatorName Value
4839015 Arab World Access to electricity (% of population) 8.434222e+01
4839016 Arab World Access to electricity, rural (% of rural popul... 7.196990e+01
4839017 Arab World Access to electricity, urban (% of urban popul... 9.382846e+01
4839018 Arab World Access to non-solid fuel (% of population) 8.600367e+01
4839019 Arab World Access to non-solid fuel, rural (% of rural po... 7.455260e+01
... ... ... ...
5026216 Zimbabwe Urban population (% of total) 3.319600e+01
5026217 Zimbabwe Urban population growth (annual %) 1.279630e+00
5026218 Zimbabwe Use of IMF credit (DOD, current US$) 5.287290e+08
5026219 Zimbabwe Vitamin A supplementation coverage rate (% of ... 4.930002e+01
5026220 Zimbabwe Womens share of population ages 15+ living wi... 5.898546e+01
The problem is there are 247 Unique countries and 1299 Unique IndicatorNames and every country doesn't have the data for the all the indicators. I want the set of countries and Indicator names such that every country has the data of the same indicator names and vice versa
(Edit)
df:
df = pd.DataFrame({'CountryName': ['USA', 'USA','USA','UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
Expected output for df:
CountryName IndicatorName value
USA elec 1
USA fuel 3
UAE elec 4
UAE fuel 5
Zimbabve elec 8
Zimbabve fuel 9
Solution not working for this case:
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe', 'Spain'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission','population'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
})
Output got:
CountryName IndicatorName value
0 Saudi fuel 6
1 Saudi population 7
2 UAE elec 4
3 UAE fuel 5
4 USA elec 1
5 USA fuel 3
6 Zimbabwe elec 8
7 Zimbabwe fuel 9
Output expected:
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9
Though Saudi has 2 indicators but they're not common to the rest.
For eg if Saudi had 3 indicators like ['elec', 'fuel', credit] then Saudi would be added to the final df with elec and fuel.
You can groupby IndicatorName, get the number of unique countries that have the indicator name, then filter your df to keep only the rows that have that indicator for > 1 country.
Nit: your CountryName column is missing a comma between 'USA' 'UAE', fixed below.
df = pd.DataFrame(
{'CountryName': ['USA', 'USA', 'USA', 'UAE', 'UAE', 'Saudi', 'Saudi', 'Zimbabwe', 'Zimbabwe', 'Zimbabwe'],
'IndicatorName': ['elec', 'area', 'fuel', 'elec','fuel','fuel', 'population', 'elec', 'fuel', 'co2 emission'],
'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df_indicators = df.groupby('IndicatorName', as_index=False)['CountryName'].nunique()
df_indicators = df_indicators.rename(columns={'CountryName': 'CountryCount'})
df_indicators = df_indicators[df_indicators['CountryCount'] > 1]
# merge on only the indicator column, how='inner' - which is the default so no need to specify
# to keep only those indicators that have a value for > 1 country
df2use = df.merge(df_indicators[['IndicatorName']], on=['IndicatorName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
5 Saudi fuel 6
1 UAE elec 4
4 UAE fuel 5
0 USA elec 1
3 USA fuel 3
2 Zimbabwe elec 8
6 Zimbabwe fuel 9
Looks like you also want to exclude Saudi because it although it has fuel, it has only 1 common IndicatorName. If so, you can use a similar process for countries rather than indicators, starting with only the countries and indicators that survived the first round of filtering, so after the code above use:
df_countries = df2use.groupby('CountryName', as_index=False)['IndicatorName'].nunique()
df_countries = df_countries.rename(columns={'IndicatorName': 'IndicatorCount'})
df_countries = df_countries[df_countries['IndicatorCount'] > 1]
df2use = df2use.merge(df_countries[['CountryName']], on=['CountryName'])
df2use = df2use.sort_values(by=['CountryName', 'IndicatorName'])
to get
CountryName IndicatorName value
0 UAE elec 4
1 UAE fuel 5
2 USA elec 1
3 USA fuel 3
4 Zimbabwe elec 8
5 Zimbabwe fuel 9

Calculate average of non numeric columns in pandas

I have a df "data" as below
Name Quality city
Tom High A
nick Medium B
krish Low A
Jack High A
Kevin High B
Phil Medium B
I want group it by city and a create a new columns based on the column "quality" and calculate avegare as below
city High Medium Low High_Avg Medium_AVG Low_avg
A 2 0 1 66.66 0 33.33
B 1 1 0 50 50 0
I tried with the below script and I know it is completely wrong.
data_average = data_df.groupby(['city'], as_index = False).count()
Get a count of the frequencies, divide the outcome by the sum across columns, and finally concatenate the datframes into one :
result = pd.crosstab(df.city, df.Quality)
averages = result.div(result.sum(1).array, axis=0).mul(100).round(2).add_suffix("_Avg")
#combine the dataframes
pd.concat((result, averages), axis=1)
Quality High Low Medium High_Avg Low_Avg Medium_Avg
city
A 2 1 0 66.67 33.33 0.00
B 1 0 2 33.33 0.00 66.67

Overall sum by groupby pandas

I have a dataframe as shown below, which is area usage of whole city say Bangalore.
Sector Plot Usage Status Area
A 1 Villa Constructed 40
A 2 Residential Constructed 50
A 3 Substation Not_Constructed 120
A 4 Villa Not_Constructed 60
A 5 Residential Not_Constructed 30
A 6 Substation Constructed 100
B 1 Villa Constructed 80
B 2 Residential Constructed 60
B 3 Substation Not_Constructed 40
B 4 Villa Not_Constructed 80
B 5 Residential Not_Constructed 100
B 6 Substation Constructed 40
Bangalore consist of two sectors A and B.
From the above I would like to calculate total area of Bangalore and its distribution of usage.
Expected Output:
City Total_Area %_Villa %_Resid %_Substation %_Constructed %_Not_Constructed
Bangalore(A+B) 800 32.5 30 37.5 46.25 53.75
I think you need set scalar value to column city before apply solution (if there are only sectors A and B):
df['Sector'] = 'Bangalore(A+B)'
#aggregate sum per 2 columns Sector and Usage
df1 = df.groupby(['Sector', 'Usage'])['Area'].sum()
#percentage by division of total per Sector
df1 = df1.div(df1.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#aggregate sum per 2 columns Sector and Status
df2 = df.groupby(['Sector', 'Status'])['Area'].sum()
df2 = df2.div(df2.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#total Area per Sector
s = df.groupby('Sector')['Area'].sum().rename('Total_area')
#join all together
dfA = pd.concat([s, df1, df2], axis=1).reset_index()
print (dfA)
Sector Total_area %_Residential %_Substation %_Villa \
0 Bangalore(A+B) 800 30.0 37.5 32.5
%_Constructed %_Not_Constructed
0 46.25 53.75
Simple Pivot Table can help!
1. One Line Pandas Solution: 80% work done
pv =
df.pivot_table(values='Area',aggfunc=np.sum,index=['Status'],columns=['Usage'],margins=True,margins_name='Total',fill_value=0).unstack()
2. Now formatting for %: 90% work done
ans =
pd.DataFrame([[pv['Villa']['Total']/pv['Total']['Total'].astype('float'),pv['Resid']['Total']/pv['Total']['Total'].astype('float'),pv['Substation']['Total']/pv['Total']['Total'].astype('float'),pv['Total']['Constructed']/pv['Total']['Total'].astype('float'),pv['Total']['Not_Constructed']/pv['Total']['Total'].astype('float')]]).round(2)*100
3. Adding the Total Column: 99% work done
ans['Total'] = pv['Total']['Total']
4. Renaming Columns and Arranging in your expected order: and done!
ans.columns=['%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed','Total']
ans = ans[['Total',''%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed']]

Pandas identify # of items which generate 80 of sales

I have a dataframe with for each country, list of product and the relevant sales
I need to identify for each country how many are # of top sales items of which cumulative sales represent 80% of the total sales for all the items in each country.
E.g.
Cnt Product, units
Italy apple 500
Italy beer 1500
Italy bread 2000
Italy orange 3000
Italy butter 3000
Expected results
Italy 3
(Total units are 10.000 and the sales of the top 3 product - Butter, Orange, Bread, is 8.000 which is the 80% of total)
Try define a function and apply on groupby:
def get_sale(x, pct=0.8):
thresh = 0.8 * x.sum()
# sort values descendingly for top salse
x=x.sort_values(ascending=False).reset_index(drop=True)
# store indices of those with cumsum pass threshold
sale_pass_thresh = x.index[x.cumsum().ge(thresh)]
return sale_pass_thresh[0] + 1
df.groupby('Cnt').units.apply(get_sale)
Output:
Cnt
Italy 3
Name: units, dtype: int64
Need play a little bit logic here
df=df.sort_values('units',ascending=False)
g=df.groupby('Cnt').units
s=(g.cumsum()>=g.transform('sum')*0.8).groupby(df.Cnt).sum()
df.groupby('Cnt').size()-s+1
Out[710]:
Cnt
Italy 3.0
dtype: float64