Pandas identify # of items which generate 80 of sales - pandas

I have a dataframe with for each country, list of product and the relevant sales
I need to identify for each country how many are # of top sales items of which cumulative sales represent 80% of the total sales for all the items in each country.
E.g.
Cnt Product, units
Italy apple 500
Italy beer 1500
Italy bread 2000
Italy orange 3000
Italy butter 3000
Expected results
Italy 3
(Total units are 10.000 and the sales of the top 3 product - Butter, Orange, Bread, is 8.000 which is the 80% of total)

Try define a function and apply on groupby:
def get_sale(x, pct=0.8):
thresh = 0.8 * x.sum()
# sort values descendingly for top salse
x=x.sort_values(ascending=False).reset_index(drop=True)
# store indices of those with cumsum pass threshold
sale_pass_thresh = x.index[x.cumsum().ge(thresh)]
return sale_pass_thresh[0] + 1
df.groupby('Cnt').units.apply(get_sale)
Output:
Cnt
Italy 3
Name: units, dtype: int64

Need play a little bit logic here
df=df.sort_values('units',ascending=False)
g=df.groupby('Cnt').units
s=(g.cumsum()>=g.transform('sum')*0.8).groupby(df.Cnt).sum()
df.groupby('Cnt').size()-s+1
Out[710]:
Cnt
Italy 3.0
dtype: float64

Related

Is there a easiest way to make number of labels equal with pandas dataframe?

when we use dataset with pandas.dataframe(), sometimes labels categories are not same ratio.
example) bike: car = 7:3
price
label
200
bike
100
bike
700
bike
300
bike
5500
car
400
bike
5200
car
310
bike
2000
car
20
bike
In this case, car and bike are not same ratio.
so, I want to make each category to be in same ratios.
car shows only 3 times, so 4 bike rows are deleted like this...
price
label
200
bike
300
bike
5500
car
5200
car
2000
car
20
bike
order is not important. I just want to get same ratio categories.
I did count car labels and bike labels, and check fewer labels(In this time, car is fewer labels), and read each rows to move another dataframe. It takes a lot of time, so
Inconvenience.
Is there a easiest way to make number of labels equal with pandas dataframe? or just count each label and make another dataframe?
Thank you.
IIUC, take the minimum of each value_counts and GroupBy.head :
out = df.groupby("label").head(min(df["label"].value_counts())) #or GroupBy.sample
Alternatively and in a #mozway, use a grouper :
g = df.groupby("label")
out = g.head(g["price"].size().min())
Output :
print(out)
price label
0 200 bike
1 100 bike
2 700 bike
4 5500 car
6 5200 car
8 2000 car

Calculating mean and count of rows for a bucket

I have a df as follows
MPG Maker Price
8 Toyota 20000
12 Toyota 18000
15 Toyota 19000
5 Honda 19000
4 Honda 20000
I am looking to bin by MPG and then calculate average price and number of elements in the bin.
The DF I am looking to create is
MPG Maker Avg_Price Num_Sold
0-8 Toyota 19000 3
9-15 Honda 19500 2
I followed the directions in Bucketing in python and calculating mean for a bucket and was able to get the average price, but I am unable to get the Num_Sold to work.
I used
bins = [1,8,15]
df_bins = df.MAKER.groupby(pd.cut(df['MPG'],bins))
df_bins = df_bins.agg([np.mean, len]).reset_index(drop=True)
Any ideas on what I might be doing wrong?
Thanks!
Use named aggregation also with column Maker:
bins = [1,8,15]
df_bins = (df.groupby(['Maker',pd.cut(df['MPG'],bins)])
.agg(Avg_Price=('Price','mean'),
Num_Sold= ('Price', 'size')).reset_index())
Or without Maker column:
bins = [1,8,15]
df_bins = (df.groupby([pd.cut(df['MPG'],bins)])
.agg(Avg_Price=('Price','mean'),
Num_Sold= ('Price', 'size')).reset_index())
print (df_bins)

Pandas: Aggregate mean ("totals") for each combination of dimensions

I have a table like this:
gender
city
age
time
value
male
newyork
10_20y
2010
10.5
female
newyork
10_20y
2010
11
I'd like to add all possible means for combinations of dimensions 1-3 to the same table (or a new dataframe that I can concatenate with the original).
The time dimension should not be aggregated. Example of means added for different combinations:
gender
city
age
time
value
total
total
total
2010
(mean)
male
total
total
2010
(mean)
male
newyork
total
2010
(mean)
female
total
total
2010
(mean)
female
newyork
total
2010
(mean)
... total
... total
10_20y
2010
(mean)
Using groupby on multiple columns will groupby with all combinations of these columns. So a simple df.groupby(["city", "age"]).mean() will achieve the mean for the "total", "city", "age" combination. The problem here is you want all combinations of all size for the list ["gender", "city", "age"]. It's not straightforward and I think a more pythonic way can be found but here is my proposal :
## create artificial data
cols = ["gender", "city", "age"]
genders = ["male", "female"]
cities = ["newyork", "losangeles", "chicago"]
ages = ["10_20y", "20_30y"]
n = 20
df = pd.DataFrame({"gender" : np.random.choice(genders, n),
"city" : np.random.choice(cities, n),
"age" : np.random.choice(ages, n),
"value": np.random.randint(1, 20, n)})
## the dataframe we will append during the process
new = pd.DataFrame(columns = cols)
## itertools contains the function combinations
import itertools
## list all size combinations possible
for n in range(0, len(cols)+1):
for i in itertools.combinations(cols, n):
## if n > 0, the combinations is not empty,
## so we can directly groupby this sublist and take the mean
if n != 0:
agg_df = df.groupby(list(i)).mean().reset_index()
## reset index since for multiple columns, the result will be multiindex
## if n=0, we just want to take the mean of the whole dataframe
else:
agg_df = pd.DataFrame(columns = cols)
agg_df.loc[0, "value"] = df.loc[:, "value"].mean()
## a bit ugly since this mean is an integer not a dataframe
## from here agg_df will have n+1 columns,
## for instance for ["gender"] we will have only "gender" and "value" columns
## "city" and "age" are missing since we aggregate on it
for j in cols:
if j not in i:
agg_df.loc[:, j] = "total" ## adding total as you asked for
new = new.append(agg_df)
For instance, I find :
new
gender city age value
0 total total total 9.750000
0 female total total 8.083333
1 male total total 12.250000
0 total chicago total 8.000000
1 total losangeles total 11.428571
2 total newyork total 11.666667
0 total total 10_20y 8.100000
1 total total 20_30y 11.400000
0 female chicago total 7.333333
1 female losangeles total 9.250000
2 female newyork total 8.000000
3 male chicago total 9.000000
4 male losangeles total 14.333333
5 male newyork total 19.000000
0 female total 10_20y 7.333333
1 female total 20_30y 8.833333
2 male total 10_20y 9.250000
3 male total 20_30y 15.250000
0 total chicago 10_20y 7.285714
1 total chicago 20_30y 9.666667
2 total losangeles 10_20y 10.000000
3 total losangeles 20_30y 12.500000
4 total newyork 20_30y 11.666667
0 female chicago 10_20y 7.250000
1 female chicago 20_30y 7.500000
2 female losangeles 10_20y 7.500000
3 female losangeles 20_30y 11.000000
4 female newyork 20_30y 8.000000
5 male chicago 10_20y 7.333333
6 male chicago 20_30y 14.000000
7 male losangeles 10_20y 15.000000
8 male losangeles 20_30y 14.000000
9 male newyork 20_30y 19.000000
It's a base work, I think you can work around

Overall sum by groupby pandas

I have a dataframe as shown below, which is area usage of whole city say Bangalore.
Sector Plot Usage Status Area
A 1 Villa Constructed 40
A 2 Residential Constructed 50
A 3 Substation Not_Constructed 120
A 4 Villa Not_Constructed 60
A 5 Residential Not_Constructed 30
A 6 Substation Constructed 100
B 1 Villa Constructed 80
B 2 Residential Constructed 60
B 3 Substation Not_Constructed 40
B 4 Villa Not_Constructed 80
B 5 Residential Not_Constructed 100
B 6 Substation Constructed 40
Bangalore consist of two sectors A and B.
From the above I would like to calculate total area of Bangalore and its distribution of usage.
Expected Output:
City Total_Area %_Villa %_Resid %_Substation %_Constructed %_Not_Constructed
Bangalore(A+B) 800 32.5 30 37.5 46.25 53.75
I think you need set scalar value to column city before apply solution (if there are only sectors A and B):
df['Sector'] = 'Bangalore(A+B)'
#aggregate sum per 2 columns Sector and Usage
df1 = df.groupby(['Sector', 'Usage'])['Area'].sum()
#percentage by division of total per Sector
df1 = df1.div(df1.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#aggregate sum per 2 columns Sector and Status
df2 = df.groupby(['Sector', 'Status'])['Area'].sum()
df2 = df2.div(df2.sum(level=0), level=0).unstack(fill_value=0).mul(100).add_prefix('%_')
#total Area per Sector
s = df.groupby('Sector')['Area'].sum().rename('Total_area')
#join all together
dfA = pd.concat([s, df1, df2], axis=1).reset_index()
print (dfA)
Sector Total_area %_Residential %_Substation %_Villa \
0 Bangalore(A+B) 800 30.0 37.5 32.5
%_Constructed %_Not_Constructed
0 46.25 53.75
Simple Pivot Table can help!
1. One Line Pandas Solution: 80% work done
pv =
df.pivot_table(values='Area',aggfunc=np.sum,index=['Status'],columns=['Usage'],margins=True,margins_name='Total',fill_value=0).unstack()
2. Now formatting for %: 90% work done
ans =
pd.DataFrame([[pv['Villa']['Total']/pv['Total']['Total'].astype('float'),pv['Resid']['Total']/pv['Total']['Total'].astype('float'),pv['Substation']['Total']/pv['Total']['Total'].astype('float'),pv['Total']['Constructed']/pv['Total']['Total'].astype('float'),pv['Total']['Not_Constructed']/pv['Total']['Total'].astype('float')]]).round(2)*100
3. Adding the Total Column: 99% work done
ans['Total'] = pv['Total']['Total']
4. Renaming Columns and Arranging in your expected order: and done!
ans.columns=['%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed','Total']
ans = ans[['Total',''%_Villa','%_Resid','%_Substation','%_Constructed','%_Not_Constructed']]

Remove none values from dataframe

I have a dataframe like :
Country Name Income Group
1 Norway High income
2 Switzerland Middle income
3 Qatar Low income
4 Luxembourg Low income
5 Macao High income
6 India Middle income
i need something like:
High income Middle income Low income
1 Norway Switzerland Qatar
2 Macao India Luxembourg
I have used pivot tables :
df= df.pivot(values='Country Name', index=None, columns='Income Group')
and i get something like :
High income Middle income Low income
1 Norway none none
2 none Switzerland none
.
.
.
Can someone suggest a better solution than pivot here so that i don't have to deal with none values?
The trick is to introduce a new column index whose values are groupby/cumcount values. cumcount returns a cumulative count -- thus numbering the items in each group:
df['index'] = df.groupby('Income Group').cumcount()
# Country Name Income Group index
# 1 Norway High income 0
# 2 Switzerland Middle income 0
# 3 Qatar Low income 0
# 4 Luxembourg Low income 1
# 5 Macao High income 1
# 6 India Middle income 1
Once you have the index column, the desired result can be obtained by pivoting:
import pandas as pd
df = pd.DataFrame({'Country Name': ['Norway', 'Switzerland', 'Qatar', 'Luxembourg', 'Macao', 'India'], 'Income Group': ['High income', 'Middle income', 'Low income', 'Low income', 'High income', 'Middle income']})
df['index'] = df.groupby('Income Group').cumcount() + 1
result = df.pivot(index='index', columns='Income Group', values='Country Name')
result.index.name = result.columns.name = None
print(result)
yields
High income Low income Middle income
1 Norway Qatar Switzerland
2 Macao Luxembourg India