Is there a easiest way to make number of labels equal with pandas dataframe? - pandas

when we use dataset with pandas.dataframe(), sometimes labels categories are not same ratio.
example) bike: car = 7:3
price
label
200
bike
100
bike
700
bike
300
bike
5500
car
400
bike
5200
car
310
bike
2000
car
20
bike
In this case, car and bike are not same ratio.
so, I want to make each category to be in same ratios.
car shows only 3 times, so 4 bike rows are deleted like this...
price
label
200
bike
300
bike
5500
car
5200
car
2000
car
20
bike
order is not important. I just want to get same ratio categories.
I did count car labels and bike labels, and check fewer labels(In this time, car is fewer labels), and read each rows to move another dataframe. It takes a lot of time, so
Inconvenience.
Is there a easiest way to make number of labels equal with pandas dataframe? or just count each label and make another dataframe?
Thank you.

IIUC, take the minimum of each value_counts and GroupBy.head :
out = df.groupby("label").head(min(df["label"].value_counts())) #or GroupBy.sample
Alternatively and in a #mozway, use a grouper :
g = df.groupby("label")
out = g.head(g["price"].size().min())
Output :
print(out)
price label
0 200 bike
1 100 bike
2 700 bike
4 5500 car
6 5200 car
8 2000 car

Related

How do you copy data from a dataframe to another

I am having a difficult time getting the correct data from a reference csv file to the one I am working on.
I have a csv file that has over 6 million rows and 19 columns. I looks something like this :
enter image description here
For each row there is a brand and a model of a car amongst other information.
I want to add to this file the fuel consumption per 100km traveled and the type of fuel that is used.
I have another csv file that has the fuel consumption of every model of car that looks something like this : enter image description here
What I want to ultimately do is add the matching values of G,H, I and J columns from the second file to the first one.
Because of the size of the file I was wondering if there is another way to do it other than with a "for" or a "while" loop?
EDIT :
For example...
The first df would look something like this
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
NaN
NaN
2
Honda
Civic
b
NaN
NaN
3
GMC
Sierra
c
NaN
NaN
4
Toyota
Rav4
d
NaN
NaN
The second df would be something like this
ID
Brand
Model
Fuel_consu_1
Fuel_consu_2
1
Toyota
Corrola
100
120
2
Toyota
Rav4
80
84
3
GMC
Sierra
91
105
4
Honda
Civic
112
125
The output should be :
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
80
84
2
Honda
Civic
b
112
125
3
GMC
Sierra
c
91
105
4
Toyota
Rav4
d
80
84
The first df may have many times the same brand and model for different ID's. The order is completely random.
Thank you for providing updates I was able to put something together that should be able to help you
#You drop these two columns because you won't need them once you join them to df1 (which is your 2nd table provided)
df.drop(['Fuel_consu_1', 'Fuel_consu_2'], axis = 1 , inplace = True)
#This will join your first and second column to each other on the Brand and Model columns
df_merge = pd.merge(df, df1, on=['Brand', 'Model'])

Calculating mean and count of rows for a bucket

I have a df as follows
MPG Maker Price
8 Toyota 20000
12 Toyota 18000
15 Toyota 19000
5 Honda 19000
4 Honda 20000
I am looking to bin by MPG and then calculate average price and number of elements in the bin.
The DF I am looking to create is
MPG Maker Avg_Price Num_Sold
0-8 Toyota 19000 3
9-15 Honda 19500 2
I followed the directions in Bucketing in python and calculating mean for a bucket and was able to get the average price, but I am unable to get the Num_Sold to work.
I used
bins = [1,8,15]
df_bins = df.MAKER.groupby(pd.cut(df['MPG'],bins))
df_bins = df_bins.agg([np.mean, len]).reset_index(drop=True)
Any ideas on what I might be doing wrong?
Thanks!
Use named aggregation also with column Maker:
bins = [1,8,15]
df_bins = (df.groupby(['Maker',pd.cut(df['MPG'],bins)])
.agg(Avg_Price=('Price','mean'),
Num_Sold= ('Price', 'size')).reset_index())
Or without Maker column:
bins = [1,8,15]
df_bins = (df.groupby([pd.cut(df['MPG'],bins)])
.agg(Avg_Price=('Price','mean'),
Num_Sold= ('Price', 'size')).reset_index())
print (df_bins)

pandas group by and sum with values being displayed

I need to group by two columns and sum the third one. My data looks like this:
site industry spent
Auto Cars 1000
Auto Fashion 200
Auto Housing 100
Auto Housing 300
Magazine Cars 100
Magazine Fashion 200
Magazine Housing 300
Magazine Housing 500
My code:
df.groupby(by=['site', 'industry'])['Revenue'].sum()
The output is:
spent
site industry
Auto Cars 1000
Fashion 200
Housing 400
Magazine Cars 100
Fashion 200
Housing 800
When I convert it to csv I only get one column - spent. My desired output is the same format as the original data only the revenue needs to be summed and I need to see all the values in columns.
Try this, using as_index=False:
df = df.groupby(by=['site', 'industry'], as_index=False).sum()
print(df)
site industry spent
0 Auto Cars 1000
1 Auto Fashion 200
2 Auto Housing 400
3 Magazine Cars 100
4 Magazine Fashion 200
5 Magazine Housing 800

Pandas identify # of items which generate 80 of sales

I have a dataframe with for each country, list of product and the relevant sales
I need to identify for each country how many are # of top sales items of which cumulative sales represent 80% of the total sales for all the items in each country.
E.g.
Cnt Product, units
Italy apple 500
Italy beer 1500
Italy bread 2000
Italy orange 3000
Italy butter 3000
Expected results
Italy 3
(Total units are 10.000 and the sales of the top 3 product - Butter, Orange, Bread, is 8.000 which is the 80% of total)
Try define a function and apply on groupby:
def get_sale(x, pct=0.8):
thresh = 0.8 * x.sum()
# sort values descendingly for top salse
x=x.sort_values(ascending=False).reset_index(drop=True)
# store indices of those with cumsum pass threshold
sale_pass_thresh = x.index[x.cumsum().ge(thresh)]
return sale_pass_thresh[0] + 1
df.groupby('Cnt').units.apply(get_sale)
Output:
Cnt
Italy 3
Name: units, dtype: int64
Need play a little bit logic here
df=df.sort_values('units',ascending=False)
g=df.groupby('Cnt').units
s=(g.cumsum()>=g.transform('sum')*0.8).groupby(df.Cnt).sum()
df.groupby('Cnt').size()-s+1
Out[710]:
Cnt
Italy 3.0
dtype: float64

Divide two row values based on label and create a new column to populate the calculated value

New to Python and looking for some help.
I would like to divide values in two different rows (part of the same column) and then insert a new column with the calculated value
City 2017-18 Item
0 Boston 100 Primary
1 Boston 200 Secondary
2 Boston 300 Tertiary
3 Boston 400 Nat'l average
4 Chicago 500 Primary
5 Chicago 600 Secondary
6 Chicago 700 Tertiary
7 Chicago 800 Nat'l average
On the above Dataframe, I am trying to divide a City's Primary, Secondary and Tertiary values respectively by the Nat'l average for that City. The resultant answer to be populated in a new column part of the same Dataframe. After calculation, the row with the label 'Nat'l average' need to be deleted.
Appreciate your help...
City 2014-15 Item New_column
0 Boston 100 Primary 100/400
1 Boston 200 Secondary 200/400
2 Boston 300 Tertiary 300/400
3 Chicago 500 Primary 500/800
4 Chicago 600 Secondary 600/800
5 Chicago 700 Tertiary 700/800
If mean value is always last per groups divide column by Series created by GroupBy.transform and GroupBy.last:
df['new'] = df['2017-18'].div(df.groupby('City')['2017-18'].transform('last'))
If not first filter values with averages and divide by Series.maping Series:
s = df[df['Item'] == "Nat'l average"].set_index('City')['2017-18']
df['new'] = df['2017-18'].div(df['City'].map(s))
And last filter out rows by boolean indexing:
df = df[df['Item'] != "Nat'l average"]
print (df)
City 2017-18 Item new
0 Boston 100 Primary 0.250
1 Boston 200 Secondary 0.500
2 Boston 300 Tertiary 0.750
4 Chicago 500 Primary 0.625
5 Chicago 600 Secondary 0.750
6 Chicago 700 Tertiary 0.875
Detail:
print (df['City'].map(s))
0 400
1 400
2 400
3 400
4 800
5 800
6 800
7 800
Name: City, dtype: int64