I have a df as follows
MPG Maker Price
8 Toyota 20000
12 Toyota 18000
15 Toyota 19000
5 Honda 19000
4 Honda 20000
I am looking to bin by MPG and then calculate average price and number of elements in the bin.
The DF I am looking to create is
MPG Maker Avg_Price Num_Sold
0-8 Toyota 19000 3
9-15 Honda 19500 2
I followed the directions in Bucketing in python and calculating mean for a bucket and was able to get the average price, but I am unable to get the Num_Sold to work.
I used
bins = [1,8,15]
df_bins = df.MAKER.groupby(pd.cut(df['MPG'],bins))
df_bins = df_bins.agg([np.mean, len]).reset_index(drop=True)
Any ideas on what I might be doing wrong?
Thanks!
Use named aggregation also with column Maker:
bins = [1,8,15]
df_bins = (df.groupby(['Maker',pd.cut(df['MPG'],bins)])
.agg(Avg_Price=('Price','mean'),
Num_Sold= ('Price', 'size')).reset_index())
Or without Maker column:
bins = [1,8,15]
df_bins = (df.groupby([pd.cut(df['MPG'],bins)])
.agg(Avg_Price=('Price','mean'),
Num_Sold= ('Price', 'size')).reset_index())
print (df_bins)
Related
when we use dataset with pandas.dataframe(), sometimes labels categories are not same ratio.
example) bike: car = 7:3
price
label
200
bike
100
bike
700
bike
300
bike
5500
car
400
bike
5200
car
310
bike
2000
car
20
bike
In this case, car and bike are not same ratio.
so, I want to make each category to be in same ratios.
car shows only 3 times, so 4 bike rows are deleted like this...
price
label
200
bike
300
bike
5500
car
5200
car
2000
car
20
bike
order is not important. I just want to get same ratio categories.
I did count car labels and bike labels, and check fewer labels(In this time, car is fewer labels), and read each rows to move another dataframe. It takes a lot of time, so
Inconvenience.
Is there a easiest way to make number of labels equal with pandas dataframe? or just count each label and make another dataframe?
Thank you.
IIUC, take the minimum of each value_counts and GroupBy.head :
out = df.groupby("label").head(min(df["label"].value_counts())) #or GroupBy.sample
Alternatively and in a #mozway, use a grouper :
g = df.groupby("label")
out = g.head(g["price"].size().min())
Output :
print(out)
price label
0 200 bike
1 100 bike
2 700 bike
4 5500 car
6 5200 car
8 2000 car
I am having a difficult time getting the correct data from a reference csv file to the one I am working on.
I have a csv file that has over 6 million rows and 19 columns. I looks something like this :
enter image description here
For each row there is a brand and a model of a car amongst other information.
I want to add to this file the fuel consumption per 100km traveled and the type of fuel that is used.
I have another csv file that has the fuel consumption of every model of car that looks something like this : enter image description here
What I want to ultimately do is add the matching values of G,H, I and J columns from the second file to the first one.
Because of the size of the file I was wondering if there is another way to do it other than with a "for" or a "while" loop?
EDIT :
For example...
The first df would look something like this
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
NaN
NaN
2
Honda
Civic
b
NaN
NaN
3
GMC
Sierra
c
NaN
NaN
4
Toyota
Rav4
d
NaN
NaN
The second df would be something like this
ID
Brand
Model
Fuel_consu_1
Fuel_consu_2
1
Toyota
Corrola
100
120
2
Toyota
Rav4
80
84
3
GMC
Sierra
91
105
4
Honda
Civic
112
125
The output should be :
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
80
84
2
Honda
Civic
b
112
125
3
GMC
Sierra
c
91
105
4
Toyota
Rav4
d
80
84
The first df may have many times the same brand and model for different ID's. The order is completely random.
Thank you for providing updates I was able to put something together that should be able to help you
#You drop these two columns because you won't need them once you join them to df1 (which is your 2nd table provided)
df.drop(['Fuel_consu_1', 'Fuel_consu_2'], axis = 1 , inplace = True)
#This will join your first and second column to each other on the Brand and Model columns
df_merge = pd.merge(df, df1, on=['Brand', 'Model'])
Let's say I have a pandas dataframe:
brand
category
size
nike
sneaker
9
adidas
boots
11
nike
boots
9
There could be more than 100 brands and some brands could have more categories than others. How do I get a table that will group them based on brands? That is the first column(index) that should be the brands, the second should be the categories belonging to the brand, and if possible the mean size for each brand as well, using pandas.
brand
category
size
nike
sneaker
10.5
boots
adidas
boots
11
Maybe their is a little error in the size from the example (mean is 9 instead of 10.5), but a solution might be :
df.groupby(['brand'], as_index=False).agg({'category': list, 'size': 'mean'})
Output :
brand category size
0 adidas [boots] 11
1 nike [sneaker, boots] 9
I have a pandas data frame with three columns having a mixture of alphanumeric values in them. I want to:
efficiently remove the characters/strings in the alphanumeric values in columns Price, Miles, and Weight.
Convert the resulting values to a float
See below for an example...
import pandas as pd
cars_info = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': ['22000$','25000$','27000$','35000$'],
'Miles': ['1200 miles', '10045 miles', '22103 miles', '1110 miles'],
'Weight': ['2500 lbs','2335 lbs','2110 lbs','2655 lbs']}
df = pd.DataFrame(cars_info, columns = ['Brand', 'Price','Miles','Weight'])
df.dtypes # returns `object` data type for columns Price, Miles and Weight
Desired result
Brand, Price($), Miles(in miles), Weight(lbs)
Honda Civic,22000,1200, 2500
Toyota Corolla, 25000, 10045, 2335
Ford Focus, 27000, 22103, 2110
Audi A4, 35000, 1110, 2655
My attempt
for col in df:
df[col] = df[col].str.replace(r'\D', '').astype(float)
There are many ways to solve this problem. You could .str.replace the labels you don't care about, or .str.split if you know the number is always the first thing before a space, for instance.
In this case it looks like you can extract whatever looks like a number ([\d\.]+), and then use pd.to_numeric to cast that to a numeric type.
for col in ['Price', 'Miles', 'Weight']:
df[col] = pd.to_numeric(df[col].str.extract('([\d\.]+)', expand=False))
print(df)
# Brand Price Miles Weight
#0 Honda Civic 22000 1200 2500
#1 Toyota Corolla 25000 10045 2335
#2 Ford Focus 27000 22103 2110
#3 Audi A4 35000 1110 2655
df.dtypes
#Brand object
#Price int64
#Miles int64
#Weight int64
#dtype: object
step 1: focus on the int conversion per relevant columns:
brands = df['Brand']
df = df.drop(columns=['Brand'])
step 2: maintain ints only:
df = df.apply(lambda x: x.str.replace(r'\D', '')).astype(int)
Price Miles Weight
0 22000 1200 2500
1 25000 10045 2335
2 27000 22103 2110
3 35000 1110 2655
step 3: combine: (as in concat)
pd.concat([brands,df], axis=1)
Brand Price Miles Weight
0 Honda Civic 22000 1200 2500
1 Toyota Corolla 25000 10045 2335
2 Ford Focus 27000 22103 2110
3 Audi A4 35000 1110 2655
df.dtypes
Price int32
Miles int32
Weight int32
dtype: object
I have a dataframe with for each country, list of product and the relevant sales
I need to identify for each country how many are # of top sales items of which cumulative sales represent 80% of the total sales for all the items in each country.
E.g.
Cnt Product, units
Italy apple 500
Italy beer 1500
Italy bread 2000
Italy orange 3000
Italy butter 3000
Expected results
Italy 3
(Total units are 10.000 and the sales of the top 3 product - Butter, Orange, Bread, is 8.000 which is the 80% of total)
Try define a function and apply on groupby:
def get_sale(x, pct=0.8):
thresh = 0.8 * x.sum()
# sort values descendingly for top salse
x=x.sort_values(ascending=False).reset_index(drop=True)
# store indices of those with cumsum pass threshold
sale_pass_thresh = x.index[x.cumsum().ge(thresh)]
return sale_pass_thresh[0] + 1
df.groupby('Cnt').units.apply(get_sale)
Output:
Cnt
Italy 3
Name: units, dtype: int64
Need play a little bit logic here
df=df.sort_values('units',ascending=False)
g=df.groupby('Cnt').units
s=(g.cumsum()>=g.transform('sum')*0.8).groupby(df.Cnt).sum()
df.groupby('Cnt').size()-s+1
Out[710]:
Cnt
Italy 3.0
dtype: float64