I am having a difficult time getting the correct data from a reference csv file to the one I am working on.
I have a csv file that has over 6 million rows and 19 columns. I looks something like this :
enter image description here
For each row there is a brand and a model of a car amongst other information.
I want to add to this file the fuel consumption per 100km traveled and the type of fuel that is used.
I have another csv file that has the fuel consumption of every model of car that looks something like this : enter image description here
What I want to ultimately do is add the matching values of G,H, I and J columns from the second file to the first one.
Because of the size of the file I was wondering if there is another way to do it other than with a "for" or a "while" loop?
EDIT :
For example...
The first df would look something like this
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
NaN
NaN
2
Honda
Civic
b
NaN
NaN
3
GMC
Sierra
c
NaN
NaN
4
Toyota
Rav4
d
NaN
NaN
The second df would be something like this
ID
Brand
Model
Fuel_consu_1
Fuel_consu_2
1
Toyota
Corrola
100
120
2
Toyota
Rav4
80
84
3
GMC
Sierra
91
105
4
Honda
Civic
112
125
The output should be :
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
80
84
2
Honda
Civic
b
112
125
3
GMC
Sierra
c
91
105
4
Toyota
Rav4
d
80
84
The first df may have many times the same brand and model for different ID's. The order is completely random.
Thank you for providing updates I was able to put something together that should be able to help you
#You drop these two columns because you won't need them once you join them to df1 (which is your 2nd table provided)
df.drop(['Fuel_consu_1', 'Fuel_consu_2'], axis = 1 , inplace = True)
#This will join your first and second column to each other on the Brand and Model columns
df_merge = pd.merge(df, df1, on=['Brand', 'Model'])
Related
I am trying to update my master data table with the information from my custom table.
where mt.type is null update mt.type when mt.item = ct.item
On the internet, I can't find a solution to update one column in a data frame based on a different matched column from the main data frame and another one.
I think maybe I need something like this but with the condition where mt.['item'] matched cc.['item']:
mt['type'] = mt['type'].fillna(cc['type'])
I have also tried using lambda and x and mapper but I can't figure it out.
Tables below:
custom table as ct
Type
Item
Cupboard
Pasta
Fresh
Apple
Frozen
Peas
master table as mt
Type
Item
Weather
Shopping Week
Cupboard
Beans
Sunny
1
NULL
Pasta
Rainy
NULL
NULL
Apples
Null
2
NULL
Peas
Cloudy
5
...
...
...
...
desired output
Type
Item
Weather
Shopping Week
Cupboard
Beans
Sunny
1
Cupboard
Pasta
Rainy
NULL
Fresh
Apples
Null
2
Frozen
Peas
Cloudy
5
...
...
...
...
Thanks!
Here is one way to do it using fillna with a little help from set_index :
out = mt.assign(Type= mt.set_index("Item")["Type"]
.fillna(ct.set_index("Item")["Type"])
.reset_index(drop=True))
This will create a new DataFrame. If you need to overwrite the column "Type" in mt, use this :
mt["Type"] = mt.set_index("Item")["Type"]
.fillna(ct.set_index("Item")["Type"])
.reset_index(drop=True))
Output :
print(out) # or print(mt)
Type Item Shopping_Week
0 Fresh Orange 1.0
1 Fresh Apple 2.0
2 Fresh Banana 3.0
3 Cupboard NaN NaN
4 Cupboard Beans 4.0
5 Frozen Peas 7.0
i have a data of some patients and they are in a configuration that cant be used for data analysis
we have a couple of patients that each patients have multiple visits to our clinic. therefor, in our data we have a row for a visit an it contains some data and as i mentioned every patient have multiple visit. so i have multiple rows for a single patient . i would like a way that we can have just one row for a patient and multiple variables for each visit
for example
enter image description here
as you can see we have multiple visits for a patient
and i wana change it to this format
enter image description here
You have to pivot your dataframe:
out = (df.pivot_table(index=['Family Name', 'ID', 'Sex', 'age'],
columns = df.groupby('ID').cumcount().add(1).astype(str))
.reset_index())
out.columns = out.columns.to_flat_index().map(''.join)
print(out)
# Output
Family Name ID Sex age AA1 AA2 AB1 AB2 AC1 AC2 AE1 AE2
0 Mr James 2 male 64 9.0 NaN 10.0 NaN 11.0 NaN 12.0 NaN
1 Mr Lee 1 male 62 1.0 5.0 2.0 6.0 3.0 7.0 4.0 8.0
The following is a sample from my data frame:
import pandas as pd
import numpy as np
d=['SKODASUPERB','SKODASUPERB',\
'SKODASUPERB','MERCEDES-BENZE CLASS','ASTON MARTINVIRAGE'\
,'ASTON MARTINVIRAGE','ASTON MARTINVIRAGE','TOYOTAHIACE',\
'TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS',\
'FERRARI360','FERRARILAFERRARI']
x=['SUV','SUV','nan','nan','SPORTS','SPORTS','SPORTS',\
'nan','SEDAN','SEDAN','SEDAN','SEDAN','SPORT','SPORT']
df=pd.DataFrame({'make_model':d,'body':x})
df.body=df.body.replace('nan',np.NaN)
df.head()
Out[24]:
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB NaN
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
There are some null values in the 'body' column
df.body.isnull().sum()
Out[25]: 3
So i am trying to fill the null values in body column by using the mode of body type for a particular make_model. For instance, 2 observations of SKODASUPERB have body as 'SUV' and 1 observation has body as null. So the mode of body for SKODASUPERB would be 'SUV' and i want 'SUV to be filled in for the third observation too. For this i am using the following code
make_model_list=df.make_model.unique().tolist()
for x in make_model_list:
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']=\
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']\
.fillna(df.loc[df['make_model']==x,'body'].mode())
Unfortunately, the loop is breaking as some observation dont have a mode value
df.body.isnull().sum()
Out[30]: 3
How can i force the loop to run even if there is no mode 'body' value for a particular make_model. I know that i can use continue command, but i am not sure how to write it.
Assuming that make_model and body are distinct values:
donor = df.dropna().groupby(by=['make_model']).agg(pd.Series.mode).reset_index()
df = df.merge(donor, how='left', on=['make_model'])
df['body_x'].fillna(df.body_y, inplace=True)
df.drop(columns=['body_y'], inplace=True)
df.columns = ['make_model', 'body']
df
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB SUV
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
5 ASTON MARTINVIRAGE SPORTS
6 ASTON MARTINVIRAGE SPORTS
7 TOYOTAHIACE NaN
8 TOYOTAAVENSIS SEDAN
9 TOYOTAAVENSIS SEDAN
10 TOYOTAAVENSIS SEDAN
11 TOYOTAAVENSIS SEDAN
12 FERRARI360 SPORT
13 FERRARILAFERRARI SPORT
Finally, I have worked out a solution. It was just a matter of putting try and exception. This solution works perfectly for the purpose of my project and has filled 95% of the missing values. I have slightly changed the data to show that this method is effective:
d=['SKODASUPERB','SKODASUPERB',\
'SKODASUPERB','MERCEDES-BENZE CLASS','ASTON MARTINVIRAGE'\
,'ASTON MARTINVIRAGE','ASTON MARTINVIRAGE','TOYOTAHIACE',\
'TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS',\
'FERRARI360','FERRARILAFERRARI']
x=['SUV','SUV','nan','nan','SPORTS','SPORTS','nan',\
'nan','SEDAN','SEDAN','nan','SEDAN','SPORT','SPORT']
df=pd.DataFrame({'make_model':d,'body':x})
df.body=df.body.replace('nan',np.NaN)
df
Out[6]:
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB NaN
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
5 ASTON MARTINVIRAGE SPORTS
6 ASTON MARTINVIRAGE NaN
7 TOYOTAHIACE NaN
8 TOYOTAAVENSIS SEDAN
9 TOYOTAAVENSIS SEDAN
10 TOYOTAAVENSIS NaN
11 TOYOTAAVENSIS SEDAN
df.body.isnull().sum()
Out[7]: 5
My Solution
for x in make_model_list:
try:
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']=\
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body'].fillna\
(df.loc[df['make_model']==x,'body'].value_counts().index[0])
except:
pass
df.body.isnull().sum()
Out[9]: 2 #null values have dropped from 5 to 2.
Those 2 null values couldn't be filled because there was no frequent or mode value for them at all.
I have a df (car_data) where there are 2 columns: model and is_4wd.
The is_4wd is either 0 or 1 and have about 25,000 missing values. However, I know that some models are 4wd because they already has a 1, and the same models have nan.
How can I replace the nan values for the models I know they already 1?
I have created a for loop, but I had to change all nan values to 0, create a variable of unique car models and the loop take a long time to complete.
car_data['is_4wd']=car_data['is_4wd'].fillna(0)
car_4wd=car_data.query('is_4wd==1')
caru=car_4wd['model'].unique()
for index, row in car_data.iterrows():
if row['is_4wd']==0:
if row['model'] in caru:
car_data.loc[car_data.model==row['model'],'is_4wd']=1
Is there a better way to do it? Tried several replace() methods but to no avail.
The df head looks like this: (you can see ford f-150 for example has both 1 and nan in is_4wd) the expected outcome is to replace all the nan for the models I know they have values already entered with 1.
price model_year model condition cylinders fuel odometer \
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0
1 25500 NaN ford f-150 good 6.0 gas 88705.0
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0
3 1500 2003.0 ford f-150 fair 8.0 gas NaN
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0
transmission type paint_color is_4wd date_posted days_listed
0 automatic SUV NaN 1.0 2018-06-23 19
1 automatic pickup white 1.0 2018-10-19 50
2 automatic sedan red NaN 2019-02-07 79
3 automatic pickup NaN NaN 2019-03-22 9
4 automatic sedan black NaN 2019-04-02 28
Group your data by model column and fill is_4wd column by the max value of the group:
df['is_4wd'] = df.groupby('model')['is_4wd'] \
.transform(lambda x: x.fillna(x.max())).fillna(0).astype(int)
print(df[['model', 'is_4wd']])
# Output:
model is_4wd
0 bmw x5 1
1 ford f-150 1
2 hyundai sonata 0
3 ford f-150 1
4 chrysler 200 0
I have 2 DataFrames.
DF1:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
DF2:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
My new DataFrame should look like this.
DF_new:
userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1 4.0
2
3
4
The Values will be the ratings of the the indiviudel user to each movie.
The movie titles are now the columns.
The userId are now the rows.
I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.
Anybody has an idea?
The problem consists of essentially 2 parts:
How to transpose df2, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.
The rest is about mapping the movieIDs to their names. This can be easily done by direct substitution on df.columns.
In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieIDs directly before name substitution mentioned previously.
Code
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={
"movieId": [1,2,3,4,5],
"title": ["toy story (1995)",
"Jumanji (1995)",
"Grumpier 33 (1995)", # shortened for printing
"Waiting 44 (1995)",
"Father 55 (1995)"],
}
)
# to better demonstrate the correctness, 2 distinct user ids were used.
df2 = pd.DataFrame(
data={
"userId": [1,1,1,2,2],
"movieId": [1,2,2,3,5],
"rating": [4,5,4,5,4]
}
)
# 1. Produce the main table
df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")
print(df_new) # already pretty close
Out[17]:
movieId 1 2 3 5
userId
1 4.0 4.5 NaN NaN
2 NaN NaN 5.0 4.0
# 2. map movie ID's to titles
# name lookup dataset
df_names = df1[["movieId", "title"]].set_index("movieId")
# strip the last 7 characters containing year
# (assume consistent formatting in df1)
df_names["title"] = df_names["title"].apply(lambda s: s[:-7])
# (optional) fill unrated columns and sort
for movie_id in df_names.index.values:
if movie_id not in df_new.columns.values:
df_new[movie_id] = np.nan
else:
df_new = df_new[df_names.index.values]
# replace IDs with titles
df_new.columns = df_names.loc[df_new.columns, "title"].values
Result
df_new
Out[16]:
toy story Jumanji Grumpier 33 Waiting 44 Father 55
userId
1 4.0 4.5 NaN NaN NaN
2 NaN NaN 5.0 NaN 4.0