How to fill a column based on another column truth value? - pandas

I have a df (car_data) where there are 2 columns: model and is_4wd.
The is_4wd is either 0 or 1 and have about 25,000 missing values. However, I know that some models are 4wd because they already has a 1, and the same models have nan.
How can I replace the nan values for the models I know they already 1?
I have created a for loop, but I had to change all nan values to 0, create a variable of unique car models and the loop take a long time to complete.
car_data['is_4wd']=car_data['is_4wd'].fillna(0)
car_4wd=car_data.query('is_4wd==1')
caru=car_4wd['model'].unique()
for index, row in car_data.iterrows():
if row['is_4wd']==0:
if row['model'] in caru:
car_data.loc[car_data.model==row['model'],'is_4wd']=1
Is there a better way to do it? Tried several replace() methods but to no avail.
The df head looks like this: (you can see ford f-150 for example has both 1 and nan in is_4wd) the expected outcome is to replace all the nan for the models I know they have values already entered with 1.
price model_year model condition cylinders fuel odometer \
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0
1 25500 NaN ford f-150 good 6.0 gas 88705.0
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0
3 1500 2003.0 ford f-150 fair 8.0 gas NaN
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0
transmission type paint_color is_4wd date_posted days_listed
0 automatic SUV NaN 1.0 2018-06-23 19
1 automatic pickup white 1.0 2018-10-19 50
2 automatic sedan red NaN 2019-02-07 79
3 automatic pickup NaN NaN 2019-03-22 9
4 automatic sedan black NaN 2019-04-02 28

Group your data by model column and fill is_4wd column by the max value of the group:
df['is_4wd'] = df.groupby('model')['is_4wd'] \
.transform(lambda x: x.fillna(x.max())).fillna(0).astype(int)
print(df[['model', 'is_4wd']])
# Output:
model is_4wd
0 bmw x5 1
1 ford f-150 1
2 hyundai sonata 0
3 ford f-150 1
4 chrysler 200 0

Related

A loop to fill null values in a column using the mode value is breaking. Still no working Solution

The following is a sample from my data frame:
import pandas as pd
import numpy as np
d=['SKODASUPERB','SKODASUPERB',\
'SKODASUPERB','MERCEDES-BENZE CLASS','ASTON MARTINVIRAGE'\
,'ASTON MARTINVIRAGE','ASTON MARTINVIRAGE','TOYOTAHIACE',\
'TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS',\
'FERRARI360','FERRARILAFERRARI']
x=['SUV','SUV','nan','nan','SPORTS','SPORTS','SPORTS',\
'nan','SEDAN','SEDAN','SEDAN','SEDAN','SPORT','SPORT']
df=pd.DataFrame({'make_model':d,'body':x})
df.body=df.body.replace('nan',np.NaN)
df.head()
Out[24]:
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB NaN
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
There are some null values in the 'body' column
df.body.isnull().sum()
Out[25]: 3
So i am trying to fill the null values in body column by using the mode of body type for a particular make_model. For instance, 2 observations of SKODASUPERB have body as 'SUV' and 1 observation has body as null. So the mode of body for SKODASUPERB would be 'SUV' and i want 'SUV to be filled in for the third observation too. For this i am using the following code
make_model_list=df.make_model.unique().tolist()
for x in make_model_list:
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']=\
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']\
.fillna(df.loc[df['make_model']==x,'body'].mode())
Unfortunately, the loop is breaking as some observation dont have a mode value
df.body.isnull().sum()
Out[30]: 3
How can i force the loop to run even if there is no mode 'body' value for a particular make_model. I know that i can use continue command, but i am not sure how to write it.
Assuming that make_model and body are distinct values:
donor = df.dropna().groupby(by=['make_model']).agg(pd.Series.mode).reset_index()
df = df.merge(donor, how='left', on=['make_model'])
df['body_x'].fillna(df.body_y, inplace=True)
df.drop(columns=['body_y'], inplace=True)
df.columns = ['make_model', 'body']
df
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB SUV
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
5 ASTON MARTINVIRAGE SPORTS
6 ASTON MARTINVIRAGE SPORTS
7 TOYOTAHIACE NaN
8 TOYOTAAVENSIS SEDAN
9 TOYOTAAVENSIS SEDAN
10 TOYOTAAVENSIS SEDAN
11 TOYOTAAVENSIS SEDAN
12 FERRARI360 SPORT
13 FERRARILAFERRARI SPORT
Finally, I have worked out a solution. It was just a matter of putting try and exception. This solution works perfectly for the purpose of my project and has filled 95% of the missing values. I have slightly changed the data to show that this method is effective:
d=['SKODASUPERB','SKODASUPERB',\
'SKODASUPERB','MERCEDES-BENZE CLASS','ASTON MARTINVIRAGE'\
,'ASTON MARTINVIRAGE','ASTON MARTINVIRAGE','TOYOTAHIACE',\
'TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS',\
'FERRARI360','FERRARILAFERRARI']
x=['SUV','SUV','nan','nan','SPORTS','SPORTS','nan',\
'nan','SEDAN','SEDAN','nan','SEDAN','SPORT','SPORT']
df=pd.DataFrame({'make_model':d,'body':x})
df.body=df.body.replace('nan',np.NaN)
df
Out[6]:
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB NaN
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
5 ASTON MARTINVIRAGE SPORTS
6 ASTON MARTINVIRAGE NaN
7 TOYOTAHIACE NaN
8 TOYOTAAVENSIS SEDAN
9 TOYOTAAVENSIS SEDAN
10 TOYOTAAVENSIS NaN
11 TOYOTAAVENSIS SEDAN
df.body.isnull().sum()
Out[7]: 5
My Solution
for x in make_model_list:
try:
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']=\
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body'].fillna\
(df.loc[df['make_model']==x,'body'].value_counts().index[0])
except:
pass
df.body.isnull().sum()
Out[9]: 2 #null values have dropped from 5 to 2.
Those 2 null values couldn't be filled because there was no frequent or mode value for them at all.

How do you copy data from a dataframe to another

I am having a difficult time getting the correct data from a reference csv file to the one I am working on.
I have a csv file that has over 6 million rows and 19 columns. I looks something like this :
enter image description here
For each row there is a brand and a model of a car amongst other information.
I want to add to this file the fuel consumption per 100km traveled and the type of fuel that is used.
I have another csv file that has the fuel consumption of every model of car that looks something like this : enter image description here
What I want to ultimately do is add the matching values of G,H, I and J columns from the second file to the first one.
Because of the size of the file I was wondering if there is another way to do it other than with a "for" or a "while" loop?
EDIT :
For example...
The first df would look something like this
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
NaN
NaN
2
Honda
Civic
b
NaN
NaN
3
GMC
Sierra
c
NaN
NaN
4
Toyota
Rav4
d
NaN
NaN
The second df would be something like this
ID
Brand
Model
Fuel_consu_1
Fuel_consu_2
1
Toyota
Corrola
100
120
2
Toyota
Rav4
80
84
3
GMC
Sierra
91
105
4
Honda
Civic
112
125
The output should be :
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
80
84
2
Honda
Civic
b
112
125
3
GMC
Sierra
c
91
105
4
Toyota
Rav4
d
80
84
The first df may have many times the same brand and model for different ID's. The order is completely random.
Thank you for providing updates I was able to put something together that should be able to help you
#You drop these two columns because you won't need them once you join them to df1 (which is your 2nd table provided)
df.drop(['Fuel_consu_1', 'Fuel_consu_2'], axis = 1 , inplace = True)
#This will join your first and second column to each other on the Brand and Model columns
df_merge = pd.merge(df, df1, on=['Brand', 'Model'])

Concatenate labels to an existing dataframe

I want to use a list of names "headers" to create a new column in my dataframe. In the initial table, the name of each division is positioned above the results for each team in that division. I want to add that header to each row entry for each divsion to make the data more identifiable like this. I have the headers stored in the "header" object in my code. How can I multiply each division header by the number of rows that appear in the division and append to the dataset?
Edit: here is another snippet of what I want the get from the end product.
df3 = df.iloc[0:6]
df3.insert(0, 'Divisions', ['na','L5 Junior', 'L5 Junior', 'na',
'L5 Senior - Medium', 'L5 Senior - Medium'])
df3
.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
Import HTML
scr = 'https://tv.varsity.com/results/7361971-2022-spirit-unlimited-battle-at-the-
boardwalk-atlantic-city-grand-ntls/31220'
scr1 = requests.get(scr)
soup = BeautifulSoup(scr1.text, "html.parser")
List of names to append
table_MN = pd.read_html(scr)
sp3 = soup.find(class_="full-content").find_all("h2")
headers = [elt.text for elt in sp3]
table_MN = pd.read_html(scr)
Extract text and header from division info
div = pd.DataFrame(headers)
div.columns = ["division"]
df = pd.concat(table_MN, ignore_index=True)
df.columns = df.iloc[0]
df
It is still not clear what is the output you are looking for. However, may I suggest the following, which accomplishes selecting common headers from tables in table_MN and the concatenating the results. If it is going in the right direction pls let me know, and indicate what else you want to extract from the resulting table:
tmn_1 = [tbl.T.set_index(0).T for tbl in table_MN]
pd.concat(tmn_1, axis=0, ignore_index = True)
output:
Rank Program Name Team Name Raw Score Deductions Performance Score Event Score
-- ------ --------------------------- ----------------- ----------- ------------ ------------------- -------------
0 1 Rockstar Cheer New Jersey City Girls 47.8667 0 95.7333 95.6833
1 2 Cheer Factor Xtraordinary 46.6667 0.15 93.1833 92.8541
2 1 Rockstar Cheer New Jersey City Girls 47.7667 0 95.5333 23.8833
3 2 Cheer Factor Xtraordinary 46.0333 0.2 91.8667 22.9667
4 1 Star Athletics Roar 47.5333 0.9 94.1667 93.9959
5 1 Prime Time All Stars Lady Onyx 43.9 1.35 86.45 86.6958
6 1 Prime Time All Stars Lady Onyx 44.1667 0.9 87.4333 21.8583
7 1 Just Cheer All Stars Jag 5 46.4333 0.15 92.7167 92.2875
8 1 Just Cheer All Stars Jag 5 45.8 0.6 91 22.75
9 1 Quest Athletics Black Ops 47.4333 0.45 94.4167 93.725
10 1 Quest Athletics Black Ops 46.5 1.35 91.65 22.9125
11 1 The Stingray Allstars X-Rays 45.3 0.95 89.65 88.4375
12 1 Vortex Allstars Lady Rays 45.7 0.5 90.9 91.1083
13 1 Vortex Allstars Lady Rays 45.8667 0 91.7333 22.9333
14 1 Upper Merion All Stars Citrus 46.4333 0 92.8667 92.7
15 2 Cheer Factor JUNIOR X 45.9 1.1 90.7 90.6542
16 3 NJ Premier All Stars Prodigy 44.6333 0.05 89.2167 89.8292
17 1 Upper Merion All Stars Citrus 46.1 0 92.2 23.05
18 2 NJ Premier All Stars Prodigy 45.8333 0 91.6667 22.9167
19 3 Cheer Factor JUNIOR X 45.7333 0.95 90.5167 22.6292
20 1 Virginia Royalty Athletics Dynasty 46.5 0 93 92.9
21 1 Virginia Royalty Athletics Dynasty 46.3 0 92.6 23.15
22 1 South Jersey Storm Lady Reign 47.2333 0 94.4667 93.4875
...

How to count the distance in cells (e.g. in indices) between two repeating values in one column in Pandas dataframe?

I have the following dataset. It lists the words that were presented to a participant in the psycholinguistic experiment (I set the order of the presentation of each word as an index):
data = {'Stimulus': ['sword','apple','tap','stick', 'elephant', 'boots', 'berry', 'apple', 'pear', 'apple', 'stick'],'Order': [1,2,3,4,5,6,7,8,9,10,11]}
df = pd.DataFrame(data, columns = ['Stimulus', 'Order'])
df.set_index('Order', inplace=True)
Stimulus
Order
1 sword
2 apple
3 tap
4 stick
5 elephant
6 boots
7 berry
8 apple
9 pear
10 apple
11 stick
Some values in this dataset are repeated (e.g. apple), some are not. The problem is that I need to calculate the distance in cells based on the order column between each occurrence of repeated values and store it in a separate column, like this:
Stimulus Distance
Order
1 sword NA
2 apple NA
3 tap NA
4 stick NA
5 elephant NA
6 boots NA
7 berry NA
8 apple 6
9 pear NA
10 apple 2
11 stick 7
It shouldn't be hard to implement, but I've got stuck.. Initially, I made a dictionary of duplicates where I store items as keys and their indices as values:
{'apple': [2,8,10],'stick': [4, 11]}
And then I failed to find a solution to put those values into a column. If there is a simplier way to do it in a loop without using dictionaries, please let me know. I will appreciate any advice.
Use, df.groupby on Stimulus then transform the Order column using pd.Series.diff:
df = df.reset_index()
df['Distance'] = df.groupby('Stimulus').transform(pd.Series.diff)
df = df.set_index('Order')
# print(df)
Stimulus Distance
Order
1 sword NaN
2 apple NaN
3 tap NaN
4 stick NaN
5 elephant NaN
6 boots NaN
7 berry NaN
8 apple 6.0
9 pear NaN
10 apple 2.0
11 stick 7.0

python pandas melt on combined columns

I have a dataframe like this. i have regular fields till "state" then i will have trailers (3 columns tr1* represents 1 tailer) i want to convert those trailers to rows. I tried melt function but i am able to use only 1 trailer column. kindly look at below example you can understand
Name number city state tr1num tr1acct tr1ct tr2num tr2acct tr2ct tr3num tr3acct tr3ct
DJ 10 Edison nj 1001 20345 Dew 1002 20346 Newca. 1003. 20347. pen
ND 20 Newark DE 2001 1985 flor 2002 1986 rodge
I am expecting the output like this.
Name number city state trnum tracct trct
DJ 10 Edison nj 1001 20345 Dew
DJ 10 Edison nj 1002 20346 Newca
DJ 10 Edison nj 1003 20347 pen
ND 20 Newark DE 2001 1985 flor
ND 20 Newark DE 2002 1986 rodge
You need to look at using pd.wide_to_long. However, you will need to do some column renaming first.
df = df.set_index(['Name','number','city','state'])
df.columns = df.columns.str.replace('(\D+)(\d+)(\D+)',r'\1\3_\2')
df = df.reset_index()
pd.wide_to_long(df, ['trnum','trct','tracct'],
['Name','number','city','state'], 'Code',sep='_',suffix='\d+')\
.reset_index()\
.drop('Code',axis=1)
Output:
Name number city state trnum trct tracct
0 DJ 10 Edison nj 1001.0 Dew 20345.0
1 DJ 10 Edison nj 1002.0 Newca. 20346.0
2 DJ 10 Edison nj 1003.0 pen 20347.0
3 ND 20 Newark DE 2001.0 flor 1985.0
4 ND 20 Newark DE 2002.0 rodge 1986.0
5 ND 20 Newark DE NaN NaN NaN
you could achieve this by renaming your columns and bit and applying the pandas wide_to_long method. Below is the code which produces your desired output.
df = pd.DataFrame({"Name":["DJ", "ND"], "number":[10,20], "city":["Edison", "Newark"], "state":["nj","DE"],
"trnum_1":[1001,2001], "tracct_1":[20345,1985], "trct_1":["Dew", "flor"], "trnum_2":[1002,2002],
"trct_2":["Newca", "rodge"], "trnum_3":[1003,None], "tracct_3":[20347,None], "trct_3":["pen", None]})
pd.wide_to_long(df, stubnames=['trnum', 'tracct', 'trct'], i='Name', j='dropme', sep='_').reset_index().drop('dropme', axis=1)\
.sort_values('trnum')
outputs
Name state city number trnum tracct trct
0 DJ nj Edison 10 1001.0 20345.0 Dew
1 DJ nj Edison 10 1002.0 NaN Newca
2 DJ nj Edison 10 1003.0 20347.0 pen
3 ND DE Newark 20 2001.0 1985.0 flor
4 ND DE Newark 20 2002.0 NaN rodge
5 ND DE Newark 20 NaN NaN None
Another option:
df = pd.DataFrame({'col1': [1,2,3], 'col2':[3,4,5], 'col3':[5,6,7], 'tr1':[0,9,8], 'tr2':[0,9,8]})
The df:
col1 col2 col3 tr1 tr2
0 1 3 5 0 0
1 2 4 6 9 9
2 3 5 7 8 8
subsetting to create 2 df's:
tr1_df = df[['col1', 'col2', 'col3', 'tr1']].rename(index=str, columns={"tr1":"tr"})
tr2_df = df[['col1', 'col2', 'col3', 'tr2']].rename(index=str, columns={"tr2":"tr"})
res = pd.concat([tr1_df, tr2_df])
result:
col1 col2 col3 tr
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
One option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.pivot_longer(
index=slice("Name", "state"),
names_to=(".value", ".value"),
names_pattern=r"(.+)\d(.+)",
sort_by_appearance=True)
.dropna()
)
Name number city state trnum tracct trct
0 DJ 10 Edison nj 1001.0 20345.0 Dew
1 DJ 10 Edison nj 1002.0 20346.0 Newca.
2 DJ 10 Edison nj 1003.0 20347.0 pen
3 ND 20 Newark DE 2001.0 1985.0 flor
4 ND 20 Newark DE 2002.0 1986.0 rodge
The .value keeps the part of the column associated with it as header, and since we have multiple .value, they are combined into a single word. The .value is determined by the groups in the names_pattern, which is a regular expression.
Note that currently the multiple .value option is available in dev.