python pandas dataframe indexing - pandas

I have a data frame
df = pd.DataFrame(carData)
df.column['models']
#df.ffill() This is where i need to fill the amount of columns i want to add with previous value
The data looks something like
models
1 honda
2 ford
3 chevy
I want to add an index but keep it numerical up to a certain number and forward fill the models column to the last value. so for example the dataset above has 3 entries, I want to add an have a total of 5 entries it should look something like
models
1 honda
2 ford
3 chevy
4 chevy
5 chevy

Using df.reindex() and df.ffill()
N= 5
df.reindex(range(N)).ffill()
models
0 honda
1 ford
2 chevy
3 chevy
4 chevy

Use reindex with method='ffill' or add ffill:
N = 5
df = df.reindex(np.arange(1, N + 1), method='ffill')
#alternative
#df = df.reindex(np.arange(1, N + 1)).ffill()
print (df)
models
1 honda
2 ford
3 chevy
4 chevy
5 chevy
If default RangeIndex:
df = df.reset_index(drop=True)
N = 5
df = df.reindex(np.arange(N), method='ffill')
#alternative
#df = df.reindex(np.arange(N)).ffill()
print (df)
models
0 honda
1 ford
2 chevy
3 chevy
4 chevy

Related

Pandas - Move data in one column to the same row in a different column

I have a df which looks like the below, There are 2 quantity columns and I want to move the quantities in the "QTY 2" column to the "QTY" column
Note: there are no instances where there are values in the same row for both columns (So for each row, QTY is either populated or else QTY 2 is populated. Not Both)
DF
Index
Product
QTY
QTY 2
0
Shoes
5
1
Jumpers
10
2
T Shirts
15
3
Shorts
13
Desired Output
Index
Product
QTY
0
Shoes
5
1
Jumpers
10
2
T Shirts
15
3
Shorts
13
Thanks
Try this:
import numpy as np
df['QTY'] = np.where(df['QTY'].isnull(), df['QTY 2'], df['QTY'])
df["QTY"] = df["QTY"].fillna(df["QTY 2"], downcast="infer")
filling the gaps of QTY with QTY 2:
In [254]: df
Out[254]:
Index Product QTY QTY 2
0 0 Shoes 5.0 NaN
1 1 Jumpers NaN 10.0
2 2 T Shirts NaN 15.0
3 3 Shorts 13.0 NaN
In [255]: df["QTY"] = df["QTY"].fillna(df["QTY 2"], downcast="infer")
In [256]: df
Out[256]:
Index Product QTY QTY 2
0 0 Shoes 5 NaN
1 1 Jumpers 10 10.0
2 2 T Shirts 15 15.0
3 3 Shorts 13 NaN
downcast="infer" makes it "these look like integer after NaNs gone, so make the type integer".
you can drop QTY 2 after this with df = df.drop(columns="QTY 2"). If you want one-line is as usual possible:
df = (df.assign(QTY=df["QTY"].fillna(df["QTY 2"], downcast="infer"))
.drop(columns="QTY 2"))
You can do ( I am assuming your empty values are empty strings):
df = df.assign(QTY= df[['QTY', 'QTY2']].
replace('', 0).
sum(axis=1)).drop('QTY2', axis=1)
print(df):
Product QTY
0 Shoes 5
1 Jumpers 10
2 T Shirts 15
3 Shorts 13
If the empty values are actually NaNs then
df['QTY'] = df['QTY'].fillna(df['QTY2']) #or
df['QTY'] = df[['QTY', 'QTY2']].sum(1)

How do you copy data from a dataframe to another

I am having a difficult time getting the correct data from a reference csv file to the one I am working on.
I have a csv file that has over 6 million rows and 19 columns. I looks something like this :
enter image description here
For each row there is a brand and a model of a car amongst other information.
I want to add to this file the fuel consumption per 100km traveled and the type of fuel that is used.
I have another csv file that has the fuel consumption of every model of car that looks something like this : enter image description here
What I want to ultimately do is add the matching values of G,H, I and J columns from the second file to the first one.
Because of the size of the file I was wondering if there is another way to do it other than with a "for" or a "while" loop?
EDIT :
For example...
The first df would look something like this
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
NaN
NaN
2
Honda
Civic
b
NaN
NaN
3
GMC
Sierra
c
NaN
NaN
4
Toyota
Rav4
d
NaN
NaN
The second df would be something like this
ID
Brand
Model
Fuel_consu_1
Fuel_consu_2
1
Toyota
Corrola
100
120
2
Toyota
Rav4
80
84
3
GMC
Sierra
91
105
4
Honda
Civic
112
125
The output should be :
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
80
84
2
Honda
Civic
b
112
125
3
GMC
Sierra
c
91
105
4
Toyota
Rav4
d
80
84
The first df may have many times the same brand and model for different ID's. The order is completely random.
Thank you for providing updates I was able to put something together that should be able to help you
#You drop these two columns because you won't need them once you join them to df1 (which is your 2nd table provided)
df.drop(['Fuel_consu_1', 'Fuel_consu_2'], axis = 1 , inplace = True)
#This will join your first and second column to each other on the Brand and Model columns
df_merge = pd.merge(df, df1, on=['Brand', 'Model'])

How to fill a column based on another column truth value?

I have a df (car_data) where there are 2 columns: model and is_4wd.
The is_4wd is either 0 or 1 and have about 25,000 missing values. However, I know that some models are 4wd because they already has a 1, and the same models have nan.
How can I replace the nan values for the models I know they already 1?
I have created a for loop, but I had to change all nan values to 0, create a variable of unique car models and the loop take a long time to complete.
car_data['is_4wd']=car_data['is_4wd'].fillna(0)
car_4wd=car_data.query('is_4wd==1')
caru=car_4wd['model'].unique()
for index, row in car_data.iterrows():
if row['is_4wd']==0:
if row['model'] in caru:
car_data.loc[car_data.model==row['model'],'is_4wd']=1
Is there a better way to do it? Tried several replace() methods but to no avail.
The df head looks like this: (you can see ford f-150 for example has both 1 and nan in is_4wd) the expected outcome is to replace all the nan for the models I know they have values already entered with 1.
price model_year model condition cylinders fuel odometer \
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0
1 25500 NaN ford f-150 good 6.0 gas 88705.0
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0
3 1500 2003.0 ford f-150 fair 8.0 gas NaN
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0
transmission type paint_color is_4wd date_posted days_listed
0 automatic SUV NaN 1.0 2018-06-23 19
1 automatic pickup white 1.0 2018-10-19 50
2 automatic sedan red NaN 2019-02-07 79
3 automatic pickup NaN NaN 2019-03-22 9
4 automatic sedan black NaN 2019-04-02 28
Group your data by model column and fill is_4wd column by the max value of the group:
df['is_4wd'] = df.groupby('model')['is_4wd'] \
.transform(lambda x: x.fillna(x.max())).fillna(0).astype(int)
print(df[['model', 'is_4wd']])
# Output:
model is_4wd
0 bmw x5 1
1 ford f-150 1
2 hyundai sonata 0
3 ford f-150 1
4 chrysler 200 0

Efficient way to remove strings/characters from three columns and convert to a float in pandas

I have a pandas data frame with three columns having a mixture of alphanumeric values in them. I want to:
efficiently remove the characters/strings in the alphanumeric values in columns Price, Miles, and Weight.
Convert the resulting values to a float
See below for an example...
import pandas as pd
cars_info = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': ['22000$','25000$','27000$','35000$'],
'Miles': ['1200 miles', '10045 miles', '22103 miles', '1110 miles'],
'Weight': ['2500 lbs','2335 lbs','2110 lbs','2655 lbs']}
df = pd.DataFrame(cars_info, columns = ['Brand', 'Price','Miles','Weight'])
df.dtypes # returns `object` data type for columns Price, Miles and Weight
Desired result
Brand, Price($), Miles(in miles), Weight(lbs)
Honda Civic,22000,1200, 2500
Toyota Corolla, 25000, 10045, 2335
Ford Focus, 27000, 22103, 2110
Audi A4, 35000, 1110, 2655
My attempt
for col in df:
df[col] = df[col].str.replace(r'\D', '').astype(float)
There are many ways to solve this problem. You could .str.replace the labels you don't care about, or .str.split if you know the number is always the first thing before a space, for instance.
In this case it looks like you can extract whatever looks like a number ([\d\.]+), and then use pd.to_numeric to cast that to a numeric type.
for col in ['Price', 'Miles', 'Weight']:
df[col] = pd.to_numeric(df[col].str.extract('([\d\.]+)', expand=False))
print(df)
# Brand Price Miles Weight
#0 Honda Civic 22000 1200 2500
#1 Toyota Corolla 25000 10045 2335
#2 Ford Focus 27000 22103 2110
#3 Audi A4 35000 1110 2655
df.dtypes
#Brand object
#Price int64
#Miles int64
#Weight int64
#dtype: object
step 1: focus on the int conversion per relevant columns:
brands = df['Brand']
df = df.drop(columns=['Brand'])
step 2: maintain ints only:
df = df.apply(lambda x: x.str.replace(r'\D', '')).astype(int)
Price Miles Weight
0 22000 1200 2500
1 25000 10045 2335
2 27000 22103 2110
3 35000 1110 2655
step 3: combine: (as in concat)
pd.concat([brands,df], axis=1)
Brand Price Miles Weight
0 Honda Civic 22000 1200 2500
1 Toyota Corolla 25000 10045 2335
2 Ford Focus 27000 22103 2110
3 Audi A4 35000 1110 2655
df.dtypes
Price int32
Miles int32
Weight int32
dtype: object

How to calculate conditional probability of values in pyspark dataframe?

I want to calculate conditional probabilites of ratings('A','B','C') in ratings column by value of column type in pyspark without collecting.
input:
company model rating type
0 ford mustang A coupe
1 chevy camaro B coupe
2 ford fiesta C sedan
3 ford focus A sedan
4 ford taurus B sedan
5 toyota camry B sedan
output:
rating type conditional_probability
0 A coupe 0.50
1 B coupe 0.33
2 C sedan 1.00
3 A sedan 0.50
4 B sedan 0.66
You can use groupby to get counts of items in separate ratings and separate combinations of ratings and types and calculate conditional probability using these values.
from pyspark.sql import functions as F
ratings_cols = ["company", "model", "rating", "type"]
ratings_values = [
("ford", "mustang", "A", "coupe"),
("chevy", "camaro", "B", "coupe"),
("ford", "fiesta", "C", "sedan"),
("ford", "focus", "A", "sedan"),
("ford", "taurus", "B", "sedan"),
("toyota", "camry", "B", "sedan"),
]
ratings_df = spark.createDataFrame(data=ratings_values, schema=ratings_cols)
ratings_df.show()
# +-------+-------+------+-----+
# |company| model|rating| type|
# +-------+-------+------+-----+
# | ford|mustang| A|coupe|
# | chevy| camaro| B|coupe|
# | ford| fiesta| C|sedan|
# | ford| focus| A|sedan|
# | ford| taurus| B|sedan|
# | toyota| camry| B|sedan|
# +-------+-------+------+-----+
probability_df = (ratings_df.groupby(["rating", "type"])
.agg(F.count(F.lit(1)).alias("rating_type_count"))
.join(ratings_df.groupby("rating").agg(F.count(F.lit(1)).alias("rating_count")), on="rating")
.withColumn("conditional_probability", F.round(F.col("rating_type_count")/F.col("rating_count"), 2))
.select(["rating", "type", "conditional_probability"])
.sort(["type", "rating"]))
probability_df.show()
# +------+-----+-----------------------+
# |rating| type|conditional_probability|
# +------+-----+-----------------------+
# | A|coupe| 0.5|
# | B|coupe| 0.33|
# | A|sedan| 0.5|
# | B|sedan| 0.67|
# | C|sedan| 1.0|
# +------+-----+-----------------------+