How to handle categorical variables in MissForest() during missing value imputation? - missing-data

I am working on a regression problem using Bengaluru house price prediction dataset.
I was trying to impute missing values in bath and balcony using MissForest().
Since documentation says that MissForest() can handle categorical variables using 'cat_vars' parameter, I tried to use 'area_type' and 'locality' features in the imputer fit_transform method by passing their index, as shown below:
df_temp.info()
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area_type 10296 non-null object
1 location 10296 non-null object
2 bath 10245 non-null float64
3 balcony 9810 non-null float64
4 rooms 10296 non-null int64
5 tot_sqft_1 10296 non-null float64
imputer = MissForest()
imputer.fit_transform(df_temp, cat_vars=[0,1])
But I am getting the below error:
'Cannot convert str to float: 'Super built up Area''
Could you please let me know why this could be? Do we need to encode the categorical variables using one hot encoding?

Related

Plotting line graph from pandas DataFrame - does not work if I do not include .mean(), .sum() or even .median(). Very confused

I have a DataFrame that has list of date, city, country, and average temperature in Celcius.
Here is the .info() of this DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16500 entries, 0 to 16499
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 16500 non-null object
1 city 16500 non-null object
2 country 16500 non-null object
3 avg_temp_c 16407 non-null float64
dtypes: float64(1), object(3)
I only want to plot the avg_temp_c for the cities of Toronto and Rome, both lines on one graph. This was actually a practice problem that has the solution, so here is the code for that:
toronto = temperatures[temperatures["city"] == "Toronto"]
rome = temperatures[temperatures["city"] == "Rome"]
toronto.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="blue")
rome.groupby("date")["avg_temp_c"].mean().plot(kind="line", color="green")
My question is: why do I need to include .mean() in lines 3 and 4? I thought the numbers were already in avg_temp_c. Also, I experimented by replacing .mean() with .sum() and .median(), and it gives the same values. However, removing .mean() altogether for both lines just gives a blank plot. I tried to figure why, but I am very confused and I want to understand. Why doesn't it work without .mean() when the values are already listed in avg_temp_c?
I tried removing .mean(). I tried replacing .mean() with .median() and .sum(), which give the exact same values for some reason. I tried just printing toronto["avg_temp_c"] and rome["avg_temp_c"], which gives me the values, but when I plot it without .mean(), .sum(), or .median(), it does not work. I am just trying to figure why this is the case, and how does all three of those methods give me the same values as if I were just to print the avg_temp_c list?
Hope my question was clear. Thank you!

How to make Pandas Series with np.arrays into numerical value?

I am using the classical Titanic dataset. I used OneHotEncoder to encode surnames of people.
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Surname']), remainder = "drop")
encoded_surname = transformer.fit_transform(titanic)
titanic['Encoded_Surname'] = list(encoded_surname.astype(np.float64))
Here is what my data frame looks like:
This is what I get when I look for the .info():
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null int64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
6 Encoded_Surname 891 non-null object
dtypes: float64(1), int64(5), object(1)
Since the Encoded_Surname label is an object and not numeric like the rest, I cannot fit the data into the classifier model.
How do I turn the np.array I got from OneHotEncoder into numeric data?
IIUC, create a new dataframe for encoded_surname data and join it to your original dataset:
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Surname']), remainder = "drop")
encoded_surname = transformer.fit_transform(titanic)
titanic = titanic.join(pd.DataFrame(encoded_surname, dtype=int).add_prefix('Encoded_Surname'))
I would suggest you use pd.get_dummies instead of OneHotEncoder. If you really want to use the OneHotEncoder:
ohe_df = pd.DataFrame(encoded_surname, columns=transformer.get_feature_names())
#concat with original data
titanic = pd.concat([titanic, ohe_df], axis=1).drop(['Surname'], axis=1)
If you can use pd.get_dummies:
titanic = pd.get_dummies(titanic, prefix=['Surname'], columns=['Surname'], drop_first=True)

pd.concat turning categorical variables into object

I am seeing some strange behavior when trying to use pd.concat. I have a list of dataframes, with variables of one type (in this instance categorical) which get changed to objects when I concatenate them. The df is massive and this makes it even larger - too large to deal with.
Here is some sample code:
As context, I have scraped a website for a bunch of CSV files. I am reading, cleaning and setting the dtypes of all of them before appending them to a list. I then concatenate all the dfs in that list (but the dtypes of some variables get changed).
#Import modules
import glob
import pandas as pd
#Code to identify and download all the csvs
###
#code not included - seemed excessive
###
#Identify all the downloaded csvs
modis_csv_files = glob.glob('/path/to/files/**/*.csv', recursive = True)
#Examine the dtypes of one of these files
pd.read_csv(modis_csv_files[0]).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null object
6 acq_time 6 non-null int64
7 satellite 6 non-null object
8 instrument 6 non-null object
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null object
14 type 6 non-null int64
dtypes: float64(8), int64(3), object(4)
memory usage: 848.0+ bytes
We can see a number of object dtypes in there that will make the final df larger. So now I try read all the files, and set the dtype as i go.
#Read the CSVs, clean them and append them to a list
outputs = [] #Create the list
counter = 1 #Start a counter as i am importing around 4000 files
for i in modis_csv_files: #Iterate over the files importing and cleaning
print('Reading csv no. {} of {}'.format(counter, len(modis_csv_files))) #Produce a print statement describing progress
output = pd.read_csv(i) #Read the csv
output[['daynight', 'instrument', 'satellite']] = output[['daynight', 'instrument', 'satellite']].apply(lambda x: x.astype('category')) #Set the dtype for all the object variables that can be categories
output['acq_date'] = output['acq_date'].astype('datetime64[ns]') #Set the date variable
outputs.append(output) #Append to the list
counter += 1 #Increment the counter
#Conetenate all the files
final_modis = pd.concat(outputs)
#Look at the dtypes
final_modis.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85604183 entries, 0 to 24350
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 latitude float64
1 longitude float64
2 brightness float64
3 scan float64
4 track float64
5 acq_date datetime64[ns]
6 acq_time int64
7 satellite object
8 instrument category
9 confidence int64
10 version float64
11 bright_t31 float64
12 frp float64
13 daynight object
14 type int64
dtypes: category(1), datetime64[ns](1), float64(8), int64(3), object(2)
memory usage: 9.6+ GB
Notice that satellite and daynight still show as object (though notably instrument stays as category). So I check if there is a problem with my cleaning code.
outputs[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null datetime64[ns]
6 acq_time 6 non-null int64
7 satellite 6 non-null category
8 instrument 6 non-null category
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null category
14 type 6 non-null int64
dtypes: category(3), datetime64[ns](1), float64(8), int64(3)
memory usage: 986.0 bytes
Looks like everything changed. Perhaps one of the 4000 dfs contained something that meant they could not be changed to categorical, which caused the whole variable to shift back toobject when concatenated. Try checking each df in the list to see if either satellite or daynight is not category:
error_output = [] #create an empty list
for i in range(len(outputs)): #iterate over the list checking if dtype['variable'].name is categorical
if outputs[i].dtypes['satellite'].name != 'category' or outputs[i].dtypes['daynight'].name != 'category':
error_output.append(outputs[i]) #if not, append
#Check what is in the list
len(error_output)
0
So there are no dataframes in the list for which either of these variables is not categorical, but when I concatenate them the resulting variables are objects. Notably this outcome does not apply to all categorical variables, as instrument doesn't get changed back. What is going on?
Note: I can't change the dtype after pd.concat, because I run out of memory (I know there are some other solutions to this, but I am still intrigued by the behavior of pd.concat).
FWIW i am scraping data from the modis sattelite: https://firms.modaps.eosdis.nasa.gov/download/ (yearly summary by country). I can share all the scraping code as well if that would be helpful (seemed excessive for now however).

When plotting a pandas dataframe, the y-axis values are not displayed correctly

I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB

Pandas - Converting dataframe object to numbers

I have a dataframe which looks like this.
x_train.info()
Int64Index: 8330 entries, 16 to 8345
Data columns (total 4 columns):
userId 8330 non-null object
base_id 8330 non-null object
rating 8330 non-null object
dtypes: object(3)
I am trying to convert this to sparse matrix using the following command
train_sparse_matrix = sparse.csc_matrix((x_train['rating'].values, (x_train['userId'].values, x_train['base_id'].values)),)
But I get the following error
<ipython-input-112-520f5e1aee89> in <module>
4
5 train_sparse_matrix = sparse.csc_matrix((x_train['result.courseViewCount'].values, (x_train['userId'].values,
----> 6 **x_train['result.base_id'].values)),)**
TypeError: 'numpy.float64' object cannot be interpreted as an integer
So I tried converting this dataframe using .astype('int32) and to_numeric() function but the x_train.info() still keeps showing as object.
Can you please help!
Data would be something like this:
userId base_id rating
5392.0 ABC001 6.0
5392.0 ETZ222 2.0
5392.0 XYZ095 1.0
Is it because the base_id contains alphabets?
Can you try converting it to numpy.int64?
.astype(numpy.int64)
If you give an extract of the real data, it will be helpful to answer