pd.concat turning categorical variables into object - pandas

I am seeing some strange behavior when trying to use pd.concat. I have a list of dataframes, with variables of one type (in this instance categorical) which get changed to objects when I concatenate them. The df is massive and this makes it even larger - too large to deal with.
Here is some sample code:
As context, I have scraped a website for a bunch of CSV files. I am reading, cleaning and setting the dtypes of all of them before appending them to a list. I then concatenate all the dfs in that list (but the dtypes of some variables get changed).
#Import modules
import glob
import pandas as pd
#Code to identify and download all the csvs
###
#code not included - seemed excessive
###
#Identify all the downloaded csvs
modis_csv_files = glob.glob('/path/to/files/**/*.csv', recursive = True)
#Examine the dtypes of one of these files
pd.read_csv(modis_csv_files[0]).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null object
6 acq_time 6 non-null int64
7 satellite 6 non-null object
8 instrument 6 non-null object
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null object
14 type 6 non-null int64
dtypes: float64(8), int64(3), object(4)
memory usage: 848.0+ bytes
We can see a number of object dtypes in there that will make the final df larger. So now I try read all the files, and set the dtype as i go.
#Read the CSVs, clean them and append them to a list
outputs = [] #Create the list
counter = 1 #Start a counter as i am importing around 4000 files
for i in modis_csv_files: #Iterate over the files importing and cleaning
print('Reading csv no. {} of {}'.format(counter, len(modis_csv_files))) #Produce a print statement describing progress
output = pd.read_csv(i) #Read the csv
output[['daynight', 'instrument', 'satellite']] = output[['daynight', 'instrument', 'satellite']].apply(lambda x: x.astype('category')) #Set the dtype for all the object variables that can be categories
output['acq_date'] = output['acq_date'].astype('datetime64[ns]') #Set the date variable
outputs.append(output) #Append to the list
counter += 1 #Increment the counter
#Conetenate all the files
final_modis = pd.concat(outputs)
#Look at the dtypes
final_modis.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85604183 entries, 0 to 24350
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 latitude float64
1 longitude float64
2 brightness float64
3 scan float64
4 track float64
5 acq_date datetime64[ns]
6 acq_time int64
7 satellite object
8 instrument category
9 confidence int64
10 version float64
11 bright_t31 float64
12 frp float64
13 daynight object
14 type int64
dtypes: category(1), datetime64[ns](1), float64(8), int64(3), object(2)
memory usage: 9.6+ GB
Notice that satellite and daynight still show as object (though notably instrument stays as category). So I check if there is a problem with my cleaning code.
outputs[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null datetime64[ns]
6 acq_time 6 non-null int64
7 satellite 6 non-null category
8 instrument 6 non-null category
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null category
14 type 6 non-null int64
dtypes: category(3), datetime64[ns](1), float64(8), int64(3)
memory usage: 986.0 bytes
Looks like everything changed. Perhaps one of the 4000 dfs contained something that meant they could not be changed to categorical, which caused the whole variable to shift back toobject when concatenated. Try checking each df in the list to see if either satellite or daynight is not category:
error_output = [] #create an empty list
for i in range(len(outputs)): #iterate over the list checking if dtype['variable'].name is categorical
if outputs[i].dtypes['satellite'].name != 'category' or outputs[i].dtypes['daynight'].name != 'category':
error_output.append(outputs[i]) #if not, append
#Check what is in the list
len(error_output)
0
So there are no dataframes in the list for which either of these variables is not categorical, but when I concatenate them the resulting variables are objects. Notably this outcome does not apply to all categorical variables, as instrument doesn't get changed back. What is going on?
Note: I can't change the dtype after pd.concat, because I run out of memory (I know there are some other solutions to this, but I am still intrigued by the behavior of pd.concat).
FWIW i am scraping data from the modis sattelite: https://firms.modaps.eosdis.nasa.gov/download/ (yearly summary by country). I can share all the scraping code as well if that would be helpful (seemed excessive for now however).

Related

How to handle categorical variables in MissForest() during missing value imputation?

I am working on a regression problem using Bengaluru house price prediction dataset.
I was trying to impute missing values in bath and balcony using MissForest().
Since documentation says that MissForest() can handle categorical variables using 'cat_vars' parameter, I tried to use 'area_type' and 'locality' features in the imputer fit_transform method by passing their index, as shown below:
df_temp.info()
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area_type 10296 non-null object
1 location 10296 non-null object
2 bath 10245 non-null float64
3 balcony 9810 non-null float64
4 rooms 10296 non-null int64
5 tot_sqft_1 10296 non-null float64
imputer = MissForest()
imputer.fit_transform(df_temp, cat_vars=[0,1])
But I am getting the below error:
'Cannot convert str to float: 'Super built up Area''
Could you please let me know why this could be? Do we need to encode the categorical variables using one hot encoding?

How to make Pandas Series with np.arrays into numerical value?

I am using the classical Titanic dataset. I used OneHotEncoder to encode surnames of people.
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Surname']), remainder = "drop")
encoded_surname = transformer.fit_transform(titanic)
titanic['Encoded_Surname'] = list(encoded_surname.astype(np.float64))
Here is what my data frame looks like:
This is what I get when I look for the .info():
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null int64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
6 Encoded_Surname 891 non-null object
dtypes: float64(1), int64(5), object(1)
Since the Encoded_Surname label is an object and not numeric like the rest, I cannot fit the data into the classifier model.
How do I turn the np.array I got from OneHotEncoder into numeric data?
IIUC, create a new dataframe for encoded_surname data and join it to your original dataset:
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Surname']), remainder = "drop")
encoded_surname = transformer.fit_transform(titanic)
titanic = titanic.join(pd.DataFrame(encoded_surname, dtype=int).add_prefix('Encoded_Surname'))
I would suggest you use pd.get_dummies instead of OneHotEncoder. If you really want to use the OneHotEncoder:
ohe_df = pd.DataFrame(encoded_surname, columns=transformer.get_feature_names())
#concat with original data
titanic = pd.concat([titanic, ohe_df], axis=1).drop(['Surname'], axis=1)
If you can use pd.get_dummies:
titanic = pd.get_dummies(titanic, prefix=['Surname'], columns=['Surname'], drop_first=True)

When plotting a pandas dataframe, the y-axis values are not displayed correctly

I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB

Optimizing sending a dataframe to hdf?

What is the difference between an h5 and hdf file? Should I use one over the other? I tried doing a timeit with the following two codes and it took about 3 minutes and 29 seconds per loop and with a 240mb file. I had an error eventually on the second code but the file size was above 300mb on disk.
hdf = pd.HDFStore('combined.h5')
hdf.put('table', df, format='table', complib='blosc', complevel=5, data_columns=True)
hdf.close()
df.to_hdf('combined.hdf', 'table', format='table', mode='w', complib='blosc', complevel=5)
Also, I had an error message which said:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types
This is due to string columns which are objects because of blank values. If I do .astype(str), all of the blanks are replaced with nan (a string which even appears in output files). Do I worry about the error message and fill in the blanks and replace them again with np.nan later, or just ignore it?
Here is the df.info() to show that there are some columns with nulls. I can't remove these rows but I could temporarily fill them with something if required.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1387276 entries, 0 to 657406
Data columns (total 12 columns):
date 1387276 non-null datetime64[ns]
start_time 1387276 non-null datetime64[ns]
end_time 313190 non-null datetime64[ns]
cola 1387276 non-null object
colb 1387276 non-null object
colc 1387276 non-null object
cold 476816 non-null object
cole 1228781 non-null object
colx 1185679 non-null object
coly 313190 non-null object
colz 1387276 non-null int64
colzz 1387276 non-null int64
dtypes: datetime64[ns](3), int64(2), object(7)
memory usage: 137.6+ MB

Replacing Category Data (pandas)

I have some large files with several category columns. Category is kind of a generous word too because these are basically descriptions/partial sentences.
Here are the unique values per category:
Category 1 = 15
Category 2 = 94
Category 3 = 294
Category 4 = 401
Location 1 = 30
Location 2 = 60
Then there are even users with recurring data (first name, last name, IDs etc).
I was thinking of the following solutions to make the file size smaller:
1) Create a file which matches each category with an unique integer
2) Create a map (is there a way to do this from reading another file? Like I would create a .csv and load it as another dataframe and then match it? Or do I literally have to type it out initially?)
OR
3) Basically do a join (VLOOKUP) and then del the old column with the long object names
pd.merge(df1, categories, on = 'Category1', how = 'left')
del df1['Category1']
What do people normally do in this case? These files are pretty huge. 60 columns and most of the data are long, repeating categories and timestamps. Literally no numerical data at all. It's fine for me, but sharing the files is almost impossible due to shared drive space allocations for more than a few months.
To benefit from Categorical dtype when saving to csv you might want to follow this process:
Extract your Category definitions into separate dataframes / files
Convert your Categorical data to int codes
Save converted DataFrame to csv along with definitions dataframes
When you need to use them again:
Restore dataframes from csv files
Map dataframe with int codes to category definitions
Convert mapped columns to Categorical
To illustrate the process:
Make a sample dataframe:
df = pd.DataFrame(index=pd.np.arange(0,100000))
df.index.name = 'index'
df['categories'] = 'Category'
df['locations'] = 'Location'
n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4)
n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2)
df['categories'] = df['categories'] + pd.Series(n1).astype(str)
df['locations'] = df['locations'] + pd.Series(n2).astype(str)
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null object
locations 100000 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
None
Note the size: 2.3+ MB - this would be roughly the size of your csv file.
Now convert these data to Categorical:
df['categories'] = df['categories'].astype('category')
df['locations'] = df['locations'].astype('category')
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null category
locations 100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None
Note the drop in memory usage down to 976.6 KB
But if you would save it to csv now:
df.to_csv('test1.csv')
...you would see this inside the file:
index,categories,locations
0,Category1,Location1
1,Category2,Location2
2,Category3,Location1
3,Category4,Location2
Which means 'Categorical' has been converted to strings for saving in csv.
So let's get rid of the labels in Categorical data after we save the definitions:
categories_details = pd.DataFrame(df.categories.drop_duplicates(), columns=['categories'])
print categories_details
categories
index
0 Category1
1 Category2
2 Category3
3 Category4
locations_details = pd.DataFrame(df.locations.drop_duplicates(), columns=['locations'])
print locations_details
index
0 Location1
1 Location2
Now covert Categorical to int dtype:
for col in df.select_dtypes(include=['category']).columns:
df[col] = df[col].cat.codes
print df.head()
categories locations
index
0 0 0
1 1 1
2 2 0
3 3 1
4 0 0
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null int8
locations 100000 non-null int8
dtypes: int8(2)
memory usage: 976.6 KB
None
Save converted data to csv and note that the file now has only numbers without labels.
The file size will also reflect this change.
df.to_csv('test2.csv')
index,categories,locations
0,0,0
1,1,1
2,2,0
3,3,1
Save the definitions as well:
categories_details.to_csv('categories_details.csv')
locations_details.to_csv('locations_details.csv')
When you need to restore the files, load them from csv files:
df2 = pd.read_csv('test2.csv', index_col='index')
print df2.head()
categories locations
index
0 0 0
1 1 1
2 2 0
3 3 1
4 0 0
print df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null int64
locations 100000 non-null int64
dtypes: int64(2)
memory usage: 2.3 MB
None
categories_details2 = pd.read_csv('categories_details.csv', index_col='index')
print categories_details2.head()
categories
index
0 Category1
1 Category2
2 Category3
3 Category4
print categories_details2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
categories 4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
None
locations_details2 = pd.read_csv('locations_details.csv', index_col='index')
print locations_details2.head()
locations
index
0 Location1
1 Location2
print locations_details2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
locations 2 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
None
Now use map to replace int coded data with categories descriptions and convert them to Categorical:
df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category')
df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category')
print df2.head()
categories locations
index
0 Category1 Location1
1 Category2 Location2
2 Category3 Location1
3 Category4 Location2
4 Category1 Location1
print df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null category
locations 100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None
Note the memory usage back to what it was when we first converted data to Categorical.
It should not be hard to automate this process if you need to repeat it many time.
Pandas has a Categorical data type that does just that. It basically maps the categories to integers behind the scenes.
Internally, the data structure consists of a categories array and an
integer array of codes which point to the real value in the categories
array.
Documentation is here.
Here's a way to save a dataframe with Categorical columns in a single .csv:
Example:
------ -------
Fatcol Thincol: unique strings once, then numbers
------ -------
"Alberta" "Alberta"
"BC" "BC"
"BC" 2 -- string 2
"Alberta" 1 -- string 1
"BC" 2
...
The "Thincol" on the right can be saved as is in a .csv file,
and expanded to the "Fatcol" on the left after reading it in;
this can halve the size of big .csv s with repeated strings.
Functions
---------
fatcol( col: Thincol ) -> Fatcol, list[ unique str ]
thincol( col: Fatcol ) -> Thincol, dict( unique str -> int ), list[ unique str ]
Here "Fatcol" and "Thincol" are type names for iterators, e.g. lists:
Fatcol: list of strings
Thincol: list of strings or ints or NaN s
If a `col` is a `pandas.Series`, its `.values` are used.
This cut a 700M .csv to 248M -- but write_csv runs at ~ 1 MB/sec on my imac.