Pandas - Converting dataframe object to numbers - pandas

I have a dataframe which looks like this.
x_train.info()
Int64Index: 8330 entries, 16 to 8345
Data columns (total 4 columns):
userId 8330 non-null object
base_id 8330 non-null object
rating 8330 non-null object
dtypes: object(3)
I am trying to convert this to sparse matrix using the following command
train_sparse_matrix = sparse.csc_matrix((x_train['rating'].values, (x_train['userId'].values, x_train['base_id'].values)),)
But I get the following error
<ipython-input-112-520f5e1aee89> in <module>
4
5 train_sparse_matrix = sparse.csc_matrix((x_train['result.courseViewCount'].values, (x_train['userId'].values,
----> 6 **x_train['result.base_id'].values)),)**
TypeError: 'numpy.float64' object cannot be interpreted as an integer
So I tried converting this dataframe using .astype('int32) and to_numeric() function but the x_train.info() still keeps showing as object.
Can you please help!
Data would be something like this:
userId base_id rating
5392.0 ABC001 6.0
5392.0 ETZ222 2.0
5392.0 XYZ095 1.0
Is it because the base_id contains alphabets?

Can you try converting it to numpy.int64?
.astype(numpy.int64)
If you give an extract of the real data, it will be helpful to answer

Related

How to handle categorical variables in MissForest() during missing value imputation?

I am working on a regression problem using Bengaluru house price prediction dataset.
I was trying to impute missing values in bath and balcony using MissForest().
Since documentation says that MissForest() can handle categorical variables using 'cat_vars' parameter, I tried to use 'area_type' and 'locality' features in the imputer fit_transform method by passing their index, as shown below:
df_temp.info()
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 area_type 10296 non-null object
1 location 10296 non-null object
2 bath 10245 non-null float64
3 balcony 9810 non-null float64
4 rooms 10296 non-null int64
5 tot_sqft_1 10296 non-null float64
imputer = MissForest()
imputer.fit_transform(df_temp, cat_vars=[0,1])
But I am getting the below error:
'Cannot convert str to float: 'Super built up Area''
Could you please let me know why this could be? Do we need to encode the categorical variables using one hot encoding?

pd.concat turning categorical variables into object

I am seeing some strange behavior when trying to use pd.concat. I have a list of dataframes, with variables of one type (in this instance categorical) which get changed to objects when I concatenate them. The df is massive and this makes it even larger - too large to deal with.
Here is some sample code:
As context, I have scraped a website for a bunch of CSV files. I am reading, cleaning and setting the dtypes of all of them before appending them to a list. I then concatenate all the dfs in that list (but the dtypes of some variables get changed).
#Import modules
import glob
import pandas as pd
#Code to identify and download all the csvs
###
#code not included - seemed excessive
###
#Identify all the downloaded csvs
modis_csv_files = glob.glob('/path/to/files/**/*.csv', recursive = True)
#Examine the dtypes of one of these files
pd.read_csv(modis_csv_files[0]).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null object
6 acq_time 6 non-null int64
7 satellite 6 non-null object
8 instrument 6 non-null object
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null object
14 type 6 non-null int64
dtypes: float64(8), int64(3), object(4)
memory usage: 848.0+ bytes
We can see a number of object dtypes in there that will make the final df larger. So now I try read all the files, and set the dtype as i go.
#Read the CSVs, clean them and append them to a list
outputs = [] #Create the list
counter = 1 #Start a counter as i am importing around 4000 files
for i in modis_csv_files: #Iterate over the files importing and cleaning
print('Reading csv no. {} of {}'.format(counter, len(modis_csv_files))) #Produce a print statement describing progress
output = pd.read_csv(i) #Read the csv
output[['daynight', 'instrument', 'satellite']] = output[['daynight', 'instrument', 'satellite']].apply(lambda x: x.astype('category')) #Set the dtype for all the object variables that can be categories
output['acq_date'] = output['acq_date'].astype('datetime64[ns]') #Set the date variable
outputs.append(output) #Append to the list
counter += 1 #Increment the counter
#Conetenate all the files
final_modis = pd.concat(outputs)
#Look at the dtypes
final_modis.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85604183 entries, 0 to 24350
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 latitude float64
1 longitude float64
2 brightness float64
3 scan float64
4 track float64
5 acq_date datetime64[ns]
6 acq_time int64
7 satellite object
8 instrument category
9 confidence int64
10 version float64
11 bright_t31 float64
12 frp float64
13 daynight object
14 type int64
dtypes: category(1), datetime64[ns](1), float64(8), int64(3), object(2)
memory usage: 9.6+ GB
Notice that satellite and daynight still show as object (though notably instrument stays as category). So I check if there is a problem with my cleaning code.
outputs[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null datetime64[ns]
6 acq_time 6 non-null int64
7 satellite 6 non-null category
8 instrument 6 non-null category
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null category
14 type 6 non-null int64
dtypes: category(3), datetime64[ns](1), float64(8), int64(3)
memory usage: 986.0 bytes
Looks like everything changed. Perhaps one of the 4000 dfs contained something that meant they could not be changed to categorical, which caused the whole variable to shift back toobject when concatenated. Try checking each df in the list to see if either satellite or daynight is not category:
error_output = [] #create an empty list
for i in range(len(outputs)): #iterate over the list checking if dtype['variable'].name is categorical
if outputs[i].dtypes['satellite'].name != 'category' or outputs[i].dtypes['daynight'].name != 'category':
error_output.append(outputs[i]) #if not, append
#Check what is in the list
len(error_output)
0
So there are no dataframes in the list for which either of these variables is not categorical, but when I concatenate them the resulting variables are objects. Notably this outcome does not apply to all categorical variables, as instrument doesn't get changed back. What is going on?
Note: I can't change the dtype after pd.concat, because I run out of memory (I know there are some other solutions to this, but I am still intrigued by the behavior of pd.concat).
FWIW i am scraping data from the modis sattelite: https://firms.modaps.eosdis.nasa.gov/download/ (yearly summary by country). I can share all the scraping code as well if that would be helpful (seemed excessive for now however).

Pandas Interpolation: {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

I am trying to interpolate time series data, df, which looks like:
id data lat notes analysis_date
0 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
1 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
2 17352742 -2.331365 26.125979 None 2019-09-20 12:00:00+00:00
3 17358709 -4.424366 26.125979 None 2019-09-20 12:00:00+00:00
I try: df.groupby(['lat', 'lon']).apply(lambda group: group.interpolate(method='linear')), and it throws {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
I suspect the issue is with the fact that I have None values, and I do not want to interpolate those. What is the solution?
df.dtypes gives me:
id int64
data float64
lat float64
notes object
analysis_date datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
dtype: object
DataFrame.interpolate has issues with timezone-aware datetime64ns columns, which leads to that rather cryptic error message. E.g.
import pandas as pd
df = pd.DataFrame({'time': pd.to_datetime(['2010', '2011', 'foo', '2012', '2013'],
errors='coerce')})
df['time'] = df.time.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
df.interpolate()
ValueError: Invalid fill method. Expecting pad (ffill) or backfill
(bfill). Got linear
In this case interpolating that column is unnecessary so only interpolate the column you need. We still want DataFrame.interpolate so select with [[ ]] (Series.interpolate leads to some odd reshaping)
df['data'] = df.groupby(['lat', 'lon']).apply(lambda x: x[['data']].interpolate())
This error happens because one of the columns you are interpolating is of object data type. Interpolating works only for numerical data types such as integer or float.
If you need to use interpolating for an object or categorical data type, then first convert it to a numerical data type. For this, you need to encode your column first. The following piece of code will resolve your problem:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
notes_encoder=LabelEncoder()
df['notes'] = notes_encoder.fit_transform(df['notes'])
After doing this, check the column's data type. It must be int. If it is categorical ,then change its type to int using the following code:
df['notes']=df['notes'].astype('int32')

merging two pandas data frames with modin.pandas gives ValueError

In an attempt to make my pandas code faster I installed modin and tried to use it. A merge of two data frames that had previously worked gave me the following error:
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.frame.DataFrame'>
Here is the info of both data frames:
printing event_df.info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980101 entries, 0 to 1980100
Data columns (total 5 columns):
other_id object
id object
category object
description object
date datetime64[ns]
dtypes: datetime64[ns](1), object(4)
memory usage: 75.5+ MB
printing other_df info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752438 entries, 0 to 752437
Data columns (total 4 columns):
id 752438 non-null object
other_id 752438 non-null object
Value 752438 non-null object
Unit 752438 non-null object
dtypes: object(4)
memory usage: 23.0+ MB
Here are some rows from event_df:
other_id id category description date
08E5A97350FC8B00092F 1 some_string some_string 2019-04-09
17B71019E148415D 4 some_string some_string 2019-11-08
17B71019E148415D360 7 some_string some_string 2019-11-08
and here are 3 rows from other_df:
id other_id Value Unit
a01 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a02 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
a03 BE4F15A3AE8A508ACB45F0FC8CDC173D1628D283 3 some_string
I tried installing the version cited in this question Join two modin.pandas.DataFrame(s), but it didn't help.
Here's the line of code throwing the error:
joint_dataframe2 = pd.merge(event_df,other_df, on = ["id","other_id"])
It seems there is some problem with modin's merge functionality. Is there any workaround such as using pandas for the merge and using modin for a groupby.transform()? I tried overwriting the pandas import after the merge with import modin.pandas, but got an error saying pandas was referenced before assignment. Has anyone come across this problem and if so, is there a solution?
Your error reads like you were merging an instance of modin.pandas.dataframe.DataFrame with an instance of pandas.core.frame.DataFrame, which is not allowed.
If that's indeed the case, you could turn the pandas Dataframe into a modin Dataframe first, then you should be able to merge them, I believe.

Pandas df.head() Error

I'm having a basic problem with df.head(). When the function is executed, it usually displays a nice HTML formatted table of the top 5 values, but now it only appears to slice the dataframe and outputs like below:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
survived 5 non-null values
pclass 5 non-null values
name 5 non-null values
sex 5 non-null values
age 5 non-null values
sibsp 5 non-null values
parch 5 non-null values
fare 5 non-null values
embarked 5 non-null values
dtypes: float64(2), int64(4), object(3)
After looking at this thread I tried
pd.util.terminal.get_terminal_size()
and received the expected output (80, 25). Manually setting the print options with
pd.set_printoptions(max_columns=10)
Yields the same sliced dataframe results like above.
This was confirmed after diving into the documentation here and using the
get_option("display.max_rows")
get_option("display.max_columns")
and getting the correct default 60 rows and 10 columns.
I've never had a problem with df.head() before but now its an issue in all of my IPython Notebooks.
I'm running pandas 0.11.0 and IPython 0.13.2 in google chrome.
In pandas 11.0, I think the minimum of display.height and max_rows (and display.width and max_columns`) is used, so you need to manually change that too.
I don't like this, I posted this github issue about it previously.
Try using the following to display the top 10 items
from IPython.display import HTML
HTML(users.head(10).to_html())
I think pandas 11.0 head function is totally unintuitive and should
simply have remained as head() and you get your html.