Missing information in correlation matrix

Missing information in correlation matrix - pandas

I created a correlation matrix using pandas.DataFrame.corr(method = 'spearman') and got this result. The column RAIN does not contain any correlation between it and the other columns.
My question is - why are the correlations between RAIN and the other columns blank?
My dataset contains the following columns with their respective datatypes -
PM2.5 float64
PM10 float64
SO2 float64
NO2 float64
CO float64
O3 float64
TEMP float64
PRES float64
DEWP float64
RAIN float64
WSPM float64
dtype: object

First debugging step would be to check with df.isna().all() if RAIN column is all nan (which would be kicked out)

Related

pd.concat turning categorical variables into object

I am seeing some strange behavior when trying to use pd.concat. I have a list of dataframes, with variables of one type (in this instance categorical) which get changed to objects when I concatenate them. The df is massive and this makes it even larger - too large to deal with.
Here is some sample code:
As context, I have scraped a website for a bunch of CSV files. I am reading, cleaning and setting the dtypes of all of them before appending them to a list. I then concatenate all the dfs in that list (but the dtypes of some variables get changed).
#Import modules
import glob
import pandas as pd
#Code to identify and download all the csvs
###
#code not included - seemed excessive
###
#Identify all the downloaded csvs
modis_csv_files = glob.glob('/path/to/files/**/*.csv', recursive = True)
#Examine the dtypes of one of these files
pd.read_csv(modis_csv_files[0]).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null object
6 acq_time 6 non-null int64
7 satellite 6 non-null object
8 instrument 6 non-null object
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null object
14 type 6 non-null int64
dtypes: float64(8), int64(3), object(4)
memory usage: 848.0+ bytes
We can see a number of object dtypes in there that will make the final df larger. So now I try read all the files, and set the dtype as i go.
#Read the CSVs, clean them and append them to a list
outputs = [] #Create the list
counter = 1 #Start a counter as i am importing around 4000 files
for i in modis_csv_files: #Iterate over the files importing and cleaning
print('Reading csv no. {} of {}'.format(counter, len(modis_csv_files))) #Produce a print statement describing progress
output = pd.read_csv(i) #Read the csv
output[['daynight', 'instrument', 'satellite']] = output[['daynight', 'instrument', 'satellite']].apply(lambda x: x.astype('category')) #Set the dtype for all the object variables that can be categories
output['acq_date'] = output['acq_date'].astype('datetime64[ns]') #Set the date variable
outputs.append(output) #Append to the list
counter += 1 #Increment the counter
#Conetenate all the files
final_modis = pd.concat(outputs)
#Look at the dtypes
final_modis.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85604183 entries, 0 to 24350
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 latitude float64
1 longitude float64
2 brightness float64
3 scan float64
4 track float64
5 acq_date datetime64[ns]
6 acq_time int64
7 satellite object
8 instrument category
9 confidence int64
10 version float64
11 bright_t31 float64
12 frp float64
13 daynight object
14 type int64
dtypes: category(1), datetime64[ns](1), float64(8), int64(3), object(2)
memory usage: 9.6+ GB
Notice that satellite and daynight still show as object (though notably instrument stays as category). So I check if there is a problem with my cleaning code.
outputs[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null datetime64[ns]
6 acq_time 6 non-null int64
7 satellite 6 non-null category
8 instrument 6 non-null category
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null category
14 type 6 non-null int64
dtypes: category(3), datetime64[ns](1), float64(8), int64(3)
memory usage: 986.0 bytes
Looks like everything changed. Perhaps one of the 4000 dfs contained something that meant they could not be changed to categorical, which caused the whole variable to shift back toobject when concatenated. Try checking each df in the list to see if either satellite or daynight is not category:
error_output = [] #create an empty list
for i in range(len(outputs)): #iterate over the list checking if dtype['variable'].name is categorical
if outputs[i].dtypes['satellite'].name != 'category' or outputs[i].dtypes['daynight'].name != 'category':
error_output.append(outputs[i]) #if not, append
#Check what is in the list
len(error_output)
0
So there are no dataframes in the list for which either of these variables is not categorical, but when I concatenate them the resulting variables are objects. Notably this outcome does not apply to all categorical variables, as instrument doesn't get changed back. What is going on?
Note: I can't change the dtype after pd.concat, because I run out of memory (I know there are some other solutions to this, but I am still intrigued by the behavior of pd.concat).
FWIW i am scraping data from the modis sattelite: https://firms.modaps.eosdis.nasa.gov/download/ (yearly summary by country). I can share all the scraping code as well if that would be helpful (seemed excessive for now however).

Pandas set column value to 1 if other column value is NaN

I've looked everywhere tried .loc .apply and using lambda but I still cannot figure this out.
I have the UCI congressional vote dataset in a pandas dataframe and some votes are missing for votes 1 to 16 for each Democrat or Republican Congressperson.
So I inserted 16 columns after each vote column called abs.
I want each abs column to be 1 if the corresponding vote column is NaN.
None of those above methods I read on this site worked for me.
So I have this snippet below that also does not work but it might give a hint as to my current attempt using basic iterative Python syntax.
for i in range(16):
for j in range(len(cvotes['v1'])):
if cvotes['v{}'.format(i+1)][j] == np.nan:
cvotes['abs{}'.format(i+1)][j] = 1
else:
cvotes['abs{}'.format(i+1)][j] = 0
Any suggestions?
The above currently gives me 1 for abs when the vote value is NaN or 1.
edit:
I saw the given answer so tried this with just one column
cols = ['v1']
for col in cols:
cvotes = cvotes.join(cvotes[col].add_prefix('abs').isna().
astype(int))
but it's giving me an error:
ValueError: columns overlap but no suffix specified: Index(['v1'], dtype='object')
My dtypes are:
party object
v1 float64
v2 float64
v3 float64
v4 float64
v5 float64
v6 float64
v7 float64
v8 float64
v9 float64
v10 float64
v11 float64
v12 float64
v13 float64
v14 float64
v15 float64
v16 float64
abs1 int64
abs2 int64
abs3 int64
abs4 int64
abs5 int64
abs6 int64
abs7 int64
abs8 int64
abs9 int64
abs10 int64
abs11 int64
abs12 int64
abs13 int64
abs14 int64
abs15 int64
abs16 int64
dtype: object

Let us just do join with add_prefix
col=[c1,c2...]
s=pd.DataFrame(df[col].values.tolist(),index=df.index)
s.columns=s.columns+1
df=df.join(s.add_prefix('abs').isna().astype(int))

Pandas: create a random sample and correlation matrix

How to create a subset of the data that contains a random sample of 200 observations (database create form a csv file)
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
How to determine the correlations between housing values(median_house_value) and the other variables and display in descending order.
df.corr() gives me all the correlations. How to make it show only the median house value?

For the sample,
df = df.sample(200)
For the correlation, just do
df.corr()['median_house_value'].sort_values(ascending=False)

Resolving error when merging dataframes on two columns

I am trying to merge two dataframes (D1 & R1) on two columns (Date & Symbol) but I'm receiving this error "You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat".
I've been using pd.merge and I've tried different dtypes. I don't want to concatenate these because I just want to add D1 to the right side of R1.
D2 = pd.merge(D1, R2, on=['Date','Symbol'])
D1.dtypes()
Date object
Symbol object
High float64
Low float64
Open float64
Close float64
Volume float64
Adj Close float64
pct_change_1D float64
Symbol_above object
NE bool
R1.dtypes()
gvkey int64
datadate int64
fyearq int64
fqtr int64
indfmt object
consol object
popsrc object
datafmt object
tic object
curcdq object
datacqtr object
datafqtr object
rdq int64
costat object
ipodate float64
Report_Today int64
Symbol object
Date int64
Ideally, the columns not in the index of R1 (gvkey - Report_Today) will be on the right side of the columns in D1.
Any help is appreciated. Thanks.

In your description of DataFrames we can see,
In D1 DataFrame column Date has type "object"
In R1 DataFrame column Date has type "int64".
Make types of these columns the same and everything will be OK.

Problem assigning integer value in dataframe with mixed dtypes

I have a dataframe with four columns, with dtypes set up like this (hat tip to ryanjdillon!)
dtypes = np.dtype([
('size', int),
('sum', float),
('mean', float),
('std', float),
])
data = np.empty(0, dtype=dtypes)
df = pd.DataFrame(data)
At this stage, df.dtypes looks like this:
size int64
sum float64
mean float64
std float64
dtype: object
Great so far. But the first time I assign an int value to the 'size' column, e.g.
df.loc['foo', 'size'] = 1
it flips the dtype of the column to float64, and the value is cast, to 1.0 in this case.
size float64
sum float64
mean float64
std float64
dtype: object
Wazzup here?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Missing information in correlation matrix - pandas

First debugging step would be to check with df.isna().all() if RAIN column is all nan (which would be kicked out)

Related

pd.concat turning categorical variables into object

Pandas set column value to 1 if other column value is NaN

Pandas: create a random sample and correlation matrix

Resolving error when merging dataframes on two columns

Problem assigning integer value in dataframe with mixed dtypes

Categories

Resources