How to create a subset of the data that contains a random sample of 200 observations (database create form a csv file)
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
How to determine the correlations between housing values(median_house_value) and the other variables and display in descending order.
df.corr() gives me all the correlations. How to make it show only the median house value?
For the sample,
df = df.sample(200)
For the correlation, just do
df.corr()['median_house_value'].sort_values(ascending=False)
Related
I would display all information of my data frame which contains more than 100 columns with .info() from pandas but it won't :
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85529 entries, 0 to 85528
Columns: 110 entries, ID to TARGET
dtypes: float64(40), int64(19), object(51)
memory usage: 71.8+ MB
I would like it displays like this :
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime 10886 non-null object
season 10886 non-null int64
holiday 10886 non-null int64
workingday 10886 non-null int64
weather 10886 non-null int64
temp 10886 non-null float64
atemp 10886 non-null float64
humidity 10886 non-null int64
windspeed 10886 non-null float64
casual 10886 non-null int64
registered 10886 non-null int64
count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
But the problem seems to be the high number of columns from my previous data frame. I would like to show all values including non null values (NaN).
You can pass optional arguments verbose=True and show_counts=True (null_counts=True deprecated since pandas 1.2.0) to the .info() method to output information for all of the columns
pandas >=1.2.0: data_train.info(verbose=True, show_counts=True)
pandas <1.2.0: data_train.info(verbose=True, null_counts=True)
I am seeing some strange behavior when trying to use pd.concat. I have a list of dataframes, with variables of one type (in this instance categorical) which get changed to objects when I concatenate them. The df is massive and this makes it even larger - too large to deal with.
Here is some sample code:
As context, I have scraped a website for a bunch of CSV files. I am reading, cleaning and setting the dtypes of all of them before appending them to a list. I then concatenate all the dfs in that list (but the dtypes of some variables get changed).
#Import modules
import glob
import pandas as pd
#Code to identify and download all the csvs
###
#code not included - seemed excessive
###
#Identify all the downloaded csvs
modis_csv_files = glob.glob('/path/to/files/**/*.csv', recursive = True)
#Examine the dtypes of one of these files
pd.read_csv(modis_csv_files[0]).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null object
6 acq_time 6 non-null int64
7 satellite 6 non-null object
8 instrument 6 non-null object
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null object
14 type 6 non-null int64
dtypes: float64(8), int64(3), object(4)
memory usage: 848.0+ bytes
We can see a number of object dtypes in there that will make the final df larger. So now I try read all the files, and set the dtype as i go.
#Read the CSVs, clean them and append them to a list
outputs = [] #Create the list
counter = 1 #Start a counter as i am importing around 4000 files
for i in modis_csv_files: #Iterate over the files importing and cleaning
print('Reading csv no. {} of {}'.format(counter, len(modis_csv_files))) #Produce a print statement describing progress
output = pd.read_csv(i) #Read the csv
output[['daynight', 'instrument', 'satellite']] = output[['daynight', 'instrument', 'satellite']].apply(lambda x: x.astype('category')) #Set the dtype for all the object variables that can be categories
output['acq_date'] = output['acq_date'].astype('datetime64[ns]') #Set the date variable
outputs.append(output) #Append to the list
counter += 1 #Increment the counter
#Conetenate all the files
final_modis = pd.concat(outputs)
#Look at the dtypes
final_modis.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85604183 entries, 0 to 24350
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 latitude float64
1 longitude float64
2 brightness float64
3 scan float64
4 track float64
5 acq_date datetime64[ns]
6 acq_time int64
7 satellite object
8 instrument category
9 confidence int64
10 version float64
11 bright_t31 float64
12 frp float64
13 daynight object
14 type int64
dtypes: category(1), datetime64[ns](1), float64(8), int64(3), object(2)
memory usage: 9.6+ GB
Notice that satellite and daynight still show as object (though notably instrument stays as category). So I check if there is a problem with my cleaning code.
outputs[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null datetime64[ns]
6 acq_time 6 non-null int64
7 satellite 6 non-null category
8 instrument 6 non-null category
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null category
14 type 6 non-null int64
dtypes: category(3), datetime64[ns](1), float64(8), int64(3)
memory usage: 986.0 bytes
Looks like everything changed. Perhaps one of the 4000 dfs contained something that meant they could not be changed to categorical, which caused the whole variable to shift back toobject when concatenated. Try checking each df in the list to see if either satellite or daynight is not category:
error_output = [] #create an empty list
for i in range(len(outputs)): #iterate over the list checking if dtype['variable'].name is categorical
if outputs[i].dtypes['satellite'].name != 'category' or outputs[i].dtypes['daynight'].name != 'category':
error_output.append(outputs[i]) #if not, append
#Check what is in the list
len(error_output)
0
So there are no dataframes in the list for which either of these variables is not categorical, but when I concatenate them the resulting variables are objects. Notably this outcome does not apply to all categorical variables, as instrument doesn't get changed back. What is going on?
Note: I can't change the dtype after pd.concat, because I run out of memory (I know there are some other solutions to this, but I am still intrigued by the behavior of pd.concat).
FWIW i am scraping data from the modis sattelite: https://firms.modaps.eosdis.nasa.gov/download/ (yearly summary by country). I can share all the scraping code as well if that would be helpful (seemed excessive for now however).
I created a correlation matrix using pandas.DataFrame.corr(method = 'spearman') and got this result. The column RAIN does not contain any correlation between it and the other columns.
My question is - why are the correlations between RAIN and the other columns blank?
My dataset contains the following columns with their respective datatypes -
PM2.5 float64
PM10 float64
SO2 float64
NO2 float64
CO float64
O3 float64
TEMP float64
PRES float64
DEWP float64
RAIN float64
WSPM float64
dtype: object
First debugging step would be to check with df.isna().all() if RAIN column is all nan (which would be kicked out)
I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB
I am using two different data sets (linked below) which observe geographic location. I am trying to drop all observations where State is a territory or other non-State (excluding DC). The 'State' variable/column is a non-null object in both dataframes. I include .info() on them to show data types and number of observations.
Earlier in my code I use .isin() for this purpose for a different dataframe and it works:
zipcodes = pd.read_csv('zipcode.csv')
zipcodes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42522 entries, 0 to 42521
Data columns (total 12 columns):
Zipcode 42522 non-null int64
ZipCodeType 42522 non-null object
City 42522 non-null object
State 42522 non-null object
LocationType 42522 non-null object
Lat 41874 non-null float64
Long 41874 non-null float64
Location 42521 non-null object
Decommisioned 42522 non-null bool
TaxReturnsFiled 28879 non-null float64
EstimatedPopulation 28879 non-null float64
TotalWages 28844 non-null float64
dtypes: bool(1), float64(5), int64(1), object(5)
memory usage: 2.8+ MB
drops =['PR','AP','AA','VI','GU','FM','MP','MH','PW','AS','AE']
zipcodes = zipcodes[~zipcodes['State'].isin(drops)]
zipcodes.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 41656 entries, 12 to 42521
Data columns (total 12 columns):
Zipcode 41656 non-null int64
ZipCodeType 41656 non-null object
City 41656 non-null object
State 41656 non-null object
LocationType 41656 non-null object
Lat 41656 non-null float64
Long 41656 non-null float64
Location 41656 non-null object
Decommisioned 41656 non-null bool
TaxReturnsFiled 28879 non-null float64
EstimatedPopulation 28879 non-null float64
TotalWages 28844 non-null float64
dtypes: bool(1), float64(5), int64(1), object(5)
memory usage: 3.1+ MB
The observations for which State is an object in the list drops have now been dropped.
However, when I try to do the same with another data set, the same .isin() code does not drop the observations for which State is in the list drops (I have to create the State variable by splitting an included variable and then adding it as a column, which I suspect may be causing my issue, but the resulting variable/column is still a non-null object):
cty_nbrs = pd.read_excel('county_adjacency.xlsx')
cty_nbrs.fillna(method='ffill', inplace=True)
cty_nbrs.join(pd.DataFrame({'State':cty_nbrs.County.str.split(",").str.get(1)}))
cty_nbrs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22200 entries, 0 to 22199
Data columns (total 5 columns):
County 22200 non-null object
ctyfips 22200 non-null int32
Neighbours 22200 non-null object
nbrfips 22200 non-null int64
State 22200 non-null object
dtypes: int32(1), int64(1), object(3)
memory usage: 520.4+ KB
cty_nbrs= cty_nbrs[~cty_nbrs['State'].isin(drops)]
cty_nbrs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22200 entries, 0 to 22199
Data columns (total 5 columns):
County 22200 non-null object
ctyfips 22200 non-null int32
Neighbours 22200 non-null object
nbrfips 22200 non-null int64
State 22200 non-null object
dtypes: int32(1), int64(1), object(3)
memory usage: 520.4+ KB
The offending observations have not been dropped here. In case this is not evident:
cty_nbrs['State'].value_counts().tail()
AS 7
MP 6
DC 6
VI 5
GU 1
Name: State, dtype: int64
Those realizations of State are all elements of the list drops which were earlier dropped from zipcodes using the same code. What am I missing here?
zipcode.csv
county_adjacency.csv