this is my code which has a txt file loaded into the new data frame:
import pandas as pd
desired_width = 320
pd.set_option('display.width', desired_width)
from datetime import datetime
print(new.head(5))
new.info()
and this is the result:
Date Time Open
0 2013/1/4 07:00:00.0 7847.5
1 2013/1/4 07:00:00.1 7847.5
2 2013/1/4 07:00:00.2 7847.5
3 2013/1/4 07:00:00.3 7847.5
4 2013/1/4 07:00:00.4 7847.5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 3 columns):
Date 17 non-null object
Time 17 non-null object
Open 17 non-null float64
dtypes: float64(1), object(2)
memory usage: 488.0+ bytes
I am failing to make the Date+Time as the index as both Date and Time are objects. also I need to keep the time with it's milliseconds.
Trails with:
pd.to_datetime(new.Date + ' ' + new.Time)
caused:
AttributeError: 'DataFrame' object has no attribute 'Time'
Please advice how to create the multiindex which will be as an float64 as other columns
thanks
df.columns = df.columns.str.strip() cuts all spaces from headers names and that datetime() works with no problem
Related
I am seeing some strange behavior when trying to use pd.concat. I have a list of dataframes, with variables of one type (in this instance categorical) which get changed to objects when I concatenate them. The df is massive and this makes it even larger - too large to deal with.
Here is some sample code:
As context, I have scraped a website for a bunch of CSV files. I am reading, cleaning and setting the dtypes of all of them before appending them to a list. I then concatenate all the dfs in that list (but the dtypes of some variables get changed).
#Import modules
import glob
import pandas as pd
#Code to identify and download all the csvs
###
#code not included - seemed excessive
###
#Identify all the downloaded csvs
modis_csv_files = glob.glob('/path/to/files/**/*.csv', recursive = True)
#Examine the dtypes of one of these files
pd.read_csv(modis_csv_files[0]).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null object
6 acq_time 6 non-null int64
7 satellite 6 non-null object
8 instrument 6 non-null object
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null object
14 type 6 non-null int64
dtypes: float64(8), int64(3), object(4)
memory usage: 848.0+ bytes
We can see a number of object dtypes in there that will make the final df larger. So now I try read all the files, and set the dtype as i go.
#Read the CSVs, clean them and append them to a list
outputs = [] #Create the list
counter = 1 #Start a counter as i am importing around 4000 files
for i in modis_csv_files: #Iterate over the files importing and cleaning
print('Reading csv no. {} of {}'.format(counter, len(modis_csv_files))) #Produce a print statement describing progress
output = pd.read_csv(i) #Read the csv
output[['daynight', 'instrument', 'satellite']] = output[['daynight', 'instrument', 'satellite']].apply(lambda x: x.astype('category')) #Set the dtype for all the object variables that can be categories
output['acq_date'] = output['acq_date'].astype('datetime64[ns]') #Set the date variable
outputs.append(output) #Append to the list
counter += 1 #Increment the counter
#Conetenate all the files
final_modis = pd.concat(outputs)
#Look at the dtypes
final_modis.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 85604183 entries, 0 to 24350
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 latitude float64
1 longitude float64
2 brightness float64
3 scan float64
4 track float64
5 acq_date datetime64[ns]
6 acq_time int64
7 satellite object
8 instrument category
9 confidence int64
10 version float64
11 bright_t31 float64
12 frp float64
13 daynight object
14 type int64
dtypes: category(1), datetime64[ns](1), float64(8), int64(3), object(2)
memory usage: 9.6+ GB
Notice that satellite and daynight still show as object (though notably instrument stays as category). So I check if there is a problem with my cleaning code.
outputs[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 6 non-null float64
1 longitude 6 non-null float64
2 brightness 6 non-null float64
3 scan 6 non-null float64
4 track 6 non-null float64
5 acq_date 6 non-null datetime64[ns]
6 acq_time 6 non-null int64
7 satellite 6 non-null category
8 instrument 6 non-null category
9 confidence 6 non-null int64
10 version 6 non-null float64
11 bright_t31 6 non-null float64
12 frp 6 non-null float64
13 daynight 6 non-null category
14 type 6 non-null int64
dtypes: category(3), datetime64[ns](1), float64(8), int64(3)
memory usage: 986.0 bytes
Looks like everything changed. Perhaps one of the 4000 dfs contained something that meant they could not be changed to categorical, which caused the whole variable to shift back toobject when concatenated. Try checking each df in the list to see if either satellite or daynight is not category:
error_output = [] #create an empty list
for i in range(len(outputs)): #iterate over the list checking if dtype['variable'].name is categorical
if outputs[i].dtypes['satellite'].name != 'category' or outputs[i].dtypes['daynight'].name != 'category':
error_output.append(outputs[i]) #if not, append
#Check what is in the list
len(error_output)
0
So there are no dataframes in the list for which either of these variables is not categorical, but when I concatenate them the resulting variables are objects. Notably this outcome does not apply to all categorical variables, as instrument doesn't get changed back. What is going on?
Note: I can't change the dtype after pd.concat, because I run out of memory (I know there are some other solutions to this, but I am still intrigued by the behavior of pd.concat).
FWIW i am scraping data from the modis sattelite: https://firms.modaps.eosdis.nasa.gov/download/ (yearly summary by country). I can share all the scraping code as well if that would be helpful (seemed excessive for now however).
I have a dataframe (see link for image) and I've listed the info on the data frame. I use the pivot_table function to sum the total number of births for each year. The issue is that when I try to plot the dataframe, the y-axis values range from 0 to 2.0 instead of the minimum and maximum values from the M and F columns.
To verify that it's not my environment, I created a simple dataframe, with just a few values and plot the line graph for that dataframe and it works as expected. Does anyone know why this is happening? Attempting to set the values using ylim or yticks is not working. Ultimately, I will have to try other graphing utilities like matplotlib, but I'm curious as to why it's not working for such a simple dataframe and dataset.
Visit my github page for a working example <git#github.com:stevencorrea-chicago/stackoverflow_question.git>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1690784 non-null object
1 sex 1690784 non-null object
2 births 1690784 non-null int64
3 year 1690784 non-null Int64
dtypes: Int64(1), int64(1), object(2)
memory usage: 53.2+ MB
new_df = df.pivot_table(values='births', index='year', columns='sex', aggfunc=sum)
new_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 131 entries, 1880 to 2010
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F 131 non-null int64
1 M 131 non-null int64
dtypes: int64(2)
memory usage: 3.1+ KB
I'm trying to generate 4 plots from a DataFrame using Seaborn
Date A B C D
2019-04-05 330.665 161.975 168.69 0
2019-04-06 322.782 150.243 172.539 0
2019-04-07 322.782 150.243 172.539 0
2019-04-08 295.918 127.801 168.117 0
2019-04-09 282.674 126.894 155.78 0
2019-04-10 293.818 133.413 160.405 0
I have casted dates using pd.to_DateTime and numbers using pd.to_numeric. Here is the df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 460 to 465
Data columns (total 5 columns):
Date 6 non-null datetime64[ns]
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 6 non-null float64
dtypes: datetime64[ns](1), float64(4)
memory usage: 288.0 bytes
I can do a wide column plot by just calling .plot() on df.
However,
The legend of the plot is covering the plot itself
I would instead like to have 4 separate plots in 1 diagram and have tried using lmplot to achieve this.
I would like to add labels to the plot like so:
Plot with image
I first melted the data:
df=pd.melt(df,id_vars='Date', var_name='Var', value_name='Unit')
And then tried lmplot
sns.lmplot(x = df['Date'], y='Unit', col='Var', data=df)
However, I get the traceback:
TypeError: Invalid comparison between dtype=datetime64[ns] and str
I have also tried setting df.set_index['Date'] and replotting that using x=df.index and that gave me the same error.
The data can be plotted using Google Sheets but I am trying to automate a workflow where the chart can be generated and sent via Slack to selected recipients.
I hope I have expressed myself clearly enough as I am rather new to Python and Seaborn and hope to get some help from the experts here.
Regarding the legend you can just use .legend(loc="upper left", bbox_to_anchor=(1,1)) as in this example
%matplotlib inline
import pandas as pd
import numpy as np
data = np.random.rand(10,4)
df = pd.DataFrame(data, columns=["A", "B", "C", "D"])
df.plot()\
.legend(loc="upper left", bbox_to_anchor=(1,1));
While for the second IIUC you can play from
df.plot(subplots=True, layout=(2,2));
I can't seem to index rows using datetime index with pandas. Information on my dataframe shows that the index is datetimeindex:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 223 entries, 2013-10-29 to 2017-05-29
Data columns (total 6 columns):
Unnamed: 0 223 non-null float64
company 223 non-null object
date 223 non-null object
date_conv 223 non-null object
text 223 non-null object
title 223 non-null object
dtypes: float64(1), object(5)
memory usage: 17.2+ KB
But when I do this it returns 'key error'
df['2017-02-04']
Should I have index series name as "index" to make this work? Although the my df is using datetimeindex, the column name of the index is not 'index' it's 'date_conv'.
In your example '2017-02-04' is string.
You have to refer to the row by datetime:
df.loc[datetime.datetime.strptime('2017-2-4', '%Y-%m-%d'),:]
You CAN use a string to address a row in a DataFrame, but you need to do it using the loc property:
df_row = df.loc['2017-02-04']
I have a Dataframe with dates in the even columns. The date format is yyyy.mm.dd hh:mm:ss and I want to convert it to yyyy-mm-dd.
I trued by filtering the even columns and using dt.strftime like this:
even_cols = range(0, df.shape[1], 2)
df.iloc[:, even_cols] = df.iloc[:, even_cols].dt.strftime('%Y-%m-%d')
but i get this error
"AttributeError: 'DataFrame' object has no attribute 'dt'"
Try this:
df=pd.DataFrame({'A':pd.date_range('2018-01-01', periods=10),'B':pd.date_range('2018-02-01', periods=10),
'C':pd.date_range('2018-03-01', periods=10),'D':pd.date_range('2018-04-01', periods=10)})
even_cols = [1,3]
df.iloc[:, even_cols] = df.iloc[:, even_cols].apply(lambda x: x.dt.strftime('%Y-%m-%d'))
Output df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
A 10 non-null datetime64[ns]
B 10 non-null object
C 10 non-null datetime64[ns]
D 10 non-null object
dtypes: datetime64[ns](2), object(2)
memory usage: 400.0+ bytes