How to move the timestamp bounds for datetime in pandas (working with historical data)? - pandas

I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})

You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23

Related

Seasonal Decomposition plots won't show despite pandas recognizing DateTime Index

I loaded the data as follows
og_data=pd.read_excel(r'C:\Users\hp\Downloads\dataframe.xlsx',index_col='DateTime')
And it looks like this:
DateTime A
2019-02-04 10:37:54 0
2019-02-04 10:47:54 1
2019-02-04 10:57:54 2
2019-02-04 11:07:54 3
2019-02-04 11:17:54 4
Problem is, I'm trying to set the data to a frequency, but there are NaN values that I have to drop, and even if I don't, it seems to be an irregular frequency. I've got pandas to recognize the DateTime to the index:
og_data.index
DatetimeIndex
dtype='datetime64[ns]', name='DateTime', length=15536, freq=None)
but when I try doing this:
og_data.index.freq = '10T'
That should mean 10min, right?
But I get the following error instead:
ValueError: Inferred frequency None from passed values does not conform to passed frequency 10T
Even if I set the frequency to days:
og_data.index.freq = 'D'
I get a similar error.
The goal is to get a seasonal decomposition plots because I want to forecast the data. But I get the following error when I try to do so:
result=seasonal_decompose(og_data['A'],model='add')
result.plot();
ValueError: You must specify a period or x must be a pandas object with a DatetimeIndex with a freq not set to None
Which makes sense, I can't set the datetime index to a specified frequency. I need it to be every 10min, please advise.

"None of [Float64Index([56.0, ..\n dtype='float64', length=1057499)] are in the [columns]" Pandas dataframe

Please excuse any obvious mistakes as I am new to Pandas and coding in general.
I am filtering the original dataframe and creating a copy with chosen columns. This is how my data frame looks like:
(dataframe filter routine):
df_new=df.filter(['date','location','value','lat_final','lon_final'], axis=1)
df_new = df_new.set_index('date')
print (df_new.head())
The new dataframe:
location value lat_final lon_final
date
2015-06-30 09:40:00+05:30 XYZI 56.0 28.6508 77.3152
2015-06-30 11:00:00+05:30 MNOP 36.0 28.6683 77.1167
2015-06-30 17:10:00+05:30 QRST 71.0 28.6508 77.3152
2015-06-30 11:00:00+05:30 UVWX 98.0 28.6508 77.3152
2015-06-30 09:40:00+05:30 XXYZ 26.0 28.6683 77.1167
While trying to perform some operations on columns in this new dataframe, I am getting the none type error. These are the operations I am performing:
(This step goes fine)
f=df_new[df_new['value']>=0]
f.drop(f[f['value'] >1500].index, inplace = True)
f.drop(f[f['value'] <2].index, inplace = True)
(The error crops up here):
#Filteration steps:
#Step1: grouping into 12h or n hour intervals:
diurnal = f[f['value']].resample('12h')
Where am I going wrong?
Any help will be much appreciated.
This: f[f['value']] will give you an error. If you want to resample the value column, you should select it properly, and also tell resample how you want to aggregate the values (sum, mean?). Something like this:
f['value'].resample('12h').sum()

How to show multiple timeseries plots using seaborn

I'm trying to generate 4 plots from a DataFrame using Seaborn
Date A B C D
2019-04-05 330.665 161.975 168.69 0
2019-04-06 322.782 150.243 172.539 0
2019-04-07 322.782 150.243 172.539 0
2019-04-08 295.918 127.801 168.117 0
2019-04-09 282.674 126.894 155.78 0
2019-04-10 293.818 133.413 160.405 0
I have casted dates using pd.to_DateTime and numbers using pd.to_numeric. Here is the df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 460 to 465
Data columns (total 5 columns):
Date 6 non-null datetime64[ns]
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 6 non-null float64
dtypes: datetime64[ns](1), float64(4)
memory usage: 288.0 bytes
I can do a wide column plot by just calling .plot() on df.
However,
The legend of the plot is covering the plot itself
I would instead like to have 4 separate plots in 1 diagram and have tried using lmplot to achieve this.
I would like to add labels to the plot like so:
Plot with image
I first melted the data:
df=pd.melt(df,id_vars='Date', var_name='Var', value_name='Unit')
And then tried lmplot
sns.lmplot(x = df['Date'], y='Unit', col='Var', data=df)
However, I get the traceback:
TypeError: Invalid comparison between dtype=datetime64[ns] and str
I have also tried setting df.set_index['Date'] and replotting that using x=df.index and that gave me the same error.
The data can be plotted using Google Sheets but I am trying to automate a workflow where the chart can be generated and sent via Slack to selected recipients.
I hope I have expressed myself clearly enough as I am rather new to Python and Seaborn and hope to get some help from the experts here.
Regarding the legend you can just use .legend(loc="upper left", bbox_to_anchor=(1,1)) as in this example
%matplotlib inline
import pandas as pd
import numpy as np
data = np.random.rand(10,4)
df = pd.DataFrame(data, columns=["A", "B", "C", "D"])
df.plot()\
.legend(loc="upper left", bbox_to_anchor=(1,1));
While for the second IIUC you can play from
df.plot(subplots=True, layout=(2,2));

Groupby month parameter in Multi-level Index in pandas

I have a large DF which is structured like this. It has multiple stocks in level 0 and Date is level 1. Starts monthly data at 12/31/2004 and continues to 12/31/2017 (not shown).
Date DAILY_RETURN
A 12/31/2004 NaN
1/31/2005 -8.26
2/28/2005 8.55
3/31/2005 -7.5
4/29/2005 -6.53
5/31/2005 15.71
6/30/2005 -4.12
7/29/2005 13.99
8/31/2005 22.56
9/30/2005 1.83
10/31/2005 -2.26
11/30/2005 11.4
12/30/2005 -6.65
1/31/2006 1.86
2/28/2006 6.16
3/31/2006 4.31
What I want to do is groupby the month and then count the number of POSITIVE returns in the daily_returns by month (ie 01, then 02, 03, etc from the Date part of the index). This code will give me the count but only by index level=0.
df3.groupby(level=0)['DAILY_RETURN'].agg(['count'])
There are other question out there, this one being the closest but I can not get the code to work. Can someone help out. Ultimately what I want to do is groupby stock and then month and FILTER all stocks that have at least 70% positive returns by month. I cant seem to figure out how to get the positive return from the dataframe either
How to group pandas DataFrame entries by date in a non-unique column
Here it is for a smaller data, using datetime
import pandas as pd
from datetime import datetime
df = pd.DataFrame()
df['Date'] = ['12/31/2004', '1/31/2005', '12/31/2005', '2/28/2006', '2/28/2007']
df['DAILY_RETURN'] = [-8, 9, 5, 10, 14]
df = df[df.DAILY_RETURN > 0]
df['Date_obj'] = df['Date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y').month)
df.groupby('Date_obj').count()[['DAILY_RETURN']]

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))