Slice a Pandas dataframe with DatetimeIndex based on time interval - pandas

I'm trying to accomplish the following...
I got a Pandas dataframe that have a number of entries, indexed with DatetimeIndex which looks a bit like this:
bro_df.info()
<class 'bat.log_to_dataframe.LogToDataFrame'>
DatetimeIndex: 3596641 entries, 2017-12-14 13:52:01.633070 to 2018-01-03 09:59:53.108566
Data columns (total 20 columns):
conn_state object
duration timedelta64[ns]
history object
id.orig_h object
id.orig_p int64
id.resp_h object
id.resp_p int64
local_orig bool
local_resp bool
missed_bytes int64
orig_bytes int64
orig_ip_bytes int64
orig_pkts int64
proto object
resp_bytes int64
resp_ip_bytes int64
resp_pkts int64
service object
tunnel_parents object
uid object
dtypes: bool(2), int64(9), object(8), timedelta64[ns](1)
memory usage: 528.2+ MB
What I'm interested in is getting a slice of this data that takes the last entry, 2018-01-03 09:59:53.108566' in this case, and then subtracts an hour from that. This should give me the last hours worth of entries.
What I've tried to do so far is the following:
last_entry = bro_df.index[-1:]
first_entry = last_entry - pd.Timedelta('1 hour')
Which gives me what to me looks like fairly correct values, as per:
print(first_entry)
print(last_entry)
DatetimeIndex(['2018-01-03 08:59:53.108566'], dtype='datetime64[ns]', name='ts', freq=None)
DatetimeIndex(['2018-01-03 09:59:53.108566'], dtype='datetime64[ns]', name='ts', freq=None)
This is also sadly where I get stuck. I've tried various things with bro_df.loc and bro_df.iloc and so on but all I get is different errors for datatypes and not in index etc. Which leads me to think that I possibly might need to convert the first_entry, last_entry variables to another type?
Or I might as usual be barking up entirely the wrong tree.
Any assistance or guidance would be most appreciated.
Cheers, Mike

It seems you need create scalars by indexing [0] and select by loc:
df = bro_df.loc[first_entry[0]: last_entry[0]]
Or select by exact indexing:
df = bro_df[first_entry[0]: last_entry[0]]
Sample:
rng = pd.date_range('2017-04-03', periods=10, freq='2H 24T')
bro_df = pd.DataFrame({'a': range(10)}, index=rng)
print (bro_df)
a
2017-04-03 00:00:00 0
2017-04-03 02:24:00 1
2017-04-03 04:48:00 2
2017-04-03 07:12:00 3
2017-04-03 09:36:00 4
2017-04-03 12:00:00 5
2017-04-03 14:24:00 6
2017-04-03 16:48:00 7
2017-04-03 19:12:00 8
2017-04-03 21:36:00 9
last_entry = bro_df.index[-1:]
first_entry = last_entry - pd.Timedelta('3 hour')
print (last_entry)
DatetimeIndex(['2017-04-03 21:36:00'], dtype='datetime64[ns]', freq='144T')
print (first_entry)
DatetimeIndex(['2017-04-03 18:36:00'], dtype='datetime64[ns]', freq=None)
print (last_entry[0])
2017-04-03 21:36:00
print (first_entry[0])
2017-04-03 18:36:00
df = bro_df.loc[first_entry[0]: last_entry[0]]
print (df)
a
2017-04-03 19:12:00 8
2017-04-03 21:36:00 9
df1 = bro_df[first_entry[0]: last_entry[0]]
print (df1)
a
2017-04-03 19:12:00 8
2017-04-03 21:36:00 9

Related

Python: Mixed date format in data frame column

I have a dataframe with mixed date formats across and within columns. When trying to convert them from object to datetime type, I get an error due to column date1 having a mixed format. I can't see how to fix it in this case. Also, how could I remove the seconds from both columns (date1 and date2)?
Here's the code I attempted:
df = pd.DataFrame(np.array([[10, "2021-06-13 12:08:52.311 UTC", "2021-03-29 12:44:33.468"],
[36, "2019-12-07 12:18:02 UTC", "2011-10-15 10:14:32.118"]
]),
columns=['col1', 'date1', 'date2'])
df
>>
col1 date1 date2
0 10 2021-06-13 12:08:52.311 UTC 2021-03-29 12:44:33.468
1 36 2019-12-07 12:18:02 UTC 2011-10-15 10:14:32.118
# Converting from object to datetime
df["date1"]= pd.to_datetime(df["date1"], format="%Y-%m-%d %H:%M:%S.%f UTC")
df["date2"]= pd.to_datetime(df["date2"], format="%Y-%m-%d %H:%M:%S.%f")
>>
ValueError: time data '2019-12-07 12:18:02 UTC' does not match format '%Y-%m-%d %H:%M:%S.%f UTC' (match)
for conversion to datetime, i found the infer_datetime_format to be helpful.
could not get it to work on the complete dataframe, it is able to convert one column at a time.
In [19]: pd.to_datetime(df["date1"], infer_datetime_format=True)
Out[19]:
0 2021-06-13 12:08:52.311000+00:00
1 2019-12-07 12:18:02+00:00
Name: date1, dtype: datetime64[ns, UTC]
In [20]: pd.to_datetime(df["date2"], infer_datetime_format=True)
Out[20]:
0 2021-03-29 12:44:33.468
1 2011-10-15 10:14:32.118
Name: date2, dtype: datetime64[ns]
If atleast all formats start with this format "%Y-%m-%d %H:%M" , then you can just slice all strings till that point and use them
In [32]: df['date1'].str.slice(stop=16)
Out[32]:
0 2021-06-13 12:08
1 2019-12-07 12:18
Name: date1, dtype: object
for getting rid of the seconds in your datetime values, instead of simply getting rid of those values, you can use round , you can also check floor and ceil whatever suits your use case better.
In [28]: pd.to_datetime(df["date1"], infer_datetime_format=True).dt.round('T')
Out[28]:
0 2021-06-13 12:09:00+00:00
1 2019-12-07 12:18:00+00:00
Name: date1, dtype: datetime64[ns, UTC]
In [29]: pd.to_datetime(df["date2"], infer_datetime_format=True).dt.round('T')
Out[29]:
0 2021-03-29 12:45:00
1 2011-10-15 10:15:00
Name: date2, dtype: datetime64[ns]

Date object and time integer to datetime

All, I have a dataframe with a date column and an hour column. I am trying to combine those into a single timestamp. I tried many solutions available using datetime.datetime.combine and just implicitly extracting month day and year and creating a datetime stamp with it but all lead to some error.
idOnController date eventTime Energy hour
0 5014 2018-05-31 2018-05-31 01:00:00 26.619 0
2 5014 2018-06-02 2018-06-02 02:00:00 29.251 0
3 5014 2018-06-03 2018-06-03 03:00:00 30.635 0
The datatypes are as follows
idOnController int64
date object
eventTime datetime64[ns]
Energy float64
hour int64
dtype: object
I am looking to combine date and hour into a timestamp that looks like eventTime and then replace eventTime with that value.
You can do:
df['new_date'] = pd.to_datetime(df['date']) + df['hour'] * pd.to_timedelta('1H')
Output of df.dtypes:
idOnController int64
date object
eventTime datetime64[ns]
Energy float64
hour int64
new_date datetime64[ns]
dtype: object
If you want to have the string timestamps you can do
df['new_date'] = df['new_date'].dt.strftime('%Y-%m-%d %H:%M:%S')
Another way of doing this would be (a bit more verbose though!):
df['date'] = pd.to_datetime(df['date'])
df['year'] = df.date.dt.year
df['month'] = df.date.dt.month
df['day'] = df.date.dt.day
df['date'] = pd.to_datetime(df[['year','month','day','hour']])

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Finding max date of the month in a list of pandas timeseries dates

I have a timeseries without every date (ie. trading dates). Series can be reproduced here.
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16']))
I would like the last day of the month in my list of dates ie: '2010-01-29' and '2010-02-16'
I have looked at Get the last date of each month in a list of dates in Python
and more specifically...
import pandas as pd
import numpy as np
df = pd.read_csv('/path/to/file/') # Load a dataframe with your file
df.index = df['my_date_field'] # set the dataframe index with your date
dfg = df.groupby(pd.TimeGrouper(freq='M')) # group by month / alternatively use MS for Month Start / referencing the previously created object
# Finally, find the max date in each month
dfg.agg({'my_date_field': np.max})
# To specifically coerce the results of the groupby to a list:
dfg.agg({'my_date_field': np.max})['my_date_field'].tolist()
... but can't quite figure out how to adapt this to my application. Thanks in advance.
You can try the following to get your desired output:
import numpy as np
import pandas as pd
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16']))
This:
dates.groupby(dates.index.month).apply(pd.Series.tail,1).reset_index(level=0, drop=True)
Or this:
dates[dates.groupby(dates.index.month).apply(lambda s: np.max(s.index))]
Both should yield something like the following:
#2010-01-29 43
#2010-02-16 48
To convert it into a list:
dates.groupby(dates.index.month).apply(pd.Series.tail,1).reset_index(level=0, drop=True).tolist()
Or:
dates[dates.groupby(dates.index.month).apply(lambda s: np.max(s.index))].tolist()
Both yield something like:
#[43, 48]
If you're dealing with a dataset that spans beyond one year, then you will need to group by both year and month. The following should help:
import numpy as np
import pandas as pd
z = ['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2010-01-11', '2010-01-12', '2010-01-13',
'2010-01-14', '2010-01-15', '2010-01-19', '2010-01-20',
'2010-01-21', '2010-01-22', '2010-01-25', '2010-01-26',
'2010-01-27', '2010-01-28', '2010-01-29', '2010-02-01',
'2010-02-02', '2010-02-03', '2010-02-04', '2010-02-05',
'2010-02-08', '2010-02-09', '2010-02-10', '2010-02-11',
'2010-02-12', '2010-02-16', '2011-01-04', '2011-01-05',
'2011-01-06', '2011-01-07', '2011-01-08', '2011-01-11',
'2011-01-12', '2011-01-13', '2011-01-14', '2011-01-15',
'2011-01-19', '2011-01-20', '2011-01-21', '2011-01-22',
'2011-01-25', '2011-01-26', '2011-01-27', '2011-01-28',
'2011-01-29', '2011-02-01', '2011-02-02', '2011-02-03',
'2011-02-04', '2011-02-05', '2011-02-08', '2011-02-09',
'2011-02-10', '2011-02-11', '2011-02-12', '2011-02-16']
dates1 = pd.Series(np.random.randint(100,size=60),index=pd.to_datetime(z))
This:
dates1.groupby((dates1.index.year, dates1.index.month)).apply(pd.Series.tail,1).reset_index(level=(0,1), drop=True)
Or:
dates1[dates1.groupby((dates1.index.year, dates1.index.month)).apply(lambda s: np.max(s.index))]
Both yield something like:
# 2010-01-29 66
# 2010-02-16 80
# 2011-01-29 13
# 2011-02-16 10
I hope this proves useful.
You can use groupby by month and apply last value of index:
print (dates.groupby(dates.index.month).apply(lambda x: x.index[-1]))
1 2010-01-29
2 2010-02-16
dtype: datetime64[ns]
Another solution:
print (dates.groupby(dates.index.month).apply(lambda x: x.index.max()))
1 2010-01-29
2 2010-02-16
dtype: datetime64[ns]
For list first convert to string by strftime:
print (dates.groupby(dates.index.month)
.apply(lambda x: x.index[-1]).dt.strftime('%Y-%m-%d').tolist())
['2010-01-29', '2010-02-16']
If need values per last Month value use iloc:
print (dates.groupby(dates.index.month).apply(lambda x: x.iloc[-1]))
1 55
2 48
dtype: int64
print (dates.groupby(dates.index.month).apply(lambda x: x.iloc[-1]).tolist())
[55, 48]
EDIT:
For year and month need convert index to_period by months:
dates=pd.Series(np.random.randint(100,size=30),index=pd.to_datetime(
['2010-01-04', '2010-01-05', '2010-01-06', '2010-01-07',
'2010-01-08', '2011-01-11', '2011-01-12', '2011-01-13',
'2012-01-14', '2012-01-15', '2012-01-19', '2012-01-20',
'2013-01-21', '2013-01-22', '2013-01-25', '2013-01-26',
'2013-01-27', '2013-01-28', '2013-01-29', '2013-02-01',
'2014-02-02', '2014-02-03', '2014-02-04', '2014-02-05',
'2015-02-08', '2015-02-09', '2015-02-10', '2015-02-11',
'2016-02-12', '2016-02-16']))
#print (dates)
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.index[-1]))
2010-01 2010-01-08
2011-01 2011-01-13
2012-01 2012-01-20
2013-01 2013-01-29
2013-02 2013-02-01
2014-02 2014-02-05
2015-02 2015-02-11
2016-02 2016-02-16
Freq: M, dtype: datetime64[ns]
print (dates.groupby(dates.index.to_period('m'))
.apply(lambda x: x.index[-1]).dt.strftime('%Y-%m-%d').tolist())
['2010-01-08', '2011-01-13', '2012-01-20', '2013-01-29',
'2013-02-01', '2014-02-05', '2015-02-11', '2016-02-16']
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.iloc[-1]))
2010-01 68
2011-01 96
2012-01 53
2013-01 4
2013-02 16
2014-02 18
2015-02 41
2016-02 90
Freq: M, dtype: int64
print (dates.groupby(dates.index.to_period('m')).apply(lambda x: x.iloc[-1]).tolist())
[68, 96, 53, 4, 16, 18, 41, 90]
EDIT1: If need convert period to end of month datetime:
df = dates.groupby(dates.index.to_period('m')).apply(lambda x: x.index[-1])
df.index = df.index.to_timestamp('m')
print (df)
2010-01-31 2010-01-08
2011-01-31 2011-01-13
2012-01-31 2012-01-20
2013-01-31 2013-01-29
2013-02-28 2013-02-01
2014-02-28 2014-02-05
2015-02-28 2015-02-11
2016-02-29 2016-02-16
dtype: datetime64[ns]

pandas HDFStore select rows by datetime index

I'm sure this is probably very simple but I can't figure out how to slice a pandas HDFStore table by its datetime index to get a specific range of rows.
I have a table that looks like this:
mdstore = pd.HDFStore(store.h5)
histTable = '/ES_USD20120615_MIDPOINT30s'
print(mdstore[histTable])
open high low close volume WAP \
date
2011-12-04 23:00:00 1266.000 1266.000 1266.000 1266.000 -1 -1
2011-12-04 23:00:30 1266.000 1272.375 1240.625 1240.875 -1 -1
2011-12-04 23:01:00 1240.875 1242.250 1240.500 1242.125 -1 -1
...
[488000 rows x 7 columns]
For example I'd like to get the range from 2012-01-11 23:00:00 to 2012-01-12 22:30:00. If it were in a df I would just use datetimes to slice on the index, but I can't figure out how to do that directly from the store table so I don't have to load the whole thing into memory.
I tried mdstore.select(histTable, where='index>20120111') and that worked in as much as I got everything on the 11th and 12th, but I couldn't see how to add a time in.
Example is here
needs pandas >= 0.13.0
In [2]: df = DataFrame(np.random.randn(5),index=date_range('20130101 09:00:00',periods=5,freq='s'))
In [3]: df
Out[3]:
0
2013-01-01 09:00:00 -0.110577
2013-01-01 09:00:01 -0.420989
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
2013-01-01 09:00:04 -0.830469
[5 rows x 1 columns]
In [4]: df.to_hdf('test.h5','data',mode='w',format='table')
Specify it as a quoted string
In [8]: pd.read_hdf('test.h5','data',where='index>"20130101 09:00:01" & index<"20130101 09:00:04"')
Out[8]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]
You can also specify it directly as a Timestamp
In [10]: pd.read_hdf('test.h5','data',where='index>Timestamp("20130101 09:00:01") & index<Timestamp("20130101 09:00:04")')
Out[10]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]