matplotlib, pandas, how to generate a histogram of timedeltas? - pandas

I have a pandas.DataFrame df with that contains the following series:
Time
2182447 0 days 05:44:00
2182447 0 days 05:49:00
3129563 0 days 22:09:00
13341029 0 days 16:49:00
13341029 0 days 16:58:00
25622668 0 days 08:24:00
25622668 0 days 08:28:00
30077018 24 days 15:01:00
30077018 24 days 15:09:00
20131954 0 days 06:18:00
I would like to plot a histogram of the timedeltas. However:
hist(df)
df.Time.hist()
# both functions give the same error
>>> TypeError: Cannot cast ufunc less input from dtype('float64') to dtype('<m8[ns]') with casting rule 'same_kind'

The following works:
hist(df.Time.astype('timedelta64[h]'))
You can use different units in the astype argument. Here I use ´h´ hours.
More detailed description can be found here.

Related

Why is datetime64 converted to timedelta64 when converting into a YYYY-MM string [duplicate]

This question already has answers here:
datetime to string with series in pandas
(3 answers)
Closed 9 months ago.
This post was edited and submitted for review 9 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I want to convert time columns (dtype: datetime64[ns]) in a panda.DataFrame into strings representing the year and month only.
It works as expected if all values in the column are valid.
0 2019-4
1 2017-12
dtype: object
But with missing values (pandas.NaT) in the column the result confuses me.
0 -1 days +23:59:59.999979806
1 -1 days +23:59:59.999798288
2 NaT
dtype: timedelta64[ns]
Or with .unique() it is array([ -20194, -201712, 'NaT'], dtype='timedelta64[ns]').
What happens here seems that somehow the result becomes a timedelta64. But I don't understand why this happens. The question is why does this happen?
The complete example code:
#!/usr/bin/env pyhton3
import pandas as pd
import numpy as np
# series with missing values
series = pd.Series([
np.datetime64('2019-04-08'),
np.datetime64('2017-12-05')])
def year_month_string(cell):
"""Convert a datetime64 into string representation with
year and month only.
"""
if pd.isna(cell):
return pd.NaT
return '{}-{}'.format(cell.year, cell.month)
print(series.apply(year_month_string))
# 0 2019-4
# 1 2017-12
# dtype: object
# Series with a missing value
series_nat = pd.Series([
np.datetime64('2019-04-08'),
np.datetime64('2017-12-05'),
pd.NaT])
result = series_nat.apply(year_month_string)
print(result)
# 0 -1 days +23:59:59.999979806
# 1 -1 days +23:59:59.999798288
# 2 NaT
# dtype: timedelta64[ns]
print(result.unique())
# array([ -20194, -201712, 'NaT'], dtype='timedelta64[ns]')
Don't use a custom function, use strftime with %-m (the minus strips the leading zeros):
series_nat.dt.strftime('%Y-%-m')
output:
0 2019-4
1 2017-12
2 NaN
dtype: object
%m would keep the leading zeros:
series_nat.dt.strftime('%Y-%m')
output:
0 2019-04
1 2017-12
2 NaN
dtype: object

How to resample intra-day intervals and use .idxmax()?

I am using data from yfinance which returns a pandas Data-Frame.
Volume
Datetime
2021-09-13 09:30:00-04:00 951104
2021-09-13 09:35:00-04:00 408357
2021-09-13 09:40:00-04:00 498055
2021-09-13 09:45:00-04:00 466363
2021-09-13 09:50:00-04:00 315385
2021-12-06 15:35:00-05:00 200748
2021-12-06 15:40:00-05:00 336136
2021-12-06 15:45:00-05:00 473106
2021-12-06 15:50:00-05:00 705082
2021-12-06 15:55:00-05:00 1249763
There are 5 minute intra-day intervals in the data-frame. I want to resample to daily data and get the idxmax of the maximum volume for that day.
df.resample("B")["Volume"].idxmax()
Returns an error:
ValueError: attempt to get argmax of an empty sequence
I used B(business-days) as the resampling period, so there shouldn't be any empty sequences.
I should say .max() works fine.
Also using .agg as was suggested in another question returns an error:
df["Volume"].resample("B").agg(lambda x : np.nan if x.count() == 0 else x.idxmax())
error:
IndexError: index 77 is out of bounds for axis 0 with size 0
You can use groupby as an alternative of resample:
>>> df.groupby(df.index.normalize())['Volume'].agg(Datetime='idxmax', Volume='max')
Datetime Volume
Datetime
2021-09-13 2021-09-13 09:30:00 951104
2021-12-06 2021-12-06 15:55:00 1249763
For me working test if all NaNs per group in if-else:
df = df.resample("B")["Volume"].agg(lambda x: np.nan if x.isna().all() else x.idxmax())

Pandas Series: Decrement DateTime by 100 Years

I have a pandas series as follows...
0 2039-03-16
1 2056-01-21
2 2051-11-18
3 2064-03-05
4 2048-06-05
Name: BIRTH, dtype: datetime64
It was created from string data as follows
s = data['BIRTH']
s = pd.to_datetime(s)
s
I want to convert all dates after year 2040 to 1940
I can do this for a single record as follows
s.iloc[0].replace(year=d.year-100)
but I really want to just run it over the whole series. I can't work it out. Help!??
PS - I know there's ways outside of pandas using Python's DT module but I'd like to learn how to do this within Pandas please
Using DateOffset is the obvious choice here:
df['date'] - pd.offsets.DateOffset(years=100)
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Assign it back:
df['date'] -= pd.offsets.DateOffset(years=100)
df
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
We have the offsets module to deal with non-fixed frequencies, it comes in handy in situations like these.
To fix your code, you'd have wanted to apply datetime.replace rowwise using apply (not recommended):
df['date'].apply(lambda x: x.replace(year=x.year-100))
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Name: date, dtype: datetime64[ns]
Or using a list comprehension,
df.assign(date=[x.replace(year=x.year-100) for x in df['date']])
date
0 1939-03-16
1 1956-01-21
2 1951-11-18
3 1964-03-05
4 1948-06-05
Neither of these handle NaT entries very well.

'NaTType' object has no attribute 'days'

I have a column in my dataset which represents a date in ms and sometimes its values is nan (actually my columns is of type str and sometimes its valus is 'nan'). I want to compute the epoch in days of this column. The problem is that when doing the difference of two dates:
(pd.to_datetime('now') - pd.to_datetime(np.nan)).days
if one is nan it is converted to NaT and the difference is of type NaTType which hasn't the attribute days.
In my case I would like to have nan as a result.
Other approach I have tried: np.datetime64 cannot be used, since it cannot take as argument nan. My data cannot be converted to int since int doesn't have nan.
It will just work even if you filter first:
In [201]:
df = pd.DataFrame({'date':[dt.datetime.now(), pd.NaT, dt.datetime(2015,1,1)]})
df
Out[201]:
date
0 2015-08-28 12:12:12.851729
1 NaT
2 2015-01-01 00:00:00.000000
In [203]:
df.loc[df['date'].notnull(), 'days'] = (pd.to_datetime('now') - df['date']).dt.days
df
Out[203]:
date days
0 2015-08-28 12:12:12.851729 -1
1 NaT NaN
2 2015-01-01 00:00:00.000000 239
For me upgrading to pandas 0.20.3 from pandas 0.19.2 helped resolve this error.
pip install --upgrade pandas

detecting jumps on pandas index dates

I managed to load historical data on data series on a large set of financial instruments, indexed by date.
I am plotting volume , price information without any issue.
What I want to achieve now is to determine if there is any big jump in dates, to see if I am missing large chunks of data.
The idea I had in mind was somehow to plot the difference in between two consecutive dates in the index and if the number is superior to 3 or 4 ( which is bigger than a week end and a bank holiday on a friday or monday ) then there is an issue.
Problem is I can figure out how do compute simply df[next day]-df[day], where df is indexed by day
You can use the shift Series method (note the DatetimeIndex method shifts by freq):
In [11]: rng = pd.DatetimeIndex(['20120101', '20120102', '20120106']) # DatetimeIndex like df.index
In [12]: s = pd.Series(rng) # df.index instead of rng
In [13]: s - s.shift()
Out[13]:
0 NaT
1 1 days, 00:00:00
2 4 days, 00:00:00
dtype: timedelta64[ns]
In [14]: s - s.shift() > pd.offsets.Day(3).nanos
Out[14]:
0 False
1 False
2 True
dtype: bool
Depending on what you want, perhaps you could either do any, or find the problematic values...
In [15]: (s - s.shift() > pd.offsets.Day(3).nanos).any()
Out[15]: True
In [16]: s[s - s.shift() > pd.offsets.Day(3).nanos]
Out[16]:
2 2012-01-06 00:00:00
dtype: datetime64[ns]
Or perhaps find the maximum jump (and where it is):
In [17]: (s - s.shift()).max() # it's weird this returns a Series...
Out[17]:
0 4 days, 00:00:00
dtype: timedelta64[ns]
In [18]: (s - s.shift()).idxmax()
Out[18]: 2
If you really wanted to plot this, simply plotting the difference would work:
(s - s.shift()).plot()