how to plot bar gaps in pandas dataframe with timedelta and timestamp - pandas

Given a timestamped df with timedelta showing time covered such as:
df = pd.DataFrame(pd.to_timedelta(['00:45:00','01:00:00','00:30:00']).rename('span'),
index=pd.to_datetime(['2019-09-19 18:00','2019-09-19 19:00','2019-09-19 21:00']).rename('ts'))
# span
# ts
# 2019-09-19 18:00:00 00:45:00
# 2019-09-19 19:00:00 01:00:00
# 2019-09-19 21:00:00 00:30:00
How can I plot a bar graph showing drop outs every 15 minutes? What I want is a bar graph that will show 0 or 1 on the Y axis with a 1 for each 15 minute segment in the time periods covered above, and a 0 for all the 15 minute segments not covered.
Per this answer I tried:
df['span'].astype('timedelta64[m]').plot.bar()
However this plots each timespan vertically, and does not show that the whole hour of 2019-09-19 20:00 is missing.
.
I tried
df['span'].astype('timedelta64[m]').plot()
It plots the following which is not very useful.
I also tried this answer to no avail.
Update
Based on lostCode's answer I was able to further modify the DataFrame as follows:
def isvalid(period):
for ndx, row in df.iterrows():
if (period.start_time >= ndx) and (period.start_time < row.end):
return 1
return 0
df['end']= df.index + df.span
ds = pd.period_range(df.index.min(), df.end.max(), freq='15T')
df_valid = pd.DataFrame(ds.map(isvalid).rename('valid'), index=ds.rename('period'))
Is there a better, more efficient way to do it?

You can use DataFrame.resample to create a new DataFrame to
to verify the existence of time spaces. To check use DataFrame.isin
import numpy as np
check=df.resample('H')['span'].sum().reset_index()
d=df.reset_index('ts').sort_values('ts')
check['valid']=np.where(check['ts'].isin(d['ts']),1,0)
check.set_index('ts')['valid'].plot(kind='bar',figsize=(10,10))

Related

Find maximum between -300 seconds to -30 seconds ago?

To find the maximum over the last 300 seconds:
import pandas as pd
# 16 to 17 minutes of time-series data.
df = pd.DataFrame(range(10000))
df.index = pd.date_range(1, 1000000000000, 10000)
# maximum over last 300 seconds. (outputs 9999)
df[0].rolling('300s').max().tail(1)
How can I exclude the most recent 30s from the rolling calculation? I went the max between -300s and -30s.
So, instead of 9999 being outputted by the above, I want something like 9700 (thereabouts) to be displayed.
You can compute the rolling max for the last 271 seconds (271 instead of 270 if you need that 300th second included), then shift the results by 30 seconds, and merge them with the original dataframe. Since in your example the index is at the sub-second level, you will need to utilize merge_asof to find the desired matches (you can use the direction parameter of that function to select non-exact matches).
import pandas as pd
# 16 minutes and 40 seconds of time-series data.
df = pd.DataFrame(range(10_000))
df.index = pd.date_range(1, 1_000_000_000_000, 10_000)
roll_max = df[0].rolling('271s').max().shift(30, freq='s').rename('roll_max')
res = pd.merge_asof(df, roll_max, left_index=True, right_index=True)
print(res.tail(1))
# 0 roll_max
# 1970-01-01 00:16:40 9999 9699.0

Inconsistent output for pandas groupby-resample with missing values in first time bin

I am finding an inconsistent output with pandas groupby-resample behavior.
Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:
df1 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
data={'category':['A','A','B']})
# Output:
# category
#2022-01-01 01:00:00 A
#2022-01-02 01:00:00 A
#2022-01-02 01:00:00 B
When I groupby-resample I get a Series with multiindex on category and time:
res1 = df1.groupby('category').resample('1D').size()
#Output:
#category
#A 2022-01-01 1
# 2022-01-02 1
#B 2022-01-02 1
#dtype: int64
But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:
df2 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
data={'category':['A','A','B','B']})
res2 = df2.groupby('category').resample('1D').size()
# Output:
# 2022-01-01 2022-01-02
# category
# A 1 1
# B 1 1
Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.
I submitted bug report 46826 to pandas.
The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.

How to move the timestamp bounds for datetime in pandas (working with historical data)?

I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})
You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23

Pandas df histo, format my x ticker and include empty

I got this pandas df:
index TIME
12:07 2019-06-03 12:07:28
10:04 2019-06-04 10:04:25
11:14 2019-06-09 11:14:25
...
I use this command to do an histogram to plot how much occurence for each 15min periods
df['TIME'].groupby([df["TIME"].dt.hour, df["TIME"].dt.minute]).count().plot(kind="bar")
my plot look like this:
How can I get x tick like 10:15 in lieu of (10, 15) and how manage to add x tick missing like 9:15, 9:30... to get a complet time line??
You can resample your TIME column to 15 mins intervalls and count the number of rows. Then plot a regular bar chart.
df = pd.DataFrame({'TIME': pd.to_datetime('2019-01-01') + pd.to_timedelta(pd.np.random.rand(100) * 3, unit='h')})
df = df[df.TIME.dt.minute > 15] # make gap
ax = df.resample('15T', on='TIME').count().plot.bar(rot=0)
ticklabels = [x.get_text()[-8:-3] for x in ax.get_xticklabels()]
ax.xaxis.set_major_formatter(matplotlib.ticker.FixedFormatter(ticklabels))
(for details about formatting datetime ticklabels of pandas bar plots see this SO question)

how to group pandas timestamps plot several plots in one figure and stack them together in matplotlib?

I have a data frame with perfectly organised timestamps, like below:
It's a web log, and the timestamps go though the whole year. I want to cut them into each day and show the visits within each hour and plot them into the same figure and stack them all together. Just like the pic shown below:
I am doing well on cutting them into days and plot the visits of a day individually, but I am having trouble plotting them and stacking them together. The primary tool I am using is Pandas and Matplotlib.
Any advices and suggestions? Much Appreciated!
Edited:
My Code is as below:
The timestamps are: https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5
And the dataframe looks like this:
I plotted the timestamp density through the whole period used the code below:
timestamps_series_all = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp))
timestamps_series_all_toBePlotted = pd.Series(1, index=timestamps_series_all)
timestamps_series_all_toBePlotted.resample('D').sum().plot()
and got the result:
I plotted timestamps within one day using the code:
timestamps_series_oneDay = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp.loc[unique_visitors_df["date"] == "2014-08-01"]))
timestamps_series_oneDay_toBePlotted = pd.Series(1, index=timestamps_series_oneDay)
timestamps_series_oneDay_toBePlotted.resample('H').sum().plot()
and the result:
And now I am stuck.
I'd really appreciate all of your help!
I think you need pivot:
#https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5 to L
df = pd.DataFrame({'date':L})
print (df.head())
date
0 2014-08-01 00:05:46
1 2014-08-01 00:14:47
2 2014-08-01 00:16:05
3 2014-08-01 00:20:46
4 2014-08-01 00:23:22
#convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'] )
#resample by Hours, get count and create df
df = df.resample('H', on='date').size().to_frame('count')
#extract date and hour
df['days'] = df.index.date
df['hours'] = df.index.hour
#pivot and plot
#maybe check parameter kind='density' from http://stackoverflow.com/a/33474410/2901002
#df.pivot(index='days', columns='hours', values='count').plot(rot='90')
#edit: last line change to below:
df.pivot(index='hours', columns='days', values='count').plot(rot='90')