pandas: how to group by time intervals of varying length? - pandas

I know that it is possible to group your data by time intervals of the same length by using the function resample. But how can I group by time intervals of custom length (i.e. irregular time intervals)?
Here is an example:
Say we have a dataframe with time values, like this:
rng = pd.date_range(start='2015-02-11', periods=7, freq='M')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) })
And we have the following time intervals:
2015-02-12 -----
2015-05-10
2015-05-10 -----
2015-08-20
2015-08-20 -----
2016-01-01
It is clear that rows with index 0,1,2 belong to the first time interval, rows with index 3,4,5 belong to the second time interval and row 7 belongs to the last time interval.
My question is: how do I group these rows according to those specific time intervals, in order to perform aggregate functions (e.g. mean) on them?

Related

Pandas resample by integration over time with non equidistant data

I have a DataFrame with a Datetimeindex with non equidistant timestamps. I want to get the mean for each hour. But by using resample.mean(), the time distance between the timestamps is not considered.
How can I resample a DataFrame with a Datetimeindex to integrate the values in a column?
given the following data:
time
data
00:15
5
00:55
1
00:56
1
00:57
1
resample.mean() would give 4, but the value 1 was only set for 3 from 60 minutes.

Generate dataframe with timeseries index starting today and fixed interval

I'm trying to generate pandas dataframe with timeseries index with the fixed interval. As an input parameters I need to provide set start and end date. The challenge is that the generated index starts either from month start freq='3MS' or month end with freq='3M'. That cannot be defined in number of days as the whole year needs to have exact 4 periods and the start date needs to be as the defined start date.
The expected output should be in this case:
2020-10-05
2021-01-05
2021-04-05
2021-10-05
Any ideas appreciated.
interpolated = pd.DataFrame( index=pd.date_range('2020-10-05', '2045-10-05', freq='3M'),columns['dummy'])

Timeseries resample error - none of Dateindex in column pandas

Please excuse obvious errors - still in the learning process.
I am trying to do a simple timeseries plot on my data with a frequency of 15 minutes. The idea is to plot monthly means, starting with resampling data every hour - including only those hourly means that have atleast 1 observation in the interval. There are subsequent conditions for daily and monthly means.
This is relatively simpler only if this error does not crop up- "None of [DatetimeIndex(['2016-01-01 05:00:00', '2016-01-01 05:15:00',\n....2016-12-31 16:15:00'],\n dtype='datetime64[ns]', length=103458, freq=None)] are in the [columns]"
This is my code:
#Original dataframe
Date value
0 1/1/2016 0:00 405.22
1 1/1/2016 0:15 418.56
Date object
value object
dtype: object
#Conversion of 'value' column to numeric/float values.
df.Date = pd.to_datetime(df.Date,errors='coerce')
year=df.Date.dt.year
df['Year'] = df['Date'].map(lambda x: x.year )
df.value = pd.to_numeric(df.value,errors='coerce' )
Date datetime64[ns]
value float64
Year int64
dtype: object
Date value Year
0 2016-01-01 00:00:00 405.22 2016
1 2016-01-01 00:15:00 418.56 2016
df=df.set_index(Date)
diurnal1 = df[df['Date']].resample('h').mean().count()>=2
**(line of error)**
diurnal_mean_1 = diurnal1.mean()[diurnal1.count() >= 1]
(the code follows)
Any help in solving the error will be appreciated.
I think you want df=df.set_index('Date') (Date is a string). Also I would move the conversions over into the constructor if possible after you get it working.

Pandas Count Number of On/Off Events and Duration

I have a DataFrame with two columns, one containing the time of an event and the other containing whether the event is an On or an Off. I would like to count the number of times an On occurs followed by an Off as well as the total duration On occurs.
For example see this DataFrame:
Time Event
01:00 On
01:15 Off
01:16 Off
02:00 On
02:15 Off
23:30 On
Would have 2 On/Off events with a total duration of O:30.
I'm sure how to approach this problem.
Create a mask, which gives you the number of events. Then subtract to get the time difference.
df['Time'] = pd.to_timedelta(df.Time+':00')
m = df.Event.eq('On') & df.Event.shift(-1).eq('Off')
m.sum()
#2
(df.shift(-1).loc[m, 'Time'] - df.loc[m, 'Time']).sum()
#Timedelta('0 days 00:30:00')

Efficiently label each row in a large dataframe

I have a file containing dates from June 2015 + 365 days. I am using this as a lookup table because there are custom business dates (only certain holidays are observed and there are some no-work dates because of internal reasons). Using the customer business date offsets was just so slow with 3.5 million records.
initial_date | day_1 | day_2 | day_3 | ... | day_365
2015-06-01 2015-06-02 2015-06-03 2015-06-04
2015-06-02 2015-06-03 2015-06-04 2015-06-05
The idea is to 'tag' each row in the data based on the number of (custom) business dates since specific entries. Is there a better way to do this?
For example, if a new entry happens on 2016-06-28 then this is labeled as 'initial_date'. day_1 is tomorrow, day_2 is the next day, day_3 is Friday, and day_4 is next Monday.
My naive way of doing this is to create a loop which basically does this:
df.day_1_label = np.where(df.date == df.day_1, 'Day 1', '')
df.day_2_label = np.where(df.date == df.day_2, 'Day 2', '')
df.day_label = (df.day_1_label + df.day_2_label + ...).replace('', regex=True, inplace=True)
This would result in one label per row which I could then aggregate or plot. Eventually this would be used for forecasting. initial_date + the subset of customers from previous dates = total customers
Also, depending on what events happen, a subset of the customers would have a certain event occur in the future. I want to know on what business date from the initial_date this happens on, on average. So if we have 100 customers today, a certain percent will have an event next July 15th or whatever it might be.
Edit- pastebin with some sample data:
http://pastebin.com/ZuE1Q2KJ
So the output I am looking for is the day_label. This basically checks if the date == each date from day_0 - day_n. Is there a better way to fill that in? I am just trying to match the date of each row with a value in one of the day_ columns.
I think this can be made more efficiently if your data looks as I think it looks.
Say you have a calendar of dates (June 2015 + 365 days) and a data frame as for example:
cal = ['2015-06-01', '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-08',
'2015-06-09']
df = pd.DataFrame({'date': ['2015-06-04', '2015-06-09', '2015-06-09'],
'initial_date': ['2015-06-02', '2015-06-02', '2015-06-01']})
Keeping dates in numpy.datetime types is more efficient, so I convert:
df['date'] = df['date'].astype(np.datetime64)
df['initial_date'] = df['initial_date'].astype(np.datetime64)
Now, let's turn cal into a fast lookup table:
# converting types:
cal = np.asarray(cal, dtype=np.datetime64)
# lookup series
s_cal = pd.Series(range(len(cal)), index=cal)
# a convenience lookup function
def lookup(dates):
return s_cal[dates].reset_index(drop=True)
EDIT:
The above lookup returns a series with a RangeIndex, which may be different from the index of df (this causes problems when assigning the result to a column). So it's probably better to rewrite this lookup so that it returns a plain numpy array:
def lookup2(dates):
return s_cal[dates].values
It's possible to set the correct index in lookup (taken from the input dates). But such a solution would be less flexible (would require that the input be a Series) or unnecessarily complicated.
The series:
s_cal
Out[220]:
2015-06-01 0
2015-06-02 1
2015-06-03 2
2015-06-04 3
2015-06-08 4
2015-06-09 5
dtype: int64
and how it works:
lookup(df.date)
Out[221]:
0 3
1 5
2 5
dtype: int64
lookup2(df.date)
Out[36]: array([3, 5, 5])
The rest is straightforward. This adds a column of integers (day differences) to the data frame:
df['day_label'] = lookup2(df.date) - lookup2(df.initial_date)
and if you want to convert it to labels:
df['day_label'] = 'day ' + df['day_label'].astype(str)
df
Out[225]:
date initial_date day_label
0 2015-06-04 2015-06-02 day 2
1 2015-06-09 2015-06-02 day 4
2 2015-06-09 2015-06-01 day 5
Hope this is what you wanted.