I am aggregating some data by date.
for dt,group in df.groupby(df.timestamp.dt.date):
# do stuff
Now, I would like to do the same, but without using midnight as time offset.
Still, I would like to use groupby, but e.g. in 6AM-6AM bins.
Is there any better solution than a dummy column?
unfortunately, resample as discussed in
Resample daily pandas timeseries with start at time other than midnight
Resample hourly TimeSeries with certain starting hour
does not work, as I do need to apply any resampling/aggregation function
You can, for example, subtract the offset before grouping:
for dt, group in df.groupby(df.timestamp.sub(pd.to_timedelta('6H')).dt.date):
# do stuff
There's a base argument for resample or pd.Grouper that is meant to handle this situation. There are many ways to accomplish this, pick whichever you feel is more clear.
'1D' frequency with base=0.25
'24h' frequency with base=6
'1440min' frequency with base=360
Code
df = pd.DataFrame({'timestamp': pd.date_range('2010-01-01', freq='10min', periods=200)})
df.resample(on='timestamp', rule='1D', base=0.25).timestamp.agg(['min', 'max'])
#df.resample(on='timestamp', rule='24h', base=6).timestamp.agg(['min', 'max'])
#df.resample(on='timestamp', rule=f'{60*24}min', base=60*6).timestmap.agg(['min', 'max'])
min max
timestamp
2009-12-31 06:00:00 2010-01-01 00:00:00 2010-01-01 05:50:00 #[Dec31 6AM - Jan1 6AM)
2010-01-01 06:00:00 2010-01-01 06:00:00 2010-01-02 05:50:00 #[Jan1 6AM - Jan2 6AM)
2010-01-02 06:00:00 2010-01-02 06:00:00 2010-01-02 09:10:00 #[Jan2 6AM - Jan3 6AM)
For completeness, resample is a convenience method and is in all ways the same as groupby. If for some reason you absolutely cannot use resample you could do:
for dt, gp in df.groupby(pd.Grouper(key='timestamp', freq='24h', base=6)):
...
which is equivalent to
for dt, gp in df.resample(on='timestamp', rule='24h', base=6):
...
Related
I have a data for a period from December 2013 to November 2018. I converted it into a data frame as shown here.
Date 0.1 0.2 0.3 0.4 0.5 0.6
2013-12-01 301.04 297.4 296.63 295.76 295.25 295.25
2013-12-04 297.96 297.15 296.25 295.25 294.43 293.45
2013-12-05 298.4 297.61 296.65 295.81 294.75 293.89
2013-12-08 298.82 297.95 297.15 296.25 295.45 294.41
2013-12-09 298.65 297.65 296.95 296.02 295.13 294.05
2013-12-12 299.05 297.33 296.65 295.81 294.85 293.85
2013-12-16 301.05 300.28 299.38 298.45 297.65 296.51
....
2014-01-10 301.65 297.45 296.46 295.52 294.65 293.56
2014-01-11 301.99 298.95 298.39 297.15 296.05 295.11
2014-01-12 299.86 298.65 297.73 296.82 296.35 295.37
2014-01-13 299.25 298.15 297.3 296.43 295.26 294.31
I want to take monthly mean and seasonal mean of this data.
For monthly mean I have tried
df.resample('M').mean()
And it worked well.
For seasons, I would like decompose this data into 4 seasons (December-Feb; Mar-May; June-Aug; and Sep-Nov) of three months interval. While I tried the resample with 3 months interval. i.e.
df.resample('3M').mean()
However this is not worked well as it giving the average for the starting December month separately and then considering the above said interval for a calendar year (ie. from January to March and so on).
I would like to know if there are any possible ways to avoid this by specifying which month is our period of consideration begins.
Moreover, I would also like to know whether we can define these seasons beforehand and group the data accordingly to get averages with more ease.
You can define the origin in resample:
df.resample('M', origin=pd.Timestamp('2013-12-01')).mean()
The following SQL uses getdate() to select today's date and appends a random timestamp:
SELECT
DATEADD(SECOND ,RAND(CHECKSUM(NEWID())) * 86400,CONVERT(datetime,CONVERT(varchar(8),GETDATE(),112)))
AS DUEDATE
FROM [dbo].[Table] as t
producing
2021-02-03 21:11:21.000
2021-02-03 15:51:06.000
2021-02-03 14:08:24.000
2021-02-03 16:10:50.000
2021-02-03 02:56:00.000
This SQL uses getdate() to get todays date and subtracts the value from t.daysago (e,g 1,2,3,4,5 etc) to produce a date in the past (e.g., today's date - 5 days)
select
(Dateadd(day, -t.daysago, Getdate()))
FROM [dbo].[Table] as t
which produces these descending dates but having the same timestamps in each row.
2021-02-02 12:38:09.133
2021-02-01 12:38:09.133
2021-01-30 12:38:09.133
2021-01-29 12:38:09.133
2021-01-28 12:38:09.133
I need to vary the time stamps so the data in my demo dashboard looks realistic.
I am trying to combine the two approaches but am having trouble. I want to use getdate() to produce today's date, then subtract the value in t.daysago from todays date, and then randomize the timestamp.
If today's date and current time was 2021-02-03 22:11:31.000 I'd like to produce the following (by subtracting the values in t.daysago (1,2,3,4,5):
2021-02-02 22:11:31.000
2021-02-01 15:51:06.000
2021-01-30 14:08:24.000
2021-01-29 16:10:50.000
2021-01-28 02:56:00.000
I can't seem to figure out how to combine the approaches to get the desired output. Any suggestions?
you can still use the above logic with some changes , you can choose SECOND or play with number param for DATEADD for tighter or wider range :
SELECT
DATEADD(MILLISECOND ,CHECKSUM(NEWID()) ,GETDATE()) AS DUEDATE
FROM [dbo].[Table] as t
For example If you want to have controlled range for example for dates in 5 days range from A specific date:
you can declare #startdate whataver you want
DECLARE #dayrange int = 5
SELECT
DATEADD(SECOND ,RAND(CHECKSUM(NEWID())) * 86400,CONVERT(datetime,CONVERT(varchar(8),DATEADD(DAY, - t.daysago ,GETDATE(),112)))
AS DUEDATE
FROM [dbo].[Table] as t
You can use rand() to add some random time as well like below:
select
(Dateadd(day, -t.daysago, dateadd(minute,round(rand()*rand()*25,0),Getdate())))
FROM [dbo].[Table] as t
I have 3 columns called 'customer_state','call_date' and 'call_time' in my dataframe and I want to create 3 new columns 'customer_timezone' ,'customer_date' and 'customer_time'
Possible values for timezone are
Eastern Standard Time (EST)
Central Standard Time (CST)
Mountain Standard Time (MST)
Pacific Standard Time (PST)
Note: call_time is in Mountain Standard time and in 24 hours format
My dataframe looks like below :
call_date call_time customer_state
2019-11-01 13:46 MD
How my resultant dataframe should look like:
call_date call_time customer_state customer_timezone customer_date customer_time
2019-11-01 13:46 MD EST 2019-11-01 16:46
Any help is appreciated!
Additional note: To simplify this solution, my data only has 'call_time' within 6am and 4pm MST. So, I don't have to worry about changing the dates (for instance, if it is 9pm in MST, then it would be 12 am next day in EST). I do not have to worry about these edge cases. Infact 'call_date' and 'customer_date' would always be the same in my scenario. I just need to add +3 hours to the time
I am trying to find the mean for each day and put that value as an input for every hour in a year. The problem is that the last day only contains the first hour.
rng = pd.date_range('2011-01-01', '2011-12-31')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
days = ts.resample('D')
day_mean_in_hours = asfreq('H', method='ffill')
day_mean_in_hours.tail(5)
2011-12-30 20:00:00 -0.606819
2011-12-30 21:00:00 -0.606819
2011-12-30 22:00:00 -0.606819
2011-12-30 23:00:00 -0.606819
2011-12-31 00:00:00 -2.086733
Freq: H, dtype: float64
Is there a nice way to change the frequency to hour and still get the full last day?
You could reindex the frame using an hourly DatetimeIndex that covers the last full day.
hourly_rng = pd.date_range('2011-01-01', '2012-01-01', freq='1H', closed='left')
day_mean_in_hours = day_mean_in_hours.reindex(hourly_rng, method='ffill')
See Resample a time series with the index of another time series for another example.
I am working with a hourly time series (Date, Time (hr), P) and trying to calculate the proportion of daily total 'Amount' for each hour. I know I can us Pandas' resample('D', how='sum') to calculate the daily sum of P (DailyP) but in the same step, I would like to use the daily P to calculate proportion of daily P in each hour (so, P/DailyP) to end up with an hourly time series (i.e., same frequency as original). I am not sure if this can even be called 'resampling' in Pandas term.
This is probably apparent from my use of terminology, but I am an absolute newbie at Python or programming for that matter. If anyone can suggest a way to do this, I would really appreciate it.
Thanks!
A possible approach is to reindex the daily sums back to the original hourly index (reindex) and filling the values forward (so that every hour gets the value of the sum of that day, fillna):
df.resample('D', how='sum').reindex(df.index).fillna(method="ffill")
And this you can use to divide your original dataframe with.
An example:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame({'P' : np.random.rand(72)}, index=pd.date_range('2013-05-05', periods=72, freq='h'))
>>> df.resample('D', 'sum').reindex(df.index).fillna(method="pad")
P
2013-05-05 00:00:00 14.049649
2013-05-05 01:00:00 14.049649
...
2013-05-05 22:00:00 14.049649
2013-05-05 23:00:00 14.049649
2013-05-06 00:00:00 13.483974
2013-05-06 01:00:00 13.483974
...
2013-05-06 23:00:00 13.483974
2013-05-07 00:00:00 12.693711
2013-05-07 01:00:00 12.693711
..
2013-05-07 22:00:00 12.693711
2013-05-07 23:00:00 12.693711