I am trying to find the mean for each day and put that value as an input for every hour in a year. The problem is that the last day only contains the first hour.
rng = pd.date_range('2011-01-01', '2011-12-31')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
days = ts.resample('D')
day_mean_in_hours = asfreq('H', method='ffill')
day_mean_in_hours.tail(5)
2011-12-30 20:00:00 -0.606819
2011-12-30 21:00:00 -0.606819
2011-12-30 22:00:00 -0.606819
2011-12-30 23:00:00 -0.606819
2011-12-31 00:00:00 -2.086733
Freq: H, dtype: float64
Is there a nice way to change the frequency to hour and still get the full last day?
You could reindex the frame using an hourly DatetimeIndex that covers the last full day.
hourly_rng = pd.date_range('2011-01-01', '2012-01-01', freq='1H', closed='left')
day_mean_in_hours = day_mean_in_hours.reindex(hourly_rng, method='ffill')
See Resample a time series with the index of another time series for another example.
Related
I have imported a time series that I resampled to monthly time steps, however I would like to select all the years with only March, April, and May months (months 3,4, and 5).
Unfortunately this is not exactly reproducible data since it's a particular text file, but is there a way to just isolate all months 3, 4, and 5 of this time series?
# loading textfile
mjo = np.loadtxt('.../omi.1x.txt')
# setting up dates
dates = pd.date_range('1979-01', periods=mjo.shape[0], freq='D')
#resampling one of the columns to monthly data
MJO_amp = Series(mjo[:,6], index=dates)
MJO_amp_month = MJO_amp.resample("M").mean()[:-27] #match to precipitation time series (ends feb 2019)
MJO_amp_month_normed = (MJO_amp_month - MJO_amp_month.mean())/MJO_amp_month.std()
MJO_amp_month_normed
1979-01-31 0.032398
1979-02-28 -0.718921
1979-03-31 0.999467
1979-04-30 -0.790618
1979-05-31 1.113730
...
2018-10-31 0.198834
2018-11-30 0.221942
2018-12-31 1.804934
2019-01-31 1.359485
2019-02-28 1.076308
Freq: M, Length: 482, dtype: float64
print(MJO_amp_month_normed['2018-10'])
2018-10-31 0.198834
Freq: M, dtype: float64
I was thinking something along the lines of this:
def is_amj(month):
return (month >= 4) & (month <= 6)
seasonal_data = MJO_amp_month_normed.sel(time=is_amj(MJO_amp_month_normed))
but I think my issue is the textfile isn't exactly in pandas format and doesn't have column titles...
You can use the month attribute of pd.DatetimeIndex with isin like this:
df[df.index.month.isin([3,4,5])]
Please excuse obvious errors - still in the learning process.
I am trying to do a simple timeseries plot on my data with a frequency of 15 minutes. The idea is to plot monthly means, starting with resampling data every hour - including only those hourly means that have atleast 1 observation in the interval. There are subsequent conditions for daily and monthly means.
This is relatively simpler only if this error does not crop up- "None of [DatetimeIndex(['2016-01-01 05:00:00', '2016-01-01 05:15:00',\n....2016-12-31 16:15:00'],\n dtype='datetime64[ns]', length=103458, freq=None)] are in the [columns]"
This is my code:
#Original dataframe
Date value
0 1/1/2016 0:00 405.22
1 1/1/2016 0:15 418.56
Date object
value object
dtype: object
#Conversion of 'value' column to numeric/float values.
df.Date = pd.to_datetime(df.Date,errors='coerce')
year=df.Date.dt.year
df['Year'] = df['Date'].map(lambda x: x.year )
df.value = pd.to_numeric(df.value,errors='coerce' )
Date datetime64[ns]
value float64
Year int64
dtype: object
Date value Year
0 2016-01-01 00:00:00 405.22 2016
1 2016-01-01 00:15:00 418.56 2016
df=df.set_index(Date)
diurnal1 = df[df['Date']].resample('h').mean().count()>=2
**(line of error)**
diurnal_mean_1 = diurnal1.mean()[diurnal1.count() >= 1]
(the code follows)
Any help in solving the error will be appreciated.
I think you want df=df.set_index('Date') (Date is a string). Also I would move the conversions over into the constructor if possible after you get it working.
I am aggregating some data by date.
for dt,group in df.groupby(df.timestamp.dt.date):
# do stuff
Now, I would like to do the same, but without using midnight as time offset.
Still, I would like to use groupby, but e.g. in 6AM-6AM bins.
Is there any better solution than a dummy column?
unfortunately, resample as discussed in
Resample daily pandas timeseries with start at time other than midnight
Resample hourly TimeSeries with certain starting hour
does not work, as I do need to apply any resampling/aggregation function
You can, for example, subtract the offset before grouping:
for dt, group in df.groupby(df.timestamp.sub(pd.to_timedelta('6H')).dt.date):
# do stuff
There's a base argument for resample or pd.Grouper that is meant to handle this situation. There are many ways to accomplish this, pick whichever you feel is more clear.
'1D' frequency with base=0.25
'24h' frequency with base=6
'1440min' frequency with base=360
Code
df = pd.DataFrame({'timestamp': pd.date_range('2010-01-01', freq='10min', periods=200)})
df.resample(on='timestamp', rule='1D', base=0.25).timestamp.agg(['min', 'max'])
#df.resample(on='timestamp', rule='24h', base=6).timestamp.agg(['min', 'max'])
#df.resample(on='timestamp', rule=f'{60*24}min', base=60*6).timestmap.agg(['min', 'max'])
min max
timestamp
2009-12-31 06:00:00 2010-01-01 00:00:00 2010-01-01 05:50:00 #[Dec31 6AM - Jan1 6AM)
2010-01-01 06:00:00 2010-01-01 06:00:00 2010-01-02 05:50:00 #[Jan1 6AM - Jan2 6AM)
2010-01-02 06:00:00 2010-01-02 06:00:00 2010-01-02 09:10:00 #[Jan2 6AM - Jan3 6AM)
For completeness, resample is a convenience method and is in all ways the same as groupby. If for some reason you absolutely cannot use resample you could do:
for dt, gp in df.groupby(pd.Grouper(key='timestamp', freq='24h', base=6)):
...
which is equivalent to
for dt, gp in df.resample(on='timestamp', rule='24h', base=6):
...
I am stuck with this problem. Although I found some similar questions, I could not manage to apply the solutions to my case.
I have a small series in which I have a start and an end date of a experimental deployment. My goal is to get the starting day of the week (monday 00h 00min) in which the deployment was started and the same for the last week.
This is my series:
Input
print(df_startend)
Output
Camera_Deployment_Start 2015-09-28 11:00:00
Camera_Deployment_End 2017-12-25 16:40:00
dtype: datetime64[ns]
I thought I could first get the week number and then go back to a datetime object, which would represent the very start of the week. So I did this:
df_startend=df_startend.apply(lambda x: x.isocalendar())
Input
print(df_startend)
Output
Camera_Deployment_Start (2015, 40, 1)
Camera_Deployment_End (2017, 52, 1)
dtype: object
None
It is worth saying that I can ignore the object in the 3rd position of the (tuple[2]). In this example both are coincidentally 1-the first day of the week- but that may not be the case with other data samples.
And from here on I cannot manage.
My ultimate goal is to generate all the start days of all the weeks in between. Probably using something like:
ws=pd.date_range(start=,end=,freq='W')
Your attention is very appreciated, thank you very much!
If only 2 element Series firstsubtract days extracted by dayofweek and then use floor for remove times and then date_range with W-Mon offset:
print (df_startend)
Camera_Deployment_Start 2015-09-28 11:00:00
Camera_Deployment_End 2015-12-25 16:40:00
dtype: datetime64[ns]
s = (df_startend - pd.to_timedelta(df_startend.dt.dayofweek, unit='d')).dt.floor('d')
ws=pd.date_range(start=s['Camera_Deployment_Start'],
end=s['Camera_Deployment_End'],
freq='W-Mon')
print (ws)
DatetimeIndex(['2015-09-28', '2015-10-05', '2015-10-12', '2015-10-19',
'2015-10-26', '2015-11-02', '2015-11-09', '2015-11-16',
'2015-11-23', '2015-11-30', '2015-12-07', '2015-12-14',
'2015-12-21'],
dtype='datetime64[ns]', freq='W-MON')
Detail:
print (s)
Camera_Deployment_Start 2015-09-28
Camera_Deployment_End 2015-12-21
dtype: datetime64[ns]
Solution with isocalendar:
s = df_startend.apply(lambda x: '-'.join(str(y) for y in x.isocalendar()[:2]))
s = pd.to_datetime(s + '-1', format='%Y-%W-%w') - pd.Timedelta(7, 'd')
print (s)
Camera_Deployment_Start 2015-09-28
Camera_Deployment_End 2015-12-21
dtype: datetime64[ns]
I am working with a hourly time series (Date, Time (hr), P) and trying to calculate the proportion of daily total 'Amount' for each hour. I know I can us Pandas' resample('D', how='sum') to calculate the daily sum of P (DailyP) but in the same step, I would like to use the daily P to calculate proportion of daily P in each hour (so, P/DailyP) to end up with an hourly time series (i.e., same frequency as original). I am not sure if this can even be called 'resampling' in Pandas term.
This is probably apparent from my use of terminology, but I am an absolute newbie at Python or programming for that matter. If anyone can suggest a way to do this, I would really appreciate it.
Thanks!
A possible approach is to reindex the daily sums back to the original hourly index (reindex) and filling the values forward (so that every hour gets the value of the sum of that day, fillna):
df.resample('D', how='sum').reindex(df.index).fillna(method="ffill")
And this you can use to divide your original dataframe with.
An example:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame({'P' : np.random.rand(72)}, index=pd.date_range('2013-05-05', periods=72, freq='h'))
>>> df.resample('D', 'sum').reindex(df.index).fillna(method="pad")
P
2013-05-05 00:00:00 14.049649
2013-05-05 01:00:00 14.049649
...
2013-05-05 22:00:00 14.049649
2013-05-05 23:00:00 14.049649
2013-05-06 00:00:00 13.483974
2013-05-06 01:00:00 13.483974
...
2013-05-06 23:00:00 13.483974
2013-05-07 00:00:00 12.693711
2013-05-07 01:00:00 12.693711
..
2013-05-07 22:00:00 12.693711
2013-05-07 23:00:00 12.693711