I have a five minute dataframe:
rng = pd.date_range('1/1/2011', periods=60, freq='5Min')
df = pd.DataFrame(np.random.randn(60, 4), index=rng, columns=['A', 'B', 'C', 'D'])
A B C D
2011-01-01 00:00:00 1.287045 -0.621473 0.482130 1.886648
2011-01-01 00:05:00 0.402645 -1.335942 -0.609894 -0.589782
2011-01-01 00:10:00 -0.311789 0.342995 -0.875089 -0.781499
2011-01-01 00:15:00 1.970683 0.471876 1.042425 -0.128274
2011-01-01 00:20:00 -1.900357 -0.718225 -3.168920 -0.355735
2011-01-01 00:25:00 1.128843 -0.097980 1.130860 -1.045019
2011-01-01 00:30:00 -0.261523 0.379652 -0.385604 -0.910902
I would like to resample only the data on the 15 minute interval, but without aggregating into a statistic (I dont want the mean,median,stdev).I want to subsample and get the actual data on the 15 minute interval.Is there a builtin method to do this?
My output would be:
A B C D
2011-01-01 00:00:00 1.287045 -0.621473 0.482130 1.886648
2011-01-01 00:15:00 1.970683 0.471876 1.042425 -0.128274
2011-01-01 00:30:00 -0.261523 0.379652 -0.385604 -0.910902
You can resample to 15 min and take the 'first' of each group:
In [40]: df.resample('15min').first()
Out[40]:
A B C D
2011-01-01 00:00:00 -0.415637 -1.345454 1.151189 -0.834548
2011-01-01 00:15:00 0.221777 -0.866306 0.932487 -1.243176
2011-01-01 00:30:00 -0.690039 0.778672 -0.527087 -0.156369
...
Another way to do this is constructing the new desired index and do a reindex (this is a bit more work in this case, but in the case of a irregular time series this ensures it takes the data at exactly each 15min):
In [42]: new_rng = pd.date_range('1/1/2011', periods=20, freq='15min')
In [43]: df.reindex(new_rng)
Out[43]:
A B C D
2011-01-01 00:00:00 -0.415637 -1.345454 1.151189 -0.834548
2011-01-01 00:15:00 0.221777 -0.866306 0.932487 -1.243176
2011-01-01 00:30:00 -0.690039 0.778672 -0.527087 -0.156369
...
Function asfreq() doesn't do any aggregation:
df.asfreq('15min')
Related
Suppose, I have a pandas Series with daily observations:
pd_series = pd.Series(np.random.rand(26281), index = pd.date_range('2022-01-01', '2024-12-31', freq = 'H'))
pd_series
2022-01-01 00:00:00 0.933746
2022-01-01 01:00:00 0.588907
2022-01-01 02:00:00 0.229040
2022-01-01 03:00:00 0.557752
2022-01-01 04:00:00 0.798649
2024-12-30 20:00:00 0.314143
2024-12-30 21:00:00 0.670485
2024-12-30 22:00:00 0.300531
2024-12-30 23:00:00 0.075403
2024-12-31 00:00:00 0.716685
What I want is to replace every observation by the monthly average. I know that the average can be calculated as
pd_series.resample('MS').mean()
But how do I put the observations to the respective observations?
Use Resampler.transform:
print (pd_series.resample('MS').transform('mean'))
2022-01-01 00:00:00 0.495015
2022-01-01 01:00:00 0.495015
2022-01-01 02:00:00 0.495015
2022-01-01 03:00:00 0.495015
2022-01-01 04:00:00 0.495015
2024-12-30 20:00:00 0.508646
2024-12-30 21:00:00 0.508646
2024-12-30 22:00:00 0.508646
2024-12-30 23:00:00 0.508646
2024-12-31 00:00:00 0.508646
Freq: H, Length: 26281, dtype: float64
I have 2 columns of data in a pandas DF that looks like this with the "DateTime" column in format YYYY-MM-DD HH:MM:SS - this is first 24 hrs but the df is for one full year or 8784 x 2.
BAFFIN BAY DateTime
8759 8.112838 2016-01-01 00:00:00
8760 7.977169 2016-01-01 01:00:00
8761 8.420204 2016-01-01 02:00:00
8762 9.515370 2016-01-01 03:00:00
8763 9.222840 2016-01-01 04:00:00
8764 8.872423 2016-01-01 05:00:00
8765 8.776145 2016-01-01 06:00:00
8766 9.030668 2016-01-01 07:00:00
8767 8.394983 2016-01-01 08:00:00
8768 8.092915 2016-01-01 09:00:00
8769 8.946967 2016-01-01 10:00:00
8770 9.620883 2016-01-01 11:00:00
8771 9.535951 2016-01-01 12:00:00
8772 8.861761 2016-01-01 13:00:00
8773 9.077692 2016-01-01 14:00:00
8774 9.116074 2016-01-01 15:00:00
8775 8.724343 2016-01-01 16:00:00
8776 8.916940 2016-01-01 17:00:00
8777 8.920438 2016-01-01 18:00:00
8778 8.926278 2016-01-01 19:00:00
8779 8.817666 2016-01-01 20:00:00
8780 8.704014 2016-01-01 21:00:00
8781 8.496358 2016-01-01 22:00:00
8782 8.434297 2016-01-01 23:00:00
I am trying to calculate daily averages of the "BAFFIN BAY" and I've tried these approaches:
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DateTime')).mean()
davg_df2 = df2.groupby(pd.Grouper(freq='1D', key='DateTime')).mean()
davg_df2 = df2.groupby(by=df2['DateTime'].dt.date).mean()
All of these approaches yields the same answer as shown below :
BAFFIN BAY
DateTime
2016-01-01 6.008044
However, if you do the math, the correct average for 2016-01-01 is 8.813134 Thank you kindly for your help. I'm assuming the grouping is just by day or 24hrs to make consecutive DAILY averages but the 3 approaches above clearly is looking at other data in my 8784 x 2 DF.
I just ran your df with this code and i get 8.813134:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df = df.groupby(by=pd.Grouper(freq='D', key='DateTime')).mean()
print(df)
Output:
BAFFIN BAY
DateTime
2016-01-01 8.813134
My data looks like this, it is a minute based data for 2 years.
2017-04-02 00:00:00
2017-04-02 00:01:00
2017-04-02 00:02:00
2017-04-02 00:03:00
2017-04-02 00:04:00
....
2017-04-02 23:59:00
...
2019-02-01 22:54:00
2019-02-01 22:55:00
2019-02-01 22:56:00
2019-02-01 22:57:00
2019-02-01 22:58:00
2019-02-01 22:59:00
2019-02-01 23:00:00
I want to access all the data rows between the end of the workday to the beginning of the next. Example between 2018-04-02 18:00:00 2018-04-03 05:00:00 for all the days in my data frame. Please help
If you use a DatetimeIndex then you can use .between_time
import pandas as pd
df = pd.DataFrame({'date': pd.date_range('2017-04-02', freq='90min', periods=100)})
df = df.set_index('date')
df.between_time('18:00', '5:00')
#date
#2017-04-02 00:00:00
#2017-04-02 01:30:00
#2017-04-02 03:00:00
#2017-04-02 04:30:00
#2017-04-02 18:00:00
#2017-04-02 19:30:00
#2017-04-02 21:00:00
#2017-04-02 22:30:00
#....
One approach is boolean indexing based on conditions on the datetime column or index. Assuming your DataFrame is named df and it has a DatetimeIndex equal to the example data you've posted, try this:
df[(df.index.hour >= 18) | (df.index.hour <= 5)]
I have the following dataframe:
start = ['31/12/2011 01:00','31/12/2011 01:00','31/12/2011 01:00','01/01/2013 08:00','31/12/2012 20:00']
end = ['02/01/2013 01:00','02/01/2014 01:00','02/01/2014 01:00','01/01/2013 14:00','01/01/2013 04:00']
df = pd.DataFrame({'start':start,'end':end})
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y %H:%M')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y %H:%M')
print(df)
end start
0 2013-01-02 01:00:00 2011-12-31 01:00:00
1 2014-01-02 01:00:00 2011-12-31 01:00:00
2 2014-01-02 01:00:00 2011-12-31 01:00:00
3 2013-01-01 14:00:00 2013-01-01 08:00:00
4 2013-01-01 04:00:00 2012-12-31 20:00:00
I am tying to compare df.end and df.start to two given dates, year_start and year_end:
year_start = pd.to_datetime(2013,format='%Y')
year_end = pd.to_datetime(2013+1,format='%Y')
print(year_start)
print(year_end)
2013-01-01 00:00:00
2014-01-01 00:00:00
But i can't get my comparison to work (comparison in conditions):
conditions = [(df['start'].any()< year_start) and (df['end'].any()> year_end)]
choices = [8760]
df['test'] = np.select(conditions, choices, default=0)
I also tried to define year_end and year_start as follows but it does not work either:
year_start = np.datetime64(pd.to_datetime(2013,format='%Y'))
year_end = np.datetime64(pd.to_datetime(2013+1,format='%Y'))
Any idea on how I could make it work?
Try this:
In [797]: df[(df['start']< year_start) & (df['end']> year_end)]
Out[797]:
end start
1 2014-01-02 01:00:00 2011-12-31 01:00:00
2 2014-01-02 01:00:00 2011-12-31 01:00:00
I'm using pandas 0.12.0. I have a DataFrame that looks like:
date ms
0 2013-06-03 00:10:00 75.846318
1 2013-06-03 00:20:00 78.408277
2 2013-06-03 00:30:00 75.807990
3 2013-06-03 00:40:00 70.509438
4 2013-06-03 00:50:00 71.537499
I want to generate a third column, "tod", which contains just the time portion of the date (i.e. call .time() on each value). I'm somewhat of a pandas newbie, so I suspect this is trivial but I'm just not seeing how to do it.
Just apply the Timestamp time method to items in the date column:
In [11]: df['date'].apply(lambda x: x.time())
# equivalently .apply(pd.Timestamp.time)
Out[11]:
0 00:10:00
1 00:20:00
2 00:30:00
3 00:40:00
4 00:50:00
Name: date, dtype: object
In [12]: df['tod'] = df['date'].apply(lambda x: x.time())
This gives a column of datetime.time objects.
Using the method Andy created on Index is faster than apply
In [93]: df = DataFrame(randn(5,1),columns=['A'])
In [94]: df['date'] = date_range('20130101 9:05',periods=5)
In [95]: df['time'] = Index(df['date']).time
In [96]: df
Out[96]:
A date time
0 0.053570 2013-01-01 09:05:00 09:05:00
1 -0.382155 2013-01-02 09:05:00 09:05:00
2 0.357984 2013-01-03 09:05:00 09:05:00
3 -0.718300 2013-01-04 09:05:00 09:05:00
4 0.531953 2013-01-05 09:05:00 09:05:00
In [97]: df.dtypes
Out[97]:
A float64
date datetime64[ns]
time object
dtype: object
In [98]: df['time'][0]
Out[98]: datetime.time(9, 5)