Pandas DateTime Calculating Daily Averages - pandas

I have 2 columns of data in a pandas DF that looks like this with the "DateTime" column in format YYYY-MM-DD HH:MM:SS - this is first 24 hrs but the df is for one full year or 8784 x 2.
BAFFIN BAY DateTime
8759 8.112838 2016-01-01 00:00:00
8760 7.977169 2016-01-01 01:00:00
8761 8.420204 2016-01-01 02:00:00
8762 9.515370 2016-01-01 03:00:00
8763 9.222840 2016-01-01 04:00:00
8764 8.872423 2016-01-01 05:00:00
8765 8.776145 2016-01-01 06:00:00
8766 9.030668 2016-01-01 07:00:00
8767 8.394983 2016-01-01 08:00:00
8768 8.092915 2016-01-01 09:00:00
8769 8.946967 2016-01-01 10:00:00
8770 9.620883 2016-01-01 11:00:00
8771 9.535951 2016-01-01 12:00:00
8772 8.861761 2016-01-01 13:00:00
8773 9.077692 2016-01-01 14:00:00
8774 9.116074 2016-01-01 15:00:00
8775 8.724343 2016-01-01 16:00:00
8776 8.916940 2016-01-01 17:00:00
8777 8.920438 2016-01-01 18:00:00
8778 8.926278 2016-01-01 19:00:00
8779 8.817666 2016-01-01 20:00:00
8780 8.704014 2016-01-01 21:00:00
8781 8.496358 2016-01-01 22:00:00
8782 8.434297 2016-01-01 23:00:00
I am trying to calculate daily averages of the "BAFFIN BAY" and I've tried these approaches:
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DateTime')).mean()
davg_df2 = df2.groupby(pd.Grouper(freq='1D', key='DateTime')).mean()
davg_df2 = df2.groupby(by=df2['DateTime'].dt.date).mean()
All of these approaches yields the same answer as shown below :
BAFFIN BAY
DateTime
2016-01-01 6.008044
However, if you do the math, the correct average for 2016-01-01 is 8.813134 Thank you kindly for your help. I'm assuming the grouping is just by day or 24hrs to make consecutive DAILY averages but the 3 approaches above clearly is looking at other data in my 8784 x 2 DF.

I just ran your df with this code and i get 8.813134:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df = df.groupby(by=pd.Grouper(freq='D', key='DateTime')).mean()
print(df)
Output:
BAFFIN BAY
DateTime
2016-01-01 8.813134

Related

Pandas replace daily observations by monthly mean

Suppose, I have a pandas Series with daily observations:
pd_series = pd.Series(np.random.rand(26281), index = pd.date_range('2022-01-01', '2024-12-31', freq = 'H'))
pd_series
2022-01-01 00:00:00 0.933746
2022-01-01 01:00:00 0.588907
2022-01-01 02:00:00 0.229040
2022-01-01 03:00:00 0.557752
2022-01-01 04:00:00 0.798649
2024-12-30 20:00:00 0.314143
2024-12-30 21:00:00 0.670485
2024-12-30 22:00:00 0.300531
2024-12-30 23:00:00 0.075403
2024-12-31 00:00:00 0.716685
What I want is to replace every observation by the monthly average. I know that the average can be calculated as
pd_series.resample('MS').mean()
But how do I put the observations to the respective observations?
Use Resampler.transform:
print (pd_series.resample('MS').transform('mean'))
2022-01-01 00:00:00 0.495015
2022-01-01 01:00:00 0.495015
2022-01-01 02:00:00 0.495015
2022-01-01 03:00:00 0.495015
2022-01-01 04:00:00 0.495015
2024-12-30 20:00:00 0.508646
2024-12-30 21:00:00 0.508646
2024-12-30 22:00:00 0.508646
2024-12-30 23:00:00 0.508646
2024-12-31 00:00:00 0.508646
Freq: H, Length: 26281, dtype: float64

Pandas: merge two time series and get the mean values during the period when these two have overlapped time period

I got two pandas dataframes as following:
ts1
Out[50]:
soil_moisture_ids41
date_time
2007-01-07 05:00:00 0.1830
2007-01-07 06:00:00 0.1825
2007-01-07 07:00:00 0.1825
2007-01-07 08:00:00 0.1825
2007-01-07 09:00:00 0.1825
... ...
2017-10-10 20:00:00 0.0650
2017-10-10 21:00:00 0.0650
2017-10-10 22:00:00 0.0650
2017-10-10 23:00:00 0.0650
2017-10-11 00:00:00 0.0650
[94316 rows x 3 columns]
and the other one is
ts2
Out[51]:
soil_moisture_ids42
date_time
2016-07-20 00:00:00 0.147
2016-07-20 01:00:00 0.148
2016-07-20 02:00:00 0.149
2016-07-20 03:00:00 0.150
2016-07-20 04:00:00 0.152
... ...
2019-12-31 19:00:00 0.216
2019-12-31 20:00:00 0.216
2019-12-31 21:00:00 0.215
2019-12-31 22:00:00 0.215
2019-12-31 23:00:00 0.215
[30240 rows x 3 columns]
You could see that, from 2007-01-07 to 2016-07-19, only ts1 has the data points. And from 2016-07-20 to 2017-10-11 there are some overlapped time series. Now I want to combine these two data frames. During the overlapped period, I want to get the mean values over ts1 and ts2. During the non-overlapped period, (2007-01-07 to 2016-07-19 and 2017-10-12 to 2019-12-31), the values at each time stamp is set as the value from ts1 or ts2. So how can I do it?
Thanks!
Use concat with aggregate mean, if only one value get same ouput, if multiple get mean. Also finally DatatimeIndex is sorted:
s = pd.concat([ts1, ts2]).groupby(level=0).mean()
Just store the concatenated series first and then apply the mean. i.e. merged_ts = pd.concat([ts1, ts2]) and then mean_ts = merged_ts.group_by(level=0).mean()

Select data between night and day hours

My data looks like this, it is a minute based data for 2 years.
2017-04-02 00:00:00
2017-04-02 00:01:00
2017-04-02 00:02:00
2017-04-02 00:03:00
2017-04-02 00:04:00
....
2017-04-02 23:59:00
...
2019-02-01 22:54:00
2019-02-01 22:55:00
2019-02-01 22:56:00
2019-02-01 22:57:00
2019-02-01 22:58:00
2019-02-01 22:59:00
2019-02-01 23:00:00
I want to access all the data rows between the end of the workday to the beginning of the next. Example between 2018-04-02 18:00:00 2018-04-03 05:00:00 for all the days in my data frame. Please help
If you use a DatetimeIndex then you can use .between_time
import pandas as pd
df = pd.DataFrame({'date': pd.date_range('2017-04-02', freq='90min', periods=100)})
df = df.set_index('date')
df.between_time('18:00', '5:00')
#date
#2017-04-02 00:00:00
#2017-04-02 01:30:00
#2017-04-02 03:00:00
#2017-04-02 04:30:00
#2017-04-02 18:00:00
#2017-04-02 19:30:00
#2017-04-02 21:00:00
#2017-04-02 22:30:00
#....
One approach is boolean indexing based on conditions on the datetime column or index. Assuming your DataFrame is named df and it has a DatetimeIndex equal to the example data you've posted, try this:
df[(df.index.hour >= 18) | (df.index.hour <= 5)]

Pandas datetime comparison

I have the following dataframe:
start = ['31/12/2011 01:00','31/12/2011 01:00','31/12/2011 01:00','01/01/2013 08:00','31/12/2012 20:00']
end = ['02/01/2013 01:00','02/01/2014 01:00','02/01/2014 01:00','01/01/2013 14:00','01/01/2013 04:00']
df = pd.DataFrame({'start':start,'end':end})
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y %H:%M')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y %H:%M')
print(df)
end start
0 2013-01-02 01:00:00 2011-12-31 01:00:00
1 2014-01-02 01:00:00 2011-12-31 01:00:00
2 2014-01-02 01:00:00 2011-12-31 01:00:00
3 2013-01-01 14:00:00 2013-01-01 08:00:00
4 2013-01-01 04:00:00 2012-12-31 20:00:00
I am tying to compare df.end and df.start to two given dates, year_start and year_end:
year_start = pd.to_datetime(2013,format='%Y')
year_end = pd.to_datetime(2013+1,format='%Y')
print(year_start)
print(year_end)
2013-01-01 00:00:00
2014-01-01 00:00:00
But i can't get my comparison to work (comparison in conditions):
conditions = [(df['start'].any()< year_start) and (df['end'].any()> year_end)]
choices = [8760]
df['test'] = np.select(conditions, choices, default=0)
I also tried to define year_end and year_start as follows but it does not work either:
year_start = np.datetime64(pd.to_datetime(2013,format='%Y'))
year_end = np.datetime64(pd.to_datetime(2013+1,format='%Y'))
Any idea on how I could make it work?
Try this:
In [797]: df[(df['start']< year_start) & (df['end']> year_end)]
Out[797]:
end start
1 2014-01-02 01:00:00 2011-12-31 01:00:00
2 2014-01-02 01:00:00 2011-12-31 01:00:00

Pandas Datetime conversion

I have the following dataframe;
Date = ['01-Jan','01-Jan','01-Jan','01-Jan']
Heure = ['00:00','01:00','02:00','03:00']
value =[1,2,3,4]
df = pd.DataFrame({'value':value,'Date':Date,'Hour':Heure})
print(df)
Date Hour value
0 01-Jan 00:00 1
1 01-Jan 01:00 2
2 01-Jan 02:00 3
3 01-Jan 03:00 4
I am trying to create a datetime index, knowing that the file I am working with is for 2015. I have tried a lot of things but can get it to work! I tried to only convert the date and the month, but even that does not work:
df.index = pd.to_datetime(df['Date'],format='%d-%m')
I expect the following result:
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
Does anyone know how to do it?
Thanks,
You need to explicitely add 2015 somehow, and include the Hour column as well. I would do something like this:
df.index = pd.to_datetime(df.Date + '-2015 ' + df.Hour, format='%d-%b-%Y %H:%M')
>>> df
Date Hour value
2015-01-01 00:00:00 01-Jan 00:00 1
2015-01-01 01:00:00 01-Jan 01:00 2
2015-01-01 02:00:00 01-Jan 02:00 3
2015-01-01 03:00:00 01-Jan 03:00 4
You can replace the default 1900 by using replace
s=pd.to_datetime(df['Date']+df['Hour'],format='%d-%b%H:%M').apply(lambda x : x.replace(year=2015))
s
Out[131]:
0 2015-01-01 00:00:00
1 2015-01-01 01:00:00
2 2015-01-01 02:00:00
3 2015-01-01 03:00:00
dtype: datetime64[ns]
df.index=s