I want to get readings every 15 minutes starting on the hour given a set of readings that are made hourly but at offset minutes from the hour.
My first approach was to use resample to 15 mins but I did not get expected results:
So if readings are on the hour, resampling works fine:
left_key = pd.to_datetime(['2020-12-01 00:00',
'2020-12-01 01:00',
'2020-12-01 02:00',
'2020-12-01 03:00',
'2020-12-01 04:00',
'2020-12-01 05:00'])
left_data = pd.Series([12,12,13,15,16,15], index=left_key, name='master')
resampled = left_data.resample('15min')
resampled.interpolate(method='spline', order=2)
Yields just what I need:
2020-12-01 00:00:00 12.000000
2020-12-01 00:15:00 11.777455
2020-12-01 00:30:00 12.079464
2020-12-01 00:45:00 12.370313
2020-12-01 01:00:00 12.000000
2020-12-01 01:15:00 12.918527
2020-12-01 01:30:00 13.175893
But if the readings are offset from the hour:
left_key = pd.to_datetime(['2020-12-01 00:06',
'2020-12-01 01:06',
'2020-12-01 02:06',
'2020-12-01 03:06',
'2020-12-01 04:06',
'2020-12-01 05:06'])
left_data = pd.Series([12,12,13,15,16,15], index=left_key, name='master')
resampled = left_data.resample('15min')
resampled.interpolate(method='spline', order=2)
Now I get no data
2020-12-01 00:00:00 NaN
2020-12-01 00:15:00 NaN
2020-12-01 00:30:00 NaN
2020-12-01 00:45:00 NaN
2020-12-01 01:00:00 NaN
And if I resample hourly, it simply shifts the readings back
resampled = left_data.resample('H')
resampled.interpolate(method='spline', order=2)
2020-12-01 00:00:00 12
2020-12-01 01:00:00 12
2020-12-01 02:00:00 13
2020-12-01 03:00:00 15
2020-12-01 04:00:00 16
2020-12-01 05:00:00 15
Is there a way to get resample to interpolate readings so I have the correct value on the hour?
(and is there a better title for this question!)
Update
While the solutions works it is not suitable for larger volumes of data. 1000 rows was too much for my machine! Even reducing the initial resample size required large amounts of memory and time to complete.
Here is another solution from this question: Interpolate one time series onto custom time series
# create a new index for the ranges of datetimes required
starts = df.index.min()
starts = datetime(starts.year, starts.month, starts.day, starts.hour,15*(starts.minute // 15))
master = pd.date_range(starts, df.index.max(), freq="15min")
# will need this to identify original data rows later
df['tag'] = True
# merge with original data and interpolate missing rows
idx = df.index.union(master)
df2 = df.reindex(idx).interpolate('index')
# now remove the things we don't want
df2.drop(df2.index[0], inplace=True) # first value will be NaN (unless has real data)
# use the tag column to remove the original data and then drop that column
df2 = df2[df2['tag'].isna()]
df2.drop(columns=['tag',], inplace=True)
This is much much faster!
OK. This is not the most beautiful of all solutions, but it has worked for me in the past. It's a trick consisting of resampling twice with a negligeable time interval befor applying the one you want. First of all, you need to set your index on time (Dates).
left_key = pd.to_datetime(['2020-12-01 00:06',
'2020-12-01 01:06',
'2020-12-01 02:06',
'2020-12-01 03:06',
'2020-12-01 04:06',
'2020-12-01 05:06'])
left_data = pd.Series([12,12,13,15,16,15])
df = pd.DataFrame({'Dates':left_key , 'Values':left_data})
df.set_index('Dates', inplace=True)
df1 = df.resample('1ms').interpolate(method='spline', order=2).resample('15min').first()
which gives
Values
Dates
2020-12-01 00:00:00 12.000000
2020-12-01 00:15:00 11.653527
2020-12-01 00:30:00 11.960000
2020-12-01 00:45:00 12.255313
2020-12-01 01:00:00 12.539464
2020-12-01 01:15:00 12.812455
2020-12-01 01:30:00 13.074286
2020-12-01 01:45:00 13.324955
2020-12-01 02:00:00 13.564464
2020-12-01 02:15:00 13.792813
2020-12-01 02:30:00 14.010000
2020-12-01 02:45:00 14.216027
2020-12-01 03:00:00 14.410893
2020-12-01 03:15:00 14.594598
2020-12-01 03:30:00 14.767143
2020-12-01 03:45:00 14.928527
2020-12-01 04:00:00 15.078750
2020-12-01 04:15:00 15.217812
2020-12-01 04:30:00 15.345714
2020-12-01 04:45:00 15.462455
2020-12-01 05:00:00 15.568036
Then, you concatenate with your original df
frames = [df, df1]
df2 = pd.concat(frames)
df2.sort_values('Dates')
which returns
Values
Dates
2020-12-01 00:00:00 12.000000
2020-12-01 00:06:00 12.000000
2020-12-01 00:15:00 11.653527
2020-12-01 00:30:00 11.960000
2020-12-01 00:45:00 12.255313
2020-12-01 01:00:00 12.539464
2020-12-01 01:06:00 12.000000
2020-12-01 01:15:00 12.812455
2020-12-01 01:30:00 13.074286
2020-12-01 01:45:00 13.324955
2020-12-01 02:00:00 13.564464
2020-12-01 02:06:00 13.000000
2020-12-01 02:15:00 13.792813
2020-12-01 02:30:00 14.010000
2020-12-01 02:45:00 14.216027
2020-12-01 03:00:00 14.410893
2020-12-01 03:06:00 15.000000
2020-12-01 03:15:00 14.594598
2020-12-01 03:30:00 14.767143
2020-12-01 03:45:00 14.928527
2020-12-01 04:00:00 15.078750
2020-12-01 04:06:00 16.000000
2020-12-01 04:15:00 15.217812
2020-12-01 04:30:00 15.345714
2020-12-01 04:45:00 15.462455
2020-12-01 05:00:00 15.568036
2020-12-01 05:06:00 15.000000
Related
Suppose, I have a pandas Series with daily observations:
pd_series = pd.Series(np.random.rand(26281), index = pd.date_range('2022-01-01', '2024-12-31', freq = 'H'))
pd_series
2022-01-01 00:00:00 0.933746
2022-01-01 01:00:00 0.588907
2022-01-01 02:00:00 0.229040
2022-01-01 03:00:00 0.557752
2022-01-01 04:00:00 0.798649
2024-12-30 20:00:00 0.314143
2024-12-30 21:00:00 0.670485
2024-12-30 22:00:00 0.300531
2024-12-30 23:00:00 0.075403
2024-12-31 00:00:00 0.716685
What I want is to replace every observation by the monthly average. I know that the average can be calculated as
pd_series.resample('MS').mean()
But how do I put the observations to the respective observations?
Use Resampler.transform:
print (pd_series.resample('MS').transform('mean'))
2022-01-01 00:00:00 0.495015
2022-01-01 01:00:00 0.495015
2022-01-01 02:00:00 0.495015
2022-01-01 03:00:00 0.495015
2022-01-01 04:00:00 0.495015
2024-12-30 20:00:00 0.508646
2024-12-30 21:00:00 0.508646
2024-12-30 22:00:00 0.508646
2024-12-30 23:00:00 0.508646
2024-12-31 00:00:00 0.508646
Freq: H, Length: 26281, dtype: float64
I got two pandas dataframes as following:
ts1
Out[50]:
soil_moisture_ids41
date_time
2007-01-07 05:00:00 0.1830
2007-01-07 06:00:00 0.1825
2007-01-07 07:00:00 0.1825
2007-01-07 08:00:00 0.1825
2007-01-07 09:00:00 0.1825
... ...
2017-10-10 20:00:00 0.0650
2017-10-10 21:00:00 0.0650
2017-10-10 22:00:00 0.0650
2017-10-10 23:00:00 0.0650
2017-10-11 00:00:00 0.0650
[94316 rows x 3 columns]
and the other one is
ts2
Out[51]:
soil_moisture_ids42
date_time
2016-07-20 00:00:00 0.147
2016-07-20 01:00:00 0.148
2016-07-20 02:00:00 0.149
2016-07-20 03:00:00 0.150
2016-07-20 04:00:00 0.152
... ...
2019-12-31 19:00:00 0.216
2019-12-31 20:00:00 0.216
2019-12-31 21:00:00 0.215
2019-12-31 22:00:00 0.215
2019-12-31 23:00:00 0.215
[30240 rows x 3 columns]
You could see that, from 2007-01-07 to 2016-07-19, only ts1 has the data points. And from 2016-07-20 to 2017-10-11 there are some overlapped time series. Now I want to combine these two data frames. During the overlapped period, I want to get the mean values over ts1 and ts2. During the non-overlapped period, (2007-01-07 to 2016-07-19 and 2017-10-12 to 2019-12-31), the values at each time stamp is set as the value from ts1 or ts2. So how can I do it?
Thanks!
Use concat with aggregate mean, if only one value get same ouput, if multiple get mean. Also finally DatatimeIndex is sorted:
s = pd.concat([ts1, ts2]).groupby(level=0).mean()
Just store the concatenated series first and then apply the mean. i.e. merged_ts = pd.concat([ts1, ts2]) and then mean_ts = merged_ts.group_by(level=0).mean()
details of the raw data (Mnth.nc)
netcdf Mnth {
dimensions:
time = UNLIMITED ; // (480 currently)
bnds = 2 ;
longitude = 25 ;
latitude = 33 ;
variables:
double time(time) ;
time:standard_name = "time" ;
time:long_name = "verification time generated by wgrib2 function verftime()" ;
time:bounds = "time_bnds" ;
time:units = "seconds since 1970-01-01 00:00:00.0 0:00" ;
time:calendar = "standard" ;
time:axis = "T" ;
double time_bnds(time, bnds) ;
double longitude(longitude) ;
longitude:standard_name = "longitude" ;
longitude:long_name = "longitude" ;
longitude:units = "degrees_east" ;
longitude:axis = "X" ;
double latitude(latitude) ;
latitude:standard_name = "latitude" ;
latitude:long_name = "latitude" ;
latitude:units = "degrees_north" ;
latitude:axis = "Y" ;
float APCP_sfc(time, latitude, longitude) ;
APCP_sfc:long_name = "Total Precipitation" ;
APCP_sfc:units = "kg/m^2" ;
APCP_sfc:_FillValue = 9.999e+20f ;
APCP_sfc:missing_value = 9.999e+20f ;
APCP_sfc:cell_methods = "time: sum" ;
APCP_sfc:short_name = "APCP_surface" ;
APCP_sfc:level = "surface" ;
}
Detail information of the raw data (Mnth.nc)
File format : NetCDF4 classic
-1 : Institut Source T Steptype Levels Num Points Num Dtype : Parameter ID
1 : unknown unknown v instant 1 1 825 1 F32 : -1
Grid coordinates :
1 : lonlat : points=825 (25x33)
longitude : 87 to 89.88 by 0.12 degrees_east
latitude : 25.08 to 28.92 by 0.12 degrees_north
Vertical coordinates :
1 : surface : levels=1
Time coordinate : 480 steps
RefTime = 1970-01-01 00:00:00 Units = seconds Calendar = standard Bounds = true
YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss
1980-01-16 12:30:00 1980-02-15 12:30:00 1980-03-16 12:30:00 1980-04-16 00:30:00
1980-05-16 12:30:00 1980-06-16 00:30:00 1980-07-16 12:30:00 1980-08-16 12:30:00
1980-09-16 00:30:00 1980-10-16 12:30:00 1980-11-16 00:30:00 1980-12-16 12:30:00
1981-01-16 12:30:00 1981-02-15 00:30:00 1981-03-16 12:30:00 1981-04-16 00:30:00
1981-05-16 12:30:00 1981-06-16 00:30:00 1981-07-16 12:30:00 1981-08-16 12:30:00
1981-09-16 00:30:00 1981-10-16 12:30:00 1981-11-16 00:30:00 1981-12-16 12:30:00
1982-01-16 12:30:00 1982-02-15 00:30:00 1982-03-16 12:30:00 1982-04-16 00:30:00
1982-05-16 12:30:00 1982-06-16 00:30:00 1982-07-16 12:30:00 1982-08-16 12:30:00
1982-09-16 00:30:00 1982-10-16 12:30:00 1982-11-16 00:30:00 1982-12-16 12:30:00
1983-01-16 12:30:00 1983-02-15 00:30:00 1983-03-16 12:30:00 1983-04-16 00:30:00
1983-05-16 12:30:00 1983-06-16 00:30:00 1983-07-16 12:30:00 1983-08-16 12:30:00
1983-09-16 00:30:00 1983-10-16 12:30:00 1983-11-16 00:30:00 1983-12-16 12:30:00
1984-01-16 12:30:00 1984-02-15 12:30:00 1984-03-16 12:30:00 1984-04-16 00:30:00
1984-05-16 12:30:00 1984-06-16 00:30:00 1984-07-16 12:30:00 1984-08-16 12:30:00
1984-09-16 00:30:00 1984-10-16 12:30:00 1984-11-16 00:30:00 1984-12-16 12:30:00
................................................................................
............................
2016-01-16 12:30:00 2016-02-15 12:30:00 2016-03-16 12:30:00 2016-04-16 00:30:00
2016-05-16 12:30:00 2016-06-16 00:30:00 2016-07-16 12:30:00 2016-08-16 12:30:00
2016-09-16 00:30:00 2016-10-16 12:30:00 2016-11-16 00:30:00 2016-12-16 12:30:00
2017-01-16 12:30:00 2017-02-15 00:30:00 2017-03-16 12:30:00 2017-04-16 00:30:00
2017-05-16 12:30:00 2017-06-16 00:30:00 2017-07-16 12:30:00 2017-08-16 12:30:00
2017-09-16 00:30:00 2017-10-16 12:30:00 2017-11-16 00:30:00 2017-12-16 12:30:00
2018-01-16 12:30:00 2018-02-15 00:30:00 2018-03-16 12:30:00 2018-04-16 00:30:00
2018-05-16 12:30:00 2018-06-16 00:30:00 2018-07-16 12:30:00 2018-08-16 12:30:00
2018-09-16 00:30:00 2018-10-16 12:30:00 2018-11-16 00:30:00 2018-12-16 12:30:00
2019-01-16 12:30:00 2019-02-15 00:30:00 2019-03-16 12:30:00 2019-04-16 00:30:00
2019-05-16 12:30:00 2019-06-16 00:30:00 2019-07-16 12:30:00 2019-08-16 12:30:00
2019-09-16 00:30:00 2019-10-16 12:30:00 2019-11-16 00:30:00 2019-12-16 12:30:00
2020-01-16 12:30:00 2020-02-15 12:30:00 2020-03-16 12:30:00 2020-04-16 00:30:00
2020-05-16 12:30:00 2020-06-16 00:30:00 2020-07-16 12:30:00 2020-08-16 12:30:00
2020-09-16 00:30:00 2020-10-16 12:30:00 2020-11-16 00:30:00 2020-12-16 12:30:00
cdo sinfo: Processed 1 variable over 480 timesteps [0.50s 30MB].
I extracted monthly rainfall values from the Mnth.nc file for a location (lon: 88.44; lat: 27.12)using the following command
cdo remapnn,lon=88.44-lat=27.12 Mnth.nc Mnth1.nc
cdo outputtab,year, month, value Mnth1.nc > Mnth.csv
The output is as follows ()
Year month Value
1980 1 31.74219
1980 2 54.60938
1980 3 66.94531
1980 4 149.4062
1980 5 580.7227
1980 6 690.1328
1980 7 1146.305
1980 8 535.8164
1980 9 486.4688
1980 10 119.5391
1980 11 82.10547
1980 12 13.95703
Then I extracted the rainfall values from the same data (Mnth.nc) for the same location (lon: 88.44; lat: 27.12) using the features of the multidimensional toolbox provided in ArcGIS. The result is as follows-
year month Value
1980 1 38.8125
1980 2 58.6542969
1980 3 71.7382813
1980 4 148.6367188
1980 5 564.7070313
1980 6 653.0390625
1980 7 1026.832031
1980 8 501.3164063
1980 9 458.5429688
1980 10 113.078125
1980 11 74.0976563
1980 12 24.2265625
Why I'm getting different results in two different software for the same location and for the same variable? Any help will highly be appreciated.
Thanks in advance.
The question is perhaps misleading, in that you are not "extracting" the data in both cases. Instead you are interpolating it. The method used by CDO is nearest neighbour. arcGIS is probably simply using a different method, so you should get different results. They should give slightly different results.
The results look very similar, so both are almost certainly working as advertised.
I think I ended up in the same issues. I used CDO to extract a point and also used ArcGIS for cross checking. I found out that the values were different.
Just to be sure, I recorded the location extent of one particular cell and tried extracting values for different locations within the cell boundary extent. CDO seemed to have been giving the same results as expected because it uses nearest neighbour resampling method.
Then I tried the same with ArcGIS. Interestingly, in my case, I found out that ArcGIS also gave me same results sometimes within the same cell boundary extent and sometimes different. I checked the values by also using 'Panoply' and I realised that CDO gave accurate results, while ArcGIS was sometimes giving offset results,i.e., it was giving the values of the nearby cells. This was confirmed by cross-checking with Panoply. As #Robert Wilson mentioned that ArcGIS must be using different resampling method, I figured out in the results section after using the tool 'Netcdf to table view' that it also uses Nearest neighbour method. This is not an answer to your question, but just something I found.
I have below table structure in SQL Server:
StartDate Start End Sales
==============================================
2020-08-25 00:00:00 00:15:00 291.4200
2020-08-25 00:15:00 00:30:00 401.1700
2020-08-25 00:30:00 00:45:00 308.3300
2020-08-25 00:45:00 01:00:00 518.3200
2020-08-25 01:00:00 01:15:00 247.3700
2020-08-25 01:15:00 01:30:00 115.4700
2020-08-25 01:30:00 01:45:00 342.3800
2020-08-25 01:45:00 02:00:00 233.0900
2020-08-25 02:00:00 02:15:00 303.3400
2020-08-25 02:15:00 02:30:00 11.9000
2020-08-25 02:30:00 02:45:00 115.2400
2020-08-25 02:45:00 03:00:00 199.5200
2020-08-25 06:00:00 06:15:00 0.0000
2020-08-25 06:15:00 06:30:00 45.2400
2020-08-25 06:30:00 06:45:00 30.4800
2020-08-25 06:45:00 07:00:00 0.0000
2020-08-25 07:00:00 07:15:00 0.0000
2020-08-25 07:15:00 07:30:00 69.2800
Is there a way to group above data into one hour interval instead of 15 minute interval?
It has to be based on start and end columns.
Thanks,
Maybe something like the following using datepart?
select startdate, DatePart(hour,start) [Hour], Sum(sales) SalesPerHour
from t
group by startdate, DatePart(hour,start)
I have 2 columns of data in a pandas DF that looks like this with the "DateTime" column in format YYYY-MM-DD HH:MM:SS - this is first 24 hrs but the df is for one full year or 8784 x 2.
BAFFIN BAY DateTime
8759 8.112838 2016-01-01 00:00:00
8760 7.977169 2016-01-01 01:00:00
8761 8.420204 2016-01-01 02:00:00
8762 9.515370 2016-01-01 03:00:00
8763 9.222840 2016-01-01 04:00:00
8764 8.872423 2016-01-01 05:00:00
8765 8.776145 2016-01-01 06:00:00
8766 9.030668 2016-01-01 07:00:00
8767 8.394983 2016-01-01 08:00:00
8768 8.092915 2016-01-01 09:00:00
8769 8.946967 2016-01-01 10:00:00
8770 9.620883 2016-01-01 11:00:00
8771 9.535951 2016-01-01 12:00:00
8772 8.861761 2016-01-01 13:00:00
8773 9.077692 2016-01-01 14:00:00
8774 9.116074 2016-01-01 15:00:00
8775 8.724343 2016-01-01 16:00:00
8776 8.916940 2016-01-01 17:00:00
8777 8.920438 2016-01-01 18:00:00
8778 8.926278 2016-01-01 19:00:00
8779 8.817666 2016-01-01 20:00:00
8780 8.704014 2016-01-01 21:00:00
8781 8.496358 2016-01-01 22:00:00
8782 8.434297 2016-01-01 23:00:00
I am trying to calculate daily averages of the "BAFFIN BAY" and I've tried these approaches:
davg_df2 = df2.groupby(pd.Grouper(freq='D', key='DateTime')).mean()
davg_df2 = df2.groupby(pd.Grouper(freq='1D', key='DateTime')).mean()
davg_df2 = df2.groupby(by=df2['DateTime'].dt.date).mean()
All of these approaches yields the same answer as shown below :
BAFFIN BAY
DateTime
2016-01-01 6.008044
However, if you do the math, the correct average for 2016-01-01 is 8.813134 Thank you kindly for your help. I'm assuming the grouping is just by day or 24hrs to make consecutive DAILY averages but the 3 approaches above clearly is looking at other data in my 8784 x 2 DF.
I just ran your df with this code and i get 8.813134:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df = df.groupby(by=pd.Grouper(freq='D', key='DateTime')).mean()
print(df)
Output:
BAFFIN BAY
DateTime
2016-01-01 8.813134