Would like to spread the values of the 15 minute intervals evenly over the 5 minute intervals. But cannot get it to work. Data is:
Datetime a
2018-01-01 00:00:00 6
2018-01-01 00:15:00 3
2018-01-01 00:30:00 9
Desired output would be:
Datetime a
2018-01-01 00:00:00 2
2018-01-11 00:05:00 2
2018-01-11 00:10:00 2
2018-01-11 00:15:00 1
2018-01-11 00:20:00 1
2018-01-11 00:25:00 1
2018-01-11 00:30:00 3
2018-01-11 00:35:00 3
2018-01-11 00:40:00 3
perhaps unnecessarily, but the value '6' of 00:00:00 in the data is spread over the intervals 00:00:00-00:10:00
Slightly different approach:
# convert to datetime
df.Datetime = pd.to_datetime(df.Datetime)
# set Datetime as index
df.set_index('Datetime', inplace=True)
# add one extra row
df.loc[df.index.max()+pd.to_timedelta('10min')] = 0
# set_index and resample
s = df.asfreq('5T', fill_value=0)
# transform the 0's to mean:
(s.groupby(s['a'].ne(0)
.cumsum())
.transform('mean')
.reset_index()
)
Output:
Datetime a
0 2018-01-01 00:00:00 2
1 2018-01-01 00:05:00 2
2 2018-01-01 00:10:00 2
3 2018-01-01 00:15:00 1
4 2018-01-01 00:20:00 1
5 2018-01-01 00:25:00 1
6 2018-01-01 00:30:00 3
7 2018-01-01 00:35:00 3
8 2018-01-01 00:40:00 3
Related
I have time series data on a PostgreSQL database. Currently I import this into pandas to do some analysis and my first step often times is resampling the 5 min frequency data to 1 hour average data. I do this so that I first pivot the data to bring it in wide form, then I resample it to 1 hour and after that I melt it to have it in long form again.
Now I want to do the resampling on the Database so that I can import the 1 hour average right away.
This is what the data looks like on the Database. I have two different types with 3 different names.
datetime value type name
0 2018-01-01 13:35:00+01:00 0.22 HLN NO2
1 2018-01-01 13:35:00+01:00 0.31 HLN CO
2 2018-01-01 13:35:00+01:00 1.15 HLN NO
3 2018-01-01 13:40:00+01:00 1.80 AIS NO2
4 2018-01-01 13:40:00+01:00 2.60 AIS CO
5 2018-01-01 13:40:00+01:00 2.30 AIS NO
6 2018-01-01 13:45:00+01:00 2.25 HLN NO2
7 2018-01-01 13:45:00+01:00 2.14 HLN CO
8 2018-01-01 13:45:00+01:00 2.96 HLN NO
9 2018-01-01 14:35:00+01:00 0.76 HLN NO2
10 2018-01-01 14:35:00+01:00 0.80 HLN CO
11 2018-01-01 14:35:00+01:00 1.19 HLN NO
12 2018-01-01 14:40:00+01:00 1.10 AIS NO2
13 2018-01-01 14:40:00+01:00 2.87 AIS CO
14 2018-01-01 14:40:00+01:00 2.80 AIS NO
15 2018-01-01 14:45:00+01:00 3.06 HLN NO2
16 2018-01-01 13:45:00+01:00 2.86 HLN CO
17 2018-01-01 13:45:00+01:00 2.22 HLN NO
Now comes the part which I have problems with. After resampling and plotting it in pandas and plotly I get the expected result which is right, one value for every hour:
After performing a SQL Query to resample it to one hour with the following code:
SELECT date_trunc('hour', datetime) AS hour, type, AVG(value) AS measure, name
FROM data_table
GROUP BY datetime, type, name
ORDER BY datetime
I get this after plotting:
This is not smooth and there are multiple values in the hour, I guess there are all the values within the hour.
My question, how can I correctly resample a time series in SQL?
Edit: Expected result in table form:
datetime value type name
2018-01-01 13:00:00 1.235 HLN NO2
2018-01-01 13:00:00 2.65 HLN CO
2018-01-01 13:00:00 2.96 HLN NO
2018-01-01 13:00:00 2.48 AIS NO2
2018-01-01 13:00:00 2.65 AIS CO
2018-01-01 13:00:00 2.26 AIS NO
2018-01-01 14:00:00 2.78 HLN NO2
2018-01-01 14:00:00 3.65 HLN CO
2018-01-01 14:00:00 1.95 HLN NO
2018-01-01 14:00:00 1.45 AIS NO2
2018-01-01 14:00:00 1.64 AIS CO
2018-01-01 14:00:00 3.23 AIS NO
An alternative is to create the time intervals using generate_series() or just with in a subquery / CTE truncate the hours per type and name, and in the outer query join both records and aggregate the values with avg() by the columns hour, type and name, e.g.
WITH j AS (
SELECT DISTINCT date_trunc('hour', datetime) AS hour, type,name
FROM data_table
)
SELECT j.*, avg(d.value)
FROM data_table d
JOIN j ON date_trunc('hour', d.datetime) = j.hour AND
j.type = d.type AND
d.name = j.name
GROUP BY j.hour, j.name, j.type
ORDER BY j.hour ASC,j.type DESC;
hour | type | name | avg
---------------------+------+------+------------------------
2018-01-01 13:00:00 | HLN | CO | 1.7700000000000000
2018-01-01 13:00:00 | HLN | NO | 2.1100000000000000
2018-01-01 13:00:00 | HLN | NO2 | 1.23500000000000000000
2018-01-01 13:00:00 | AIS | CO | 2.6000000000000000
2018-01-01 13:00:00 | AIS | NO | 2.3000000000000000
2018-01-01 13:00:00 | AIS | NO2 | 1.80000000000000000000
2018-01-01 14:00:00 | HLN | CO | 0.80000000000000000000
2018-01-01 14:00:00 | HLN | NO | 1.19000000000000000000
2018-01-01 14:00:00 | HLN | NO2 | 1.9100000000000000
2018-01-01 14:00:00 | AIS | CO | 2.8700000000000000
2018-01-01 14:00:00 | AIS | NO | 2.8000000000000000
2018-01-01 14:00:00 | AIS | NO2 | 1.10000000000000000000
Demo: db<>fiddle
I have dataframe like this:
datestamp Name Reading
2018-01-01 00:00:00 A01 40
2018-01-01 01:00:00 A01 50
2018-01-01 03:00:00 A01 50
2018-01-01 01:00:00 A02 50
2018-01-01 02:00:00 A02 40
2018-01-01 03:00:00 A02 30
Given a start and end date (start = 2018-01-01 00:00:00 and end = 2018-01-01 05:00:00)
I would like to transform the dataframe to below. (for all missing entry we need to have zero or NULL)
Output like:
datestamp Name Reading
2018-01-01 00:00:00 A01 40
2018-01-01 01:00:00 A01 50
2018-01-01 02:00:00 A01 00
2018-01-01 03:00:00 A01 50
2018-01-01 04:00:00 A01 00
2018-01-01 05:00:00 A01 00
2018-01-01 00:00:00 A02 00
2018-01-01 01:00:00 A02 50
2018-01-01 02:00:00 A02 40
2018-01-01 03:00:00 A02 30
2018-01-01 04:00:00 A02 00
2018-01-01 05:00:00 A02 00
I directionless so have no approach as of now.
Let's try pivot the table and reindex:
new_dates = pd.date_range('2018-01-01 00:00:00',
'2018-01-01 05:00:00',
freq='H', name='datestamp')
(df.pivot_table(index='datestamp',columns='Name',
values='Reading', fill_value=0)
.reindex(new_dates, fill_value=0)
.stack().sort_index(level=[1,0])
.reset_index(name='Reading')
)
Output:
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 0
3 2018-01-01 03:00:00 A01 50
4 2018-01-01 04:00:00 A01 0
5 2018-01-01 05:00:00 A01 0
6 2018-01-01 00:00:00 A02 0
7 2018-01-01 01:00:00 A02 50
8 2018-01-01 02:00:00 A02 40
9 2018-01-01 03:00:00 A02 30
10 2018-01-01 04:00:00 A02 0
11 2018-01-01 05:00:00 A02 0
Use pd.date_range(), df.loc[] and a nested for loop:
df
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 50
3 2018-01-01 03:00:00 A02 50
4 2018-01-01 04:00:00 A02 40
5 2018-01-01 05:00:00 A02 30
start_date = dt.datetime.strptime('2018-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
end_date = dt.datetime.strptime('2018-01-01 05:00:00', '%Y-%m-%d %H:%M:%S')
date_range = pd.date_range(start=start_date, end=end_date, freq='H')
for date in date_range:
for sensor in df.Name.unique():
if len(df.loc[(df['datestamp'] == date) & (df['Name'] == sensor)]) == 0:
df=df.append({'datestamp':date, 'Name':sensor, 'Reading':0}, ignore_index=True)
df = df.sort_values(by=['Name', 'datestamp']).reset_index(drop=True)
df
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 50
3 2018-01-01 03:00:00 A01 0
4 2018-01-01 04:00:00 A01 0
5 2018-01-01 05:00:00 A01 0
6 2018-01-01 03:00:00 A02 50
7 2018-01-01 04:00:00 A02 40
8 2018-01-01 05:00:00 A02 30
9 2018-01-01 00:00:00 A02 0
10 2018-01-01 01:00:00 A02 0
11 2018-01-01 02:00:00 A02 0
Pardon the column headings.
You can use pandas fillna method, Link to documentation
df[‘column_name’] = df[‘column_name’].fillna(0)
I have a dataframe created by:
df = pd.DataFrame({})
df['Date'] = pd.to_datetime(np.arange(0,12), unit='h', origin='2018-08-01 06:00:00')
df['ship'] = [1,1,2,2,2,3,3,3,3,3,3,3] # ship ID number
dt_trip = 4 # maximum duration of each trip to be classified as the same trip
Date ship
0 2018-08-01 06:00:00 1
1 2018-08-01 07:00:00 1
2 2018-08-01 08:00:00 2
3 2018-08-01 09:00:00 2
4 2018-08-01 10:00:00 2
5 2018-08-01 11:00:00 3
6 2018-08-01 12:00:00 3
7 2018-08-01 13:00:00 3
8 2018-08-01 14:00:00 3
9 2018-08-01 15:00:00 3
10 2018-08-01 16:00:00 3
11 2018-08-01 17:00:00 3
I try to get a a new column which shows the trips of each ship. Each trip is defined by an interval of 4 hours with respect to the start of the trip. When a new ship number is on the next row, automatically a new trip should start (irrespective of the previous datetime). From a previous post I got a solution for the trips.
origin = df["Date"][0].hour
df["Trip"] = df.apply(lambda x: ((x["Date"].hour - origin) // dt_trip) + 1, axis=1)
df["Trip"] = df.groupby(['Trip','ship']).ngroup() +1 # trip starts at: 1
This solution takes a new trip when the ship-column changes its row. The only change I want to have is to change the origin to the datetime when a new trip starts. So index 4 should have Trip = 2, because the ship is the same and the time difference between the start of the trip (index=2). Now it looks at the first given datetime.
Desired solution looks like:
Date ship Trip Trip_desired
0 2018-08-01 06:00:00 1 1 1
1 2018-08-01 07:00:00 1 1 1
2 2018-08-01 08:00:00 2 2 2
3 2018-08-01 09:00:00 2 2 2
4 2018-08-01 10:00:00 2 3 2
5 2018-08-01 11:00:00 3 4 3
6 2018-08-01 12:00:00 3 4 3
7 2018-08-01 13:00:00 3 4 3
8 2018-08-01 14:00:00 3 5 3
9 2018-08-01 15:00:00 3 5 4
10 2018-08-01 16:00:00 3 5 4
11 2018-08-01 17:00:00 3 5 4
I would do:
total_time = df['Date'] - df.groupby('ship')['Date'].transform('min')
trips = total_time.dt.total_seconds().fillna(0)//(dt_trip*3600)
df['trip'] = df.groupby(['ship', trips]).ngroup()+1
Output:
Date ship trip
0 2018-08-01 06:00:00 1 1
1 2018-08-01 07:00:00 1 1
2 2018-08-01 08:00:00 2 2
3 2018-08-01 09:00:00 2 2
4 2018-08-01 10:00:00 2 2
5 2018-08-01 11:00:00 3 3
6 2018-08-01 12:00:00 3 3
7 2018-08-01 13:00:00 3 3
8 2018-08-01 14:00:00 3 3
9 2018-08-01 15:00:00 3 4
10 2018-08-01 16:00:00 3 4
11 2018-08-01 17:00:00 3 4
I have a dataframe nf as follows :
StationID DateTime Channel Count
0 1 2017-10-01 00:00:00 1 1
1 1 2017-10-01 00:00:00 1 201
2 1 2017-10-01 00:00:00 1 8
3 1 2017-10-01 00:00:00 1 2
4 1 2017-10-01 00:00:00 1 0
5 1 2017-10-01 00:00:00 1 0
6 1 2017-10-01 00:00:00 1 0
7 1 2017-10-01 00:00:00 1 0
.......... and so on
I want to groupby values by each hour and for each channel and StationID
Output Req
Station ID DateTime Channel Count
1 2017-10-01 00:00:00 1 232
1 2017-10-01 00:01:00 1 23
2 2017-10-01 00:00:00 1 244...
...... and so on
I think you need groupby with aggregate sum, for datetimes with floor by hours add floor - it set minutes and seconds to 0:
print (df)
StationID DateTime Channel Count
0 1 2017-12-01 00:00:00 1 1
1 1 2017-12-01 00:00:00 1 201
2 1 2017-12-01 00:10:00 1 8
3 1 2017-12-01 10:00:00 1 2
4 1 2017-10-01 10:50:00 1 0
5 1 2017-10-01 10:20:00 1 5
6 1 2017-10-01 08:10:00 1 4
7 1 2017-10-01 08:00:00 1 1
df['DateTime'] = pd.to_datetime(df['DateTime'])
df1 = (df.groupby(['StationID', df['DateTime'].dt.floor('H'), 'Channel'])['Count']
.sum()
.reset_index()
)
print (df1)
StationID DateTime Channel Count
0 1 2017-10-01 08:00:00 1 5
1 1 2017-10-01 10:00:00 1 5
2 1 2017-12-01 00:00:00 1 210
3 1 2017-12-01 10:00:00 1 2
print (df['DateTime'].dt.floor('H'))
0 2017-12-01 00:00:00
1 2017-12-01 00:00:00
2 2017-12-01 00:00:00
3 2017-12-01 10:00:00
4 2017-10-01 10:00:00
5 2017-10-01 10:00:00
6 2017-10-01 08:00:00
7 2017-10-01 08:00:00
Name: DateTime, dtype: datetime64[ns]
But if dates are not important, only hours use hour:
df2 = (df.groupby(['StationID', df['DateTime'].dt.hour, 'Channel'])['Count']
.sum()
.reset_index()
)
print (df2)
StationID DateTime Channel Count
0 1 0 1 210
1 1 8 1 5
2 1 10 1 7
Or you can use Grouper:
df.groupby(pd.Grouper(key='DateTime', freq='"H'), 'Channel', 'StationID')['Count'].sum()
I have a PostgreSQL table containing start timestamp and duration time.
timestamp | interval
------------------------------
2018-01-01 15:00:00 | 06:00:00
2018-01-02 23:00:00 | 04:00:00
2018-01-04 09:00:00 | 2 days 16 hours
What I would like is to have the interval splitted into every day like this:
timestamp | interval
------------------------------
2018-01-01 15:00:00 | 06:00:00
2018-01-02 23:00:00 | 01:00:00
2018-01-03 00:00:00 | 03:00:00
2018-01-04 09:00:00 | 15:00:00
2018-01-05 00:00:00 | 24:00:00
2018-01-06 00:00:00 | 24:00:00
2018-01-07 00:00:00 | 01:00:00
I am playing with generate_series(), width_bucket(), range functions, but I still can't find plausible solution. Is there any existing or working solution?
not sure about all edge cases, but this seems working:
t=# with c as (select *,min(t) over (), max(t+i) over (), tsrange(date_trunc('day',t),t+i) tr from t)
, mid as (
select distinct t,i,g,tr
, case when g < t then t else g end tt
from c
right outer join (select generate_series(date_trunc('day',min),date_trunc('day',max),'1 day') g from c) e on g <# tr order by 3,1
)
select
tt
, i
, case when tt+'1 day' > upper(tr) and t < g then upper(tr)::time::interval when upper(tr) - lower(tr) < '1 day' then i else g+'1 day' - tt end
from mid
order by tt;
tt | i | case
---------------------+-----------------+----------
2018-01-01 15:00:00 | 06:00:00 | 06:00:00
2018-01-02 23:00:00 | 04:00:00 | 01:00:00
2018-01-03 00:00:00 | 04:00:00 | 03:00:00
2018-01-04 09:00:00 | 2 days 16:00:00 | 15:00:00
2018-01-05 00:00:00 | 2 days 16:00:00 | 1 day
2018-01-06 00:00:00 | 2 days 16:00:00 | 1 day
2018-01-07 00:00:00 | 2 days 16:00:00 | 01:00:00
(7 rows)
also please mind that timestamp without time zone can fail you when comparing timestamps...