How to correctly resample time series data? - sql

I have time series data on a PostgreSQL database. Currently I import this into pandas to do some analysis and my first step often times is resampling the 5 min frequency data to 1 hour average data. I do this so that I first pivot the data to bring it in wide form, then I resample it to 1 hour and after that I melt it to have it in long form again.
Now I want to do the resampling on the Database so that I can import the 1 hour average right away.
This is what the data looks like on the Database. I have two different types with 3 different names.
datetime value type name
0 2018-01-01 13:35:00+01:00 0.22 HLN NO2
1 2018-01-01 13:35:00+01:00 0.31 HLN CO
2 2018-01-01 13:35:00+01:00 1.15 HLN NO
3 2018-01-01 13:40:00+01:00 1.80 AIS NO2
4 2018-01-01 13:40:00+01:00 2.60 AIS CO
5 2018-01-01 13:40:00+01:00 2.30 AIS NO
6 2018-01-01 13:45:00+01:00 2.25 HLN NO2
7 2018-01-01 13:45:00+01:00 2.14 HLN CO
8 2018-01-01 13:45:00+01:00 2.96 HLN NO
9 2018-01-01 14:35:00+01:00 0.76 HLN NO2
10 2018-01-01 14:35:00+01:00 0.80 HLN CO
11 2018-01-01 14:35:00+01:00 1.19 HLN NO
12 2018-01-01 14:40:00+01:00 1.10 AIS NO2
13 2018-01-01 14:40:00+01:00 2.87 AIS CO
14 2018-01-01 14:40:00+01:00 2.80 AIS NO
15 2018-01-01 14:45:00+01:00 3.06 HLN NO2
16 2018-01-01 13:45:00+01:00 2.86 HLN CO
17 2018-01-01 13:45:00+01:00 2.22 HLN NO
Now comes the part which I have problems with. After resampling and plotting it in pandas and plotly I get the expected result which is right, one value for every hour:
After performing a SQL Query to resample it to one hour with the following code:
SELECT date_trunc('hour', datetime) AS hour, type, AVG(value) AS measure, name
FROM data_table
GROUP BY datetime, type, name
ORDER BY datetime
I get this after plotting:
This is not smooth and there are multiple values in the hour, I guess there are all the values within the hour.
My question, how can I correctly resample a time series in SQL?
Edit: Expected result in table form:
datetime value type name
2018-01-01 13:00:00 1.235 HLN NO2
2018-01-01 13:00:00 2.65 HLN CO
2018-01-01 13:00:00 2.96 HLN NO
2018-01-01 13:00:00 2.48 AIS NO2
2018-01-01 13:00:00 2.65 AIS CO
2018-01-01 13:00:00 2.26 AIS NO
2018-01-01 14:00:00 2.78 HLN NO2
2018-01-01 14:00:00 3.65 HLN CO
2018-01-01 14:00:00 1.95 HLN NO
2018-01-01 14:00:00 1.45 AIS NO2
2018-01-01 14:00:00 1.64 AIS CO
2018-01-01 14:00:00 3.23 AIS NO

An alternative is to create the time intervals using generate_series() or just with in a subquery / CTE truncate the hours per type and name, and in the outer query join both records and aggregate the values with avg() by the columns hour, type and name, e.g.
WITH j AS (
SELECT DISTINCT date_trunc('hour', datetime) AS hour, type,name
FROM data_table
)
SELECT j.*, avg(d.value)
FROM data_table d
JOIN j ON date_trunc('hour', d.datetime) = j.hour AND
j.type = d.type AND
d.name = j.name
GROUP BY j.hour, j.name, j.type
ORDER BY j.hour ASC,j.type DESC;
hour | type | name | avg
---------------------+------+------+------------------------
2018-01-01 13:00:00 | HLN | CO | 1.7700000000000000
2018-01-01 13:00:00 | HLN | NO | 2.1100000000000000
2018-01-01 13:00:00 | HLN | NO2 | 1.23500000000000000000
2018-01-01 13:00:00 | AIS | CO | 2.6000000000000000
2018-01-01 13:00:00 | AIS | NO | 2.3000000000000000
2018-01-01 13:00:00 | AIS | NO2 | 1.80000000000000000000
2018-01-01 14:00:00 | HLN | CO | 0.80000000000000000000
2018-01-01 14:00:00 | HLN | NO | 1.19000000000000000000
2018-01-01 14:00:00 | HLN | NO2 | 1.9100000000000000
2018-01-01 14:00:00 | AIS | CO | 2.8700000000000000
2018-01-01 14:00:00 | AIS | NO | 2.8000000000000000
2018-01-01 14:00:00 | AIS | NO2 | 1.10000000000000000000
Demo: db<>fiddle

Related

Transform the dataframe containing time series

I have dataframe like this:
datestamp Name Reading
2018-01-01 00:00:00 A01 40
2018-01-01 01:00:00 A01 50
2018-01-01 03:00:00 A01 50
2018-01-01 01:00:00 A02 50
2018-01-01 02:00:00 A02 40
2018-01-01 03:00:00 A02 30
Given a start and end date (start = 2018-01-01 00:00:00 and end = 2018-01-01 05:00:00)
I would like to transform the dataframe to below. (for all missing entry we need to have zero or NULL)
Output like:
datestamp Name Reading
2018-01-01 00:00:00 A01 40
2018-01-01 01:00:00 A01 50
2018-01-01 02:00:00 A01 00
2018-01-01 03:00:00 A01 50
2018-01-01 04:00:00 A01 00
2018-01-01 05:00:00 A01 00
2018-01-01 00:00:00 A02 00
2018-01-01 01:00:00 A02 50
2018-01-01 02:00:00 A02 40
2018-01-01 03:00:00 A02 30
2018-01-01 04:00:00 A02 00
2018-01-01 05:00:00 A02 00
I directionless so have no approach as of now.
Let's try pivot the table and reindex:
new_dates = pd.date_range('2018-01-01 00:00:00',
'2018-01-01 05:00:00',
freq='H', name='datestamp')
(df.pivot_table(index='datestamp',columns='Name',
values='Reading', fill_value=0)
.reindex(new_dates, fill_value=0)
.stack().sort_index(level=[1,0])
.reset_index(name='Reading')
)
Output:
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 0
3 2018-01-01 03:00:00 A01 50
4 2018-01-01 04:00:00 A01 0
5 2018-01-01 05:00:00 A01 0
6 2018-01-01 00:00:00 A02 0
7 2018-01-01 01:00:00 A02 50
8 2018-01-01 02:00:00 A02 40
9 2018-01-01 03:00:00 A02 30
10 2018-01-01 04:00:00 A02 0
11 2018-01-01 05:00:00 A02 0
Use pd.date_range(), df.loc[] and a nested for loop:
df
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 50
3 2018-01-01 03:00:00 A02 50
4 2018-01-01 04:00:00 A02 40
5 2018-01-01 05:00:00 A02 30
start_date = dt.datetime.strptime('2018-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
end_date = dt.datetime.strptime('2018-01-01 05:00:00', '%Y-%m-%d %H:%M:%S')
date_range = pd.date_range(start=start_date, end=end_date, freq='H')
for date in date_range:
for sensor in df.Name.unique():
if len(df.loc[(df['datestamp'] == date) & (df['Name'] == sensor)]) == 0:
df=df.append({'datestamp':date, 'Name':sensor, 'Reading':0}, ignore_index=True)
df = df.sort_values(by=['Name', 'datestamp']).reset_index(drop=True)
df
datestamp Name Reading
0 2018-01-01 00:00:00 A01 40
1 2018-01-01 01:00:00 A01 50
2 2018-01-01 02:00:00 A01 50
3 2018-01-01 03:00:00 A01 0
4 2018-01-01 04:00:00 A01 0
5 2018-01-01 05:00:00 A01 0
6 2018-01-01 03:00:00 A02 50
7 2018-01-01 04:00:00 A02 40
8 2018-01-01 05:00:00 A02 30
9 2018-01-01 00:00:00 A02 0
10 2018-01-01 01:00:00 A02 0
11 2018-01-01 02:00:00 A02 0
Pardon the column headings.
You can use pandas fillna method, Link to documentation
df[‘column_name’] = df[‘column_name’].fillna(0)

SQL insert values from previous date if specific date information is missing

I have got the following table.
date2 Group number
2020-28-05 00:00:00 A 55
2020-28-05 00:00:00 B 1.09
2020-28-05 00:00:00 C 1.8
2020-29-05 00:00:00 A 68
2020-29-05 00:00:00 B 1.9
2020-29-05 00:00:00 C 1.19
2020-01-06 00:00:00 A 10
2020-01-06 00:00:00 B 15
2020-01-06 00:00:00 C 0.88
2020-02-06 00:00:00 A 22
2020-02-06 00:00:00 B 15
2020-02-06 00:00:00 C 13
2020-03-06 00:00:00 A 66
2020-03-06 00:00:00 B 88
2020-03-06 00:00:00 C 99
As you can see between dates 2020-30-05 and 2020-31-05 are missing in this table. So it is necessary to fill these dates with 2020-29-05 information grouped by GROUP. As a result the final output should be like that:
date2 Group number
2020-28-05 00:00:00 A 55
2020-28-05 00:00:00 B 1.09
2020-28-05 00:00:00 C 1.8
2020-29-05 00:00:00 A 68
2020-29-05 00:00:00 B 1.9
2020-29-05 00:00:00 C 1.19
2020-30-05 00:00:00 A 68
2020-30-05 00:00:00 B 1.9
2020-30-05 00:00:00 C 1.19
2020-31-05 00:00:00 A 68
2020-31-05 00:00:00 B 1.9
2020-31-05 00:00:00 C 1.19
2020-01-06 00:00:00 A 10
2020-01-06 00:00:00 B 15
2020-01-06 00:00:00 C 0.88
2020-02-06 00:00:00 A 22
2020-02-06 00:00:00 B 15
2020-02-06 00:00:00 C 13
2020-03-06 00:00:00 A 66
2020-03-06 00:00:00 B 88
2020-03-06 00:00:00 C 99
I tried to do in the following way:
create a temporary table (table B) with only dates for period 2020-28-05 till 2020-03-06 and then use left merge, thus making these new dates as null (in order to then insert a CASE when null, so fill in last_value). However, it does not work, because when merging I got nulls only for one date (but should be 3 times one date(because of groups). This is only part of the larger dataset, can you help how can I get the necessary output?
PS I use Vertica
It's Vertica. And Vertica has the TIMESERIES clause, which seems to exactly match with what you need:
Out of a time series - like you have one - with irregular intervals between the rows, or with longer gaps in an otherwise regular time series, it creates a regular time series, with the same interval between each row pair as you specify in the AS sub-clause of the TIMESERIES clause itself. TS_FIRST_VALUE() and TS_LAST_VALUE() are functions that rely on that clause and return the right value deduced from the input rows at the generated time stamp. This right value can be obtained 'const', that is from the row in the original row set closest to the generated time stamp, or 'linear', that is, interpolated from the original row just before and the original row just after the generated timestamp. For your needs, you would use the constant value. See here:
WITH
-- your input ....
input(tmstmp,grp,nbr) AS (
SELECT TIMESTAMP '2020-05-28 00:00:00','A',55
UNION ALL SELECT TIMESTAMP '2020-05-28 00:00:00','B',1.09
UNION ALL SELECT TIMESTAMP '2020-05-28 00:00:00','C',1.8
UNION ALL SELECT TIMESTAMP '2020-05-29 00:00:00','A',68
UNION ALL SELECT TIMESTAMP '2020-05-29 00:00:00','B',1.9
UNION ALL SELECT TIMESTAMP '2020-05-29 00:00:00','C',1.19
UNION ALL SELECT TIMESTAMP '2020-06-01 00:00:00','A',10
UNION ALL SELECT TIMESTAMP '2020-06-01 00:00:00','B',15
UNION ALL SELECT TIMESTAMP '2020-06-01 00:00:00','C',0.88
UNION ALL SELECT TIMESTAMP '2020-06-02 00:00:00','A',22
UNION ALL SELECT TIMESTAMP '2020-06-02 00:00:00','B',15
UNION ALL SELECT TIMESTAMP '2020-06-02 00:00:00','C',13
UNION ALL SELECT TIMESTAMP '2020-06-03 00:00:00','A',66
UNION ALL SELECT TIMESTAMP '2020-06-03 00:00:00','B',88
UNION ALL SELECT TIMESTAMP '2020-06-03 00:00:00','C',99
)
-- real query here ...
SELECT
ts AS tmstmp
, grp
, TS_FIRST_VALUE(nbr,'const') AS nbr
FROM input
TIMESERIES ts AS '1 DAY' OVER(PARTITION BY grp ORDER BY tmstmp)
ORDER BY 1,2
;
-- out tmstmp | grp | nbr
-- out ---------------------+-----+-------
-- out 2020-05-28 00:00:00 | A | 55.00
-- out 2020-05-28 00:00:00 | B | 1.09
-- out 2020-05-28 00:00:00 | C | 1.80
-- out 2020-05-29 00:00:00 | A | 68.00
-- out 2020-05-29 00:00:00 | B | 1.90
-- out 2020-05-29 00:00:00 | C | 1.19
-- out 2020-05-30 00:00:00 | A | 68.00
-- out 2020-05-30 00:00:00 | B | 1.90
-- out 2020-05-30 00:00:00 | C | 1.19
-- out 2020-05-31 00:00:00 | A | 68.00
-- out 2020-05-31 00:00:00 | B | 1.90
-- out 2020-05-31 00:00:00 | C | 1.19
-- out 2020-06-01 00:00:00 | A | 10.00
-- out 2020-06-01 00:00:00 | B | 15.00
-- out 2020-06-01 00:00:00 | C | 0.88
-- out 2020-06-02 00:00:00 | A | 22.00
-- out 2020-06-02 00:00:00 | B | 15.00
-- out 2020-06-02 00:00:00 | C | 13.00
-- out 2020-06-03 00:00:00 | A | 66.00
-- out 2020-06-03 00:00:00 | B | 88.00

Add 10 to 40 minutes randomly to a datetime column in pandas

I have a data frame as shown below
start
2010-01-06 09:00:00
2018-01-07 08:00:00
2012-01-08 11:00:00
2016-01-07 08:00:00
2010-02-06 14:00:00
2018-01-07 16:00:00
To the above df, I would like to add a column called 'finish' by adding minutes between 10 to 40 with start column randomly with replacement.
Expected Ouput:
start finish
2010-01-06 09:00:00 2010-01-06 09:20:00
2018-01-07 08:00:00 2018-01-07 08:12:00
2012-01-08 11:00:00 2012-01-08 11:38:00
2016-01-07 08:00:00 2016-01-07 08:15:00
2010-02-06 14:00:00 2010-02-06 14:24:00
2018-01-07 16:00:00 2018-01-07 16:36:00
Create timedeltas by to_timedelta and numpy.random.randint for integers between 10 and 40:
arr = np.random.randint(10, 40, size=len(df))
df['finish'] = df['start'] + pd.to_timedelta(arr, unit='Min')
print (df)
start finish
0 2010-01-06 09:00:00 2010-01-06 09:25:00
1 2018-01-07 08:00:00 2018-01-07 08:30:00
2 2012-01-08 11:00:00 2012-01-08 11:29:00
3 2016-01-07 08:00:00 2016-01-07 08:12:00
4 2010-02-06 14:00:00 2010-02-06 14:31:00
5 2018-01-07 16:00:00 2018-01-07 16:39:00
You can achieve it by using pandas.Series.apply() in combination with pandas.to_timedelta() and random.randint().
from random import randint
df['finish'] = df.start.apply(lambda dt: dt + pd.to_timedelta(randint(10, 40), unit='m'))

Splitting value dataframe over multiple timeslots

Would like to spread the values of the 15 minute intervals evenly over the 5 minute intervals. But cannot get it to work. Data is:
Datetime a
2018-01-01 00:00:00 6
2018-01-01 00:15:00 3
2018-01-01 00:30:00 9
Desired output would be:
Datetime a
2018-01-01 00:00:00 2
2018-01-11 00:05:00 2
2018-01-11 00:10:00 2
2018-01-11 00:15:00 1
2018-01-11 00:20:00 1
2018-01-11 00:25:00 1
2018-01-11 00:30:00 3
2018-01-11 00:35:00 3
2018-01-11 00:40:00 3
perhaps unnecessarily, but the value '6' of 00:00:00 in the data is spread over the intervals 00:00:00-00:10:00
Slightly different approach:
# convert to datetime
df.Datetime = pd.to_datetime(df.Datetime)
# set Datetime as index
df.set_index('Datetime', inplace=True)
# add one extra row
df.loc[df.index.max()+pd.to_timedelta('10min')] = 0
# set_index and resample
s = df.asfreq('5T', fill_value=0)
# transform the 0's to mean:
(s.groupby(s['a'].ne(0)
.cumsum())
.transform('mean')
.reset_index()
)
Output:
Datetime a
0 2018-01-01 00:00:00 2
1 2018-01-01 00:05:00 2
2 2018-01-01 00:10:00 2
3 2018-01-01 00:15:00 1
4 2018-01-01 00:20:00 1
5 2018-01-01 00:25:00 1
6 2018-01-01 00:30:00 3
7 2018-01-01 00:35:00 3
8 2018-01-01 00:40:00 3

Splitting interval overlapping more days in PostgreSQL

I have a PostgreSQL table containing start timestamp and duration time.
timestamp | interval
------------------------------
2018-01-01 15:00:00 | 06:00:00
2018-01-02 23:00:00 | 04:00:00
2018-01-04 09:00:00 | 2 days 16 hours
What I would like is to have the interval splitted into every day like this:
timestamp | interval
------------------------------
2018-01-01 15:00:00 | 06:00:00
2018-01-02 23:00:00 | 01:00:00
2018-01-03 00:00:00 | 03:00:00
2018-01-04 09:00:00 | 15:00:00
2018-01-05 00:00:00 | 24:00:00
2018-01-06 00:00:00 | 24:00:00
2018-01-07 00:00:00 | 01:00:00
I am playing with generate_series(), width_bucket(), range functions, but I still can't find plausible solution. Is there any existing or working solution?
not sure about all edge cases, but this seems working:
t=# with c as (select *,min(t) over (), max(t+i) over (), tsrange(date_trunc('day',t),t+i) tr from t)
, mid as (
select distinct t,i,g,tr
, case when g < t then t else g end tt
from c
right outer join (select generate_series(date_trunc('day',min),date_trunc('day',max),'1 day') g from c) e on g <# tr order by 3,1
)
select
tt
, i
, case when tt+'1 day' > upper(tr) and t < g then upper(tr)::time::interval when upper(tr) - lower(tr) < '1 day' then i else g+'1 day' - tt end
from mid
order by tt;
tt | i | case
---------------------+-----------------+----------
2018-01-01 15:00:00 | 06:00:00 | 06:00:00
2018-01-02 23:00:00 | 04:00:00 | 01:00:00
2018-01-03 00:00:00 | 04:00:00 | 03:00:00
2018-01-04 09:00:00 | 2 days 16:00:00 | 15:00:00
2018-01-05 00:00:00 | 2 days 16:00:00 | 1 day
2018-01-06 00:00:00 | 2 days 16:00:00 | 1 day
2018-01-07 00:00:00 | 2 days 16:00:00 | 01:00:00
(7 rows)
also please mind that timestamp without time zone can fail you when comparing timestamps...