Time Series Downsampling/Upsampling - sql

I am trying to downsample and upsample time series data on MonetDB.
Time series database systems (TSDS) usually have an option to make the downsampling and upsampling with an operator like SAMPLE BY (1h).
My time series data looks like the following:
sql>select * from datapoints limit 5;
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
| time | id_station | temperature | discharge | ph | oxygen | oxygen_saturation |
+============================+============+==========================+==========================+==========================+==========================+==========================+
| 2019-03-01 00:00:00.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.14 | 12.14 |
| 2019-03-01 00:00:10.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:20.000000 | 0 | 407.051 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:30.000000 | 0 | 407.051 | 0.953 | 7.79 | 12.12 | 12.12 |
| 2019-03-01 00:00:40.000000 | 0 | 407.051 | 0.952 | 7.78 | 12.12 | 12.12 |
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
I tried the following query but the results are obtained by aggregating all the values from different days, which is not what I am looking for:
sql>SELECT EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "hour";
| hour | avg_ph |
+======+==========================+
| 0 | 8.041121283524923 |
| 1 | 8.041086970785418 |
| 2 | 8.041152801724111 |
| 3 | 8.04107828783526 |
| 4 | 8.041060110153223 |
| 5 | 8.041167286877407 |
| ... | ... |
| 23 | 8.041219444444451 |
I tried then to aggregate the time series data first based on the day then on the hour:
SELECT EXTRACT(DATE FROM time) AS "day", EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "day", "hour";
But I am getting the following exception:
syntax error, unexpected sqlDATE in: "select extract(date"
My question: how could I aggregate/downsample the data to a specific period of time (e.g. obtain an aggregated value every 2 days or 12 hours)?

Related

Get average time difference grouped by another column in postgresql

I have a posts table where i'm interested in calculating the average difference between each authors posts. Here is a minimal example:
+---------------+---------------------+
| post_author | post_date |
|---------------+---------------------|
| 0 | 2019-03-05 19:12:24 |
| 1 | 2017-11-06 18:28:43 |
| 1 | 2017-11-06 18:28:43 |
| 1 | 2017-11-06 18:28:43 |
| 1 | 2017-11-06 18:28:43 |
| 1 | 2018-02-19 18:36:36 |
| 1 | 2018-02-19 18:36:36 |
| 1 | 2018-02-19 18:36:36 |
| 1 | 2018-02-19 18:36:36 |
| 1 | 2018-02-19 18:40:09 |
+---------------+---------------------+
so for each author, i want to get the delta of their time series essentially, then find the average (grouped by author). so the end result would look something like:
+---------------+---------------------+
| post_author | post_date_delta(hrs)|
|---------------+---------------------|
| 0 | 0 |
| 1 | 327 |
| 2 | 95 |
| ... | ... |
+---------------+---------------------+
I can think of how to do it in Python, but I'm struggling to write a (postgres) SQL query to accomplish this. Any help is appreciated!
You can use aggregation and arithmetic:
select post_author,
(max(post_date) - min(post_date)) / nullif(count(*) - 1, 0)
from t
group by post_author;
The average days is the difference between the maximum and minimum days, divided by one less than the count.

Get a row for each day of date range even when no values exist

I have a readings table. It is defined as:
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+---------
created_at | timestamp without time zone | | not null |
device | character varying(25) | | not null |
type | character varying(25) | | not null |
value | numeric | | not null |
It has data such as:
created_at | device | type | value
---------------------+-----------+-------------+-------------
2021-03-16 07:46:47 | 465125783 | temperature | 36.5
2021-03-16 07:51:48 | 465125783 | temperature | 36.40000153
2021-03-16 07:52:47 | 465125783 | temperature | 36.40000153
2021-03-16 07:53:47 | 465125783 | temperature | 36.29999924
2021-03-24 17:53:47 | 123456789 | pressure | 79
2021-03-24 17:54:48 | 123456789 | pressure | 77
2021-03-28 05:38:48 | 123456789 | flow | 12
2021-03-28 05:45:48 | 123456789 | flow | 14
2021-03-28 05:49:47 | 123456789 | pressure | 65
2021-03-28 05:50:47 | 123456789 | flow | 32
2021-03-28 05:51:47 | 123456789 | flow | 40
Current Query
So far I have the following query:
select created_at::date, device,
avg(value) filter (where type = 'temperature') as temperature,
avg(value) filter (where type = 'pressure') as pressure,
avg(value) filter (where type = 'flow') as flow
from readings
where device = '123456789' and created_at::date > created_at::date - interval '14 days'
group by created_at::date, device
order by created_at::date desc;
The query works out a daily average value for each type for the past two weeks.
Current Output
When I run the query, I get the following:
created_at | device | temperature | pressure | flow
------------+-----------+-------------+---------------------+---------------------
2021-03-28 | 123456789 | | 65.0000000000000000 | 24.5000000000000000
2021-03-24 | 123456789 | | 78.0000000000000000 |
Desired Output
What I really want is a row for each date for the past two weeks, so I want to end up with:
created_at | device | temperature | pressure | flow
------------+-----------+-------------+---------------------+---------------------
2021-04-02 | 123456789 | | |
2021-04-01 | 123456789 | | |
2021-03-31 | 123456789 | | |
2021-03-30 | 123456789 | | |
2021-03-29 | 123456789 | | |
2021-03-28 | 123456789 | | 65.0000000000000000 | 24.5000000000000000
2021-03-27 | 123456789 | | |
2021-03-26 | 123456789 | | |
2021-03-25 | 123456789 | | |
2021-03-24 | 123456789 | | 78.0000000000000000 |
2021-03-23 | 123456789 | | |
2021-03-22 | 123456789 | | |
2021-03-21 | 123456789 | | |
2021-03-20 | 123456789 | | |
How can I can achieve that?
I have a db-fiddle.
Use generate_series():
select gs.dte, '123456789' as device,
avg(value) filter (where type = 'temperature') as temperature,
avg(value) filter (where type = 'pressure') as pressure,
avg(value) filter (where type = 'flow') as flow
from generate_series('2021-03-20'::date, '2021-04-02'::date, interval '1 day') gs(dte) left join
readings r
on r.device = '123456789' and
r.created_at::date = gs.dte
group by gs.dte
order by gs.dte desc;
Here is a db<>fiddle.

Join on minimum date between two dates - Spark SQL

I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.
month_df
| ID | month_end_date |
| -- | -------------- |
| 1 | 2019-07-31 |
| 1 | 2019-06-30 |
| 2 | 2019-10-31 |
daily_df
| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1 | 2019-07-29 | 1 |
| 1 | 2019-07-30 | 1 |
| 1 | 2019-08-01 | 2 |
| 1 | 2019-08-02 | 2 |
| 1 | 2019-08-03 | 2 |
| 1 | 2019-06-29 | 0 |
| 1 | 2019-06-30 | 0 |
| 2 | 2019-10-30 | 5 |
| 2 | 2019-10-31 | NULL |
| 2 | 2019-11-01 | 6 |
| 2 | 2019-11-02 | 6 |
I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.
This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.

How to perform aggregation(sum) group by month field using HiveQL?

Below is my data where am looking to generate sum of revenues per month basis using columns event_time and price.
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
| oct_data.event_time | oct_data.event_type | oct_data.product_id | oct_data.category_id | oct_data.category_code | oct_data.brand | oct_data.price | oct_data.user_id | oct_data.user_session |
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
| 2019-10-01 00:00:00 UTC | cart | 5773203 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:03 UTC | cart | 5773353 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:07 UTC | cart | 5881589 | 2151191071051219817 | | lovely | 13.48 | 429681830 | 49e8d843-adf3-428b-a2c3-fe8bc6a307c9 |
| 2019-10-01 00:00:07 UTC | cart | 5723490 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:15 UTC | cart | 5881449 | 1487580013522845895 | | lovely | 0.56 | 429681830 | 49e8d843-adf3-428b-a2c3-fe8bc6a307c9 |
| 2019-10-01 00:00:16 UTC | cart | 5857269 | 1487580005134238553 | | runail | 2.62 | 430174032 | 73dea1e7-664e-43f4-8b30-d32b9d5af04f |
| 2019-10-01 00:00:19 UTC | cart | 5739055 | 1487580008246412266 | | kapous | 4.75 | 377667011 | 81326ac6-daa4-4f0a-b488-fd0956a78733 |
| 2019-10-01 00:00:24 UTC | cart | 5825598 | 1487580009445982239 | | | 0.56 | 467916806 | 2f5b5546-b8cb-9ee7-7ecd-84276f8ef486 |
| 2019-10-01 00:00:25 UTC | cart | 5698989 | 1487580006317032337 | | | 1.27 | 385985999 | d30965e8-1101-44ab-b45d-cc1bb9fae694 |
| 2019-10-01 00:00:26 UTC | view | 5875317 | 2029082628195353599 | | | 1.59 | 474232307 | 445f2b74-5e4c-427e-b7fa-6e0a28b156fe |
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
I have used the below query but the sum does not seem to occur. Please suggest best approaches to generate the desired output.
select date_format(event_time,'MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(event_time,'MM')
order by Month;
Note: event_time field is in TIMESTAMP format.
First convert the timestamp to date and then apply date_format():
select date_format(cast(event_time as date),'MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(cast(event_time as date),'MM')
order by Month;
This will work if all the dates are of the same year.
If not then you should also group by year.
Your code should work -- unless you are using an old version of Hive. date_format() has accepted a timestamp argument since 1.1.2 -- released in early 2016. That said, I would strongly suggest that you include the year:
select date_format(event_time, 'yyyy-MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(event_time, 'yyyy-MM')
order by Month;

How to add total sum of each rows within group with DAX Power BI

I am trying to give rank column of every group which repeating in every rows within the group of the original table but not the shape of after sum-up.
The formula i found in another site but it show an error :
https://intellipaat.com/community/9734/rank-categories-by-sum-power-bi
Table1
+-----------+------------+-------+
| product | date | sales |
+-----------+------------+-------+
| coffee | 11/03/2019 | 15 |
| coffee | 12/03/2019 | 10 |
| coffee | 13/03/2019 | 28 |
| coffee | 14/03/2019 | 1 |
| tea | 11/03/2019 | 5 |
| tea | 12/03/2019 | 2 |
| tea | 13/03/2019 | 6 |
| tea | 14/03/2019 | 7 |
| Chocolate | 11/03/2019 | 30 |
| Chocolate | 11/03/2019 | 4 |
| Chocolate | 11/03/2019 | 15 |
| Chocolate | 11/03/2019 | 10 |
+-----------+------------+-------+
The Goal
+-----------+------------+-------+-----+------+
| product | date | sales | sum | rank |
+-----------+------------+-------+-----+------+
| coffee | 11/03/2019 | 15 | 54 | 5 |
| coffee | 12/03/2019 | 10 | 54 | 5 |
| coffee | 13/03/2019 | 28 | 54 | 5 |
| coffee | 14/03/2019 | 1 | 54 | 5 |
| tea | 11/03/2019 | 5 | 20 | 9 |
| tea | 12/03/2019 | 2 | 20 | 9 |
| tea | 13/03/2019 | 6 | 20 | 9 |
| tea | 14/03/2019 | 7 | 20 | 9 |
| Chocolate | 11/03/2019 | 30 | 59 | 1 |
| Chocolate | 11/03/2019 | 4 | 59 | 1 |
| Chocolate | 11/03/2019 | 15 | 59 | 1 |
| Chocolate | 11/03/2019 | 10 | 59 | 1 |
+-----------+------------+-------+-----+------+
The script
sum =
SUMX(
FILTER(
Table1;
Table1[product] = EARLIER(Table1[product])
);
Table1[sales]
)
The Error :
EARLIER(Table1[product]) # Parameter is not correct type cannot find name 'product'
What's wrong with the script above ?
* not able to test this script:
rank = RANKX( ALL(Table1); Table1[sum]; ;; "Dense" )
before fixed the sum approach
The script is designed for a calculated column, not a measure. If you enter it as a measure, EARLIER has no "previous" row context to refer to, and gives you the error.
Create a measure:
Total Sales = SUM(Table1[sales])
This measure will be used to show sales.
Create another measure:
Sales by Product =
SUMX(
VALUES(Table1[product]);
CALCULATE([Total Sales]; ALL(Table1[date]))
)
This measure will show sales by product ignoring dates.
Third measure:
Sale Rank =
RANKX(
ALL(Table1[product]; Table1[date]);
[Sales by Product];;DESC;Dense)
Create a report with product and dates on a pivot, and drop all 3 measures into it. Result:
Tweak RANKX parameters to change the ranking mode, if necessary.