Fill Sparse Data with SQL (Rockset) - sql

We have created the following query in order to convert sparse time series data into dense data with specific time slots. The idea is that a time range (e.g. 1 hour) is converted into distinct time slots (e.g. 60 x 1 min slots). For each slot (1 min in this example) we compute if there are one or more values and if there are we use a MAX function to get our value. If there are no values in the time range we use the one from the previous slot.
Here is the basic query:
WITH readings AS (
(
-- Get the first value before the time window to set the entry value
SELECT
timestamp AS timestamps,
attributeId AS id,
DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) AS ts,
value AS value
FROM
node_iot_attribute_values
WHERE
attributeId = 'cu937803-ne9de7df-nn7453b2-na2c7e14'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) < TIMESTAMP '2021-10-26T08:42:06.000000Z'
ORDER BY
ts DESC
LIMIT
1
)
UNION
(
-- Get the values in the time range
SELECT
timestamp AS timestamps,
attributeId AS id,
DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) AS ts,
value AS value
FROM
node_iot_attribute_values
WHERE
attributeId = 'cu937803-ne9de7df-nn7453b2-na2c7e14'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) > TIMESTAMP '2021-10-26T08:42:06.000000Z'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) < TIMESTAMP '2021-10-26T09:42:06.000000Z'
)
),
slots AS (
-- Create time slots at the correct resolution
SELECT
TIMESTAMP '2021-10-26T08:42:06.000000Z' + MINUTES(u.i - 1) AS last_ts,
TIMESTAMP '2021-10-26T08:42:06.000000Z' + MINUTES(u.i) AS ts
FROM
UNNEST(SEQUENCE(0, 60, 1) AS i) AS u
),
slot_values AS (
-- Get the values for each time slot from the readings retrieved
SELECT
slots.ts,
(
SELECT
r.value
FROM
readings r
WHERE
r.ts <= slots.ts
ORDER BY
r.ts DESC
LIMIT
1
) AS last_val,
(
SELECT
MAX(r.value)
FROM
readings r
WHERE
r.ts <= slots.ts
AND r.ts >= slots.last_ts
) AS slot_agg_val,
FROM
slots
)
SELECT
-- Use either the MAX value if several are in the same slot or the last if none
CAST(ts AT TIME ZONE 'Europe/Paris' AS string) AS ts,
COALESCE(
slot_agg_val,
LAG(slot_agg_val, 1) OVER(
ORDER BY
ts
),
last_val
) AS value
FROM
slot_values
ORDER BY
ts;
The good news is that the query works. The bad news is the performance is terrible!!!
Interestingly the part of the query that retrieves the data from storage is very performant. In our case this part of the query returns all the results in ~50ms
WITH readings AS (
(
-- Get the first value before the time window to set the entry value
SELECT
timestamp AS timestamps,
attributeId AS id,
DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) AS ts,
value AS value
FROM
node_iot_attribute_values
WHERE
attributeId = 'cu937803-ne9de7df-nn7453b2-na2c7e14'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) < TIMESTAMP '2021-10-26T08:42:06.000000Z'
ORDER BY
ts DESC
LIMIT
1
)
UNION
(
-- Get the values in the time range
SELECT
timestamp AS timestamps,
attributeId AS id,
DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) AS ts,
value AS value
FROM
node_iot_attribute_values
WHERE
attributeId = 'cu937803-ne9de7df-nn7453b2-na2c7e14'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) > TIMESTAMP '2021-10-26T08:42:06.000000Z'
AND DATE_TRUNC('second', TIMESTAMP_SECONDS(timestamp)) < TIMESTAMP '2021-10-26T09:42:06.000000Z'
)
)
Having analysed the different parts of the query the one that is exploding the performance is this:
slot_values AS (
-- Get the values for each time slot from the readings retrieved
SELECT
slots.ts,
(
SELECT
r.value
FROM
readings r
WHERE
r.ts <= slots.ts
ORDER BY
r.ts DESC
LIMIT
1
) AS last_val,
(
SELECT
MAX(r.value)
FROM
readings r
WHERE
r.ts <= slots.ts
AND r.ts >= slots.last_ts
) AS slot_agg_val,
FROM
slots
)
For some reason this part takes ~25 seconds to execute! I would really appreciate some help in optimizing this query.

I would use JOIN and AGGREGATION logic to compute this. SQL works well with map and reduce logic.
Try
SELECT
filled_slots.ts,
MAX(value) AS last_val,
slot_agg_val
FROM
(
SELECT
slots.ts,
MAX(previous_r.ts) last_previous_time,
MAX(in_interval_r.value) AS slot_agg_val,
FROM
slots
LEFT JOIN readings previous_r ON previous_r.ts <= slots.ts
LEFT JOIN readings in_interval_r ON in_interval_r.ts < slots.ts
AND in_interval_r.ts > slots.last_ts
GROUP BY
slots.ts
) filled_slots
LEFT JOIN readings ON filled_slots.last_previous_time = readings.ts
GROUP BY
filled_slots.ts,
slot_agg_val
The last one aggregation is useful to avoid issues due to duplicated data.
Code is not tested.

Related

SQL SELECT rows where the difference between consecutive columns is less than X

Basically Mysql: Find rows, where timestamp difference is less than x, but I want to stop at the first value whose timestamp difference is larger than X.
I got so far:
SELECT *
FROM (
SELECT *,
(LEAD(datetime) OVER (ORDER BY datetime)) - datetime AS difference
FROM history
) AS sq
WHERE difference < '00:01:00'
Which seems to correctly return all rows where the difference between the row and the one "behind" it is less than a minute, but that means I still get large jumps in the datetimes, which I don't want - I want to select the most recent "run" of rows, where a "run" is defined as "the timestamps in datetime differ by less than a minute".
e.g., I have rows whose hypothetical timestamps are as follows:
24, 22, 21, 19, 18, 12, 11, 9, 7...
And my limit of differences is 3, i.e. I want the run of the rows whose difference between "timestamps" is less than 3; therefore just:
24, 22, 21, 19, 18
Is this possible in SQL?
You can use lag to get the previous row's timestamp and check if the current row is within 3 minutes of it. Reset the group if the condition fails. After this grouping is done, you have the find the latest such group, use max to get it. Then get all those rows from the latest group.
Include a partition by clause in the window functions lag, sum andmax if this has to be done for each id in the table.
with grps as (
select x.*,sum(col) over(order by dt) grp
from (select t.*
--checking if the current row's timestamp is within 3 minutes of the next row
,case WHEN dt BETWEEN LAG(dt) OVER (ORDER BY dt)
AND LAG(dt) OVER (ORDER BY dt) + interval '3 minute' THEN 0 ELSE 1 END col
from t) x
)
select dt
from (select g.*,max(grp) over() maxgrp --getting the latest group
from grps g
) g
where grp = maxgrp
The above would get you the members in the latest group even though it has one row. To avoid such results get the latest group which has more than 1 row.
with grps as (
select x.*,sum(col) over(order by dt) grp
from (select t.*
,case WHEN dt BETWEEN LAG(dt) OVER (ORDER BY dt)
AND LAG(dt) OVER (ORDER BY dt) + 3 THEN 0 ELSE 1 END col
from t) x
)
,grpcnts as (select g.*,count(*) over(partition by grp) grpcnt from grps g)
select dt from (select g.*,max(grp) over() maxgrp
from grpcnts g
where grpcnt > 1
) g
where grp = maxgrp
You can do this by using a flag based on the lead() or lag() values. I believe this does what you want:
SELECT h.*
FROM (SELECT h.*,
SUM( (next_datetime < datetime + interval '1 minute')::int) OVER (ORDER BY datetime DESC) as grp
FROM (SELECT h.*,
LEAD(h.datetime) OVER (ORDER BY h.datetime)) as next_datetime
FROM history h
) h
WHERE next_datetime < datetime + interval '1 hour'
) h
WHERE grp IS NULL OR grp = 0;
This can be easily solved with recursive CTEs (this will select your rows one-by-one and stops when there is no row in range interval '1 min'):
with recursive h as (
select * from (
select *
from history
order by history.datetime desc
limit 1
) s
union all
select * from (
select history.*
from h
join history on history.datetime >= h.datetime - interval '1 min'
and history.datetime < h.datetime
order by history.datetime desc
limit 1
) s
)
select * from h
This should be efficient if you have an index on history.datetime. Though, if you care about performance, you should test it against the window-function based ones. (I personally get a headache when see as much subqueries and window functions as needed for this problem. The irony in my answer is that postgresql does not support the ORDER BY clause directly inside recrursive CTEs, so I had to use 2 meaningless subqueries to "hide" them).
rextester

Standard deviation of a set of dates

I have a table of transactions with columns id | client_id | datetime and I have calculated the mean of days between transactions to know how often this transactions are made by each client:
SELECT *, ((date_last_transaction - date_first_transaction)/total_transactions) AS frequency
FROM (
SELECT client_id, COUNT(id) AS total_transactions, MIN(datetime) AS date_first_transaction, MAX(datetime) AS date_last_transaction
FROM transactions
GROUP BY client_id
) AS t;
What would be the existing methods to calculate the standard deviation (in days) in a set of dates with postgresql? Preferably with only one query, if it is posible :-)
I have found this way:
SELECT extract(day from date_trunc('day', (
CASE WHEN COUNT(*) <= 1 THEN
0
ELSE
SUM(time_since_last_invoice)/(COUNT(*)-1)
END
) * '1 day'::interval)) AS days_between_purchases,
extract(day from date_trunc('day', (
CASE WHEN COUNT(*) <= 2 THEN
0
ELSE
STDDEV(time_since_last_invoice)
END
) * '1 day'::interval)) AS range_of_days
FROM (
SELECT client_id, datetime, COALESCE(datetime - lag(datetime)
OVER (PARTITION BY client_id ORDER BY client_id, datetime
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
), 0
) AS time_since_last_invoice
FROM my_table
GROUP BY client_id, datetime
ORDER BY client_id, datetime
)
Explanation:
This query groups by client and date and then calculates the difference between each pair of transaction dates (datetime) by client_id and returns a table with these results. After this, the external query processes the table and calculates de average time between differences greater than 0 (first value in each group is excluded because is the first transaction and therefore the interval is 0).
The standard deviation is calculated when there existe 2 o more transaction dates for the same client, to avoid division by zero errors.
All differences are returned in PostgreSQL interval format.

Query aggregated data with a given sampling time

Suppose my raw data is:
Timestamp High Low Volume
10:24.22345 100 99 10
10:24.23345 110 97 20
10:24.33455 97 89 40
10:25.33455 60 40 50
10:25.93455 40 20 60
With a sample time of 1 second, the output data should be as following (add additional column):
Timestamp High Low Volume Count
10:24 110 89 70 3
10:25 60 20 110 2
The sampling unit from varying from 1 second, 5 sec, 1 minute, 1 hour, 1 day, ...
How to query the sampled data in quick time in the PostgreSQL database with Rails?
I want to fill all the interval by getting the error
ERROR: JOIN/USING types bigint and timestamp without time zone cannot be matched
SQL
SELECT
t.high,
t.low
FROM
(
SELECT generate_series(
date_trunc('second', min(ticktime)) ,
date_trunc('second', max(ticktime)) ,
interval '1 sec'
) FROM czces AS g (time)
LEFT JOIN
(
SELECT
date_trunc('second', ticktime) AS time ,
max(last_price) OVER w AS high ,
min(last_price) OVER w AS low
FROM czces
WHERE product_type ='TA' AND contract_month = '2014-08-01 00:00:00'::TIMESTAMP
WINDOW w AS (
PARTITION BY date_trunc('second', ticktime)
ORDER BY ticktime ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) t USING (time)
ORDER BY 1
) AS t ;
Simply use date_trunc() before you aggregate. Works for basic time units 1 second, 1 minute, 1 hour, 1 day - but not for 5 sec. Arbitrary intervals are slightly more complex, see link below!
SELECT date_trunc('second', timestamp) AS timestamp -- or minute ...
, max(high) AS high, min(low) AS low, sum(volume) AS vol, count(*) AS ct
FROM tbl
GROUP BY 1
ORDER BY 1;
If there are no rows for a sample point, you get no row in the result. If you need one row for every sample point:
SELECT g.timestamp, t.high, t.low, t.volume, t.ct
FROM (SELECT generate_series(date_trunc('second', min(timestamp))
,date_trunc('second', max(timestamp))
,interval '1 sec') AS g (timestamp) -- or minute ...
LEFT JOIN (
SELECT date_trunc('second', timestamp) AS timestamp -- or minute ...
, max(high) AS high, min(low) AS low, sum(volume) AS vol, count(*) AS ct
FROM tbl
GROUP BY 1
) t USING (timestamp)
ORDER BY 1;
The LEFT JOIN is essential.
For arbitrary intervals:
Best way to count records by arbitrary time intervals in Rails+Postgres
Retrieve aggregates for arbitrary time intervals
Aside: Don't use timestamp as column name. It's a basic type name and a reserved word in standard SQL. It's also misleading for data that's not actually a timestamp.

SQL query records within a range of boundaries and max/min outside the range

I have the following three simple T-SQL queries. First one is to get records within a range of boundaries (DATETIME type):
SELECT value, timestamp
FROM myTable
WHERE timestamp BETWEEN #startDT AND #endDT
the second one is to get the closest record to #startDT (DATETIME type)
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp > #startDT
ORDER BY timestamp DESC
and the last one is to get the closest record after #endDT:
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp < #endDT
ORDER BY timestamp ASC
I would like to get all the records of above three queries as one group of records. I tried to use UNION, but it seems that sub-queries within UNION does not allow ORDER BY clause. Is there efficient way to get my result?
. . * | * * * * * | * . . .
start end
The above graph simply shows the records of *s as my required records, and |...| is the boundaries.
By the way, the amount of data in myTable is huge. My understanding UNION is not an efficient way to get data from UNIONs. Any efficient way to get data without UNION?
As you wish, without UNION.
MySQL (TESTED)
SELECT
dv1.timestamp, dv1.values
FROM
myTable AS dv1
WHERE
dv1.timestamp
BETWEEN (
SELECT dv2.timestamp
FROM myTable AS dv2
WHERE dv2.timestamp < '#START_DATE'
ORDER BY dv2.timestamp DESC
LIMIT 1
)
AND ( SELECT dv3.timestamp
FROM myTable AS dv3
WHERE dv3.timestamp > '#END_DATE'
ORDER BY dv3.timestamp ASC
LIMIT 1
)
EDIT Sorry, I forgot to notice about T-SQL.
T-SQL (NOT TESTED)
SELECT
dv1.timestamp, dv1.values
FROM
myTable AS dv1
WHERE
dv1.timestamp
BETWEEN (
SELECT TOP 1 dv2.timestamp
FROM myTable AS dv2
WHERE dv2.timestamp > #START_DATE
ORDER BY dv2.timestamp DESC
)
AND ( SELECT TOP 1 dv3.timestamp
FROM myTable AS dv3
WHERE dv3.timestamp < #END_DATE
ORDER BY dv3.timestamp ASC
)
Note If the result is not right, you could just exchange the sub queries (i.e. operators, and ASC/DESC).
Think out of the box :)
U can use max/min to get value u need. Order by +top 1 isnt best way to get max value, what i can see in ur querys. To sort n items its O(n to power 2), getting max should be only O(n)
SELECT value, timestamp
FROM myTable
WHERE timestamp BETWEEN #startDT AND #endDT
union
select A.Value, A.TimeStamp
From (
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp > #startDT
ORDER BY value, timestamp DESC ) A
Union
Select A.Value, A.TimeStamp
From (
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp < #endDT
ORDER BY value, timestamp ASC ) A
The second and third queries in your post don't make much sense because
WHERE timestamp > #startDT
and
WHERE timestamp < #endDT
result in timestamps INSIDE the range, but your descriptions
. . * | * * * * * | * . . .
start end
The above graph simply shows the records of *s as my required records, and |...| is the boundaries.
means something different.
So following the descriptions and using the following mapping
myTable = Posts
value = score
timestamp = creationdate
I wrote this query on data.stackexchange.com (modified from exodream's answer but with the comparison operators in the correct reverse direction)
DECLARE #START_DATE datetime
DECLARE #END_DATE datetime
SET #START_DATE = '2010-10-20'
SET #END_DATE = '2010-11-01'
SELECT score,
creationdate
FROM posts
WHERE creationdate BETWEEN (SELECT TOP 1 creationdate
FROM posts
WHERE creationdate < #START_DATE
ORDER BY creationdate DESC)
AND
(SELECT TOP 1 creationdate
FROM posts
WHERE creationdate > #END_DATE
ORDER BY creationdate ASC)
ORDER by creationDate
Which outputs
score creationdate
----- -------------------
4 2010-10-19 23:55:48
3 2010-10-20 2:24:50
6 2010-10-20 2:55:54
...
...
7 2010-10-31 23:14:48
4 2010-10-31 23:18:17
4 2010-10-31 23:18:48
0 2010-11-01 3:59:38
(382 row(s) affected)
Note how the first row and last rows are just outside the limits of the range
You can put those ordered queries into subqueries to get around not being able to UNION them directly. A little annoying, but it'll get you what you want.
SELECT value, timestamp
FROM myTable
WHERE timestamp BETWEEN #startDT AND #endDT
UNION
SELECT value, timestamp
FROM (
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp > #startDT
ORDER BY value, timestamp DESC
) x
UNION
SELECT value, timestamp
FROM (
SELECT TOP 1
value, timestamp
FROM myTable
WHERE timestamp < #endDT
ORDER BY value, timestamp ASC
) x

Meaning of SQL-code in reviewed code

I need review some code of test-application written in PHP/MySQL.
Author of this code wrote three SQL-queries.
I can't understand, if he meant here some performace optimization?
DB::fetch("SELECT COUNT( * ) AS count, `ip`,`datetime`
FROM `logs`
WHERE `datetime` > ('2006-02-03' - INTERVAL 10 DAY)
GROUP BY `ip`
ORDER BY `datetime` DESC");
$hits = DB::fetchAll("SELECT COUNT( * ) AS count, `datetime`
FROM `logs`
WHERE `datetime` > ( '2006-02-03' - INTERVAL 10
DAY ) AND `is_doc` = 1
GROUP BY `datetime`
ORDER BY `datetime` DESC");
$hosts = DB::fetchAll("SELECT COUNT( * ) AS hosts , datetime
FROM (
SELECT `ip` , datetime
FROM `logs`
WHERE `is_doc` = 1
GROUP BY `datetime` , `ip`
ORDER BY `logs`.`datetime` DESC
) AS cnt
WHERE cnt.datetime > ( '2006-02-03' - INTERVAL 10
DAY )
GROUP BY cnt.datetime
ORDER BY datetime DESC ");
Results of first query are not used in application.
The 1st query is invalid, as it selects 2 columns + 1 aggregate and only groups by 1 of the 2 columns selected.
The 2nd query is getting a count of all rows in logs by date within the last 10 days since 2006-02-03
The 3rd query is getting a count of all distinct ip values from logs within the last 10 days since 2006-02-03 and could be better written as
SELECT COUNT(DISTINCT ip) hosts, datetime
FROM logs
WHERE is_doc = 1
GROUP BY datetime
ORDER BY datetime desc
If this was a submission for a job iterview you may wonder why the cutoff date isn't passed as a variable.