SQL: a time-series variant of the "every nth row" problem - sql

I have a table of time-series data, with the columns:
sensor_number (integer primary key)
signal_strength (integer)
signal_time (timestamp)
Each sensor creates 20-30 rows per minute. I need a query that returns for a sensor 1 row per minute (or every 2 minutes, 3 minutes, etc). A pure SQL approach is to use a window function, with a partition on an expression that rounds the timestamp appropriately (date_trunc() works for the 1-minute case, otherwise I have to some messy casting) The problem is the expression blocks the ability to use the index. With 5B rows, that's a killer.
The best alternative I can come up with is a user-defined function that uses a cursor to step through the table in index key order (sensor_number, signal_time) and outputting a row every time the timestamp crosses a minute boundary. That's still slow though. Is there a pure SQL approach that'll accomplish this AND utilize the index?

I think if you're returning enough rows, scanning the whole range of rows that match the sensor_number will just be the best plan. The signal_time portion of the index may simply not be helpful at that point, because the database needs to read so many rows anyway.
However, if your time interval is big enough / the number of rows you're returning is small enough, it might be more efficient to hit the index separately for each row you're returning. Something like this (using an interval of 3 minutes and a sensor number of 5 as an example):
WITH range AS (
SELECT
max(signal_time) as max_time,
min(signal_time) as min_time
FROM timeseries
WHERE sensor_number = 5
)
SELECT sample.*
FROM range
JOIN generate_series(min_time, max_time, interval '3 minutes') timestamp ON true
JOIN LATERAL (
SELECT *
FROM timeseries
WHERE sensor_number = 5
AND signal_time >= timestamp
AND signal_time < timestamp + interval '3 minutes'
LIMIT 1
) sample ON true;

Related

How can I output values for time intervals with no data in QuestDB

I am using QuestDB to get the amount of events we are receiving every 500 milliseconds. Everything works as expected and I can use SAMPLE BY 500T to aggregate in half a second intervals.
However, for the intervals where we don't have any data, we are not getting any rows. I guess this is expected, but it would be good to have some way of getting a row for those intervals just with null or empty values.
Luckily in QuestDB you have the FILL keyword to do exactly that. Take this query running at the public QuestDB demo:
SELECT
timestamp, count()
FROM trades
WHERE timestamp > dateadd('d', -1, now())
SAMPLE BY 500T ALIGN TO CALENDAR;
In this case I am aggregating every 500 milliseconds and getting results only for the intervals where I have data. I am limiting to only the past day. You can run this on the demo site as it is a live dataset and you should see gaps for some intervals.
Now, by using FILL I can add the rows for the periods with no values
SELECT
timestamp, count()
FROM trades
WHERE timestamp > dateadd('d', -1, now())
SAMPLE BY 500T FILL(NULL) ALIGN TO CALENDAR;
Note that you could also fill with LINEAR (linear interpolation of previous and next rows), PREV for the value of the row before, or with a constant value.

What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds

My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)
I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"
We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.
Columns for Source1 tables are:timestamp,logtype,company,category
User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"
To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year, month, day, hour, logtype

Optimize timescale query

I am using the below query with timescaledb to get the 10 minute candles from a ticks database.
SELECT time_bucket('10minute', time) AS min,
first(ticks, time) AS open,
last(ticks, time) AS close,
max(ticks) AS high,
min(ticks) AS low,
last(volume, time)-first(volume, time) AS vol
FROM prices
WHERE asset_code = '".$symbol."'
GROUP BY min
ORDER BY min DESC
LIMIT 100
I want to make sure the query doesn't slow down the after some days as the db grows. At any time I want to run this query on ticks from last two days and not the whole table. So I want to know is there a way I can limit the time_bucket query on last 100000 ticks from db.
I am also using PDO for query db.
TimescaleDB uses constraint exclusion to eliminate needing to touch chunks when answering a query. We have some work going on right now to extend the query optimization to more intelligently handle some types of LIMIT queries, as in your example, so that even the above will just touch the necessary chunks.
But for now, there's a very easy fix: use a time predicate in the WHERE clause instead of the LIMIT.
In particular, assuming that you typically have a ticker symbol in each 10 minute interval, and you want 100 intervals:
SELECT time_bucket('10 minutes', time) AS min,
first(ticks, time) AS open,
...
FROM prices
WHERE asset_code = '".$symbol."'
AND time > NOW() - interval '1000 minutes'
GROUP BY min
ORDER BY min DESC

SQL to group time intervals by arbitrary time period

I need help with this SQL query. I have a big table with the following schema:
time_start (timestamp) - start time of the measurement,
duration (double) - duration of the measurement in seconds,
count_event1 (int) - number of measured events of type 1,
count_event2 (int) - number of measured events of type 2
I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2.
I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period), for instance 3 hours. I have already tried something like this:
SELECT
ROUND(time_start/group_period,0) AS time_period,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM measurements
GROUP BY time_period;
However, there seems to be a problem. If there is a measurement with duration greater than the group_period, I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. Is there a way to fix this?
Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. Do you have any suggestions for indexes or any other optimizations to improve the speed of this query?
Based on Timekiller's advice, I have come up with the following query:
-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.
-- First some configuration:
-- group_period = 3600 -- group by 1 hour (= 3600 seconds)
-- min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
-- max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT
-- Calculate the number of started periods in the given interval in advance.
-- period_count = CEIL((max_time - min_time) / group_period)
SET TIME ZONE UTC;
BEGIN TRANSACTION;
-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
ON COMMIT DROP;
INSERT INTO periods (period_start)
SELECT to_timestamp(min_time + group_period * coefficient)
FROM generate_series(0, period_count) as coefficient;
-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
-- A. [period_start, period_start + group_period]
-- B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
period_start,
COUNT(measurements.*) AS count_measurements,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;
COMMIT TRANSACTION;
It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:
I expect the measurements table to have about 500-800 million rows.
The time_start column is primary key and has unique btree index on it.
I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.
(This turned out way too large for a comment, so I'll post it as an answer instead).
Adding to my comment on your answer, you probably should go with getting best results first and optimize later if it turns out to be slow.
As for performance, one thing I've learned while working with databases is that you can't really predict performance. Query optimizers in advanced DBMS are complex and tend to behave differently on small and large data sets. You'll have to get your table filled with some large sample data, experiment with indexes and read the results of EXPLAIN, there's no other way.
There are a few things to suggest, though I know Oracle optimizer much better than Postgres, so some of them might not work.
Things will be faster if all fields you're checking against are included in the index. Since you're performing a left join and periods is a base, there's probably no reason to index it, since it'll be included fully either way. duration should be included in the index though, if you're going to go with proper interval overlap - this way, Postgres won't have to fetch the row to calculate the join condition, index will suffice. Chances are it will not even fetch the table rows at all since it needs no other data than what exists in indexes. I think it'll perform better if it's included as the second field to time_start index, at least in Oracle it would, but IIRC Postgres is able to join indexes together, so perhaps a second index would perform better - you'll have to check it with EXPLAIN.
Indexes and math don't mix well. Even if duration is included in the index, there's no guarantee it will be used in (time_start + duration) - though, again, look at EXPLAIN first. If it's not used, try to either create a function-based index (that is, include time_start + duration as a field), or alter the structure of the table a bit, so that time_start + duration is a separate column, and index that column instead.
If you don't really need left join (that is, you're fine with missing empty periods), then use inner join instead - optimizer will likely start with a larger table (measurements) and join periods against it, possibly using hash join instead of nested loops. If you do that, than you should also index your periods table in the same fashion, and perhaps restructure it the same way, so that it contains start and end periods explicitly, as optimizer has even more options when it doesn't have to perform any operations on the columns.
Perhaps the most important, if you have max_time and min_time, USE IT to limit the results of measurements before joining! The smaller your sets, the faster it will work.

SQL: select one record for each day nearest to a specific time

I have one table that stores values with a point in time:
CREATE TABLE values
(
value DECIMAL,
datetime DATETIME
)
There may be many values on each day, there may also be only one value for a given day. Now I want to get the value for each day in a given timespan (e.g. one month) which is nearest to a given time of day. I only want to get one value per day if there are records for this day or no value if there are no records. My database is PostgreSQL. I'm quite stuck with that. I could just get all values in the timespan and select the nearest value for each day programmatically, but that would mean to pull a huge amount of data from the database, because there can be many values on one day.
(Update)
To formulate it a bit more abstract: I have data of arbitrary precision (could be one minute, could be two hours or two days) and I want to convert it to a fixed precision of one day, with a specific time of day.
(second update)
This is the query from the accepted answer with correct postgresql type converstions, assuming the desired time is 16:00:
SELECT datetime, value FROM values, (
SELECT DATE(datetime) AS date, MIN(ABS(EXTRACT(EPOCH FROM TIME '16:00' - CAST(datetime AS TIME)))) AS timediff
FROM values
GROUP BY DATE(datetime)
) AS besttimes
WHERE
CAST(values.datetime AS TIME) BETWEEN TIME '16:00' - CAST(besttimes.timediff::text || ' seconds' AS INTERVAL)
AND TIME '16:00' + CAST(besttimes.timediff::text || ' seconds' AS INTERVAL)
AND DATE(values.datetime) = besttimes.date
How about going into this direction?
SELECT values.value, values.datetime
FROM values,
( SELECT DATE(datetime) AS date, MIN(ABS(_WANTED_TIME_ - TIME(datetime))) AS timediff
FROM values
GROUP BY DATE(datetime)
) AS besttimes
WHERE TIME(values.datetime) BETWEEN _WANTED_TIME_ - besttimes.timediff
AND _WANTED_TIME_ + besttimes.timediff
AND DATE(values.datetime) = besttimes.date
I am not sure about the date/time extracting and abs(time) functions, so you will have to replace them probably.
It appears you have two parts to solve:
Are there any results for a day at all?
If there are, then which is the nearest one?
By shortcircuiting the process at part 1 if you have no results you'll save a lot of execution time.
The next thing to note is that you don't have to pull the data from the database, wait until you have an answer or not by using PLSQL functions (or something else) to work it out on the server first.
Once you have a selection of times to check you can use intervals to compare them. Check the Postgres docs on intervals and datetime functions for precise instructions, but basically you minus the selected dates from the date you've given and the one with the smallest interval is the one you want.