Howto limit timescaleDB queries to the used bucket size? - sql

I have a postgres timescaleDB database with time series data.
The data in table flows was sampled roughly every 500ms.
I need to get the data for every 1 second.
I tried to do it with time_bucket() function.
This was my test query:
SELECT time_bucket('1 second', time) AS bucket, value AS val
FROM flows fl
WHERE
fl.time > '2021-08-31 06:14:00+00' AND
fl.time <= '2021-08-31 06:18:00+00' AND
fl.sensor_id = 2
ORDER BY fl.time ASC;
The returned data looks as follows:
|bucket |val |
| ---------------------- | ------------------- |
| 2021-08-31 06:14:00+00 | 9.75071040883207 |
| 2021-08-31 06:14:00+00 | 10.008532745208633 |
| 2021-08-31 06:14:01+00 | 9.953632354528265 |
| 2021-08-31 06:14:01+00 | 9.833033340905137 |
| 2021-08-31 06:14:02+00 | 9.77205680132453 |
| 2021-08-31 06:14:02+00 | 10.197350449765523 |
| ... | ... |
As you can see, there are two rows for each bucket of one second. Values are coming from the samples that were collected every 500ms.
How to make sure there is only one value per bucket?
(In my case: One value every second)
I also tried an aggregation function (avg) on value, but that did not change the result.

For time_bucket functions, in order to get the bucketing to work correctly, you will have to aggregate the value column in some way, and provide a group by statment. For example, something like this should correctly bucket the time,
SELECT time_bucket('1 second', time) AS bucket,
sum(value) AS val
FROM flows fl
WHERE
time_bucket('1 second', time) > '2021-08-31 06:14:00+00' AND
time_bucket('1 second', time) <= '2021-08-31 06:18:00+00' AND
fl.sensor_id = 2
GROUP BY bucket, sensor_id
ORDER BY bucket ASC;
Hopefully this works for you!
disclosure: I am a part of the Timescale team 😊

Related

How to sum the minutes of each activity in Postgresql?

The column "activitie_time_enter" has the times.
The column "activitie_still" indicates the type of activity.
The column "activitie_walking" indicates the other type of activity.
Table example:
activitie_time_enter | activitie_still | activitie_walking
17:30:20 | Still |
17:31:32 | Still |
17:32:24 | | Walking
17:33:37 | | Walking
17:34:20 | Still |
17:35:37 | Still |
17:45:13 | Still |
17:50:23 | Still |
17:51:32 | | Walking
What I need is to sum up the total minutes for each activity separately.
Any suggestions or solution?
First calculate the duration for each activity (the with CTE) and then do conditional sum.
with t as
(
select
*, lead(activitie_time_enter) over (order by activitie_time_enter) - activitie_time_enter as duration
from _table
)
select
sum (duration) filter (where activitie_still = 'Still') as total_still,
sum (duration) filter (where activitie_walking = 'Walking') as total_walking
from t;
/** Result:
total_still|total_walking|
-----------+-------------+
00:19:16| 00:01:56|
*/
BTW do you really need two columns (activitie_still and activitie_walking)? Only one activity column with those values will do. This will allow more activities (Running, Sleeping, Working etc.) w/o having to change the table structure.

Get the difference in time between multiple rows with the same column name

I need to get the time difference between two dates on different rows. This part is okay but I can have instances of the same title. A quick example which will explain things some more.
Lets say we have a table with the following records:
| ID | Title | Date |
| ----- | ------- |--------------------|
| 1 | Down |2021-03-07 12:05:00 |
| 2 | Up |2021-03-07 13:05:00 |
| 3 | Down |2021-03-07 10:30:00 |
| 4 | Up |2021-03-07 11:00:00 |
I basically need to get the time difference between the first "Down" and "Up". So ID 1 & 2 = 1 hour.
Then ID 3 & 4 = 30 mins, and so on for the amount of "Down" and "Up" rows there are.
(These will always be grouped together one after another)
It doesn't matter if the results are seperate or a SUM of all the differences.
I'm trying to get this done without a temp table.
Thank you.
This can be done using analytical functions, the availability of which will be determined based on your sql engine. The idea is to get the next value in the same row as the one you need to calculate the diff/sum
In the case above it would look some thing like below
SELECT
id ,
title,
Date as startdate,
LEAD(Date,1) OVER (
ORDER BY id
) enddate
FROM
table;
Once you have it on the same row, you can carry out your time difference operation.

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

Performance on querying only the most recent entries

I made an app that saves when a worker arrives and departures from the premises.
Over a 24 hours multiple checks are made, so the database can quickly fill hundreds to thousands of records depending on the activity.
| user_id | device_id | station_id | arrived_at | departed_at |
|-----------|-----------|------------|---------------------|---------------------|
| 67 | 46 | 4 | 2020-01-03 11:32:45 | 2020-01-03 11:59:49 |
| 254 | 256 | 8 | 2020-01-02 16:29:12 | 2020-01-02 16:44:65 |
| 97 | 87 | 7 | 2020-01-01 09:55:01 | 2020-01-01 11:59:18 |
...
This becomes a problem since the daily report software, which later reports who was absent or who made extra hours, filters by arrival date.
The query becomes a full table sweep:
(I just used SQLite for this example, but you get the idea)
EXPLAIN QUERY PLAN
SELECT * FROM activities
WHERE user_id = 67
AND arrived_at > '2020-01-01 00:00:00'
AND departed_at < '2020-01-01 23:59:59'
ORDER BY arrived_at DESC
LIMIT 10
What I want to make is make the query snappier for records created (arrived) only the most recent day, since queries for older days are rarely executed. Otherwise, I'll have to deal with timeouts.
I would use the following index, so that departed_at that don't match can be eliminated before probing the table:
CREATE INDEX ON activities (arrived_at, departed_at);
On Postgres, you may use DISTINCT ON:
SELECT DISTINCT ON (user_id) *
FROM activities
ORDER BY user_id, arrived_at::date DESC;
This assumes that you only want to report the latest record, as determined by the arrival date, for each user. If instead you just want to show all records with the latest arrival date across the entire table, then use:
SELECT *
FROM activities
WHERE arrived_at::date = (SELECT MAX(arrived_at::date) FROM activities);

JOIN or analytic function to match different sensors on nearby timestamps within a large dataset?

I have a large dataset consisting of four sensors in a single stream, but for simplicity's sake let's reduce that to two sensors that transmit at approximate (but not exact) same times like this:
+---------+-------------+-------+
| Sensor | Time | Value |
+---------+-------------+-------+
| SensorA | 10:00:01.14 | 10 |
| SensorB | 10:00:01.06 | 8 |
| SensorA | 10:00:02.15 | 11 |
| SensorB | 10:00:02.07 | 9 |
| SensorA | 10:00:03.14 | 13 |
| SensorA | 10:00:04.09 | 12 |
| SensorB | 10:00:04.13 | 6 |
+---------+-------------+-------+
I am trying to find the difference between SensorA and SensorB when their readings are within a half-second of each other. Like this:
+-------------+-------+
| Trunc_Time | Diff |
+-------------+-------+
| 10:00:01 | 2 |
| 10:00:02 | 2 |
| 10:00:04 | 6 |
+-------------+-------+
I know I could write queries to put each sensor in its own table (say SensorA_table and SensorB_table), and then join those tables like this:
SELECT
TIMESTAMP_TRUNC(a.Time, SECOND) as truncated_sec,
a.Value - b.Value as sensor_diff
FROM SensorA_table AS a JOIN SensorB_Table AS b
ON b.Time BETWEEN TIMESTAMP_SUB(a.Time, INTERVAL 500 MILLISECOND) AND TIMESTAMP_ADD(a.Time, INTERVAL 500 MILLISECOND)
But that seems very expensive to make every row of the SensorA_table compare against every row of the SensorB_table, given that the sensor tables are each about 10 TB. Or does partitioning automatically take care of this and only look at one block of SensorB's table per row of SensorA's table?
Either way, I am wondering if there is a better way to do this than a full JOIN. Since the matching values are all coming from within a few rows of each other in the original table, it seems like an analytic function might be able to look at a smaller amount of data at a time, but because we can't guarantee alternating rows of A & B, there's no clear LAG or LEAD offset that would always return the correct row.
Is it a matter of writing an analytic functions to return a few LAG and LEAD rows for each row, then evaluate each of those rows with a CASE statement to see if it is the correct row, then calculating the value? Or is there a way of doing a join against an analytic function's window?
Thanks for any guidance here.
One method uses lag(). Something like this:
select st.time, st.value - st.prev_value
from (select st.*,
lag(sensor) over (order by time, sensor) as prev_sensor,
lag(time) over (order by time, sensor) as prev_time,
lag(value) over (order by time, sensor) as prev_value
from sensor_table st
) st
where ( st.sensor = 'A' <> prev_sensor = 'B' ) and
prev_time > timestamp_add(time, interval 1 second)