Calculating average with biginteger time intervals using TimescaleDB - sql

I have a schema with the following fields:
Name of row | Type
--------------------------+--------
name | string
value1 | numeric
timestamp | bigint
The rows contain entries with a name, a numeric value and a bigint value storing the unix timestamp in nanoseconds. Using TimescaleDB and I would like to use time_buckets_gapfill to retrieve the data. Given the timestamps are stored in bigint, this is quite cumbersome.
I would like to get aggregated data for these intervals: 5min, hour, day, week, month, quarter, year. I have managed to make it work using normal time_buckets, but now I would like to fill the gaps as well. I am using the following query now:
SELECT COALESCE(COUNT(*), 0), COALESCE(SUM(value1), 0), time_bucket_gapfill('5 min', date_trunc('quarter', to_timestamp(timestamp/1000000000)), to_timestamp(1599100000), to_timestamp(1599300000)) AS bucket
FROM playground
WHERE name = 'test' AND timestamp >= 1599100000000000000 AND timestamp <= 1599300000000000000
GROUP BY bucket
ORDER BY bucket ASC
This returns the values correctly, but does not fill the empty spaces. If I modified my query to
time_bucket_gapfill('5 min',
date_trunc('quarter',
to_timestamp(timestamp/1000000000),
to_timestamp(1599100000),
to_timestamp(1599200000))
I would get the first entry correctly and then empty rows every 5 minutes. How could I make it work? Thanks!
Here is a DB fiddle, but it doesn't work as it doesn't support TimeScaleDB. The query above returns the following:
coalesce | coalesce | avg_val
------------------------+-------------------------
3 | 300 | 2020-07-01 00:00:00+00
0 | 0 | 2020-09-03 02:25:00+00
0 | 0 | 2020-09-03 02:30:00+00

You should use datatypes in your time_bucket_gapfill that matches the datatypes in your table. The following query should get you what you are looking for:
SELECT
COALESCE(count(*), 0),
COALESCE(SUM(value1), 0),
time_bucket_gapfill(300E9::BIGINT, timestamp) AS bucket
FROM
t
WHERE
name = 'example'
AND timestamp >= 1599100000000000000
AND timestamp < 1599200000000000000
GROUP BY
bucket;

I have managed to solve it by building on Sven's answer. It first uses his function to fill out the gaps and then date_trunc is called eliminating the extra rows.
WITH gapfill AS (
SELECT
COALESCE(count(*), 0) as count,
COALESCE(SUM(value1), 0) as sum,
time_bucket_gapfill(300E9::BIGINT, timestamp) as bucket
FROM
playground
WHERE
name = 'test'
AND timestamp >= 1599100000000000000
AND timestamp < 1599300000000000000
GROUP BY
bucket
)
SELECT
SUM(count),
SUM(sum),
date_trunc('quarter', to_timestamp(bucket/1000000000)) as truncated
FROM
gapfill
GROUP BY truncated
ORDER BY truncated ASC

Related

Streamlined solution for division on similar subqueries

I have a table in PostgreSQL 13 that looks like this (modified for the purpose of this question):
SELECT * FROM visits.visitors_log;
visitor_id | day | source
--------------+------------------------+----------
9 | 2019-12-30 12:10:10-05 | Twitter
7 | 2019-12-14 22:10:26-04 | Netflix
5 | 2019-12-13 15:21:04-05 | Netflix
9 | 2019-12-22 23:34:47-05 | Twitter
7 | 2019-12-22 00:10:26-04 | Netflix
9 | 2019-12-22 13:20:42-04 | Twitter
After converting the times to another timezone, I want to calculate the percentage of visits on 2019-12-22 that came from a specific source.
There are 4 steps involved:
Convert timezones
Calculate how many total visits happened on that day
Calculate how many total visits happened on that day that came from source Netflix
Divide those 2 numbers to get percentage.
I wrote this code which works but seems repetitive and not very clean:
SELECT (SELECT COUNT(*) FROM (SELECT visitor_id, source, day AT TIME ZONE 'PST' FROM visits.visitors_log WHERE day::date = '2019-12-22') AS a
WHERE day::date = '2019-12-22' AND source = 'Netflix') * 100.0
/
(SELECT COUNT(*) FROM (SELECT visitor_id, source, day AT TIME ZONE 'PST' FROM visits.visitors_log WHERE day::date = '2019-12-22') AS b
WHERE day::date = '2019-12-22')
AS visitors_percentage;
Can anyone suggest a neater way of answering this question?
Use an aggregate FILTER clause:
SELECT count(*) FILTER (WHERE source = 'Netflix') * 100.0
/ count(*) AS visitors_percentage
FROM visits.visitors_log
WHERE day >= timestamp '2019-12-22' AT TIME ZONE 'PST'
AND day < timestamp '2019-12-23' AT TIME ZONE 'PST';
See:
Aggregate columns with additional (distinct) filters
I rephrased the WHERE condition so it is "sargable" and can use an index on (day). A predicate with an expression on the column cannot use a plain index. So I moved the computation of inclusive lower and exclusive upper bound (day boundaries for the given time zone) to the right side of the expressions in the WHERE clause.
Makes a huge difference for performance with big tables.
If you use that query a lot, consider crating a function for it:
CREATE OR REPLACE FUNCTION my_func(_source text, _day date, _tz text)
RETURNS numeric
LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
$func$
SELECT round(count(*) FILTER (WHERE source = _source) * 100.0 / count(*), 2) AS visitors_percentage
FROM visits.visitors_log
WHERE day >= _day::timestamp AT TIME ZONE _tz
AND day < (_day + 1)::timestamp AT TIME ZONE _tz;
$func$;
Call:
SELECT my_func('Netflix', '2019-12-22', 'PST');
I threw in round(), which is a totally optional addition.
db<>fiddle here
Aside: "day" is a rather misleading name for a timestamp with time zone column.
Hmmm . . . You can use window functions to calculate the total:
SELECT source, COUNT(*) / SUM(COUNT(*)) OVER () as visitors_percentage
FROM visits.visitors_log
WHERE (day AT TIME ZONE 'PST')::date = '2019-12-22'
GROUP BY SOURCE

How to find the number of hours between two timestamps in Db2

I would like to subtract 2 timestamp to get the hours between them. I have use days_between function but it returns me an error of Invalid operation:function days_between has no timezone setup. Below is the sample table and timestamps that I want to subtract.
job_number timestamp 1 timestamp 2
123456789 2020-03-16 16:59:26 2020-03-17 10:58:25
134232125 2020-03-18 08:57:05 2020-03-19 01:47:26
The HOURS_BETWEEN function is the cleanest way to find the number of full hours between two timestamps in DB2
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061478.html
The HOURS_BETWEEN function returns the number of full hours between the specified arguments.
For example
VALUES
( HOURS_BETWEEN('2020-03-17-10.58.25', '2019-03-16-16.59.26')
, HOURS_BETWEEN('2020-03-19-01.47.26', '2019-03-18-08.57.05')
)
returns
1 |2
----|----
8801|8800
Note that the value is negative if the first value is less than the second value in the function
Also note that this function does not exist in version of Db2 (for LUW) lower than 11.1
You can use TIMESTAMPDIFF() function with numeric-expression is equal to 8 to represent hour difference :
SELECT job_number, TIMESTAMPDIFF(8,CHAR(timestamp2 - timestamp1)) AS ts_diff
FROM T
Demo
TIMESTAMPDIFF function has quite a specific implementation. See Table 3. TIMESTAMPDIFF computations.
I've set earlier year (2019) for timestamp 1 values deliberately.
WITH TAB (job_number, timestamp1, timestamp2) AS
(
VALUES
(123456789, TIMESTAMP('2019-03-16-16.59.26'), TIMESTAMP('2020-03-17-10.58.25'))
, (134232125, TIMESTAMP('2019-03-18-08.57.05'), TIMESTAMP('2020-03-19-01.47.26'))
)
SELECT job_number
, TIMESTAMPDIFF(8, CHAR(TIMESTAMP2 - TIMESTAMP1)) HOURS_TSDIFF
, ((DAYS(TIMESTAMP2) - DAYS(TIMESTAMP1)) * 86400 + MIDNIGHT_SECONDS(TIMESTAMP2) - MIDNIGHT_SECONDS(TIMESTAMP1)) / 3600 HOURS_REAL
FROM TAB;
The result is:
|JOB_NUMBER |HOURS_TSDIFF|HOURS_REAL |
|-----------|------------|-----------|
|123456789 |8777 |8801 |
|134232125 |8776 |8800 |

Using crosstab, dynamically loading column names of resulting pivot table in one query?

The gem we have installed (Blazer) on our site limits us to one query.
We are trying to write a query to show how many hours each employee has for the past 10 days. The first column would have employee names and the rest would have hours with the column header being each date. I'm having trouble figuring out how to make the column headers dynamic based on the day. The following is an example of what we have working without dynamic column headers and only using 3 days.
SELECT
pivot_table.*
FROM
crosstab(
E'SELECT
"User",
"Date",
"Hours"
FROM
(SELECT
"q"."qdb_users"."name" AS "User",
to_char("qdb_works"."date", \'YYYY-MM-DD\') AS "Date",
sum("qdb_works"."hours") AS "Hours"
FROM
"q"."qdb_works"
LEFT OUTER JOIN
"q"."qdb_users" ON
"q"."qdb_users"."id" = "q"."qdb_works"."qdb_user_id"
WHERE
"qdb_works"."date" > current_date - 20
GROUP BY
"User",
"Date"
ORDER BY
"Date" DESC,
"User" DESC) "x"
ORDER BY 1, 2')
AS
pivot_table (
"User" VARCHAR,
"2017-10-06" FLOAT,
"2017-10-05" FLOAT,
"2017-10-04" FLOAT
);
This results in
| User | 2017-10-05 | 2017-10-04 | 2017-10-03 |
|------|------------|------------|------------|
| John | 1.5 | 3.25 | 2.25 |
| Jill | 6.25 | 6.25 | 6 |
| Bill | 2.75 | 3 | 4 |
This is correct, but tomorrow, the column headers will be off unless we update the query every day. I know we could pivot this table with date on the left and names on the top, but that will still need updating with each new employee – and we get new ones often.
We have tried using functions and queries in the "AS" section with no luck. For example:
AS
pivot_table (
"User" VARCHAR,
current_date - 0 FLOAT,
current_date - 1 FLOAT,
current_date - 2 FLOAT
);
Is there any way to pull this off with one query?
You could select a row for each user, and then per column sum the hours for one day:
with user_work as
(
select u.name as user
, to_char(w.date, 'YYYY-MM-DD') as dt_str
, w.hours
from qdb_works w
join qdb_users u
on u.id = w.qdb_user_id
where w.date >= current_date - interval '2 days'
)
select User
, sum(case when dt_str = to_char(current_date,
'YYYY-MM-DD') then hours end) as Today
, sum(case when dt_str = to_char(current_date - 'interval 1 day',
'YYYY-MM-DD') then hours end) as Yesterday
, sum(case when dt_str = to_char(current_date - 'interval 2 days',
'YYYY-MM-DD') then hours end) as DayBeforeYesterday
from user_work
group by
user
, dt_str
It's often easier to return a list and pivot it client side. That also allows you to generate column names with a date.
Is there any way to pull this off with one query?
No, because a fixed SQL query cannot have any variability in its output columns. The SQL engine determines the number, types and names of every column of a query before executing it, without reading any data except in the catalog (for the structure of tables and other objects), execution being just the last of 5 stages.
A single-query dynamic pivot, if such a thing existed, couldn't be prepared, since a prepared query always have the same results structure, whereas by definition a dynamic pivot doesn't, as the rows that pivot into columns can change between executions. That would be at odds again with the Prepare-Bind-Execute model.
You may find some limited workarounds and additional explanations in other questions, for example: Execute a dynamic crosstab query, but since you mentioned specifically:
The gem we have installed (Blazer) on our site limits us to one
query
I'm afraid you're out of luck. Whatever the workaround, it always need at best one step with a query to figure out the columns and generate a dynamic query from them, and a second step executing the query generated at the previous step.

examine if one time series column of table has two adjacent time points which have interval larger than certain length

I am dealing with data preprocessing on a table containing time series column
toy example Table A
timestamp value
12:30:24 1
12:32:21 3
12:33:21 4
timestamp is ordered and always go incrementally
Is that possible to define an function or something else to return "True expression" when table has two adjacent time points which have interval larger than certain length and return "False" otherwise?
I am using postgresql, thank you
SQL Fiddle
select bool_or(bigger_than) as bigger_than
from (
select
time - lag(time) over (order by time)
>
interval '1 minute' as bigger_than
from table_a
) s;
bigger_than
-------------
t
bool_or will stop searching as soon as it finds the first true value.
http://www.postgresql.org/docs/current/static/functions-aggregate.html
Your sample data shows a time value. But it works the same for a timestamp
Something like this:
select count(*) > 0
from (
select timestamp,
lag(timestamp) over (order by value) as prev_ts
from table_a
) t
where timestamp - prev_ts < interval '1' minute;
It calculates the difference between a timestamp and it's "previous" timestamp. The order of the timestamps is defined by the value column. The outer query then counts the number of rows where the difference is smaller than 1 minute.
lag() is called a window functions. More details on those can be found in the manual:
http://www.postgresql.org/docs/current/static/tutorial-window.html

Getting results between two dates in PostgreSQL

I have the following table:
+-----------+-----------+------------+----------+
| id | user_id | start_date | end_date |
| (integer) | (integer) | (date) | (date) |
+-----------+-----------+------------+----------+
Fields start_date and end_date are holding date values like YYYY-MM-DD.
An entry from this table can look like this: (1, 120, 2012-04-09, 2012-04-13).
I have to write a query that can fetch all the results matching a certain period.
The problem is that if I want to fetch results from 2012-01-01 to 2012-04-12, I get 0 results even though there is an entry with start_date = "2012-04-09" and end_date = "2012-04-13".
SELECT *
FROM mytable
WHERE (start_date, end_date) OVERLAPS ('2012-01-01'::DATE, '2012-04-12'::DATE);
Datetime functions is the relevant section in the docs.
Assuming you want all "overlapping" time periods, i.e. all that have at least one day in common.
Try to envision time periods on a straight time line and move them around before your eyes and you will see the necessary conditions.
SELECT *
FROM tbl
WHERE start_date <= '2012-04-12'::date
AND end_date >= '2012-01-01'::date;
This is sometimes faster for me than OVERLAPS - which is the other good way to do it (as #Marco already provided).
Note the subtle difference. The manual:
OVERLAPS automatically takes the earlier value of the pair as the
start. Each time period is considered to represent the half-open
interval start <= time < end, unless start and end are equal in which
case it represents that single time instant. This means for instance
that two time periods with only an endpoint in common do not overlap.
Bold emphasis mine.
Performance
For big tables the right index can help performance (a lot).
CREATE INDEX tbl_date_inverse_idx ON tbl(start_date, end_date DESC);
Possibly with another (leading) index column if you have additional selective conditions.
Note the inverse order of the two columns. See:
Optimizing queries on a range of timestamps (two columns)
just had the same question, and answered this way, if this could help.
select *
from table
where start_date between '2012-01-01' and '2012-04-13'
or end_date between '2012-01-01' and '2012-04-13'
To have a query working in any locale settings, consider formatting the date yourself:
SELECT *
FROM testbed
WHERE start_date >= to_date('2012-01-01','YYYY-MM-DD')
AND end_date <= to_date('2012-04-13','YYYY-MM-DD');
Looking at the dates for which it doesn't work -- those where the day is less than or equal to 12 -- I'm wondering whether it's parsing the dates as being in YYYY-DD-MM format?
You have to use the date part fetching method:
SELECT * FROM testbed WHERE start_date ::date >= to_date('2012-09-08' ,'YYYY-MM-DD') and date::date <= to_date('2012-10-09' ,'YYYY-MM-DD')
No offense but to check for performance of sql I executed some of the above mentioned solutiona pgsql.
Let me share you Statistics of top 3 solution approaches that I come across.
1) Took : 1.58 MS Avg
2) Took : 2.87 MS Avg
3) Took : 3.95 MS Avg
Now try this :
SELECT * FROM table WHERE DATE_TRUNC('day', date ) >= Start Date AND DATE_TRUNC('day', date ) <= End Date
Now this solution took : 1.61 Avg.
And best solution is 1st that suggested by marco-mariani
SELECT *
FROM ecs_table
WHERE (start_date, end_date) OVERLAPS ('2012-01-01'::DATE, '2012-04-12'::DATE + interval '1');
Let's try range data type.
--sample data.
begin;
create temp table tbl(id serial, user_id integer, start_date date, end_date date);
insert into tbl(user_id, start_date, end_date) values(1, '2012-04-09', '2012-04-13');
insert into tbl(user_id, start_date, end_date) values(1, '2012-01-09', '2012-04-12');
insert into tbl(user_id, start_date, end_date) values(1, '2012-02-09', '2012-04-10');
insert into tbl(user_id, start_date, end_date) values(1, '2012-04-09', '2012-04-10');
commit;
add a new daterange column.
begin;
alter table tbl add column tbl_period daterange ;
update tbl set tbl_period = daterange(start_date,end_date);
commit;
--now test time.
select * from tbl
where tbl_period && daterange('2012-04-10' ::date, '2012-04-12'::date);
returns:
id | user_id | start_date | end_date | tbl_period
----+---------+------------+------------+-------------------------
1 | 1 | 2012-04-09 | 2012-04-13 | [2012-04-09,2012-04-13)
2 | 1 | 2012-01-09 | 2012-04-12 | [2012-01-09,2012-04-12)
further reference: https://www.postgresql.org/docs/current/functions-range.html#RANGE-OPERATORS-TABLE