Redshift GROUP BY time interval

Redshift GROUP BY time interval - sql

Currently, I have the following raw data in redshift.
timestamp ,lead
==================================
"2008-04-09 10:02:01.000000",true
"2008-04-09 10:03:05.000000",true
"2008-04-09 10:31:07.000000",true
"2008-04-09 11:00:05.000000",false
...
So, I would like to generate an aggregated data, with interval of 30 mins. My wished outcome is
timestamp ,count
==================================
"2008-04-09 10:00:00.000000",2
"2008-04-09 10:30:00.000000",1
"2008-04-09 11:00:00.000000",0
...
I had referred to https://stackoverflow.com/a/12046382/3238864 , which is valid for PostgreSQL.
I try to mimic the code posted, by using
with thirty_min_intervals as (
select
(select min(timestamp)::date from events) + ( n || ' minutes')::interval start_time,
(select min(timestamp)::date from events) + ((n+30) || ' minutes')::interval end_time
from generate_series(0, (24*60), 30) n
)
select count(CASE WHEN lead THEN 1 END) from events e
right join thirty_min_intervals f
on e.timestamp >= f.start_time and e.timestamp < f.end_time
group by f.start_time, f.end_time
order by f.start_time;
However, I'm getting error
[0A000] ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
May I know, what is the good way to perform aggregation data calculation, of N interval, in redshift.

Joe's answer is a great neat solution.
I feel one should always consider how the data is distributed and sorted when you are working in Redshift. It can have a dramatic impact on performance.
Building on Joe's great answer:
I will materialise the sample events. In practise the events will be in a table.
drop table if exists public.temporary_events;
create table public.temporary_events AS
select ts::timestamp as ts
,lead
from
( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;
Now run explain:
explain
WITH time_dimension
AS (SELECT dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
SELECT dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM time_dimension td
LEFT JOIN public.temporary_events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
-- ORDER BY 1 Just to simply the job a little
The output is:
XN HashAggregate (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
-> XN Hash Left Join DS_DIST_BOTH (cost=0.05..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
Outer Dist Key: ('2018-11-27 17:00:35'::timestamp without time zone - ((rn.n)::double precision * '00:00:01'::interval))
Inner Dist Key: e."ts"
Hash Cond: ("outer"."?column2?" = "inner"."ts")
-> XN Subquery Scan rn (cost=0.00..14.95 rows=1 width=8)
Filter: (((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) <= '2017-02-16 11:00:00'::timestamp without time zone) AND ((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) >= '2017-02-16 09:30:00'::timestamp without time zone))
-> XN Limit (cost=0.00..1.95 rows=130 width=0)
-> XN Window (cost=0.00..1.95 rows=130 width=0)
-> XN Network (cost=0.00..1.30 rows=130 width=0)
Send to slice 0
-> XN Seq Scan on stl_scan (cost=0.00..1.30 rows=130 width=0)
-> XN Hash (cost=0.04..0.04 rows=4 width=9)
-> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)
Kablamo!
As Joe says you may well use this pattern merrily without issue. However once your data gets sufficiently large OR your SQL logic complex you may want to optimise. If for no other reason you might like to understand the explain plan when you add more sql logic into your code.
Three areas we can look at:
The Join. Make the join between both sets of data work on the same datatype. Here we join a timestamp to an interval.
Data distribution. Materialise and distribute both tables by timestamp.
Data sorting. If events is sorted by this timestamp and the time dimension is sorted by both timestamps then you may be able to complete the entire query using a merge join without any data moving and without sending the data to the leader node for aggregation.
Observe:
drop table if exists public.temporary_time_dimension;
create table public.temporary_time_dimension
distkey(dtm) sortkey(dtm, dtm_half_hour)
AS (SELECT dtm::timestamp as dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
;
drop table if exists public.temporary_events;
create table public.temporary_events
distkey(ts) sortkey(ts)
AS
select ts::timestamp as ts
,lead
from
( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;
explain
SELECT
dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM public.temporary_time_dimension td
LEFT JOIN public.temporary_events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
--order by dtm_half_hour
This then gives:
XN HashAggregate (cost=1512.67..1512.68 rows=1 width=9)
-> XN Merge Left Join DS_DIST_NONE (cost=0.00..1504.26 rows=1682 width=9)
Merge Cond: ("outer".dtm = "inner"."ts")
-> XN Seq Scan on temporary_time_dimension td (cost=0.00..1500.00 rows=1682 width=16)
Filter: ((dtm_half_hour <= '2017-02-16 11:00:00'::timestamp without time zone) AND (dtm_half_hour >= '2017-02-16 09:30:00'::timestamp without time zone))
-> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)
Important caveats:
I've taken the order by out. putting it back in will result in the data being sent to the leader node for sorting. If you can do away with the sort then do away with the sort!
I'm certain that choosing timestamp as the events table sort key will NOT be ideal in many situations. I just thought I'd show what is possible.
I think you will likely want to have a time dimension created with diststyle all and sorted. This will ensure that your joins do not generate network traffic.

You can use ROW_NUMBER() to generate a series. I use internal tables that I know to be large. FWIW, I would typically persist the time_dimension to a real table to avoid doing this repeatedly.
Here you go:
WITH events
AS ( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
,time_dimension
AS (SELECT dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
SELECT dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM time_dimension td
LEFT JOIN events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
ORDER BY 1
;

Related

How do I flatten the complexity of this select statement in postgresql?

I have the following table:
create table account_values
(
account_id bigint not null,
timestamp timestamp not null,
value1 numeric not null,
value2 numeric not null,
primary key (timestamp, account_id)
);
I also have the following query which produces an array of every value1+value2 of the row with the closest (before) timestamp to an evenly spaced generated series:
select array [(trunc(extract(epoch from gs) * 1000))::text, COALESCE((values.value1 + values.value2), 0.000000)::text]
from generate_series((now() - '1 year'::interval)::timestamp, now(), interval '1 day') gs
left join lateral (select value1, value2
from account_values
where timestamp <= gs and account_id = ?
order by timestamp desc
limit 1) equity on (TRUE);
The issue with this method of generating such an array becomes apparent when inspecting the output of explain analyse:
Nested Loop Left Join (cost=0.45..3410.74 rows=1000 width=32) (actual time=0.134..3948.546 rows=366 loops=1)
-> Function Scan on generate_series gs (cost=0.02..10.02 rows=1000 width=8) (actual time=0.075..0.244 rows=366 loops=1)
-> Limit (cost=0.43..3.36 rows=1 width=26) (actual time=10.783..10.783 rows=1 loops=366)
-> Index Scan Backward using account_values_pkey on account_values (cost=0.43..67730.27 rows=23130 width=26) (actual time=10.782..10.782 rows=1 loops=366)
" Index Cond: ((""timestamp"" <= gs.gs) AND (account_id = 459))"
Planning Time: 0.136 ms
Execution Time: 3948.659 ms
Specifically: loops=366
This problem will only get worse if I ever decide to decrease my generated series interval time.
Is there a way to flatten this looped select into a more efficient query?
If not, what are some other approaches I can take to improving the performance?
edit;
One hard requirement is that the result of the statement cannot be altered. For example I don't want the range to round to the closest day. The range should always start the second the statement is invoked and each interval precisely one day before.

based on Edouard answer.
with a(_timestamp,values_agg) as
(select _timestamp, array_agg(lpad( (value1 + value2)::text,6,'0')) as values_agg from account_values
where account_id = 1
and _timestamp <#tsrange(now()::timestamp - '1 year'::interval, now()::timestamp)
group by 1)
select jsonb_agg(jsonb_build_object
(
'_timestamp', trunc(extract(epoch from _timestamp) *1000)::text
, 'values', values_agg )
) AS item from a;

Not sure you will get the exact same result, but it should be faster :
select array [ (trunc(extract(epoch from date_trunc('day', timestamp)) * 1000))::text
, (array_agg(value1 + value2 ORDER BY timestamp DESC))[1] :: text
]
from account_values
where account_id = ?
and timestamp <# tsrange(now() - '1 year'::interval, now())
group by date_trunc('day', timestamp)

How to get a count of records by minute using a datetime column

I have a table with columns below:
Customer
Time_Start
A
01/20/2020 01:25:00
A
01/22/2020 14:15:00
A
01/20/2020 03:23:00
A
01/21/2020 20:37:00
I am trying to get a table that outputs a table by minute (including zeros) for a given day.
i.e.
Customer
Time_Start
Count
A
01/20/2020 00:01:00
5
A
01/20/2020 00:02:00
2
A
01/20/2020 00:03:00
0
A
01/20/2020 00:04:00
12
I would like to have it show only 1 day for 1 customer at a time.
Here is what I have so far:
select
customer,
cast(time_start as time) + ((cast(time_start as time) - cast('00:00:00' as time)) hour(2)) as TimeStampHour,
count(*) as count
from Table_1
where customer in ('A')
group by customer, TimeStampHour
order by TimeStampHour

In Teradata 16.20 this would be a simple task using the new time series aggregation
SELECT
customer
,Cast(Begin($Td_TimeCode_Range) AS VARCHAR(16))
,Count(*)
FROM table_1
WHERE customer = 'A'
AND time_start BETWEEN TIMESTAMP'2020-01-20 00:00:00'
AND Prior(TIMESTAMP'2020-01-20 00:00:00' + INTERVAL '1' DAY)
GROUP BY TIME(Minutes(1) AND customer)
USING timecode(time_start)
FILL (0)
;
Before you must implement it similar to Ramin Faracov answer, create a list of all minutes first and then left join to it. But I prefer doing the count before joining:
WITH all_minutes AS
( -- create a list of all minutes
SELECT
Begin(pd) AS bucket
FROM
( -- EXPAND ON requires FROM and TRUNC materializes the FROM avoiding error
-- "9303. EXPAND ON clause must not be specified in a query expression with no table references."
SELECT Cast(Trunc(DATE '2020-01-20') AS TIMESTAMP(0)) AS start_date
) AS dt
EXPAND ON PERIOD(start_date, start_date + INTERVAL '1' DAY) AS pd
BY INTERVAL '1' MINUTE
)
SELECT customer
,bucket
,Coalesce(Cnt, 0)
FROM all_minutes
LEFT JOIN
(
SELECT customer
,time_start
- (Extract (SECOND From time_start) * INTERVAL '1' SECOND) AS time_minute
,Count(*) AS Cnt
FROM table_1
WHERE customer = 'A'
AND time_start BETWEEN TIMESTAMP'2020-01-20 00:00:00'
AND Prior(TIMESTAMP'2020-01-20 00:00:00' + INTERVAL '1' DAY)
GROUP BY customer, time_minute
) AS counts
ON counts.time_minute = bucket
ORDER BY bucket
;

Firstly, we create a function that returns a table that has one field and records starting from the time_start value and increasing one minute period.
CREATE OR REPLACE FUNCTION get_dates(time_start timestamp without time zone)
RETURNS TABLE(list_dates timestamp without time zone)
LANGUAGE plpgsql
AS $function$
declare
time_end timestamp;
begin
time_end = time_start + interval '1 day';
return query
SELECT t1.dates
FROM generate_series(time_start, time_end, interval '1 min') t1(dates);
END;
$function$
;
Consider that, this function return's list of timestamp which has an hour and minute, but seconds and milliseconds are always equal to zero. We must join this result of function with our customer table, using the timestamp fields. But our customer table's timestamp field has a full DateTime format, second and millisecond are not empty. Hence it is, understood that our join conditions will not be correct. We need that, in customer table all data of the timestamp field always have an empty second and empty millisecond for correct joining. And for this, I created an immutable function, when you send to this function fully formatted DateTime 2021-02-03 18:24:51.203 then the function will be return timestamps that are empty second and empty millisecond, on this format: 2021-02-03 18:24:00.000
CREATE OR REPLACE FUNCTION clear_seconds_from_datetime(dt timestamp without time zone)
RETURNS timestamp without time zone
LANGUAGE plpgsql
IMMUTABLE
AS $function$
DECLARE
str_date text;
str_hour text;
str_min text;
v_ret timestamp;
begin
str_date = (dt::date)::text;
str_hour = (extract(hour from dt))::text;
str_min = (extract(min from dt))::text;
str_date = str_date || ' ' || str_hour || ':' || str_min || ':' || '00';
v_ret = str_date::timestamp;
return v_ret;
END;
$function$
;
Why our function is immutable? For high performance, I want to use the function-based index on PostgreSQL. But, PostgreSQL is required to use only immutable functions on indexes. We use this function in the condition of the joining process. Now let's view the process of creating a function-based index:
CREATE INDEX customer_date_function_idx ON customer USING btree ((clear_seconds_from_datetime(datestamp)));
Let's go write the main query that's we need:
select
t1.list_dates,
cc.userid,
count(cc.id) as count_as
from
get_dates('2021-02-03 00:01:00'::timestamp) t1 (list_dates)
left join
customer cc on clear_seconds_from_datetime(cc.datestamp) = t1.list_dates
group by t1.list_dates, cc.userid
I inserted 30 million sample data into a customer table for testing. But, this query run's for 33 milliseconds. View the query plan for the resulting explain analyze command:
HashAggregate (cost=309264.30..309432.30 rows=16800 width=20) (actual time=26.076..26.958 rows=2022 loops=1)
Group Key: t1.list_dates, lg.userid
-> Nested Loop Left Join (cost=0.68..253482.75 rows=7437540 width=16) (actual time=18.870..24.882 rows=2023 loops=1)
-> Function Scan on get_dates t1 (cost=0.25..10.25 rows=1000 width=8) (actual time=18.699..18.906 rows=1441 loops=1)
-> Index Scan using log_date_cast_idx on log lg (cost=0.43..179.09 rows=7438 width=16) (actual time=0.003..0.003 rows=1 loops=1441)
Index Cond: (clear_seconds_from_datetime(datestamp) = t1.list_dates)
Planning Time: 0.398 ms
JIT:
Functions: 12
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 3.544 ms, Inlining 0.000 ms, Optimization 0.816 ms, Emission 16.709 ms, Total 21.069 ms
Execution Time: 31.429 ms
Result:
list_dates
group_count
2021-02-03 09:41:00.000
1
2021-02-03 09:42:00.000
3
2021-02-03 09:43:00.000
1
2021-02-03 09:44:00.000
3
2021-02-03 09:45:00.000
1
2021-02-03 09:46:00.000
5
2021-02-03 09:47:00.000
2
2021-02-03 09:48:00.000
1
2021-02-03 09:49:00.000
1
2021-02-03 09:50:00.000
1
2021-02-03 09:51:00.000
4
2021-02-03 09:52:00.000
0
2021-02-03 09:53:00.000
0
2021-02-03 09:54:00.000
2
2021-02-03 09:55:00.000
1

how can I group data by batches of 5 seconds with postgres?

right now I am grouping data by the minute:
SELECT
date_trunc('minute', ts) ts,
...
FROM binance_trades
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
but I would like to group by 5 seconds.
I found this question: Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)
with the answer:
SELECT date_bin(
INTERVAL '5 minutes',
measured_at,
TIMSTAMPTZ '2000-01-01'
),
sum(val)
FROM measurements
GROUP BY 1;
I am using v13.3, so date_bin doesn't exist on my version of Postgres.
I don't quite understand the other answers in that question as well.
Output of EXPLAIN:
Limit (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.072..76239.529 rows=13 loops=1)
-> GroupAggregate (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.071..76239.526 rows=13 loops=1)
Group Key: j.win_start
-> Sort (cost=6379566.36..6380086.58 rows=208088 width=28) (actual time=76238.000..76238.427 rows=5335 loops=1)
Sort Key: j.win_start
Sort Method: quicksort Memory: 609kB
-> Nested Loop (cost=1000.00..6356204.58 rows=208088 width=28) (actual time=23971.722..76237.055 rows=5335 loops=1)
Join Filter: (j.ts_win #> b.ts)
Rows Removed by Join Filter: 208736185
-> Seq Scan on binance_trades b (cost=0.00..3026558.81 rows=16006783 width=28) (actual time=0.033..30328.674 rows=16057040 loops=1)
Filter: ((instrument)::text = ''ETHUSDT''::text)
Rows Removed by Filter: 126872903
-> Materialize (cost=1000.00..208323.11 rows=13 width=30) (actual time=0.000..0.001 rows=13 loops=16057040)
-> Gather (cost=1000.00..208323.05 rows=13 width=30) (actual time=3459.850..3461.076 rows=13 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on scalper_5s_intervals j (cost=0.00..207321.75 rows=5 width=30) (actual time=2336.484..3458.397 rows=4 loops=3)
Filter: ((win_start >= ''2021-08-20 00:00:00''::timestamp without time zone) AND (win_start <= ''2021-08-20 00:01:00''::timestamp without time zone))
Rows Removed by Filter: 5080316
Planning Time: 0.169 ms
Execution Time: 76239.667 ms

If you're only interested in a five seconds distribution rather than the exact sum of a given time window (when the events really happened), you can round the timestamp in five seconds using date_trunc() and mod() and then group by it.
SELECT
date_trunc('second',ts)-
MOD(EXTRACT(SECOND FROM date_trunc('second',ts))::int,5)*interval'1 sec',
SUM(price)
FROM binance_trades
WHERE instrument = 'ETHUSDT' AND
ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY 1
Here I assume that ts and instrument are properly indexed.
However, if this is a sensitive analysis regarding time accuracy and you cannot afford rounding the timestamps, try to create the time window within a CTE (or subquery), then in the outer query create a tsrange and join binance_trades.ts with it using the containment operator #>.
WITH j AS (
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
(SELECT min(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
(SELECT max(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i))
SELECT
j.win_start ts,
SUM(price)
FROM j
JOIN binance_trades b ON ts_win #> b.ts
GROUP BY j.win_start ORDER BY j.win_start
LIMIT 5;
There is a caveat though: this approach will get pretty slow if you're creating 5 second series from a large time window. It's due to the fact that you'll have to join these newly created records with table binance_trades without an index. To overcome this issue you can create a temporary table and index it:
CREATE UNLOGGED TABLE scalper_5s_intervals AS
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
(SELECT min(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
(SELECT max(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i);
CREATE INDEX idx_ts_5sec ON scalper_5s_intervals USING gist (ts_win);
CREATE INDEX idx_ts_5sec_winstart ON scalper_5s_intervals USING btree(win_start);
UNLOGGED tables are much faster than regular ones, but keep in mind that they're not crash safe. See documentation (emphasis mine):
If specified, the table is created as an unlogged table. Data written to unlogged tables is not written to the write-ahead log (see Chapter 29), which makes them considerably faster than ordinary tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.
After that your query will become much faster than the CTE approach, but still significantly slower than the first query with the rounded the timestamps.
SELECT
j.win_start ts,
SUM(price)
FROM scalper_5s_intervals j
JOIN binance_trades b ON ts_win #> b.ts
WHERE j.win_start BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY j.win_start ORDER BY j.win_start
Demo: db<>fiddle

Efficient subquery for aggregate time series

I want to build a time series daily from a certain date and calculate a few statistics for each day. However this query is very slow... Any way to speed it up? (for example, select the table once in the subquery and compute various stats on that table for each day).
In code this would look like
for i, day in series:
previous_days = series[0...i]
some_calculation_a = some_operation_on(previous_days)
some_calculation_b = some_other_operation_on(previous_days)
Here is an example for a time series looking for users with <= 5 messages up to that date:
with
days as
(
select date::Timestamp with time zone from generate_series('2015-07-09',
now(), '1 day'::interval) date
),
msgs as
(
select days.date,
(select count(customer_id) from daily_messages where sum < 5 and date_trunc('day'::text, created_at) <= days.date) as LT_5,
(select count(customer_id) from daily_messages where sum = 1 and date_trunc('day'::text, created_at) <= days.date) as EQ_1
from days, daily_messages
where date_trunc('day'::text, created_at) = days.date
group by days.date
)
select * from msgs;
Query breakdown:
CTE Scan on msgs (cost=815579.03..815583.03 rows=200 width=24)
Output: msgs.date, msgs.lt_5, msgs.eq_1
CTE days
-> Function Scan on pg_catalog.generate_series date (cost=0.01..10.01 rows=1000 width=8)
Output: date.date
Function Call: generate_series('2015-07-09 00:00:00+00'::timestamp with time zone, now(), '1 day'::interval)
CTE msgs
-> Group (cost=6192.62..815569.02 rows=200 width=8)
Output: days.date, (SubPlan 2), (SubPlan 3)
Group Key: days.date
-> Merge Join (cost=6192.62..11239.60 rows=287970 width=8)
Output: days.date
Merge Cond: (days.date = (date_trunc('day'::text, daily_messages_2.created_at)))
-> Sort (cost=69.83..72.33 rows=1000 width=8)
Output: days.date
Sort Key: days.date
-> CTE Scan on days (cost=0.00..20.00 rows=1000 width=8)
Output: days.date
-> Sort (cost=6122.79..6266.78 rows=57594 width=8)
Output: daily_messages_2.created_at, (date_trunc('day'::text, daily_messages_2.created_at))
Sort Key: (date_trunc('day'::text, daily_messages_2.created_at))
-> Seq Scan on public.daily_messages daily_messages_2 (cost=0.00..1568.94 rows=57594 width=8)
Output: daily_messages_2.created_at, date_trunc('day'::text, daily_messages_2.created_at)
SubPlan 2
-> Aggregate (cost=2016.89..2016.90 rows=1 width=32)
Output: count(daily_messages.customer_id)
-> Seq Scan on public.daily_messages (cost=0.00..2000.89 rows=6399 width=32)
Output: daily_messages.created_at, daily_messages.customer_id, daily_messages.day_total, daily_messages.sum, daily_messages.elapsed
Filter: ((daily_messages.sum < '5'::numeric) AND (date_trunc('day'::text, daily_messages.created_at) <= days.date))
SubPlan 3
-> Aggregate (cost=2001.13..2001.14 rows=1 width=32)
Output: count(daily_messages_1.customer_id)
-> Seq Scan on public.daily_messages daily_messages_1 (cost=0.00..2000.89 rows=96 width=32)
Output: daily_messages_1.created_at, daily_messages_1.customer_id, daily_messages_1.day_total, daily_messages_1.sum, daily_messages_1.elapsed
Filter: ((daily_messages_1.sum = '1'::numeric) AND (date_trunc('day'::text, daily_messages_1.created_at) <= days.date))

In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:
SELECT created_at::date
, sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
, sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM daily_messages m
WHERE created_at >= timestamptz '2015-07-09' -- sargable!
AND created_at < now() -- probably redundant
GROUP BY 1
WINDOW w AS (ORDER BY created_at::date);
All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:
Postgres window function and group by exception
Conditional lead/lag function PostgreSQL?
The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:
Ignoring timezones altogether in Rails and PostgreSQL
On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.
Related answer explaining "sargable":
Return Table Type from A function in PostgreSQL
You might replace count(customer_id) with count(*). Not enough information in the question.
Might be optimized further, but there is not enough information to be more specific in the answer.
Include days without new entries in result
SELECT day
, sum(lt_5_day) OVER w AS lt_5
, sum(eq_1_day) OVER w AS eq_1
FROM (
SELECT day::date
FROM generate_series(date '2015-07-09', current_date, interval '1 day') day
) d
LEFT JOIN (
SELECT created_at::date AS day
, count(customer_id) FILTER (WHERE sum < 5) AS lt_5_day
, count(customer_id) FILTER (WHERE sum = 1) AS eq_1_day
FROM daily_messages m
WHERE created_at >= timestamptz '2015-07-09'
GROUP BY 1
) m USING (day)
WINDOW w AS (ORDER BY day);
Aggregate daily sums in subquery m.
Generate series of all days in time range in subquery d.
Use LEFT [OUTER] JOIN to retain all days in the result, even without new rows for the day.

Postgres is ignoring a timestamp index, why?

I have the following tables:
users (id, network_id)
networks (id)
private_messages (id, sender_id, receiver_id, created_at)
I have indexes on users.network_id, and all 3 columns in private messages however the query is skipping the indexes and taking a very long time to run. Any ideas what is wrong in the query that is causing the index to be skipped?
EXPLAIN ANALYZE SELECT COUNT(*)
FROM "networks"
WHERE (
networks.created_at BETWEEN ((timestamp '2013-01-01')) AND (( (timestamp '2013-01-31') + interval '-1 second'))
AND (SELECT COUNT(*) FROM private_messages INNER JOIN users ON private_messages.receiver_id = users.id WHERE users.network_id = networks.id AND (private_messages.created_at BETWEEN ((timestamp '2013-03-01')) AND (( (timestamp '2013-03-31') + interval '-1 second'))) ) > 0)
Result:
Aggregate (cost=722675247.10..722675247.11 rows=1 width=0) (actual time=519916.108..519916.108 rows=1 loops=1)
-> Seq Scan on networks (cost=0.00..722675245.34 rows=703 width=0) (actual time=2576.205..519916.044 rows=78 loops=1)
Filter: ((created_at >= '2013-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-01-30 23:59:59'::timestamp without time zone) AND ((SubPlan 1) > 0))
SubPlan 1
-> Aggregate (cost=50671.34..50671.35 rows=1 width=0) (actual time=240.359..240.359 rows=1 loops=2163)
-> Hash Join (cost=10333.69..50671.27 rows=28 width=0) (actual time=233.997..240.340 rows=13 loops=2163)
Hash Cond: (private_messages.receiver_id = users.id)
-> Bitmap Heap Scan on private_messages (cost=10127.11..48675.15 rows=477136 width=4) (actual time=56.599..232.855 rows=473686 loops=1809)
Recheck Cond: ((created_at >= '2013-03-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-03-30 23:59:59'::timestamp without time zone))
-> Bitmap Index Scan on index_private_messages_on_created_at (cost=0.00..10007.83 rows=477136 width=0) (actual time=54.551..54.551 rows=473686 loops=1809)
Index Cond: ((created_at >= '2013-03-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-03-30 23:59:59'::timestamp without time zone))
-> Hash (cost=205.87..205.87 rows=57 width=4) (actual time=0.218..0.218 rows=2 loops=2163)
Buckets: 1024 Batches: 1 Memory Usage: 0kB
-> Index Scan using index_users_on_network_id on users (cost=0.00..205.87 rows=57 width=4) (actual time=0.154..0.215 rows=2 loops=2163)
Index Cond: (network_id = networks.id)
Total runtime: 519916.183 ms
Thank you.

Let's try something different. I am only suggesting this as an "answer" because of its length and you cannot format a comment. Let's approach the query modularly as a series of subsets that need to get intersected. Let's see how long it takes each of these to execute (please report). Substitute your timestamps for t1 and t2. Note how each query builds upon the prior one, making the prior one an "inline view".
EDIT: also, please confirm the columns in the Networks table.
1
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
2
select U.id, U.network_id from users U
join
(
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
) as FOO
on U.id = FOO.receiver_id
3
select N.* from networks N
join
(
select U.id, U.network_id from users U
join
(
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
) as FOO
on U.id = FOO.receiver_id
) as BAR
on N.id = BAR.network_id

First, I think you want an index on network.created_at, even though right now with over 10% of the table matching the WHERE, it probably won't be used.
Next, I expect you will get better speed if you try to get as much logic as possible into one query, instead of splitting some into a subquery. I believe the plan is indicating iterating over each value of network.id that matches; usually an all-at-once join works better.
I think the code below is logically equivalent. If not, close.
SELECT COUNT(*)
FROM
(SELECT users.network_id FROM "networks"
JOIN users
ON users.network_id = networks.id
JOIN private_messages
ON private_messages.receiver_id = users.id
AND (private_messages.created_at
BETWEEN ((timestamp '2013-03-01'))
AND (( (timestamp '2013-03-31') + interval '-1 second')))
WHERE
networks.created_at
BETWEEN ((timestamp '2013-01-01'))
AND (( (timestamp '2013-01-31') + interval '-1 second'))
GROUP BY users.network_id)
AS main_subquery
;
My experience is that you will get the same query plan if you move the networks.created_at into the ON clause for the users-networks join. I don't think your issue is timestamps; it's the structure of the query. You may also get a better (or worse) plan by replacing the GROUP BY in the subquery with SELECT DISTINCT users.network_id.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Redshift GROUP BY time interval - sql

Related

How do I flatten the complexity of this select statement in postgresql?

How to get a count of records by minute using a datetime column

how can I group data by batches of 5 seconds with postgres?

Efficient subquery for aggregate time series

Postgres is ignoring a timestamp index, why?

Categories

Resources