Efficient subquery for aggregate time series

Efficient subquery for aggregate time series - sql

I want to build a time series daily from a certain date and calculate a few statistics for each day. However this query is very slow... Any way to speed it up? (for example, select the table once in the subquery and compute various stats on that table for each day).
In code this would look like
for i, day in series:
previous_days = series[0...i]
some_calculation_a = some_operation_on(previous_days)
some_calculation_b = some_other_operation_on(previous_days)
Here is an example for a time series looking for users with <= 5 messages up to that date:
with
days as
(
select date::Timestamp with time zone from generate_series('2015-07-09',
now(), '1 day'::interval) date
),
msgs as
(
select days.date,
(select count(customer_id) from daily_messages where sum < 5 and date_trunc('day'::text, created_at) <= days.date) as LT_5,
(select count(customer_id) from daily_messages where sum = 1 and date_trunc('day'::text, created_at) <= days.date) as EQ_1
from days, daily_messages
where date_trunc('day'::text, created_at) = days.date
group by days.date
)
select * from msgs;
Query breakdown:
CTE Scan on msgs (cost=815579.03..815583.03 rows=200 width=24)
Output: msgs.date, msgs.lt_5, msgs.eq_1
CTE days
-> Function Scan on pg_catalog.generate_series date (cost=0.01..10.01 rows=1000 width=8)
Output: date.date
Function Call: generate_series('2015-07-09 00:00:00+00'::timestamp with time zone, now(), '1 day'::interval)
CTE msgs
-> Group (cost=6192.62..815569.02 rows=200 width=8)
Output: days.date, (SubPlan 2), (SubPlan 3)
Group Key: days.date
-> Merge Join (cost=6192.62..11239.60 rows=287970 width=8)
Output: days.date
Merge Cond: (days.date = (date_trunc('day'::text, daily_messages_2.created_at)))
-> Sort (cost=69.83..72.33 rows=1000 width=8)
Output: days.date
Sort Key: days.date
-> CTE Scan on days (cost=0.00..20.00 rows=1000 width=8)
Output: days.date
-> Sort (cost=6122.79..6266.78 rows=57594 width=8)
Output: daily_messages_2.created_at, (date_trunc('day'::text, daily_messages_2.created_at))
Sort Key: (date_trunc('day'::text, daily_messages_2.created_at))
-> Seq Scan on public.daily_messages daily_messages_2 (cost=0.00..1568.94 rows=57594 width=8)
Output: daily_messages_2.created_at, date_trunc('day'::text, daily_messages_2.created_at)
SubPlan 2
-> Aggregate (cost=2016.89..2016.90 rows=1 width=32)
Output: count(daily_messages.customer_id)
-> Seq Scan on public.daily_messages (cost=0.00..2000.89 rows=6399 width=32)
Output: daily_messages.created_at, daily_messages.customer_id, daily_messages.day_total, daily_messages.sum, daily_messages.elapsed
Filter: ((daily_messages.sum < '5'::numeric) AND (date_trunc('day'::text, daily_messages.created_at) <= days.date))
SubPlan 3
-> Aggregate (cost=2001.13..2001.14 rows=1 width=32)
Output: count(daily_messages_1.customer_id)
-> Seq Scan on public.daily_messages daily_messages_1 (cost=0.00..2000.89 rows=96 width=32)
Output: daily_messages_1.created_at, daily_messages_1.customer_id, daily_messages_1.day_total, daily_messages_1.sum, daily_messages_1.elapsed
Filter: ((daily_messages_1.sum = '1'::numeric) AND (date_trunc('day'::text, daily_messages_1.created_at) <= days.date))

In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:
SELECT created_at::date
, sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
, sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM daily_messages m
WHERE created_at >= timestamptz '2015-07-09' -- sargable!
AND created_at < now() -- probably redundant
GROUP BY 1
WINDOW w AS (ORDER BY created_at::date);
All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:
Postgres window function and group by exception
Conditional lead/lag function PostgreSQL?
The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:
Ignoring timezones altogether in Rails and PostgreSQL
On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.
Related answer explaining "sargable":
Return Table Type from A function in PostgreSQL
You might replace count(customer_id) with count(*). Not enough information in the question.
Might be optimized further, but there is not enough information to be more specific in the answer.
Include days without new entries in result
SELECT day
, sum(lt_5_day) OVER w AS lt_5
, sum(eq_1_day) OVER w AS eq_1
FROM (
SELECT day::date
FROM generate_series(date '2015-07-09', current_date, interval '1 day') day
) d
LEFT JOIN (
SELECT created_at::date AS day
, count(customer_id) FILTER (WHERE sum < 5) AS lt_5_day
, count(customer_id) FILTER (WHERE sum = 1) AS eq_1_day
FROM daily_messages m
WHERE created_at >= timestamptz '2015-07-09'
GROUP BY 1
) m USING (day)
WINDOW w AS (ORDER BY day);
Aggregate daily sums in subquery m.
Generate series of all days in time range in subquery d.
Use LEFT [OUTER] JOIN to retain all days in the result, even without new rows for the day.

Related

How do I flatten the complexity of this select statement in postgresql?

I have the following table:
create table account_values
(
account_id bigint not null,
timestamp timestamp not null,
value1 numeric not null,
value2 numeric not null,
primary key (timestamp, account_id)
);
I also have the following query which produces an array of every value1+value2 of the row with the closest (before) timestamp to an evenly spaced generated series:
select array [(trunc(extract(epoch from gs) * 1000))::text, COALESCE((values.value1 + values.value2), 0.000000)::text]
from generate_series((now() - '1 year'::interval)::timestamp, now(), interval '1 day') gs
left join lateral (select value1, value2
from account_values
where timestamp <= gs and account_id = ?
order by timestamp desc
limit 1) equity on (TRUE);
The issue with this method of generating such an array becomes apparent when inspecting the output of explain analyse:
Nested Loop Left Join (cost=0.45..3410.74 rows=1000 width=32) (actual time=0.134..3948.546 rows=366 loops=1)
-> Function Scan on generate_series gs (cost=0.02..10.02 rows=1000 width=8) (actual time=0.075..0.244 rows=366 loops=1)
-> Limit (cost=0.43..3.36 rows=1 width=26) (actual time=10.783..10.783 rows=1 loops=366)
-> Index Scan Backward using account_values_pkey on account_values (cost=0.43..67730.27 rows=23130 width=26) (actual time=10.782..10.782 rows=1 loops=366)
" Index Cond: ((""timestamp"" <= gs.gs) AND (account_id = 459))"
Planning Time: 0.136 ms
Execution Time: 3948.659 ms
Specifically: loops=366
This problem will only get worse if I ever decide to decrease my generated series interval time.
Is there a way to flatten this looped select into a more efficient query?
If not, what are some other approaches I can take to improving the performance?
edit;
One hard requirement is that the result of the statement cannot be altered. For example I don't want the range to round to the closest day. The range should always start the second the statement is invoked and each interval precisely one day before.

based on Edouard answer.
with a(_timestamp,values_agg) as
(select _timestamp, array_agg(lpad( (value1 + value2)::text,6,'0')) as values_agg from account_values
where account_id = 1
and _timestamp <#tsrange(now()::timestamp - '1 year'::interval, now()::timestamp)
group by 1)
select jsonb_agg(jsonb_build_object
(
'_timestamp', trunc(extract(epoch from _timestamp) *1000)::text
, 'values', values_agg )
) AS item from a;

Not sure you will get the exact same result, but it should be faster :
select array [ (trunc(extract(epoch from date_trunc('day', timestamp)) * 1000))::text
, (array_agg(value1 + value2 ORDER BY timestamp DESC))[1] :: text
]
from account_values
where account_id = ?
and timestamp <# tsrange(now() - '1 year'::interval, now())
group by date_trunc('day', timestamp)

how can I group data by batches of 5 seconds with postgres?

right now I am grouping data by the minute:
SELECT
date_trunc('minute', ts) ts,
...
FROM binance_trades
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
but I would like to group by 5 seconds.
I found this question: Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)
with the answer:
SELECT date_bin(
INTERVAL '5 minutes',
measured_at,
TIMSTAMPTZ '2000-01-01'
),
sum(val)
FROM measurements
GROUP BY 1;
I am using v13.3, so date_bin doesn't exist on my version of Postgres.
I don't quite understand the other answers in that question as well.
Output of EXPLAIN:
Limit (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.072..76239.529 rows=13 loops=1)
-> GroupAggregate (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.071..76239.526 rows=13 loops=1)
Group Key: j.win_start
-> Sort (cost=6379566.36..6380086.58 rows=208088 width=28) (actual time=76238.000..76238.427 rows=5335 loops=1)
Sort Key: j.win_start
Sort Method: quicksort Memory: 609kB
-> Nested Loop (cost=1000.00..6356204.58 rows=208088 width=28) (actual time=23971.722..76237.055 rows=5335 loops=1)
Join Filter: (j.ts_win #> b.ts)
Rows Removed by Join Filter: 208736185
-> Seq Scan on binance_trades b (cost=0.00..3026558.81 rows=16006783 width=28) (actual time=0.033..30328.674 rows=16057040 loops=1)
Filter: ((instrument)::text = ''ETHUSDT''::text)
Rows Removed by Filter: 126872903
-> Materialize (cost=1000.00..208323.11 rows=13 width=30) (actual time=0.000..0.001 rows=13 loops=16057040)
-> Gather (cost=1000.00..208323.05 rows=13 width=30) (actual time=3459.850..3461.076 rows=13 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on scalper_5s_intervals j (cost=0.00..207321.75 rows=5 width=30) (actual time=2336.484..3458.397 rows=4 loops=3)
Filter: ((win_start >= ''2021-08-20 00:00:00''::timestamp without time zone) AND (win_start <= ''2021-08-20 00:01:00''::timestamp without time zone))
Rows Removed by Filter: 5080316
Planning Time: 0.169 ms
Execution Time: 76239.667 ms

If you're only interested in a five seconds distribution rather than the exact sum of a given time window (when the events really happened), you can round the timestamp in five seconds using date_trunc() and mod() and then group by it.
SELECT
date_trunc('second',ts)-
MOD(EXTRACT(SECOND FROM date_trunc('second',ts))::int,5)*interval'1 sec',
SUM(price)
FROM binance_trades
WHERE instrument = 'ETHUSDT' AND
ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY 1
Here I assume that ts and instrument are properly indexed.
However, if this is a sensitive analysis regarding time accuracy and you cannot afford rounding the timestamps, try to create the time window within a CTE (or subquery), then in the outer query create a tsrange and join binance_trades.ts with it using the containment operator #>.
WITH j AS (
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
(SELECT min(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
(SELECT max(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i))
SELECT
j.win_start ts,
SUM(price)
FROM j
JOIN binance_trades b ON ts_win #> b.ts
GROUP BY j.win_start ORDER BY j.win_start
LIMIT 5;
There is a caveat though: this approach will get pretty slow if you're creating 5 second series from a large time window. It's due to the fact that you'll have to join these newly created records with table binance_trades without an index. To overcome this issue you can create a temporary table and index it:
CREATE UNLOGGED TABLE scalper_5s_intervals AS
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
(SELECT min(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
(SELECT max(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i);
CREATE INDEX idx_ts_5sec ON scalper_5s_intervals USING gist (ts_win);
CREATE INDEX idx_ts_5sec_winstart ON scalper_5s_intervals USING btree(win_start);
UNLOGGED tables are much faster than regular ones, but keep in mind that they're not crash safe. See documentation (emphasis mine):
If specified, the table is created as an unlogged table. Data written to unlogged tables is not written to the write-ahead log (see Chapter 29), which makes them considerably faster than ordinary tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.
After that your query will become much faster than the CTE approach, but still significantly slower than the first query with the rounded the timestamps.
SELECT
j.win_start ts,
SUM(price)
FROM scalper_5s_intervals j
JOIN binance_trades b ON ts_win #> b.ts
WHERE j.win_start BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY j.win_start ORDER BY j.win_start
Demo: db<>fiddle

Redshift GROUP BY time interval

Currently, I have the following raw data in redshift.
timestamp ,lead
==================================
"2008-04-09 10:02:01.000000",true
"2008-04-09 10:03:05.000000",true
"2008-04-09 10:31:07.000000",true
"2008-04-09 11:00:05.000000",false
...
So, I would like to generate an aggregated data, with interval of 30 mins. My wished outcome is
timestamp ,count
==================================
"2008-04-09 10:00:00.000000",2
"2008-04-09 10:30:00.000000",1
"2008-04-09 11:00:00.000000",0
...
I had referred to https://stackoverflow.com/a/12046382/3238864 , which is valid for PostgreSQL.
I try to mimic the code posted, by using
with thirty_min_intervals as (
select
(select min(timestamp)::date from events) + ( n || ' minutes')::interval start_time,
(select min(timestamp)::date from events) + ((n+30) || ' minutes')::interval end_time
from generate_series(0, (24*60), 30) n
)
select count(CASE WHEN lead THEN 1 END) from events e
right join thirty_min_intervals f
on e.timestamp >= f.start_time and e.timestamp < f.end_time
group by f.start_time, f.end_time
order by f.start_time;
However, I'm getting error
[0A000] ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
May I know, what is the good way to perform aggregation data calculation, of N interval, in redshift.

Joe's answer is a great neat solution.
I feel one should always consider how the data is distributed and sorted when you are working in Redshift. It can have a dramatic impact on performance.
Building on Joe's great answer:
I will materialise the sample events. In practise the events will be in a table.
drop table if exists public.temporary_events;
create table public.temporary_events AS
select ts::timestamp as ts
,lead
from
( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;
Now run explain:
explain
WITH time_dimension
AS (SELECT dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
SELECT dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM time_dimension td
LEFT JOIN public.temporary_events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
-- ORDER BY 1 Just to simply the job a little
The output is:
XN HashAggregate (cost=999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
-> XN Hash Left Join DS_DIST_BOTH (cost=0.05..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=9)
Outer Dist Key: ('2018-11-27 17:00:35'::timestamp without time zone - ((rn.n)::double precision * '00:00:01'::interval))
Inner Dist Key: e."ts"
Hash Cond: ("outer"."?column2?" = "inner"."ts")
-> XN Subquery Scan rn (cost=0.00..14.95 rows=1 width=8)
Filter: (((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) <= '2017-02-16 11:00:00'::timestamp without time zone) AND ((('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)) - ((((("date_part"('minutes'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval))) * 60) % 1800) + "date_part"('seconds'::text, ('2018-11-27 17:00:35'::timestamp without time zone - ((n)::double precision * '00:00:01'::interval)))))::double precision * '00:00:01'::interval)) >= '2017-02-16 09:30:00'::timestamp without time zone))
-> XN Limit (cost=0.00..1.95 rows=130 width=0)
-> XN Window (cost=0.00..1.95 rows=130 width=0)
-> XN Network (cost=0.00..1.30 rows=130 width=0)
Send to slice 0
-> XN Seq Scan on stl_scan (cost=0.00..1.30 rows=130 width=0)
-> XN Hash (cost=0.04..0.04 rows=4 width=9)
-> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)
Kablamo!
As Joe says you may well use this pattern merrily without issue. However once your data gets sufficiently large OR your SQL logic complex you may want to optimise. If for no other reason you might like to understand the explain plan when you add more sql logic into your code.
Three areas we can look at:
The Join. Make the join between both sets of data work on the same datatype. Here we join a timestamp to an interval.
Data distribution. Materialise and distribute both tables by timestamp.
Data sorting. If events is sorted by this timestamp and the time dimension is sorted by both timestamps then you may be able to complete the entire query using a merge join without any data moving and without sending the data to the leader node for aggregation.
Observe:
drop table if exists public.temporary_time_dimension;
create table public.temporary_time_dimension
distkey(dtm) sortkey(dtm, dtm_half_hour)
AS (SELECT dtm::timestamp as dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
;
drop table if exists public.temporary_events;
create table public.temporary_events
distkey(ts) sortkey(ts)
AS
select ts::timestamp as ts
,lead
from
( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
;
explain
SELECT
dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM public.temporary_time_dimension td
LEFT JOIN public.temporary_events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
--order by dtm_half_hour
This then gives:
XN HashAggregate (cost=1512.67..1512.68 rows=1 width=9)
-> XN Merge Left Join DS_DIST_NONE (cost=0.00..1504.26 rows=1682 width=9)
Merge Cond: ("outer".dtm = "inner"."ts")
-> XN Seq Scan on temporary_time_dimension td (cost=0.00..1500.00 rows=1682 width=16)
Filter: ((dtm_half_hour <= '2017-02-16 11:00:00'::timestamp without time zone) AND (dtm_half_hour >= '2017-02-16 09:30:00'::timestamp without time zone))
-> XN Seq Scan on temporary_events e (cost=0.00..0.04 rows=4 width=9)
Important caveats:
I've taken the order by out. putting it back in will result in the data being sent to the leader node for sorting. If you can do away with the sort then do away with the sort!
I'm certain that choosing timestamp as the events table sort key will NOT be ideal in many situations. I just thought I'd show what is possible.
I think you will likely want to have a time dimension created with diststyle all and sorted. This will ensure that your joins do not generate network traffic.

You can use ROW_NUMBER() to generate a series. I use internal tables that I know to be large. FWIW, I would typically persist the time_dimension to a real table to avoid doing this repeatedly.
Here you go:
WITH events
AS ( SELECT '2017-02-16 10:02:01'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:03:05'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 10:31:07'::timestamp as ts, true::boolean as lead
UNION ALL SELECT '2017-02-16 11:00:05'::timestamp as ts, false::boolean as lead)
,time_dimension
AS (SELECT dtm
,dtm - ((DATEPART(SECONDS,dtm) + (DATEPART(MINUTES,dtm)*60) % 1800) * INTERVAL '1 second') AS dtm_half_hour
FROM /* Create a series of timestamp. 1 per second working backwards from NOW(). */
/* NB: `sysdate` could be substituted for an arbitrary ending timestamp */
(SELECT DATE_TRUNC('SECONDS',sysdate) - (n * INTERVAL '1 second') AS dtm
FROM /* Generate a number sequence of 100,000 values from a large internal table */
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 100000) rn) rn)
SELECT dtm_half_hour
,COUNT(CASE WHEN lead THEN 1 END)
FROM time_dimension td
LEFT JOIN events e
ON td.dtm = e.ts
WHERE td.dtm_half_hour BETWEEN '2017-02-16 09:30:00' AND '2017-02-16 11:00:00'
GROUP BY 1
ORDER BY 1
;

last and first value from group

I have a task: get first, last, max, min from each group (by time) of data. My solution works but it is extremely slow because row count in table is about 50 million.
How can i improve performance of this query:
SELECT
date_trunc('minute', t_ordered."timestamp"),
MIN (t_ordered.price),
MAX (t_ordered.price),
FIRST (t_ordered.price),
LAST (t_ordered.price)
FROM(
SELECT t.price, t."timestamp"
FROM trade t
WHERE t."timestamp" >= '2016-01-01' AND t."timestamp" < '2016-09-01'
ORDER BY t."timestamp" ASC
) t_ordered
GROUP BY 1
ORDER BY 1
FIRST and LAST are aggregate functions from postgresql wiki
Timestamp column indexed.
explain (analyze, verbose):
GroupAggregate (cost=13112830.84..33300949.59 rows=351556 width=14) (actual time=229538.092..468212.450 rows=351138 loops=1)
Output: (date_trunc('minute'::text, t_ordered."timestamp")), min(t_ordered.price), max(t_ordered.price), first(t_ordered.price), last(t_ordered.price)
Group Key: (date_trunc('minute'::text, t_ordered."timestamp"))
-> Sort (cost=13112830.84..13211770.66 rows=39575930 width=14) (actual time=229515.281..242472.677 rows=38721704 loops=1)
Output: (date_trunc('minute'::text, t_ordered."timestamp")), t_ordered.price
Sort Key: (date_trunc('minute'::text, t_ordered."timestamp"))
Sort Method: external sort Disk: 932656kB
-> Subquery Scan on t_ordered (cost=6848734.55..7442373.50 rows=39575930 width=14) (actual time=102166.368..155540.492 rows=38721704 loops=1)
Output: date_trunc('minute'::text, t_ordered."timestamp"), t_ordered.price
-> Sort (cost=6848734.55..6947674.38 rows=39575930 width=14) (actual time=102165.836..130971.804 rows=38721704 loops=1)
Output: t.price, t."timestamp"
Sort Key: t."timestamp"
Sort Method: external merge Disk: 993480kB
-> Seq Scan on public.trade t (cost=0.00..1178277.21 rows=39575930 width=14) (actual time=0.055..25726.038 rows=38721704 loops=1)
Output: t.price, t."timestamp"
Filter: ((t."timestamp" >= '2016-01-01 00:00:00'::timestamp without time zone) AND (t."timestamp" < '2016-09-01 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 9666450
Planning time: 1.663 ms
Execution time: 468949.753 ms
Maybe it can be done by window functions? I have tried but i do not have enough knowledge to use them

Creating a type and adequate aggregates will hopefully work better:
create type tp as (timestamp timestamp, price int);
create or replace function min_tp (tp, tp)
returns tp as $$
select least($1, $2);
$$ language sql immutable;
create aggregate min (tp) (
sfunc = min_tp,
stype = tp
);
The min and max (not listed) aggregate functions will reduce the query to a single loop:
select
date_trunc('minute', timestamp) as minute,
min (price) as price_min,
max (price) as price_max,
(min ((timestamp, price)::tp)).price as first,
(max ((timestamp, price)::tp)).price as last
from t
where timestamp >= '2016-01-01' and timestamp < '2016-09-01'
group by 1
order by 1
explain (analyze, verbose):
GroupAggregate (cost=6954022.61..27159050.82 rows=287533 width=14) (actual time=129286.817..510119.582 rows=351138 loops=1)
Output: (date_trunc('minute'::text, "timestamp")), min(price), max(price), (min(ROW("timestamp", price)::tp)).price, (max(ROW("timestamp", price)::tp)).price
Group Key: (date_trunc('minute'::text, trade."timestamp"))
-> Sort (cost=6954022.61..7053049.25 rows=39610655 width=14) (actual time=129232.165..156277.718 rows=38721704 loops=1)
Output: (date_trunc('minute'::text, "timestamp")), price, "timestamp"
Sort Key: (date_trunc('minute'::text, trade."timestamp"))
Sort Method: external merge Disk: 1296392kB
-> Seq Scan on public.trade (cost=0.00..1278337.71 rows=39610655 width=14) (actual time=0.035..45335.947 rows=38721704 loops=1)
Output: date_trunc('minute'::text, "timestamp"), price, "timestamp"
Filter: ((trade."timestamp" >= '2016-01-01 00:00:00'::timestamp without time zone) AND (trade."timestamp" < '2016-09-01 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 9708857
Planning time: 0.286 ms
Execution time: 510648.395 ms

Postgres is ignoring a timestamp index, why?

I have the following tables:
users (id, network_id)
networks (id)
private_messages (id, sender_id, receiver_id, created_at)
I have indexes on users.network_id, and all 3 columns in private messages however the query is skipping the indexes and taking a very long time to run. Any ideas what is wrong in the query that is causing the index to be skipped?
EXPLAIN ANALYZE SELECT COUNT(*)
FROM "networks"
WHERE (
networks.created_at BETWEEN ((timestamp '2013-01-01')) AND (( (timestamp '2013-01-31') + interval '-1 second'))
AND (SELECT COUNT(*) FROM private_messages INNER JOIN users ON private_messages.receiver_id = users.id WHERE users.network_id = networks.id AND (private_messages.created_at BETWEEN ((timestamp '2013-03-01')) AND (( (timestamp '2013-03-31') + interval '-1 second'))) ) > 0)
Result:
Aggregate (cost=722675247.10..722675247.11 rows=1 width=0) (actual time=519916.108..519916.108 rows=1 loops=1)
-> Seq Scan on networks (cost=0.00..722675245.34 rows=703 width=0) (actual time=2576.205..519916.044 rows=78 loops=1)
Filter: ((created_at >= '2013-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-01-30 23:59:59'::timestamp without time zone) AND ((SubPlan 1) > 0))
SubPlan 1
-> Aggregate (cost=50671.34..50671.35 rows=1 width=0) (actual time=240.359..240.359 rows=1 loops=2163)
-> Hash Join (cost=10333.69..50671.27 rows=28 width=0) (actual time=233.997..240.340 rows=13 loops=2163)
Hash Cond: (private_messages.receiver_id = users.id)
-> Bitmap Heap Scan on private_messages (cost=10127.11..48675.15 rows=477136 width=4) (actual time=56.599..232.855 rows=473686 loops=1809)
Recheck Cond: ((created_at >= '2013-03-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-03-30 23:59:59'::timestamp without time zone))
-> Bitmap Index Scan on index_private_messages_on_created_at (cost=0.00..10007.83 rows=477136 width=0) (actual time=54.551..54.551 rows=473686 loops=1809)
Index Cond: ((created_at >= '2013-03-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-03-30 23:59:59'::timestamp without time zone))
-> Hash (cost=205.87..205.87 rows=57 width=4) (actual time=0.218..0.218 rows=2 loops=2163)
Buckets: 1024 Batches: 1 Memory Usage: 0kB
-> Index Scan using index_users_on_network_id on users (cost=0.00..205.87 rows=57 width=4) (actual time=0.154..0.215 rows=2 loops=2163)
Index Cond: (network_id = networks.id)
Total runtime: 519916.183 ms
Thank you.

Let's try something different. I am only suggesting this as an "answer" because of its length and you cannot format a comment. Let's approach the query modularly as a series of subsets that need to get intersected. Let's see how long it takes each of these to execute (please report). Substitute your timestamps for t1 and t2. Note how each query builds upon the prior one, making the prior one an "inline view".
EDIT: also, please confirm the columns in the Networks table.
1
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
2
select U.id, U.network_id from users U
join
(
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
) as FOO
on U.id = FOO.receiver_id
3
select N.* from networks N
join
(
select U.id, U.network_id from users U
join
(
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
) as FOO
on U.id = FOO.receiver_id
) as BAR
on N.id = BAR.network_id

First, I think you want an index on network.created_at, even though right now with over 10% of the table matching the WHERE, it probably won't be used.
Next, I expect you will get better speed if you try to get as much logic as possible into one query, instead of splitting some into a subquery. I believe the plan is indicating iterating over each value of network.id that matches; usually an all-at-once join works better.
I think the code below is logically equivalent. If not, close.
SELECT COUNT(*)
FROM
(SELECT users.network_id FROM "networks"
JOIN users
ON users.network_id = networks.id
JOIN private_messages
ON private_messages.receiver_id = users.id
AND (private_messages.created_at
BETWEEN ((timestamp '2013-03-01'))
AND (( (timestamp '2013-03-31') + interval '-1 second')))
WHERE
networks.created_at
BETWEEN ((timestamp '2013-01-01'))
AND (( (timestamp '2013-01-31') + interval '-1 second'))
GROUP BY users.network_id)
AS main_subquery
;
My experience is that you will get the same query plan if you move the networks.created_at into the ON clause for the users-networks join. I don't think your issue is timestamps; it's the structure of the query. You may also get a better (or worse) plan by replacing the GROUP BY in the subquery with SELECT DISTINCT users.network_id.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Efficient subquery for aggregate time series - sql

Related

How do I flatten the complexity of this select statement in postgresql?

how can I group data by batches of 5 seconds with postgres?

Redshift GROUP BY time interval

last and first value from group

Postgres is ignoring a timestamp index, why?

Categories

Resources