I have a task: get first, last, max, min from each group (by time) of data. My solution works but it is extremely slow because row count in table is about 50 million.
How can i improve performance of this query:
SELECT
date_trunc('minute', t_ordered."timestamp"),
MIN (t_ordered.price),
MAX (t_ordered.price),
FIRST (t_ordered.price),
LAST (t_ordered.price)
FROM(
SELECT t.price, t."timestamp"
FROM trade t
WHERE t."timestamp" >= '2016-01-01' AND t."timestamp" < '2016-09-01'
ORDER BY t."timestamp" ASC
) t_ordered
GROUP BY 1
ORDER BY 1
FIRST and LAST are aggregate functions from postgresql wiki
Timestamp column indexed.
explain (analyze, verbose):
GroupAggregate (cost=13112830.84..33300949.59 rows=351556 width=14) (actual time=229538.092..468212.450 rows=351138 loops=1)
Output: (date_trunc('minute'::text, t_ordered."timestamp")), min(t_ordered.price), max(t_ordered.price), first(t_ordered.price), last(t_ordered.price)
Group Key: (date_trunc('minute'::text, t_ordered."timestamp"))
-> Sort (cost=13112830.84..13211770.66 rows=39575930 width=14) (actual time=229515.281..242472.677 rows=38721704 loops=1)
Output: (date_trunc('minute'::text, t_ordered."timestamp")), t_ordered.price
Sort Key: (date_trunc('minute'::text, t_ordered."timestamp"))
Sort Method: external sort Disk: 932656kB
-> Subquery Scan on t_ordered (cost=6848734.55..7442373.50 rows=39575930 width=14) (actual time=102166.368..155540.492 rows=38721704 loops=1)
Output: date_trunc('minute'::text, t_ordered."timestamp"), t_ordered.price
-> Sort (cost=6848734.55..6947674.38 rows=39575930 width=14) (actual time=102165.836..130971.804 rows=38721704 loops=1)
Output: t.price, t."timestamp"
Sort Key: t."timestamp"
Sort Method: external merge Disk: 993480kB
-> Seq Scan on public.trade t (cost=0.00..1178277.21 rows=39575930 width=14) (actual time=0.055..25726.038 rows=38721704 loops=1)
Output: t.price, t."timestamp"
Filter: ((t."timestamp" >= '2016-01-01 00:00:00'::timestamp without time zone) AND (t."timestamp" < '2016-09-01 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 9666450
Planning time: 1.663 ms
Execution time: 468949.753 ms
Maybe it can be done by window functions? I have tried but i do not have enough knowledge to use them
Creating a type and adequate aggregates will hopefully work better:
create type tp as (timestamp timestamp, price int);
create or replace function min_tp (tp, tp)
returns tp as $$
select least($1, $2);
$$ language sql immutable;
create aggregate min (tp) (
sfunc = min_tp,
stype = tp
);
The min and max (not listed) aggregate functions will reduce the query to a single loop:
select
date_trunc('minute', timestamp) as minute,
min (price) as price_min,
max (price) as price_max,
(min ((timestamp, price)::tp)).price as first,
(max ((timestamp, price)::tp)).price as last
from t
where timestamp >= '2016-01-01' and timestamp < '2016-09-01'
group by 1
order by 1
explain (analyze, verbose):
GroupAggregate (cost=6954022.61..27159050.82 rows=287533 width=14) (actual time=129286.817..510119.582 rows=351138 loops=1)
Output: (date_trunc('minute'::text, "timestamp")), min(price), max(price), (min(ROW("timestamp", price)::tp)).price, (max(ROW("timestamp", price)::tp)).price
Group Key: (date_trunc('minute'::text, trade."timestamp"))
-> Sort (cost=6954022.61..7053049.25 rows=39610655 width=14) (actual time=129232.165..156277.718 rows=38721704 loops=1)
Output: (date_trunc('minute'::text, "timestamp")), price, "timestamp"
Sort Key: (date_trunc('minute'::text, trade."timestamp"))
Sort Method: external merge Disk: 1296392kB
-> Seq Scan on public.trade (cost=0.00..1278337.71 rows=39610655 width=14) (actual time=0.035..45335.947 rows=38721704 loops=1)
Output: date_trunc('minute'::text, "timestamp"), price, "timestamp"
Filter: ((trade."timestamp" >= '2016-01-01 00:00:00'::timestamp without time zone) AND (trade."timestamp" < '2016-09-01 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 9708857
Planning time: 0.286 ms
Execution time: 510648.395 ms
Related
I have the following partitioned table
Column | Type | Modifiers | Storage | Stats target | Description
---------------+-----------------------------+------------------------+---------+--------------+-------------
time | timestamp without time zone | not null | plain | |
connection_id | integer | not null | plain | |
is_authorized | boolean | not null default false | plain | |
is_active | boolean | not null default true | plain | |
Indexes:
"active_connection_time_idx" btree ("time")
Child tables: metrics.active_connection_2022_02_26t00,
metrics.active_connection_2022_02_27t00,
metrics.active_connection_2022_02_28t00,
metrics.active_connection_2022_03_01t00,
metrics.active_connection_2022_03_02t00,
metrics.active_connection_2022_04_21t00
All partitions have indexes for time column.
I need execute the following query
SELECT c.connection_id, (array_agg(is_authorized order by time desc))[1], bool_or(is_active) FROM metrics.active_connection c WHERE c.time BETWEEN '2022-01-26 00:00:00' AND '2022-04-15 23:59:59' GROUP BY c.connection_id;
And I get the plan (quick seq scan and low external sort):
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1878772.55..1999873.62 rows=200 width=6) (actual time=11516.621..22951.961 rows=30631 loops=1)
Group Key: c.connection_id
-> Sort (cost=1878772.55..1909047.19 rows=12109857 width=14) (actual time=11388.096..15601.938 rows=12109856 loops=1)
Sort Key: c.connection_id
Sort Method: external merge Disk: 319520kB
-> Append (cost=0.00..247108.84 rows=12109857 width=14) (actual time=0.022..5346.587 rows=12109856 loops=1)
-> Seq Scan on active_connection c (cost=0.00..0.00 rows=1 width=14) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
-> Seq Scan on active_connection_2022_02_26t00 c_1 (cost=0.00..21728.74 rows=1064849 width=14) (actual time=0.017..307.754 rows=1064849 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
......
-> Seq Scan on active_connection_2022_03_02t00 c_5 (cost=0.00..20964.04 rows=1027336 width=14) (actual time=0.018..268.314 rows=1027336 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
If I add index for the connection_id column I get another plan (slow index scan and quick in-memory sort)
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=2.23..1071044.89 rows=200 width=6) (actual time=203.337..49643.802 rows=30631 loops=1)
Group Key: c.connection_id
-> Merge Append (cost=2.23..980218.46 rows=12109857 width=14) (actual time=184.137..38926.435 rows=12109856 loops=1)
Sort Key: c.connection_id
-> Sort (cost=0.01..0.02 rows=1 width=14) (actual time=0.036..0.037 rows=0 loops=1)
Sort Key: c.connection_id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on active_connection c (cost=0.00..0.00 rows=1 width=14) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
-> Index Scan using active_connection_2022_02_26t00_conn_id on active_connection_2022_02_26t00 c_1 (cost=0.43..56013.08 rows=1064849 width=14) (actual time=6.386..1729.893 rows=1064849 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
....
-> Index Scan using active_connection_2022_03_02t00_conn_id on active_connection_2022_03_02t00 c_5 (cost=0.42..54039.14 rows=1027336 width=14) (actual time=0.062..2142.939 rows=1027336 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
Is it possible somehow get both quick sorting and quick seq scan?
right now I am grouping data by the minute:
SELECT
date_trunc('minute', ts) ts,
...
FROM binance_trades
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
but I would like to group by 5 seconds.
I found this question: Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)
with the answer:
SELECT date_bin(
INTERVAL '5 minutes',
measured_at,
TIMSTAMPTZ '2000-01-01'
),
sum(val)
FROM measurements
GROUP BY 1;
I am using v13.3, so date_bin doesn't exist on my version of Postgres.
I don't quite understand the other answers in that question as well.
Output of EXPLAIN:
Limit (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.072..76239.529 rows=13 loops=1)
-> GroupAggregate (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.071..76239.526 rows=13 loops=1)
Group Key: j.win_start
-> Sort (cost=6379566.36..6380086.58 rows=208088 width=28) (actual time=76238.000..76238.427 rows=5335 loops=1)
Sort Key: j.win_start
Sort Method: quicksort Memory: 609kB
-> Nested Loop (cost=1000.00..6356204.58 rows=208088 width=28) (actual time=23971.722..76237.055 rows=5335 loops=1)
Join Filter: (j.ts_win #> b.ts)
Rows Removed by Join Filter: 208736185
-> Seq Scan on binance_trades b (cost=0.00..3026558.81 rows=16006783 width=28) (actual time=0.033..30328.674 rows=16057040 loops=1)
Filter: ((instrument)::text = ''ETHUSDT''::text)
Rows Removed by Filter: 126872903
-> Materialize (cost=1000.00..208323.11 rows=13 width=30) (actual time=0.000..0.001 rows=13 loops=16057040)
-> Gather (cost=1000.00..208323.05 rows=13 width=30) (actual time=3459.850..3461.076 rows=13 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on scalper_5s_intervals j (cost=0.00..207321.75 rows=5 width=30) (actual time=2336.484..3458.397 rows=4 loops=3)
Filter: ((win_start >= ''2021-08-20 00:00:00''::timestamp without time zone) AND (win_start <= ''2021-08-20 00:01:00''::timestamp without time zone))
Rows Removed by Filter: 5080316
Planning Time: 0.169 ms
Execution Time: 76239.667 ms
If you're only interested in a five seconds distribution rather than the exact sum of a given time window (when the events really happened), you can round the timestamp in five seconds using date_trunc() and mod() and then group by it.
SELECT
date_trunc('second',ts)-
MOD(EXTRACT(SECOND FROM date_trunc('second',ts))::int,5)*interval'1 sec',
SUM(price)
FROM binance_trades
WHERE instrument = 'ETHUSDT' AND
ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY 1
Here I assume that ts and instrument are properly indexed.
However, if this is a sensitive analysis regarding time accuracy and you cannot afford rounding the timestamps, try to create the time window within a CTE (or subquery), then in the outer query create a tsrange and join binance_trades.ts with it using the containment operator #>.
WITH j AS (
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
(SELECT min(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
(SELECT max(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i))
SELECT
j.win_start ts,
SUM(price)
FROM j
JOIN binance_trades b ON ts_win #> b.ts
GROUP BY j.win_start ORDER BY j.win_start
LIMIT 5;
There is a caveat though: this approach will get pretty slow if you're creating 5 second series from a large time window. It's due to the fact that you'll have to join these newly created records with table binance_trades without an index. To overcome this issue you can create a temporary table and index it:
CREATE UNLOGGED TABLE scalper_5s_intervals AS
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
(SELECT min(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
(SELECT max(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i);
CREATE INDEX idx_ts_5sec ON scalper_5s_intervals USING gist (ts_win);
CREATE INDEX idx_ts_5sec_winstart ON scalper_5s_intervals USING btree(win_start);
UNLOGGED tables are much faster than regular ones, but keep in mind that they're not crash safe. See documentation (emphasis mine):
If specified, the table is created as an unlogged table. Data written to unlogged tables is not written to the write-ahead log (see Chapter 29), which makes them considerably faster than ordinary tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.
After that your query will become much faster than the CTE approach, but still significantly slower than the first query with the rounded the timestamps.
SELECT
j.win_start ts,
SUM(price)
FROM scalper_5s_intervals j
JOIN binance_trades b ON ts_win #> b.ts
WHERE j.win_start BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY j.win_start ORDER BY j.win_start
Demo: db<>fiddle
I want to select records that have geohash (a string) begins with b within a certain period and order the results by num of pics, but it is slow.
Table:
create table test (
geohash varchar(20),
num_pics integer,
dt date,
body varchar(1000)
)
Dummy data (run 5 times to insert 10m records)
insert into test
select g, v, d, b from (
select generate_series(1, 2000000) as id,
left(md5(random()::text),9) as g,
floor(random() * 100000 + 1)::int as v,
timestamp '2014-01-10 20:00:00' + random() * (timestamp '2020-01-20 20:00:00' - timestamp '2014-01-10 10:00:00') as d,
md5(random()::text) as b) a
Plus 1m data with geohash start with b
insert into test
select g, v, d, b from (
select generate_series(1, 1000000) as id,
'b' || left(md5(random()::text),9) as g,
floor(random() * 100000 + 1)::int as v,
timestamp '2014-01-10 20:00:00' + random() * (timestamp '2020-01-20 20:00:00' - timestamp '2014-01-10 10:00:00') as d,
md5(random()::text) as b) a
Index
create index idx on test(geohash, dt desc , num_pics desc)
My query
explain analyze
select *
from test
where geohash like 'b%'
and dt between timestamp '2014-02-21 00:00:00'
and timestamp '2014-02-22 00:00:00'
order by num_pics desc limit 1000
Query Plan (https://explain.depesz.com/s/XNZ)
'Limit (cost=75956.07..75958.10 rows=814 width=51) (actual time=1743.841..1744.141 rows=1000 loops=1)'
' -> Sort (cost=75956.07..75958.10 rows=814 width=51) (actual time=1743.839..1744.019 rows=1000 loops=1)'
' Sort Key: num_pics DESC'
' Sort Method: quicksort Memory: 254kB'
' -> Index Scan using idx on test (cost=0.56..75916.71 rows=814 width=51) (actual time=2.943..1741.071 rows=1464 loops=1)'
' Index Cond: (((geohash)::text >= 'b'::text) AND ((geohash)::text < 'c'::text) AND (dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))'
' Filter: ((geohash)::text ~~ 'b%'::text)'
'Planning time: 279.249 ms'
'Execution time: 1744.194 ms'
Question:
Seems like it is hitting the index but still the performance is slow. Is it the problem of Filter: 'b%'? If it was translated into geohash >= 'b' and geohash <'c' in the optimizer, then why it has to filter it again?
Also, is it a correct way to use a multi column B tree index? Because I read that it is the best to use equality(=) operator in the first indexed column, instead of a range operator in this case.
This is just a guess since I haven't tested it. The query "access" is being done by the wrong column.
Rule of thumb:
Access by the most selective column.
Filter using the less selective column.
In this case geohash is not very selective since the pattern only has one letter. If it had more than that -- say 3 or more letters -- then it would be more selective. The selectivity is: one letter out of 26 (maybe only 16?) is 1 / 26 = 3.84%. Rather bad.
It seems that dt is more selective in this case, since it covers a single day (out of 2000 days?). The selectivity is: 1 / 2000 = 0.05%. Much better.
Try the following index, to see if you get faster execution time:
create index idx2 on test(dt, geohash, num_pics);
If you always want to ask after first sign it is better to create index for this not general index;
create index on test using btree(substr(geohash,1,1));
create index on test using btree(dt desc); analyze test;
explain analyze
select
from test
where substr(geohash,1,1) ='b'
and dt between timestamp '2014-02-21 00:00:00' and timestamp '2014-02-22 00:00:00'
order by num_pics desc limit 1000
Execution plan
Limit (cost=15057.49..15059.14 rows=660 width=4) (actual time=29.433..29.644 rows=1000 loops=1)
-> Sort (cost=15057.49..15059.14 rows=660 width=4) (actual time=29.431..29.564 rows=1000 loops=1)
Sort Key: num_pics DESC
Sort Method: quicksort Memory: 117kB
-> Bitmap Heap Scan on test (cost=96.93..15026.58 rows=660 width=4) (actual time=10.782..28.708 rows=1469 loops=1)
Recheck Cond: ((dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))
Filter: (substr((geohash)::text, 1, 1) = 'b'::text)
Rows Removed by Filter: 8470
Heap Blocks: exact=9481
-> Bitmap Index Scan on test_dt_idx (cost=0.00..96.77 rows=4433 width=0) (actual time=5.541..5.541 rows=9939 loops=1)
Index Cond: ((dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))
Planning time: 0.325 ms
Execution time: 30.065 ms
With the composite index, it is even better
create index on test using btree(substr(geohash,1,1), dt desc);
Limit (cost=2546.25..2547.90 rows=660 width=51) (actual time=6.188..6.679 rows=1000 loops=1)
-> Sort (cost=2546.25..2547.90 rows=660 width=51) (actual time=6.186..6.528 rows=1000 loops=1)
Sort Key: num_pics DESC
Sort Method: quicksort Memory: 255kB
-> Bitmap Heap Scan on test (cost=16.85..2515.34 rows=660 width=51) (actual time=1.896..4.740 rows=1469 loops=1)
Recheck Cond: ((substr((geohash)::text, 1, 1) = 'b'::text) AND (dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))
Heap Blocks: exact=1430
-> Bitmap Index Scan on test_substr_dt_idx (cost=0.00..16.68 rows=660 width=0) (actual time=1.266..1.266 rows=1469 loops=1)
Index Cond: ((substr((geohash)::text, 1, 1) = 'b'::text) AND (dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))
Planning time: 0.389 ms
Execution time: 7.052 ms
I want to build a time series daily from a certain date and calculate a few statistics for each day. However this query is very slow... Any way to speed it up? (for example, select the table once in the subquery and compute various stats on that table for each day).
In code this would look like
for i, day in series:
previous_days = series[0...i]
some_calculation_a = some_operation_on(previous_days)
some_calculation_b = some_other_operation_on(previous_days)
Here is an example for a time series looking for users with <= 5 messages up to that date:
with
days as
(
select date::Timestamp with time zone from generate_series('2015-07-09',
now(), '1 day'::interval) date
),
msgs as
(
select days.date,
(select count(customer_id) from daily_messages where sum < 5 and date_trunc('day'::text, created_at) <= days.date) as LT_5,
(select count(customer_id) from daily_messages where sum = 1 and date_trunc('day'::text, created_at) <= days.date) as EQ_1
from days, daily_messages
where date_trunc('day'::text, created_at) = days.date
group by days.date
)
select * from msgs;
Query breakdown:
CTE Scan on msgs (cost=815579.03..815583.03 rows=200 width=24)
Output: msgs.date, msgs.lt_5, msgs.eq_1
CTE days
-> Function Scan on pg_catalog.generate_series date (cost=0.01..10.01 rows=1000 width=8)
Output: date.date
Function Call: generate_series('2015-07-09 00:00:00+00'::timestamp with time zone, now(), '1 day'::interval)
CTE msgs
-> Group (cost=6192.62..815569.02 rows=200 width=8)
Output: days.date, (SubPlan 2), (SubPlan 3)
Group Key: days.date
-> Merge Join (cost=6192.62..11239.60 rows=287970 width=8)
Output: days.date
Merge Cond: (days.date = (date_trunc('day'::text, daily_messages_2.created_at)))
-> Sort (cost=69.83..72.33 rows=1000 width=8)
Output: days.date
Sort Key: days.date
-> CTE Scan on days (cost=0.00..20.00 rows=1000 width=8)
Output: days.date
-> Sort (cost=6122.79..6266.78 rows=57594 width=8)
Output: daily_messages_2.created_at, (date_trunc('day'::text, daily_messages_2.created_at))
Sort Key: (date_trunc('day'::text, daily_messages_2.created_at))
-> Seq Scan on public.daily_messages daily_messages_2 (cost=0.00..1568.94 rows=57594 width=8)
Output: daily_messages_2.created_at, date_trunc('day'::text, daily_messages_2.created_at)
SubPlan 2
-> Aggregate (cost=2016.89..2016.90 rows=1 width=32)
Output: count(daily_messages.customer_id)
-> Seq Scan on public.daily_messages (cost=0.00..2000.89 rows=6399 width=32)
Output: daily_messages.created_at, daily_messages.customer_id, daily_messages.day_total, daily_messages.sum, daily_messages.elapsed
Filter: ((daily_messages.sum < '5'::numeric) AND (date_trunc('day'::text, daily_messages.created_at) <= days.date))
SubPlan 3
-> Aggregate (cost=2001.13..2001.14 rows=1 width=32)
Output: count(daily_messages_1.customer_id)
-> Seq Scan on public.daily_messages daily_messages_1 (cost=0.00..2000.89 rows=96 width=32)
Output: daily_messages_1.created_at, daily_messages_1.customer_id, daily_messages_1.day_total, daily_messages_1.sum, daily_messages_1.elapsed
Filter: ((daily_messages_1.sum = '1'::numeric) AND (date_trunc('day'::text, daily_messages_1.created_at) <= days.date))
In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:
SELECT created_at::date
, sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
, sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM daily_messages m
WHERE created_at >= timestamptz '2015-07-09' -- sargable!
AND created_at < now() -- probably redundant
GROUP BY 1
WINDOW w AS (ORDER BY created_at::date);
All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:
Postgres window function and group by exception
Conditional lead/lag function PostgreSQL?
The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:
Ignoring timezones altogether in Rails and PostgreSQL
On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.
Related answer explaining "sargable":
Return Table Type from A function in PostgreSQL
You might replace count(customer_id) with count(*). Not enough information in the question.
Might be optimized further, but there is not enough information to be more specific in the answer.
Include days without new entries in result
SELECT day
, sum(lt_5_day) OVER w AS lt_5
, sum(eq_1_day) OVER w AS eq_1
FROM (
SELECT day::date
FROM generate_series(date '2015-07-09', current_date, interval '1 day') day
) d
LEFT JOIN (
SELECT created_at::date AS day
, count(customer_id) FILTER (WHERE sum < 5) AS lt_5_day
, count(customer_id) FILTER (WHERE sum = 1) AS eq_1_day
FROM daily_messages m
WHERE created_at >= timestamptz '2015-07-09'
GROUP BY 1
) m USING (day)
WINDOW w AS (ORDER BY day);
Aggregate daily sums in subquery m.
Generate series of all days in time range in subquery d.
Use LEFT [OUTER] JOIN to retain all days in the result, even without new rows for the day.
I have the following tables:
users (id, network_id)
networks (id)
private_messages (id, sender_id, receiver_id, created_at)
I have indexes on users.network_id, and all 3 columns in private messages however the query is skipping the indexes and taking a very long time to run. Any ideas what is wrong in the query that is causing the index to be skipped?
EXPLAIN ANALYZE SELECT COUNT(*)
FROM "networks"
WHERE (
networks.created_at BETWEEN ((timestamp '2013-01-01')) AND (( (timestamp '2013-01-31') + interval '-1 second'))
AND (SELECT COUNT(*) FROM private_messages INNER JOIN users ON private_messages.receiver_id = users.id WHERE users.network_id = networks.id AND (private_messages.created_at BETWEEN ((timestamp '2013-03-01')) AND (( (timestamp '2013-03-31') + interval '-1 second'))) ) > 0)
Result:
Aggregate (cost=722675247.10..722675247.11 rows=1 width=0) (actual time=519916.108..519916.108 rows=1 loops=1)
-> Seq Scan on networks (cost=0.00..722675245.34 rows=703 width=0) (actual time=2576.205..519916.044 rows=78 loops=1)
Filter: ((created_at >= '2013-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-01-30 23:59:59'::timestamp without time zone) AND ((SubPlan 1) > 0))
SubPlan 1
-> Aggregate (cost=50671.34..50671.35 rows=1 width=0) (actual time=240.359..240.359 rows=1 loops=2163)
-> Hash Join (cost=10333.69..50671.27 rows=28 width=0) (actual time=233.997..240.340 rows=13 loops=2163)
Hash Cond: (private_messages.receiver_id = users.id)
-> Bitmap Heap Scan on private_messages (cost=10127.11..48675.15 rows=477136 width=4) (actual time=56.599..232.855 rows=473686 loops=1809)
Recheck Cond: ((created_at >= '2013-03-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-03-30 23:59:59'::timestamp without time zone))
-> Bitmap Index Scan on index_private_messages_on_created_at (cost=0.00..10007.83 rows=477136 width=0) (actual time=54.551..54.551 rows=473686 loops=1809)
Index Cond: ((created_at >= '2013-03-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-03-30 23:59:59'::timestamp without time zone))
-> Hash (cost=205.87..205.87 rows=57 width=4) (actual time=0.218..0.218 rows=2 loops=2163)
Buckets: 1024 Batches: 1 Memory Usage: 0kB
-> Index Scan using index_users_on_network_id on users (cost=0.00..205.87 rows=57 width=4) (actual time=0.154..0.215 rows=2 loops=2163)
Index Cond: (network_id = networks.id)
Total runtime: 519916.183 ms
Thank you.
Let's try something different. I am only suggesting this as an "answer" because of its length and you cannot format a comment. Let's approach the query modularly as a series of subsets that need to get intersected. Let's see how long it takes each of these to execute (please report). Substitute your timestamps for t1 and t2. Note how each query builds upon the prior one, making the prior one an "inline view".
EDIT: also, please confirm the columns in the Networks table.
1
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
2
select U.id, U.network_id from users U
join
(
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
) as FOO
on U.id = FOO.receiver_id
3
select N.* from networks N
join
(
select U.id, U.network_id from users U
join
(
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
) as FOO
on U.id = FOO.receiver_id
) as BAR
on N.id = BAR.network_id
First, I think you want an index on network.created_at, even though right now with over 10% of the table matching the WHERE, it probably won't be used.
Next, I expect you will get better speed if you try to get as much logic as possible into one query, instead of splitting some into a subquery. I believe the plan is indicating iterating over each value of network.id that matches; usually an all-at-once join works better.
I think the code below is logically equivalent. If not, close.
SELECT COUNT(*)
FROM
(SELECT users.network_id FROM "networks"
JOIN users
ON users.network_id = networks.id
JOIN private_messages
ON private_messages.receiver_id = users.id
AND (private_messages.created_at
BETWEEN ((timestamp '2013-03-01'))
AND (( (timestamp '2013-03-31') + interval '-1 second')))
WHERE
networks.created_at
BETWEEN ((timestamp '2013-01-01'))
AND (( (timestamp '2013-01-31') + interval '-1 second'))
GROUP BY users.network_id)
AS main_subquery
;
My experience is that you will get the same query plan if you move the networks.created_at into the ON clause for the users-networks join. I don't think your issue is timestamps; it's the structure of the query. You may also get a better (or worse) plan by replacing the GROUP BY in the subquery with SELECT DISTINCT users.network_id.