I have the following partitioned table
Column | Type | Modifiers | Storage | Stats target | Description
---------------+-----------------------------+------------------------+---------+--------------+-------------
time | timestamp without time zone | not null | plain | |
connection_id | integer | not null | plain | |
is_authorized | boolean | not null default false | plain | |
is_active | boolean | not null default true | plain | |
Indexes:
"active_connection_time_idx" btree ("time")
Child tables: metrics.active_connection_2022_02_26t00,
metrics.active_connection_2022_02_27t00,
metrics.active_connection_2022_02_28t00,
metrics.active_connection_2022_03_01t00,
metrics.active_connection_2022_03_02t00,
metrics.active_connection_2022_04_21t00
All partitions have indexes for time column.
I need execute the following query
SELECT c.connection_id, (array_agg(is_authorized order by time desc))[1], bool_or(is_active) FROM metrics.active_connection c WHERE c.time BETWEEN '2022-01-26 00:00:00' AND '2022-04-15 23:59:59' GROUP BY c.connection_id;
And I get the plan (quick seq scan and low external sort):
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1878772.55..1999873.62 rows=200 width=6) (actual time=11516.621..22951.961 rows=30631 loops=1)
Group Key: c.connection_id
-> Sort (cost=1878772.55..1909047.19 rows=12109857 width=14) (actual time=11388.096..15601.938 rows=12109856 loops=1)
Sort Key: c.connection_id
Sort Method: external merge Disk: 319520kB
-> Append (cost=0.00..247108.84 rows=12109857 width=14) (actual time=0.022..5346.587 rows=12109856 loops=1)
-> Seq Scan on active_connection c (cost=0.00..0.00 rows=1 width=14) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
-> Seq Scan on active_connection_2022_02_26t00 c_1 (cost=0.00..21728.74 rows=1064849 width=14) (actual time=0.017..307.754 rows=1064849 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
......
-> Seq Scan on active_connection_2022_03_02t00 c_5 (cost=0.00..20964.04 rows=1027336 width=14) (actual time=0.018..268.314 rows=1027336 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
If I add index for the connection_id column I get another plan (slow index scan and quick in-memory sort)
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=2.23..1071044.89 rows=200 width=6) (actual time=203.337..49643.802 rows=30631 loops=1)
Group Key: c.connection_id
-> Merge Append (cost=2.23..980218.46 rows=12109857 width=14) (actual time=184.137..38926.435 rows=12109856 loops=1)
Sort Key: c.connection_id
-> Sort (cost=0.01..0.02 rows=1 width=14) (actual time=0.036..0.037 rows=0 loops=1)
Sort Key: c.connection_id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on active_connection c (cost=0.00..0.00 rows=1 width=14) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
-> Index Scan using active_connection_2022_02_26t00_conn_id on active_connection_2022_02_26t00 c_1 (cost=0.43..56013.08 rows=1064849 width=14) (actual time=6.386..1729.893 rows=1064849 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
....
-> Index Scan using active_connection_2022_03_02t00_conn_id on active_connection_2022_03_02t00 c_5 (cost=0.42..54039.14 rows=1027336 width=14) (actual time=0.062..2142.939 rows=1027336 loops=1)
Filter: (("time" >= '2022-01-26 00:00:00'::timestamp without time zone) AND ("time" <= '2022-04-15 23:59:59'::timestamp without time zone))
Is it possible somehow get both quick sorting and quick seq scan?
Related
I have the following table:
create table account_values
(
account_id bigint not null,
timestamp timestamp not null,
value1 numeric not null,
value2 numeric not null,
primary key (timestamp, account_id)
);
I also have the following query which produces an array of every value1+value2 of the row with the closest (before) timestamp to an evenly spaced generated series:
select array [(trunc(extract(epoch from gs) * 1000))::text, COALESCE((values.value1 + values.value2), 0.000000)::text]
from generate_series((now() - '1 year'::interval)::timestamp, now(), interval '1 day') gs
left join lateral (select value1, value2
from account_values
where timestamp <= gs and account_id = ?
order by timestamp desc
limit 1) equity on (TRUE);
The issue with this method of generating such an array becomes apparent when inspecting the output of explain analyse:
Nested Loop Left Join (cost=0.45..3410.74 rows=1000 width=32) (actual time=0.134..3948.546 rows=366 loops=1)
-> Function Scan on generate_series gs (cost=0.02..10.02 rows=1000 width=8) (actual time=0.075..0.244 rows=366 loops=1)
-> Limit (cost=0.43..3.36 rows=1 width=26) (actual time=10.783..10.783 rows=1 loops=366)
-> Index Scan Backward using account_values_pkey on account_values (cost=0.43..67730.27 rows=23130 width=26) (actual time=10.782..10.782 rows=1 loops=366)
" Index Cond: ((""timestamp"" <= gs.gs) AND (account_id = 459))"
Planning Time: 0.136 ms
Execution Time: 3948.659 ms
Specifically: loops=366
This problem will only get worse if I ever decide to decrease my generated series interval time.
Is there a way to flatten this looped select into a more efficient query?
If not, what are some other approaches I can take to improving the performance?
edit;
One hard requirement is that the result of the statement cannot be altered. For example I don't want the range to round to the closest day. The range should always start the second the statement is invoked and each interval precisely one day before.
based on Edouard answer.
with a(_timestamp,values_agg) as
(select _timestamp, array_agg(lpad( (value1 + value2)::text,6,'0')) as values_agg from account_values
where account_id = 1
and _timestamp <#tsrange(now()::timestamp - '1 year'::interval, now()::timestamp)
group by 1)
select jsonb_agg(jsonb_build_object
(
'_timestamp', trunc(extract(epoch from _timestamp) *1000)::text
, 'values', values_agg )
) AS item from a;
Not sure you will get the exact same result, but it should be faster :
select array [ (trunc(extract(epoch from date_trunc('day', timestamp)) * 1000))::text
, (array_agg(value1 + value2 ORDER BY timestamp DESC))[1] :: text
]
from account_values
where account_id = ?
and timestamp <# tsrange(now() - '1 year'::interval, now())
group by date_trunc('day', timestamp)
right now I am grouping data by the minute:
SELECT
date_trunc('minute', ts) ts,
...
FROM binance_trades
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
but I would like to group by 5 seconds.
I found this question: Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)
with the answer:
SELECT date_bin(
INTERVAL '5 minutes',
measured_at,
TIMSTAMPTZ '2000-01-01'
),
sum(val)
FROM measurements
GROUP BY 1;
I am using v13.3, so date_bin doesn't exist on my version of Postgres.
I don't quite understand the other answers in that question as well.
Output of EXPLAIN:
Limit (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.072..76239.529 rows=13 loops=1)
-> GroupAggregate (cost=6379566.36..6388410.30 rows=13 width=48) (actual time=76238.071..76239.526 rows=13 loops=1)
Group Key: j.win_start
-> Sort (cost=6379566.36..6380086.58 rows=208088 width=28) (actual time=76238.000..76238.427 rows=5335 loops=1)
Sort Key: j.win_start
Sort Method: quicksort Memory: 609kB
-> Nested Loop (cost=1000.00..6356204.58 rows=208088 width=28) (actual time=23971.722..76237.055 rows=5335 loops=1)
Join Filter: (j.ts_win #> b.ts)
Rows Removed by Join Filter: 208736185
-> Seq Scan on binance_trades b (cost=0.00..3026558.81 rows=16006783 width=28) (actual time=0.033..30328.674 rows=16057040 loops=1)
Filter: ((instrument)::text = ''ETHUSDT''::text)
Rows Removed by Filter: 126872903
-> Materialize (cost=1000.00..208323.11 rows=13 width=30) (actual time=0.000..0.001 rows=13 loops=16057040)
-> Gather (cost=1000.00..208323.05 rows=13 width=30) (actual time=3459.850..3461.076 rows=13 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on scalper_5s_intervals j (cost=0.00..207321.75 rows=5 width=30) (actual time=2336.484..3458.397 rows=4 loops=3)
Filter: ((win_start >= ''2021-08-20 00:00:00''::timestamp without time zone) AND (win_start <= ''2021-08-20 00:01:00''::timestamp without time zone))
Rows Removed by Filter: 5080316
Planning Time: 0.169 ms
Execution Time: 76239.667 ms
If you're only interested in a five seconds distribution rather than the exact sum of a given time window (when the events really happened), you can round the timestamp in five seconds using date_trunc() and mod() and then group by it.
SELECT
date_trunc('second',ts)-
MOD(EXTRACT(SECOND FROM date_trunc('second',ts))::int,5)*interval'1 sec',
SUM(price)
FROM binance_trades
WHERE instrument = 'ETHUSDT' AND
ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY 1
Here I assume that ts and instrument are properly indexed.
However, if this is a sensitive analysis regarding time accuracy and you cannot afford rounding the timestamps, try to create the time window within a CTE (or subquery), then in the outer query create a tsrange and join binance_trades.ts with it using the containment operator #>.
WITH j AS (
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
(SELECT min(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
(SELECT max(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i))
SELECT
j.win_start ts,
SUM(price)
FROM j
JOIN binance_trades b ON ts_win #> b.ts
GROUP BY j.win_start ORDER BY j.win_start
LIMIT 5;
There is a caveat though: this approach will get pretty slow if you're creating 5 second series from a large time window. It's due to the fact that you'll have to join these newly created records with table binance_trades without an index. To overcome this issue you can create a temporary table and index it:
CREATE UNLOGGED TABLE scalper_5s_intervals AS
SELECT i AS win_start,tsrange(i,i+interval'5sec') AS ts_win
FROM generate_series(
(SELECT min(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),
(SELECT max(date_trunc('second',ts)) FROM binance_trades
WHERE ts BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'),interval'5 sec') j (i);
CREATE INDEX idx_ts_5sec ON scalper_5s_intervals USING gist (ts_win);
CREATE INDEX idx_ts_5sec_winstart ON scalper_5s_intervals USING btree(win_start);
UNLOGGED tables are much faster than regular ones, but keep in mind that they're not crash safe. See documentation (emphasis mine):
If specified, the table is created as an unlogged table. Data written to unlogged tables is not written to the write-ahead log (see Chapter 29), which makes them considerably faster than ordinary tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers. Any indexes created on an unlogged table are automatically unlogged as well.
After that your query will become much faster than the CTE approach, but still significantly slower than the first query with the rounded the timestamps.
SELECT
j.win_start ts,
SUM(price)
FROM scalper_5s_intervals j
JOIN binance_trades b ON ts_win #> b.ts
WHERE j.win_start BETWEEN '2021-08-19 22:50:00' AND '2021-08-20 01:00:00'
GROUP BY j.win_start ORDER BY j.win_start
Demo: db<>fiddle
I want to select records that have geohash (a string) begins with b within a certain period and order the results by num of pics, but it is slow.
Table:
create table test (
geohash varchar(20),
num_pics integer,
dt date,
body varchar(1000)
)
Dummy data (run 5 times to insert 10m records)
insert into test
select g, v, d, b from (
select generate_series(1, 2000000) as id,
left(md5(random()::text),9) as g,
floor(random() * 100000 + 1)::int as v,
timestamp '2014-01-10 20:00:00' + random() * (timestamp '2020-01-20 20:00:00' - timestamp '2014-01-10 10:00:00') as d,
md5(random()::text) as b) a
Plus 1m data with geohash start with b
insert into test
select g, v, d, b from (
select generate_series(1, 1000000) as id,
'b' || left(md5(random()::text),9) as g,
floor(random() * 100000 + 1)::int as v,
timestamp '2014-01-10 20:00:00' + random() * (timestamp '2020-01-20 20:00:00' - timestamp '2014-01-10 10:00:00') as d,
md5(random()::text) as b) a
Index
create index idx on test(geohash, dt desc , num_pics desc)
My query
explain analyze
select *
from test
where geohash like 'b%'
and dt between timestamp '2014-02-21 00:00:00'
and timestamp '2014-02-22 00:00:00'
order by num_pics desc limit 1000
Query Plan (https://explain.depesz.com/s/XNZ)
'Limit (cost=75956.07..75958.10 rows=814 width=51) (actual time=1743.841..1744.141 rows=1000 loops=1)'
' -> Sort (cost=75956.07..75958.10 rows=814 width=51) (actual time=1743.839..1744.019 rows=1000 loops=1)'
' Sort Key: num_pics DESC'
' Sort Method: quicksort Memory: 254kB'
' -> Index Scan using idx on test (cost=0.56..75916.71 rows=814 width=51) (actual time=2.943..1741.071 rows=1464 loops=1)'
' Index Cond: (((geohash)::text >= 'b'::text) AND ((geohash)::text < 'c'::text) AND (dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))'
' Filter: ((geohash)::text ~~ 'b%'::text)'
'Planning time: 279.249 ms'
'Execution time: 1744.194 ms'
Question:
Seems like it is hitting the index but still the performance is slow. Is it the problem of Filter: 'b%'? If it was translated into geohash >= 'b' and geohash <'c' in the optimizer, then why it has to filter it again?
Also, is it a correct way to use a multi column B tree index? Because I read that it is the best to use equality(=) operator in the first indexed column, instead of a range operator in this case.
This is just a guess since I haven't tested it. The query "access" is being done by the wrong column.
Rule of thumb:
Access by the most selective column.
Filter using the less selective column.
In this case geohash is not very selective since the pattern only has one letter. If it had more than that -- say 3 or more letters -- then it would be more selective. The selectivity is: one letter out of 26 (maybe only 16?) is 1 / 26 = 3.84%. Rather bad.
It seems that dt is more selective in this case, since it covers a single day (out of 2000 days?). The selectivity is: 1 / 2000 = 0.05%. Much better.
Try the following index, to see if you get faster execution time:
create index idx2 on test(dt, geohash, num_pics);
If you always want to ask after first sign it is better to create index for this not general index;
create index on test using btree(substr(geohash,1,1));
create index on test using btree(dt desc); analyze test;
explain analyze
select
from test
where substr(geohash,1,1) ='b'
and dt between timestamp '2014-02-21 00:00:00' and timestamp '2014-02-22 00:00:00'
order by num_pics desc limit 1000
Execution plan
Limit (cost=15057.49..15059.14 rows=660 width=4) (actual time=29.433..29.644 rows=1000 loops=1)
-> Sort (cost=15057.49..15059.14 rows=660 width=4) (actual time=29.431..29.564 rows=1000 loops=1)
Sort Key: num_pics DESC
Sort Method: quicksort Memory: 117kB
-> Bitmap Heap Scan on test (cost=96.93..15026.58 rows=660 width=4) (actual time=10.782..28.708 rows=1469 loops=1)
Recheck Cond: ((dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))
Filter: (substr((geohash)::text, 1, 1) = 'b'::text)
Rows Removed by Filter: 8470
Heap Blocks: exact=9481
-> Bitmap Index Scan on test_dt_idx (cost=0.00..96.77 rows=4433 width=0) (actual time=5.541..5.541 rows=9939 loops=1)
Index Cond: ((dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))
Planning time: 0.325 ms
Execution time: 30.065 ms
With the composite index, it is even better
create index on test using btree(substr(geohash,1,1), dt desc);
Limit (cost=2546.25..2547.90 rows=660 width=51) (actual time=6.188..6.679 rows=1000 loops=1)
-> Sort (cost=2546.25..2547.90 rows=660 width=51) (actual time=6.186..6.528 rows=1000 loops=1)
Sort Key: num_pics DESC
Sort Method: quicksort Memory: 255kB
-> Bitmap Heap Scan on test (cost=16.85..2515.34 rows=660 width=51) (actual time=1.896..4.740 rows=1469 loops=1)
Recheck Cond: ((substr((geohash)::text, 1, 1) = 'b'::text) AND (dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))
Heap Blocks: exact=1430
-> Bitmap Index Scan on test_substr_dt_idx (cost=0.00..16.68 rows=660 width=0) (actual time=1.266..1.266 rows=1469 loops=1)
Index Cond: ((substr((geohash)::text, 1, 1) = 'b'::text) AND (dt >= '2014-02-21 00:00:00'::timestamp without time zone) AND (dt <= '2014-02-22 00:00:00'::timestamp without time zone))
Planning time: 0.389 ms
Execution time: 7.052 ms
I have a task: get first, last, max, min from each group (by time) of data. My solution works but it is extremely slow because row count in table is about 50 million.
How can i improve performance of this query:
SELECT
date_trunc('minute', t_ordered."timestamp"),
MIN (t_ordered.price),
MAX (t_ordered.price),
FIRST (t_ordered.price),
LAST (t_ordered.price)
FROM(
SELECT t.price, t."timestamp"
FROM trade t
WHERE t."timestamp" >= '2016-01-01' AND t."timestamp" < '2016-09-01'
ORDER BY t."timestamp" ASC
) t_ordered
GROUP BY 1
ORDER BY 1
FIRST and LAST are aggregate functions from postgresql wiki
Timestamp column indexed.
explain (analyze, verbose):
GroupAggregate (cost=13112830.84..33300949.59 rows=351556 width=14) (actual time=229538.092..468212.450 rows=351138 loops=1)
Output: (date_trunc('minute'::text, t_ordered."timestamp")), min(t_ordered.price), max(t_ordered.price), first(t_ordered.price), last(t_ordered.price)
Group Key: (date_trunc('minute'::text, t_ordered."timestamp"))
-> Sort (cost=13112830.84..13211770.66 rows=39575930 width=14) (actual time=229515.281..242472.677 rows=38721704 loops=1)
Output: (date_trunc('minute'::text, t_ordered."timestamp")), t_ordered.price
Sort Key: (date_trunc('minute'::text, t_ordered."timestamp"))
Sort Method: external sort Disk: 932656kB
-> Subquery Scan on t_ordered (cost=6848734.55..7442373.50 rows=39575930 width=14) (actual time=102166.368..155540.492 rows=38721704 loops=1)
Output: date_trunc('minute'::text, t_ordered."timestamp"), t_ordered.price
-> Sort (cost=6848734.55..6947674.38 rows=39575930 width=14) (actual time=102165.836..130971.804 rows=38721704 loops=1)
Output: t.price, t."timestamp"
Sort Key: t."timestamp"
Sort Method: external merge Disk: 993480kB
-> Seq Scan on public.trade t (cost=0.00..1178277.21 rows=39575930 width=14) (actual time=0.055..25726.038 rows=38721704 loops=1)
Output: t.price, t."timestamp"
Filter: ((t."timestamp" >= '2016-01-01 00:00:00'::timestamp without time zone) AND (t."timestamp" < '2016-09-01 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 9666450
Planning time: 1.663 ms
Execution time: 468949.753 ms
Maybe it can be done by window functions? I have tried but i do not have enough knowledge to use them
Creating a type and adequate aggregates will hopefully work better:
create type tp as (timestamp timestamp, price int);
create or replace function min_tp (tp, tp)
returns tp as $$
select least($1, $2);
$$ language sql immutable;
create aggregate min (tp) (
sfunc = min_tp,
stype = tp
);
The min and max (not listed) aggregate functions will reduce the query to a single loop:
select
date_trunc('minute', timestamp) as minute,
min (price) as price_min,
max (price) as price_max,
(min ((timestamp, price)::tp)).price as first,
(max ((timestamp, price)::tp)).price as last
from t
where timestamp >= '2016-01-01' and timestamp < '2016-09-01'
group by 1
order by 1
explain (analyze, verbose):
GroupAggregate (cost=6954022.61..27159050.82 rows=287533 width=14) (actual time=129286.817..510119.582 rows=351138 loops=1)
Output: (date_trunc('minute'::text, "timestamp")), min(price), max(price), (min(ROW("timestamp", price)::tp)).price, (max(ROW("timestamp", price)::tp)).price
Group Key: (date_trunc('minute'::text, trade."timestamp"))
-> Sort (cost=6954022.61..7053049.25 rows=39610655 width=14) (actual time=129232.165..156277.718 rows=38721704 loops=1)
Output: (date_trunc('minute'::text, "timestamp")), price, "timestamp"
Sort Key: (date_trunc('minute'::text, trade."timestamp"))
Sort Method: external merge Disk: 1296392kB
-> Seq Scan on public.trade (cost=0.00..1278337.71 rows=39610655 width=14) (actual time=0.035..45335.947 rows=38721704 loops=1)
Output: date_trunc('minute'::text, "timestamp"), price, "timestamp"
Filter: ((trade."timestamp" >= '2016-01-01 00:00:00'::timestamp without time zone) AND (trade."timestamp" < '2016-09-01 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 9708857
Planning time: 0.286 ms
Execution time: 510648.395 ms
I have the following tables:
users (id, network_id)
networks (id)
private_messages (id, sender_id, receiver_id, created_at)
I have indexes on users.network_id, and all 3 columns in private messages however the query is skipping the indexes and taking a very long time to run. Any ideas what is wrong in the query that is causing the index to be skipped?
EXPLAIN ANALYZE SELECT COUNT(*)
FROM "networks"
WHERE (
networks.created_at BETWEEN ((timestamp '2013-01-01')) AND (( (timestamp '2013-01-31') + interval '-1 second'))
AND (SELECT COUNT(*) FROM private_messages INNER JOIN users ON private_messages.receiver_id = users.id WHERE users.network_id = networks.id AND (private_messages.created_at BETWEEN ((timestamp '2013-03-01')) AND (( (timestamp '2013-03-31') + interval '-1 second'))) ) > 0)
Result:
Aggregate (cost=722675247.10..722675247.11 rows=1 width=0) (actual time=519916.108..519916.108 rows=1 loops=1)
-> Seq Scan on networks (cost=0.00..722675245.34 rows=703 width=0) (actual time=2576.205..519916.044 rows=78 loops=1)
Filter: ((created_at >= '2013-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-01-30 23:59:59'::timestamp without time zone) AND ((SubPlan 1) > 0))
SubPlan 1
-> Aggregate (cost=50671.34..50671.35 rows=1 width=0) (actual time=240.359..240.359 rows=1 loops=2163)
-> Hash Join (cost=10333.69..50671.27 rows=28 width=0) (actual time=233.997..240.340 rows=13 loops=2163)
Hash Cond: (private_messages.receiver_id = users.id)
-> Bitmap Heap Scan on private_messages (cost=10127.11..48675.15 rows=477136 width=4) (actual time=56.599..232.855 rows=473686 loops=1809)
Recheck Cond: ((created_at >= '2013-03-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-03-30 23:59:59'::timestamp without time zone))
-> Bitmap Index Scan on index_private_messages_on_created_at (cost=0.00..10007.83 rows=477136 width=0) (actual time=54.551..54.551 rows=473686 loops=1809)
Index Cond: ((created_at >= '2013-03-01 00:00:00'::timestamp without time zone) AND (created_at <= '2013-03-30 23:59:59'::timestamp without time zone))
-> Hash (cost=205.87..205.87 rows=57 width=4) (actual time=0.218..0.218 rows=2 loops=2163)
Buckets: 1024 Batches: 1 Memory Usage: 0kB
-> Index Scan using index_users_on_network_id on users (cost=0.00..205.87 rows=57 width=4) (actual time=0.154..0.215 rows=2 loops=2163)
Index Cond: (network_id = networks.id)
Total runtime: 519916.183 ms
Thank you.
Let's try something different. I am only suggesting this as an "answer" because of its length and you cannot format a comment. Let's approach the query modularly as a series of subsets that need to get intersected. Let's see how long it takes each of these to execute (please report). Substitute your timestamps for t1 and t2. Note how each query builds upon the prior one, making the prior one an "inline view".
EDIT: also, please confirm the columns in the Networks table.
1
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
2
select U.id, U.network_id from users U
join
(
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
) as FOO
on U.id = FOO.receiver_id
3
select N.* from networks N
join
(
select U.id, U.network_id from users U
join
(
select PM.receiver_id from private_messages PM
where PM.create_at between (t1 and t2)
) as FOO
on U.id = FOO.receiver_id
) as BAR
on N.id = BAR.network_id
First, I think you want an index on network.created_at, even though right now with over 10% of the table matching the WHERE, it probably won't be used.
Next, I expect you will get better speed if you try to get as much logic as possible into one query, instead of splitting some into a subquery. I believe the plan is indicating iterating over each value of network.id that matches; usually an all-at-once join works better.
I think the code below is logically equivalent. If not, close.
SELECT COUNT(*)
FROM
(SELECT users.network_id FROM "networks"
JOIN users
ON users.network_id = networks.id
JOIN private_messages
ON private_messages.receiver_id = users.id
AND (private_messages.created_at
BETWEEN ((timestamp '2013-03-01'))
AND (( (timestamp '2013-03-31') + interval '-1 second')))
WHERE
networks.created_at
BETWEEN ((timestamp '2013-01-01'))
AND (( (timestamp '2013-01-31') + interval '-1 second'))
GROUP BY users.network_id)
AS main_subquery
;
My experience is that you will get the same query plan if you move the networks.created_at into the ON clause for the users-networks join. I don't think your issue is timestamps; it's the structure of the query. You may also get a better (or worse) plan by replacing the GROUP BY in the subquery with SELECT DISTINCT users.network_id.