I encountered a strange behaviour of the Postgres optimizer on the following query:
select count(product0_.id) as col_0_0_ from Product product0_
where product0_.active=true
and (product0_.aggregatorId is null
or product0_.aggregatorId in ($1 , $2 , $3))
Product has about 54 columns, active is a boolean having a btree index, and aggregatorId is 'varchar(15)` and has a btree index.
On this query above the index for 'aggregatorId' is not used:
Aggregate (cost=169995.75..169995.76 rows=1 width=32) (actual time=3904.726..3904.727 rows=1 loops=1)
-> Seq Scan on product product0_ (cost=0.00..165510.39 rows=1794146 width=32) (actual time=0.055..2407.195 rows=1851827 loops=1)
Filter: (active AND ((aggregatorid IS NULL) OR ((aggregatorid)::text = ANY ('{5109037,5001015,70601}'::text[]))))
Rows Removed by Filter: 542146
Total runtime: 3904.925 ms
But if we reduce the query by leaving out the null check for this column, the index gets used:
Aggregate (cost=17600.93..17600.94 rows=1 width=32) (actual time=614.933..614.935 rows=1 loops=1)
-> Index Scan using idx_prod_aggr on product product0_ (cost=0.43..17487.56 rows=45347 width=32) (actual time=19.284..594.509 rows=12099 loops=1)
Index Cond: ((aggregatorid)::text = ANY ('{5109037,5001015,70601}'::text[]))
Filter: active
Rows Removed by Filter: 49130
Total runtime: 150.255 ms
As far as I know a btree index can handle null checks, so I don't understand why the index is not used for the complete query. The product table contains about 2.3 million entries, so it is not very fast.
EDIT:
The index is very standard:
CREATE INDEX idx_prod_aggr
ON product
USING btree
(aggregatorid COLLATE pg_catalog."default");
Since there are many identical values for the column which you use in the where clause (78% of all the table rows according to your numbers), the database will conclude that it is cheaper to use full table scan than to waste additional time to read the index.
The rule of thumb in most database vendors is that index will probably not be used if it can't narrow the search down to about 5% of all the table records.
Your problem looked interesting, so I reproduced your scenario - postgres 9.1, table with 1M rows, one boolean column, one varchar column, both indexed, half of table has NULL names.
I had same explain analyze output when varchar column was not indexed. However, with index postgres uses bitmap scan on NULL condition and IN condition and then merges them with OR condition.
Then he uses seq scan on boolean condition (because indexes are separated)
explain analyze
select * from A where active is true and ((name is null) OR (name in ('1','2','3') ));
See output:
"Bitmap Heap Scan on a (cost=17.34..21.35 rows=1 width=18) (actual time=0.048..0.048 rows=0 loops=1)"
" Recheck Cond: ((name IS NULL) OR ((name)::text = ANY ('{1,2,3}'::text[])))"
" Filter: (active IS TRUE)"
" -> BitmapOr (cost=17.34..17.34 rows=1 width=0) (actual time=0.047..0.047 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_prod_aggr (cost=0.00..4.41 rows=1 width=0) (actual time=0.010..0.010 rows=0 loops=1)"
" Index Cond: (name IS NULL)"
" -> Bitmap Index Scan on idx_prod_aggr (cost=0.00..12.93 rows=1 width=0) (actual time=0.036..0.036 rows=0 loops=1)"
" Index Cond: ((name)::text = ANY ('{1,2,3}'::text[]))"
"Total runtime: 0.077 ms"
This makes me think that you missed some details, if so, add them to your question.
Related
I'm trying to create some queries in order to implement a cursor pagination (something like this: https://shopify.engineering/pagination-relative-cursors) on Postgres. In my implementation I'm trying to reach an efficient pagination even with ordering NON-unique columns.
I'm struggling to do that efficiently, in particular on the query that retrieves the previous page given a specific cursor.
The table that I'm using (>3M records) to test these query is very simple, and it has this structure:
CREATE TABLE "placemarks" (
"id" serial NOT NULL DEFAULT,
"assetId" text,
"createdAt" timestamptz,
PRIMARY KEY ("id")
);
I have an index on the id field clearly and also an index on the assetId column.
This is the query I'm using for retrieving the next page given a cursor composed by the latest ID and the latest assetId:
SELECT
*
FROM
"placemarks"
WHERE
"assetId" > 'CURSOR_ASSETID'
or("assetId" = 'CURSOR_ASSETID'
AND id > CURSOR_INT_ID)
ORDER BY
"assetId",
id
LIMIT 5;
This query is actually pretty fast, it uses the indexes and it allows to handle also duplicated values on assetId by using the unique ID field in order to avoid skipping duplicated rows with same CURSOR_ASSETID values.
-> Sort (cost=25709.62..25726.63 rows=6803 width=2324) (actual time=0.128..0.138 rows=5 loops=1)
" Sort Key: ""assetId"", id"
Sort Method: top-N heapsort Memory: 45kB
-> Bitmap Heap Scan on placemarks (cost=271.29..25596.63 rows=6803 width=2324) (actual time=0.039..0.088 rows=11 loops=1)
" Recheck Cond: (((""assetId"")::text > 'CURSOR_ASSETID'::text) OR ((""assetId"")::text = 'CURSOR_ASSETID'::text))"
" Filter: (((""assetId"")::text > 'CURSOR_ASSETID'::text) OR (((""assetId"")::text = 'CURSOR_ASSETID'::text) AND (id > CURSOR_INT_ID)))"
Rows Removed by Filter: 1
Heap Blocks: exact=10
-> BitmapOr (cost=271.29..271.29 rows=6803 width=0) (actual time=0.030..0.034 rows=0 loops=1)
" -> Bitmap Index Scan on ""placemarks_assetId_key"" (cost=0.00..263.45 rows=6802 width=0) (actual time=0.023..0.023 rows=11 loops=1)"
" Index Cond: ((""assetId"")::text > 'CURSOR_ASSETID'::text)"
" -> Bitmap Index Scan on ""placemarks_assetId_key"" (cost=0.00..4.44 rows=1 width=0) (actual time=0.005..0.005 rows=1 loops=1)"
" Index Cond: ((""assetId"")::text = 'CURSOR_ASSETID'::text)"
Planning time: 0.201 ms
Execution time: 0.194 ms
The issue is when I try to get the same page but with the query that should return me the previous page:
SELECT
*
FROM
placemarks
WHERE
"assetId" < 'CURSOR_ASSETID'
or("assetId" = 'CURSOR_ASSETID'
AND id < CURSOR_INT_ID)
ORDER BY
"assetId" desc,
id desc
LIMIT 5;
With this query no indexes are used, even if it would be much faster:
Limit (cost=933644.62..933644.63 rows=5 width=2324)
-> Sort (cost=933644.62..944647.42 rows=4401120 width=2324)
" Sort Key: ""assetId"" DESC, id DESC"
-> Seq Scan on placemarks (cost=0.00..860543.60 rows=4401120 width=2324)
" Filter: (((""assetId"")::text < 'CURSOR_ASSETID'::text) OR (((""assetId"")::text = 'CURSOR_ASSETID'::text) AND (id < CURSOR_INT_ID)))"
I've noticied that by forcing the usage of indexes with SET enable_seqscan = OFF; the query appears to be using the indexes and it performs better and faster. The query plan resulting:
Limit (cost=12.53..12.54 rows=5 width=108) (actual time=0.532..0.555 rows=5 loops=1)
-> Sort (cost=12.53..12.55 rows=6 width=108) (actual time=0.524..0.537 rows=5 loops=1)
Sort Key: assetid DESC, id DESC
Sort Method: top-N heapsort Memory: 25kB
" -> Bitmap Heap Scan on ""placemarks"" (cost=8.33..12.45 rows=6 width=108) (actual time=0.274..0.340 rows=14 loops=1)"
" Recheck Cond: ((assetid < 'CURSOR_ASSETID'::text) OR (assetid = 'CURSOR_ASSETID'::text))"
" Filter: ((assetid < 'CURSOR_ASSETID'::text) OR ((assetid = 'CURSOR_ASSETID'::text) AND (id < 14)))"
Rows Removed by Filter: 1
Heap Blocks: exact=1
-> BitmapOr (cost=8.33..8.33 rows=7 width=0) (actual time=0.152..0.159 rows=0 loops=1)
" -> Bitmap Index Scan on ""placemarks_assetid_idx"" (cost=0.00..4.18 rows=6 width=0) (actual time=0.108..0.110 rows=12 loops=1)"
" Index Cond: (assetid < 'CURSOR_ASSETID'::text)"
" -> Bitmap Index Scan on ""placemarks_assetid_idx"" (cost=0.00..4.15 rows=1 width=0) (actual time=0.036..0.036 rows=3 loops=1)"
" Index Cond: (assetid = 'CURSOR_ASSETID'::text)"
Planning time: 1.319 ms
Execution time: 0.918 ms
Any clue to optimize the second query in order to use always the indexes?
Postgres DB version: 10.20
The fast performance of your first query seems to be down to luck of where your constant 'CURSOR_ASSETID' falls in the distribution of that column. Or maybe this luck is not luck but is how it will always be?
For good performance more generally, including for reverse sorting, you need to write your query with a tuple comparator, not an OR comparator.
WHERE
("assetId",id) < ('something',500000)
If you are using a version before incremental sorting was introduced in v13, or if "assetId" can have a large number of ties, then you will need a multicolumn index on ("assetId",id) to get optimal performance.
And there is no reason to decorate the index with DESC, as PostgreSQL knows how to read the index backwards. Decorating the index is needed when the two columns have different ordering than each other, as then you would need to read the undecorated index "spirally" rather than either completely forward or completely backwards. (But that wouldn't work well here anyway, as tuple comparators can't have different orderings between the columns.)
Setup
A table with one jsonb column attributes and a non unique numeric ID campaignid:
CREATE TABLE coupons (
id integer NOT NULL,
created timestamp with time zone DEFAULT now( ) NOT NULL,
campaignid bigint NOT NULL,
attributes jsonb NOT NULL
);
This table would have up to 500M rows, arbitrary key/values in attributes and hundreds of different campaignid values.
Two indexes exist on the table:
CREATE INDEX campaignid_attrs_idx ON coupons
USING gin (campaignid,attributes);
CREATE INDEX campaignid_idx ON coupons
USING btree (campaignid, deleted);
What I did
I executed the query:
SELECT COUNT(*)
FROM coupons
WHERE
(campaignid = 97 AND
attributes #> '{"CountryId": 3}');
Expected Results
I expected the index campaignid_attrs_idx on (campaignid,attributes) to be fully used and the query to complete quite fast.
Actual Result
The query took a long time (~40 seconds) to execute.
Here's the output from explain (ANALYZE, COSTS):
Aggregate (cost=32337.78..32337.79 rows=1 width=8) (actual time=39726.410..39726.414 rows=1 loops=1)
-> Bitmap Heap Scan on coupons (cost=30164.40..32332.44 rows=2136 width=0) (actual time=16893.439..39549.891 rows=1088478 loops=1)
" Recheck Cond: ((attributes #> '{""CountryId"": 3}'::jsonb) AND (campaignid = 97))"
Rows Removed by Index Recheck: 10531586
Heap Blocks: exact=138344 lossy=583282
-> BitmapAnd (cost=30164.40..30164.40 rows=2136 width=0) (actual time=16837.885..16837.887 rows=0 loops=1)
-> Bitmap Index Scan on coupons_campaignid_attrs_index (cost=0.00..1465.15 rows=178954 width=0) (actual time=9872.279..9872.279 rows=81799565 loops=1)
" Index Cond: (attributes #> '{""CountryId"": 3}'::jsonb)"
-> Bitmap Index Scan on campaignid_idx (cost=0.00..28697.93 rows=2135515 width=0) (actual time=6454.972..6454.972 rows=3088167 loops=1)
Index Cond: (campaignid = 97)
Planning Time: 0.175 ms
Execution Time: 39726.480 ms
Conclusions
It seems like the index campaignid_attrs_idx was used for the first part of the query attributes #> '{"CountryId": 3}' returning ~80M rows, while the index campaignid_idx was used on the second part of the WHERE clause campaignid = 97 in parallel returning ~3M rows. Results from both parts were intersected to arrive at a set that fulfills both conditions. Then there was a Bitmap Heap Scan which verified that the result set complies with the desired conditions which took most of the time (16893.439..39549.891)
My main question, why wasn't campaignid_attrs_idx used to filter both conditions?
EDIT: I removed the second index campaignid_attrs_idx to see if then the multicolumn index will be used for both conditions. Strangely I still see that the only one of the conditions used in the index scan. Here's the plan:
Aggregate (cost=181951.27..181951.28 rows=1 width=8) (actual time=209633.017..209633.018 rows=1 loops=1)
-> Bitmap Heap Scan on coupons (cost=1424.30..181945.81 rows=2183 width=0) (actual time=8938.605..209401.433 rows=1091580 loops=1)
" Recheck Cond: (attributes #> '{""CountryId"": 3}'::jsonb)"
Rows Removed by Index Recheck: 31487517
Filter: (campaignid = 97)
Rows Removed by Filter: 80674951
Heap Blocks: exact=121875 lossy=5572599
-> Bitmap Index Scan on coupons_campaignid_attributes_idx (cost=0.00..1423.75 rows=179434 width=0) (actual time=8908.682..8908.682 rows=81802589 loops=1)
" Index Cond: (attributes #> '{""CountryId"": 3}'::jsonb)"
Planning Time: 6.885 ms
Execution Time: 209638.234 ms
I have a function that is running too slow. I've isolated which piece of the function is slow.. a small SELECT statement:
SELECT image_group_id
FROM programs.image_family fam
JOIN programs.provider_file pf
ON (fam.provider_data_id = pf.provider_data_id
AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL)
LIMIT 1
When I run the function this piece of SQL generates the following query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=0.56..6.75 rows=1 width=6) (actual time=3471.004..3471.004 rows=0 loops=1)
-> Nested Loop (cost=0.56..594054.42 rows=96017 width=6) (actual time=3471.002..3471.002 rows=0 loops=1)
-> Seq Scan on image_family fam (cost=0.00..391880.08 rows=96023 width=6) (actual time=3471.001..3471.001 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
When I run the selected query in a query tool (outside of the function) the query plan looks like this:
Limit (cost=1.12..3.81 rows=1 width=6) (actual time=0.043..0.043 rows=1 loops=1)
Output: pf.image_group_id
Buffers: shared hit=11
-> Nested Loop (cost=1.12..14.55 rows=5 width=6) (actual time=0.041..0.041 rows=1 loops=1)
Output: pf.image_group_id
Inner Unique: true
Buffers: shared hit=11
-> Index Only Scan using image_family_family_id_provider_data_id_idx on programs.image_family fam (cost=0.56..1.65 rows=5 width=6) (actual time=0.024..0.024 rows=1 loops=1)
Output: fam.family_id, fam.provider_data_id
Index Cond: (fam.family_id = 8419853)
Heap Fetches: 2
Buffers: shared hit=6
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on programs.provider_file pf (cost=0.56..2.58 rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
Output: pf.provider_data_id, pf.provider_file_path, pf.posted_dt, pf.file_repository_id, pf.restricted_size, pf.image_group_id, pf.is_master, pf.is_biggest
Index Cond: (pf.provider_data_id = fam.provider_data_id)
Filter: (pf.image_group_id IS NOT NULL)
Buffers: shared hit=5
Planning time: 0.809 ms
Execution time: 0.100 ms
If I disable sequence scans in the function I can get a similar query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=1.12..8.00 rows=1 width=6) (actual time=3855.722..3855.722 rows=0 loops=1)
-> Nested Loop (cost=1.12..660217.34 rows=96017 width=6) (actual time=3855.721..3855.721 rows=0 loops=1)
-> Index Only Scan using image_family_family_id_provider_data_id_idx on image_family fam (cost=0.56..458043.00 rows=96023 width=6) (actual time=3855.720..3855.720 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
Heap Fetches: 368
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
The query plans are different where the Filter functions are for the Index Only Scan. The function has more Heap Fetches and seems to treat the argument as a string casted to a numeric.
Things I've tried:
Increasing statistics (and running vacuum/analyze)
Calling the problematic piece of SQL in another function with language SQL
Add another index (the one that its using now to perform an INDEX ONLY scan)
Create a CTE for the image_family table (this did help performance but would still do a sequence scan on the image_family instead of using the index so still, too slow)
Change from executing raw SQL to using an EXECUTE ... INTO .. USING in the function.
Makeup of the two tables:
image_family:
provider_data_id: numeric(16)
family_id: int4
(rest omitted for brevity)
unique index on provider_data_id
index on family_id
I recently added a unique index on (family_id, provider_data_id) as well
Approximately 20 million rows here. Families have many provider_data_ids but not all provider_data_ids are part of families and thus aren't all in this table.
provider_file:
provider_data_id numeric(16)
image_group_id numeric(16)
(rest omitted for brevity)
unique index on provider_data_id
Approximately 32 million rows in this table. Most rows (> 95%) have a Non-Null image_group_id.
Postgres Version 10
How can I get the query performance to match whether I call it from a function or as raw SQL in a query tool?
The problem is exhibited in this line:
Filter: ((family_id)::numeric = '8419853'::numeric)
The index on family_id cannot be used because family_id is compared to a numeric value. This requires a cast to numeric, and there is no index on family_id::numeric.
Even though integer and numeric both are types representing numbers, their internal representation is quite different, and so the indexes are incompatible. In other words, the cast to numeric is like a function for PostgreSQL, and since it has no index on that functional expression, it has to resort to a scan of the whole table (or index).
The solution is simple, however: use an integer instead of a numeric parameter for the query. If in doubt, use a cast like
fam.family_id = $1::integer
Recently we changed the format of one of our tables from using a single entry in a column to having a JSONB column in the format of ["key1","key2","key3"] etc. Although we built a GIN index on the JSONB field the queries that we use on it are EXTREMELY slow (in the range of 50 minutes in explain plan). I am trying to find out a way to optimize the query and to correctly utilize the index. I pasted the query below as well as the explain plan for it. The indexed fields are visit.visitor, launch.campaign_key, launch.launch_key, visit.store_key and visits.stop JSONB field as GIN index. We are using PostgresQL 9.4
explain (analyze on) select count(subselect.visitors) as visitors,
subselect.campaign as campaign
from (
select distinct visit.visitor as visitors,
launch.campaign_key as campaign
from visit
join launch on (jsonb_exists(visit.stops, launch.launch_key)) where
visit.store_key = 'ahBzfmdlYXJsYXVuY2gtaHVi'
and launch.state = 'PRODUCTION') as subselect group by subselect.campaign
Explain results:
HashAggregate (cost=63873548.47..63873550.47 rows=200 width=88) (actual time=248617.348..248617.365 rows=58 loops=1)
Group Key: launch.campaign_key
-> HashAggregate (cost=63519670.22..63661221.52 rows=14155130 width=88) (actual time=248587.320..248616.558 rows=1938 loops=1)
Group Key: visit.visitor, launch.campaign_key
-> HashAggregate (cost=63307343.27..63448894.57 rows=14155130 width=88) (actual time=248553.278..248584.868 rows=1938 loops=1)
Group Key: visit.visitor, launch.campaign_key
-> Nested Loop (cost=4903.09..56997885.96 rows=1261891461 width=88) (actual time=180648.410..248550.249 rows=2085 loops=1)
Join Filter: jsonb_exists(visit.stops, (launch.launch_key)::text)
Rows Removed by Join Filter: 624114512
-> Bitmap Heap Scan on launch (cost=3213.19..126084.38 rows=169389 width=123) (actual time=32.082..317.561 rows=166121 loops=1)
Recheck Cond: ((state)::text = 'PRODUCTION'::text)
Heap Blocks: exact=56635
-> Bitmap Index Scan on launch_state_idx (cost=0.00..3170.85 rows=169389 width=0) (actual time=21.172..21.172 rows=166121 loops=1)
Index Cond: ((state)::text = 'PRODUCTION'::text)
-> Materialize (cost=1689.89..86736.04 rows=22349 width=117) (actual time=0.000..0.487 rows=3757 loops=166121)
-> Bitmap Heap Scan on visit (cost=1689.89..86624.29 rows=22349 width=117) (actual time=1.324..14.381 rows=3757 loops=1)
Recheck Cond: ((store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRinzbKcDQw'::text)
Heap Blocks: exact=3672
-> Bitmap Index Scan on visit_store_key_idx (cost=0.00..1684.31 rows=22349 width=0) (actual time=0.780..0.780 rows=3757 loops=1)
Index Cond: ((store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRinzbKcDQw'::text)
Planning time: 0.232 ms
Execution time: 248708.088 ms
I should mention the index on stops is built
CREATE INDEX ON visit USING GIN (stops)
I'm wondering if switching to building it to
CREATE INDEX ON visit USING GIN (stops->’value')
Will resolve the issue?
The wrapper function jsonb_exists() prevents the use of the gin index on visits.stops. Instead of
from visit
join launch on (jsonb_exists(visit.stops, launch.launch_key))
try
from visit
join launch on visit.stops ? launch.launch_key::text
I have an Events table with 30 million rows. The following query returns in 25 seconds
SELECT DISTINCT "events"."id", "calendars"."user_id"
FROM "events"
LEFT JOIN "calendars" ON "events"."calendar_id" = "calendars"."id"
WHERE "events"."deleted_at" is null
AND tstzrange('2016-04-21T12:12:36-07:00', '2016-04-21T12:22:36-07:00') #> lower(time_range)
AND ("status" is null or (status->>'pre_processed') IS NULL)
status is a jsonb column with an index on status->>'pre_processed'. Here are the other indexes that were created on the events table. time_range is of type TSTZRANGE.
CREATE INDEX events_time_range_idx ON events USING gist (time_range);
CREATE INDEX events_lower_time_range_index on events(lower(time_range));
CREATE INDEX events_upper_time_range_index on events(upper(time_range));
CREATE INDEX events_calendar_id_index on events (calendar_id)
I'm definitely out of my comfort zone on this and am trying to reduce the query time. Here's the output of explain analyze
HashAggregate (cost=7486635.89..7486650.53 rows=1464 width=48) (actual time=26989.272..26989.306 rows=98 loops=1)
Group Key: events.id, calendars.user_id
-> Nested Loop Left Join (cost=0.42..7486628.57 rows=1464 width=48) (actual time=316.110..26988.941 rows=98 loops=1)
-> Seq Scan on events (cost=0.00..7475629.43 rows=1464 width=50) (actual time=316.049..26985.344 rows=98 loops=1)
Filter: ((deleted_at IS NULL) AND ((status IS NULL) OR ((status ->> 'pre_processed'::text) IS NULL)) AND ('["2016-04-21 19:12:36+00","2016-04-21 19:22:36+00")'::tstzrange #> lower(time_range)))
Rows Removed by Filter: 31592898
-> Index Scan using calendars_pkey on calendars (cost=0.42..7.50 rows=1 width=48) (actual time=0.030..0.031 rows=1 loops=98)
Index Cond: (events.calendar_id = (id)::text)
Planning time: 1.468 ms
Execution time: 26989.370 ms
And here is the explain analyze with the events.deleted_at part of the query removed
HashAggregate (cost=7487382.57..7487398.33 rows=1576 width=48) (actual time=23880.466..23880.503 rows=115 loops=1)
Group Key: events.id, calendars.user_id
-> Nested Loop Left Join (cost=0.42..7487374.69 rows=1576 width=48) (actual time=16.612..23880.114 rows=115 loops=1)
-> Seq Scan on events (cost=0.00..7475629.43 rows=1576 width=50) (actual time=16.576..23876.844 rows=115 loops=1)
Filter: (((status IS NULL) OR ((status ->> 'pre_processed'::text) IS NULL)) AND ('["2016-04-21 19:12:36+00","2016-04-21 19:22:36+00")'::tstzrange #> lower(time_range)))
Rows Removed by Filter: 31592881
-> Index Scan using calendars_pkey on calendars (cost=0.42..7.44 rows=1 width=48) (actual time=0.022..0.023 rows=1 loops=115)
Index Cond: (events.calendar_id = (id)::text)
Planning time: 0.372 ms
Execution time: 23880.571 ms
I added the index on the status column. Everything else what already there and I'm unsure how to proceed going forward. Any suggestions on how to get the query time down to a more manageable number?
The B-tree index on lower(time_range) can only be used for conditions involving the <, <=, =, >= and > operators. The #> operator may rely on these internally, but as far as the planner is concerned, this range check operation is a black box, and so it can't make use of the index.
You will need to reformulate your condition in terms of the B-tree operators, i.e.:
lower(time_range) >= '2016-04-21T12:12:36-07:00' AND
lower(time_range) < '2016-04-21T12:22:36-07:00'
So add an index for events.deleted_at to get rid of the nasty sequential scan. What does it look like after that?