Querying Postgres Table with JSONB data - sql

I have a Table which stores data in a JSONB column.
Now, what i want to do is, query that table, and fetch records, which have specific values for a key.
This works fine:
SELECT "documents".*
FROM "documents"
WHERE (data #> '{"type": "foo"}')
But what i want to do is, fetch all the rows in the table, which have types foo OR bar.
I tried this:
SELECT "documents".*
FROM "documents"
WHERE (data #> '{"type": ["foo", "bar"]}')
But this doesn't seem to work.
I also tried this:
SELECT "documents".*
FROM "documents"
WHERE (data->'type' ?| array['foo', 'bar'])
Which works, but if I specify a key like so data->'type' it takes away the dynamicity of the query.
BTW, I am using Ruby on Rails with Postgres, so all the queries are going through ActiveRecord. This is how:
Document.where("data #> ?", query)

if I specify a key like so data->'type' it takes away the dynamicity of the query.
I understand you have a gin index on the column data defined like this:
CREATE INDEX ON documents USING GIN (data);
The index works for this query:
EXPLAIN ANALYSE
SELECT "documents".*
FROM "documents"
WHERE data #> '{"type": "foo"}';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents (cost=30.32..857.00 rows=300 width=25) (actual time=0.639..0.640 rows=1 loops=1)
Recheck Cond: (data #> '{"type": "foo"}'::jsonb)
Heap Blocks: exact=1
-> Bitmap Index Scan on documents_data_idx (cost=0.00..30.25 rows=300 width=0) (actual time=0.581..0.581 rows=1 loops=1)
Index Cond: (data #> '{"type": "foo"}'::jsonb)
Planning time: 7.928 ms
Execution time: 0.841 ms
but not for this one:
EXPLAIN ANALYSE
SELECT "documents".*
FROM "documents"
WHERE (data->'type' ?| array['foo', 'bar']);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Seq Scan on documents (cost=0.00..6702.98 rows=300 width=25) (actual time=31.895..92.813 rows=2 loops=1)
Filter: ((data -> 'type'::text) ?| '{foo,bar}'::text[])
Rows Removed by Filter: 299997
Planning time: 1.836 ms
Execution time: 92.839 ms
Solution 1. Use the operator #> twice, the index will be used for both conditions:
EXPLAIN ANALYSE
SELECT "documents".*
FROM "documents"
WHERE data #> '{"type": "foo"}'
OR data #> '{"type": "bar"}';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents (cost=60.80..1408.13 rows=600 width=25) (actual time=0.222..0.233 rows=2 loops=1)
Recheck Cond: ((data #> '{"type": "foo"}'::jsonb) OR (data #> '{"type": "bar"}'::jsonb))
Heap Blocks: exact=2
-> BitmapOr (cost=60.80..60.80 rows=600 width=0) (actual time=0.204..0.204 rows=0 loops=1)
-> Bitmap Index Scan on documents_data_idx (cost=0.00..30.25 rows=300 width=0) (actual time=0.144..0.144 rows=1 loops=1)
Index Cond: (data #> '{"type": "foo"}'::jsonb)
-> Bitmap Index Scan on documents_data_idx (cost=0.00..30.25 rows=300 width=0) (actual time=0.059..0.059 rows=1 loops=1)
Index Cond: (data #> '{"type": "bar"}'::jsonb)
Planning time: 3.170 ms
Execution time: 0.289 ms
Solution 2. Create an additional index on (data->'type'):
CREATE INDEX ON documents USING GIN ((data->'type'));
EXPLAIN ANALYSE
SELECT "documents".*
FROM "documents"
WHERE (data->'type' ?| array['foo', 'bar']);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents (cost=30.32..857.75 rows=300 width=25) (actual time=0.056..0.067 rows=2 loops=1)
Recheck Cond: ((data -> 'type'::text) ?| '{foo,bar}'::text[])
Heap Blocks: exact=2
-> Bitmap Index Scan on documents_expr_idx (cost=0.00..30.25 rows=300 width=0) (actual time=0.035..0.035 rows=2 loops=1)
Index Cond: ((data -> 'type'::text) ?| '{foo,bar}'::text[])
Planning time: 2.951 ms
Execution time: 0.108 ms
Solution 3. In fact this is a variant of the solution 1, with a different format of the condition which may be more convenient to use by the client program:
EXPLAIN ANALYSE
SELECT "documents".*
FROM "documents"
WHERE data #> any(array['{"type": "foo"}', '{"type": "bar"}']::jsonb[]);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents (cost=60.65..1544.20 rows=600 width=26) (actual time=0.803..0.819 rows=2 loops=1)
Recheck Cond: (data #> ANY ('{"{\"type\": \"foo\"}","{\"type\": \"bar\"}"}'::jsonb[]))
Heap Blocks: exact=2
-> Bitmap Index Scan on documents_data_idx (cost=0.00..60.50 rows=600 width=0) (actual time=0.778..0.778 rows=2 loops=1)
Index Cond: (data #> ANY ('{"{\"type\": \"foo\"}","{\"type\": \"bar\"}"}'::jsonb[]))
Planning time: 2.080 ms
Execution time: 0.304 ms
(7 rows)
Read more in the documentation.

Related

jsonb field indexing

I am trying to optimize the following sql query
select * from begin_transaction where ("group"->>'id')::bigint = '5'
without using additional indexing i get this
Gather (cost=1000.00..91957.50 rows=4179 width=750) (actual time=0.158..218.972 rows=715002 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on begin_transaction (cost=0.00..90539.60 rows=1741 width=750) (actual time=0.020..127.525 rows=238334 loops=3)
Filter: (((""group"" ->> 'id'::text))::bigint = '5'::bigint)
Rows Removed by Filter: 40250
Planning Time: 0.039 ms
Execution Time: 235.200 ms
if add index (btree)
CREATE INDEX begin_transaction_group_id_idx
ON public.begin_transaction USING btree (((("group"->>'id'::text))::bigint))
I receive
Bitmap Heap Scan on begin_transaction (cost=80.81..13773.97 rows=4179 width=750) (actual time=43.647..414.756 rows=715002 loops=1)
Recheck Cond: (((""group"" ->> 'id'::text))::bigint = '5'::bigint)
Rows Removed by Index Recheck: 52117
Heap Blocks: exact=50534 lossy=33026
-> Bitmap Index Scan on begin_transaction_group_id_idx (cost=0.00..79.77 rows=4179 width=0) (actual time=35.852..35.852 rows=715002 loops=1)
Index Cond: (((""group"" ->> 'id'::text))::bigint = '5'::bigint)
Planning Time: 0.045 ms
Execution Time: 429.968 ms
any ideas how to go about it to increase performance? the group field is jsonb.

PostgreSQL: Multicolumn index (jsonb, integer) used partially with #> and = conditions

Setup
A table with one jsonb column attributes and a non unique numeric ID campaignid:
CREATE TABLE coupons (
id integer NOT NULL,
created timestamp with time zone DEFAULT now( ) NOT NULL,
campaignid bigint NOT NULL,
attributes jsonb NOT NULL
);
This table would have up to 500M rows, arbitrary key/values in attributes and hundreds of different campaignid values.
Two indexes exist on the table:
CREATE INDEX campaignid_attrs_idx ON coupons
USING gin (campaignid,attributes);
CREATE INDEX campaignid_idx ON coupons
USING btree (campaignid, deleted);
What I did
I executed the query:
SELECT COUNT(*)
FROM coupons
WHERE
(campaignid = 97 AND
attributes #> '{"CountryId": 3}');
Expected Results
I expected the index campaignid_attrs_idx on (campaignid,attributes) to be fully used and the query to complete quite fast.
Actual Result
The query took a long time (~40 seconds) to execute.
Here's the output from explain (ANALYZE, COSTS):
Aggregate (cost=32337.78..32337.79 rows=1 width=8) (actual time=39726.410..39726.414 rows=1 loops=1)
-> Bitmap Heap Scan on coupons (cost=30164.40..32332.44 rows=2136 width=0) (actual time=16893.439..39549.891 rows=1088478 loops=1)
" Recheck Cond: ((attributes #> '{""CountryId"": 3}'::jsonb) AND (campaignid = 97))"
Rows Removed by Index Recheck: 10531586
Heap Blocks: exact=138344 lossy=583282
-> BitmapAnd (cost=30164.40..30164.40 rows=2136 width=0) (actual time=16837.885..16837.887 rows=0 loops=1)
-> Bitmap Index Scan on coupons_campaignid_attrs_index (cost=0.00..1465.15 rows=178954 width=0) (actual time=9872.279..9872.279 rows=81799565 loops=1)
" Index Cond: (attributes #> '{""CountryId"": 3}'::jsonb)"
-> Bitmap Index Scan on campaignid_idx (cost=0.00..28697.93 rows=2135515 width=0) (actual time=6454.972..6454.972 rows=3088167 loops=1)
Index Cond: (campaignid = 97)
Planning Time: 0.175 ms
Execution Time: 39726.480 ms
Conclusions
It seems like the index campaignid_attrs_idx was used for the first part of the query attributes #> '{"CountryId": 3}' returning ~80M rows, while the index campaignid_idx was used on the second part of the WHERE clause campaignid = 97 in parallel returning ~3M rows. Results from both parts were intersected to arrive at a set that fulfills both conditions. Then there was a Bitmap Heap Scan which verified that the result set complies with the desired conditions which took most of the time (16893.439..39549.891)
My main question, why wasn't campaignid_attrs_idx used to filter both conditions?
EDIT: I removed the second index campaignid_attrs_idx to see if then the multicolumn index will be used for both conditions. Strangely I still see that the only one of the conditions used in the index scan. Here's the plan:
Aggregate (cost=181951.27..181951.28 rows=1 width=8) (actual time=209633.017..209633.018 rows=1 loops=1)
-> Bitmap Heap Scan on coupons (cost=1424.30..181945.81 rows=2183 width=0) (actual time=8938.605..209401.433 rows=1091580 loops=1)
" Recheck Cond: (attributes #> '{""CountryId"": 3}'::jsonb)"
Rows Removed by Index Recheck: 31487517
Filter: (campaignid = 97)
Rows Removed by Filter: 80674951
Heap Blocks: exact=121875 lossy=5572599
-> Bitmap Index Scan on coupons_campaignid_attributes_idx (cost=0.00..1423.75 rows=179434 width=0) (actual time=8908.682..8908.682 rows=81802589 loops=1)
" Index Cond: (attributes #> '{""CountryId"": 3}'::jsonb)"
Planning Time: 6.885 ms
Execution Time: 209638.234 ms

Full text search query is slow on first query

I have an airports table which contains a list of nearly 4k airports. The table has a searchable column which is a ts_vector column and an index airports_searchable_index:
searchable tsvector NULL
CREATE INDEX airports_searchable_index ON airports USING gin (searchable)
Given I have an indexed document in the searchable column and I attempt to run a query against that column, I get very quick responses on my dev machine (around 3ms for the query) but around 650ms on production (using the exact same data!). The weird part is that my production machine is much stronger than my local dev machine. A query for example:
select * from "airports" where searchable ## to_tsquery('public.hebrew', 'ltn:*') order by "popularity" desc limit 100
I've opened PGAdmin and tried doing some tests. What I saw that for the first time I run the query above in a new "Query Tool Panel", it takes anywhere between 650-800ms to execute. However, on the second run, it takes 30-60ms to run even if I change the query term. I had concluded from that, that Postgres is possible opening the document in memory for each connection and run the query against that. Since I'm using PHP to talk with my backend, every request is going to open it's own connection to the DB, hence causing Postgres to constantly re-opening the document.
Could it be a misconfiguration on my production server?
Here is an explain query (for production server):
Limit (cost=24.03..24.04 rows=1 width=8) (actual time=0.048..0.048 rows=1 loops=1)
Output: id, popularity
Buffers: shared hit=4
-> Sort (cost=24.03..24.04 rows=1 width=8) (actual time=0.047..0.047 rows=1 loops=1)
Output: id, popularity
Sort Key: airports.popularity DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=4
-> Bitmap Heap Scan on lametayel.airports (cost=20.01..24.02 rows=1 width=8) (actual time=0.040..0.040 rows=1 loops=1)
Output: id, popularity
Recheck Cond: (airports.searchable ## '''ltn'':*'::tsquery)
Heap Blocks: exact=1
Buffers: shared hit=4
-> Bitmap Index Scan on airports_searchable_index (cost=0.00..20.01 rows=1 width=0) (actual time=0.036..0.036 rows=1 loops=1)
Index Cond: (airports.searchable ## '''ltn'':*'::tsquery)
Buffers: shared hit=3
Planning time: 0.304 ms
Execution time: 0.078 ms
Here is an explain query (for development server):
Limit (cost=28.03..28.04 rows=1 width=8) (actual time=0.065..0.067 rows=1 loops=1)
Output: id, popularity
Buffers: shared hit=5
-> Sort (cost=28.03..28.04 rows=1 width=8) (actual time=0.064..0.065 rows=1 loops=1)
Output: id, popularity
Sort Key: airports.popularity DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=5
-> Bitmap Heap Scan on lametayel.airports (cost=24.01..28.02 rows=1 width=8) (actual time=0.046..0.047 rows=1 loops=1)
Output: id, popularity
Recheck Cond: (airports.searchable ## '''ltn'':*'::tsquery)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on airports_searchable_index (cost=0.00..24.01 rows=1 width=0) (actual time=0.038..0.038 rows=1 loops=1)
Index Cond: (airports.searchable ## '''ltn'':*'::tsquery)
Buffers: shared hit=4
Planning time: 0.534 ms
Execution time: 0.122 ms

Significantly different time of execution of the query due to different date value in Postgres

I have a weird case of query execution performance here. The query has a date values in the WHERE clause, and the speed of executing varies by values of the date.
Actually:
for the dates from the range of the last 30 days, execution takes around 3 min
for the dates before the range of the last 30 days, execution takes a few seconds
The query is listed below, with the date in the last 30 days range:
select
sk2_.code as col_0_0_,
bra4_.code as col_1_0_,
st0_.quantity as col_2_0_,
bat1_.forecast as col_3_0_
from
TBL_st st0_,
TBL_bat bat1_,
TBL_sk sk2_,
TBL_bra bra4_
where
st0_.batc_id=bat1_.id
and bat1_.sku_id=sk2_.id
and bat1_.bran_id=bra4_.id
and not (exists (select
1
from
TBL_st st6_,
TBL_bat bat7_,
TBL_sk sk10_
where
st6_.batc_id=bat7_.id
and bat7_.sku_id=sk10_.id
and bat7_.bran_id=bat1_.bran_id
and sk10_.code=sk2_.code
and st6_.date>st0_.date
and sk10_.acco_id=1
and st6_.date>='2017-04-20'
and st6_.date<='2017-04-30'))
and sk2_.acco_id=1
and st0_.date>='2017-04-20'
and st0_.date<='2017-04-30'
and here is the plan for the query with the date in the last 30 days range:
Nested Loop (cost=289.06..19764.03 rows=1 width=430) (actual time=3482.062..326049.246 rows=249 loops=1)
-> Nested Loop Anti Join (cost=288.91..19763.86 rows=1 width=433) (actual time=3482.023..326048.023 rows=249 loops=1)
Join Filter: ((st6_.date > st0_.date) AND ((sk10_.code)::text = (sk2_.code)::text))
Rows Removed by Join Filter: 210558
-> Nested Loop (cost=286.43..13719.38 rows=1 width=441) (actual time=4.648..2212.042 rows=2474 loops=1)
-> Nested Loop (cost=286.00..6871.33 rows=13335 width=436) (actual time=4.262..657.823 rows=666738 loops=1)
-> Index Scan using uk_TBL_sk0_account_code on TBL_sk sk2_ (cost=0.14..12.53 rows=1 width=426) (actual time=1.036..1.084 rows=50 loops=1)
Index Cond: (acco_id = 1)
-> Bitmap Heap Scan on TBL_bat bat1_ (cost=285.86..6707.27 rows=15153 width=26) (actual time=3.675..11.308 rows=13335 loops=50)
Recheck Cond: (sku_id = sk2_.id)
Heap Blocks: exact=241295
-> Bitmap Index Scan on ix_al_batc_sku_id (cost=0.00..282.07 rows=15153 width=0) (actual time=3.026..3.026 rows=13335 loops=50)
Index Cond: (sku_id = sk2_.id)
-> Index Scan using ix_al_stle_batc_id on TBL_st st0_ (cost=0.42..0.50 rows=1 width=21) (actual time=0.002..0.002 rows=0 loops=666738)
Index Cond: (batc_id = bat1_.id)
Filter: ((date >= '2017-04-20 00:00:00'::timestamp without time zone) AND (date <= '2017-04-30 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 1
-> Nested Loop (cost=2.49..3023.47 rows=1 width=434) (actual time=111.345..130.883 rows=86 loops=2474)
-> Hash Join (cost=2.06..2045.18 rows=1905 width=434) (actual time=0.010..28.028 rows=54853 loops=2474)
Hash Cond: (bat7_.sku_id = sk10_.id)
-> Index Scan using ix_al_batc_bran_id on TBL_bat bat7_ (cost=0.42..1667.31 rows=95248 width=24) (actual time=0.009..11.045 rows=54853 loops=2474)
Index Cond: (bran_id = bat1_.bran_id)
-> Hash (cost=1.63..1.63 rows=1 width=426) (actual time=0.026..0.026 rows=50 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on TBL_sk sk10_ (cost=0.00..1.63 rows=1 width=426) (actual time=0.007..0.015 rows=50 loops=1)
Filter: (acco_id = 1)
-> Index Scan using ix_al_stle_batc_id on TBL_st st6_ (cost=0.42..0.50 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=135706217)
Index Cond: (batc_id = bat7_.id)
Filter: ((date >= '2017-04-20 00:00:00'::timestamp without time zone) AND (date <= '2017-04-30 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 1
-> Index Scan using TBL_bra_pk on TBL_bra bra4_ (cost=0.14..0.16 rows=1 width=13) (actual time=0.003..0.003 rows=1 loops=249)
Index Cond: (id = bat1_.bran_id)
Planning time: 8.108 ms
Execution time: 326049.583 ms
Here is the same query with the date before the last 30 days range:
select
sk2_.code as col_0_0_,
bra4_.code as col_1_0_,
st0_.quantity as col_2_0_,
bat1_.forecast as col_3_0_
from
TBL_st st0_,
TBL_bat bat1_,
TBL_sk sk2_,
TBL_bra bra4_
where
st0_.batc_id=bat1_.id
and bat1_.sku_id=sk2_.id
and bat1_.bran_id=bra4_.id
and not (exists (select
1
from
TBL_st st6_,
TBL_bat bat7_,
TBL_sk sk10_
where
st6_.batc_id=bat7_.id
and bat7_.sku_id=sk10_.id
and bat7_.bran_id=bat1_.bran_id
and sk10_.code=sk2_.code
and st6_.date>st0_.date
and sk10_.acco_id=1
and st6_.date>='2017-01-20'
and st6_.date<='2017-01-30'))
and sk2_.acco_id=1
and st0_.date>='2017-01-20'
and st0_.date<='2017-01-30'
and here is the plan for the query with the date before the last 30 days range:
Hash Join (cost=576.33..27443.95 rows=48 width=430) (actual time=132.732..3894.554 rows=250 loops=1)
Hash Cond: (bat1_.bran_id = bra4_.id)
-> Merge Anti Join (cost=572.85..27439.82 rows=48 width=433) (actual time=132.679..3894.287 rows=250 loops=1)
Merge Cond: ((sk2_.code)::text = (sk10_.code)::text)
Join Filter: ((st6_.date > st0_.date) AND (bat7_.bran_id = bat1_.bran_id))
Rows Removed by Join Filter: 84521
-> Nested Loop (cost=286.43..13719.38 rows=48 width=441) (actual time=26.105..1893.523 rows=2491 loops=1)
-> Nested Loop (cost=286.00..6871.33 rows=13335 width=436) (actual time=1.159..445.683 rows=666738 loops=1)
-> Index Scan using uk_TBL_sk0_account_code on TBL_sk sk2_ (cost=0.14..12.53 rows=1 width=426) (actual time=0.035..0.084 rows=50 loops=1)
Index Cond: (acco_id = 1)
-> Bitmap Heap Scan on TBL_bat bat1_ (cost=285.86..6707.27 rows=15153 width=26) (actual time=1.741..7.148 rows=13335 loops=50)
Recheck Cond: (sku_id = sk2_.id)
Heap Blocks: exact=241295
-> Bitmap Index Scan on ix_al_batc_sku_id (cost=0.00..282.07 rows=15153 width=0) (actual time=1.119..1.119 rows=13335 loops=50)
Index Cond: (sku_id = sk2_.id)
-> Index Scan using ix_al_stle_batc_id on TBL_st st0_ (cost=0.42..0.50 rows=1 width=21) (actual time=0.002..0.002 rows=0 loops=666738)
Index Cond: (batc_id = bat1_.id)
Filter: ((date >= '2017-01-20 00:00:00'::timestamp without time zone) AND (date <= '2017-01-30 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 1
-> Materialize (cost=286.43..13719.50 rows=48 width=434) (actual time=15.584..1986.953 rows=84560 loops=1)
-> Nested Loop (cost=286.43..13719.38 rows=48 width=434) (actual time=15.577..1983.384 rows=2491 loops=1)
-> Nested Loop (cost=286.00..6871.33 rows=13335 width=434) (actual time=0.843..482.864 rows=666738 loops=1)
-> Index Scan using uk_TBL_sk0_account_code on TBL_sk sk10_ (cost=0.14..12.53 rows=1 width=426) (actual time=0.005..0.052 rows=50 loops=1)
Index Cond: (acco_id = 1)
-> Bitmap Heap Scan on TBL_bat bat7_ (cost=285.86..6707.27 rows=15153 width=24) (actual time=2.051..7.902 rows=13335 loops=50)
Recheck Cond: (sku_id = sk10_.id)
Heap Blocks: exact=241295
-> Bitmap Index Scan on ix_al_batc_sku_id (cost=0.00..282.07 rows=15153 width=0) (actual time=1.424..1.424 rows=13335 loops=50)
Index Cond: (sku_id = sk10_.id)
-> Index Scan using ix_al_stle_batc_id on TBL_st st6_ (cost=0.42..0.50 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=666738)
Index Cond: (batc_id = bat7_.id)
Filter: ((date >= '2017-01-20 00:00:00'::timestamp without time zone) AND (date <= '2017-01-30 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 1
-> Hash (cost=2.10..2.10 rows=110 width=13) (actual time=0.033..0.033 rows=110 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on TBL_bra bra4_ (cost=0.00..2.10 rows=110 width=13) (actual time=0.004..0.013 rows=110 loops=1)
Planning time: 14.542 ms
Execution time: 3894.793 ms
Does anyone have an idea why does this happens.
Did anyone had an experience with anything similar?
Thank you very much.
Kind regards, Petar
I am not sure, but I had a similar case a while ago(On ORACLE but i guess it is not important).
in my case the difference originated at the difference between the amount of data, meaning: if you have 1% of data from the past 30 days, it uses the indexs. when you need "older" data (the rest 99% of the data) it decides to not use the index and to do a full scan(in the form of nested loop and not hash join).
If you sure that the data distribution is ok then maybe try collecting statistics(worked for me at the time). eventually you can start to analyze every peace of this query and to see what part exactly is the bottleneck and work from there.
BTree indexes can have some issues with dates, especially if you're removing old data from the table (ie, deleting everything older than 90 days). It can cause the tables to get lopsided, with all of the rows being down one branch of the tree. Even without removing old dates, if there are many more "new" rows than "old" rows, it can still happen.
But I don't see your query plans using an index on st0_.date, so I don't think that's the issue. If you can afford a table lock on st0_, you can test this theory by running a REINDEX operation on any indexes that contain st0_.date.
Instead, I think you just have a lot more rows that match the 2017-01-20 to 2017-01-30 range vs. the 2017-04-20 to 2017-04-30 range. The first doubly indented Nested Loop is the same in both queries, so I'll ignore it. The second doubly indended stanza is different, and much more expensive in the slow query:
-> Materialize (cost=286.43..13719.50 rows=48 width=434) (actual time=15.584..1986.953 rows=84560 loops=1)
-> Nested Loop (cost=286.43..13719.38 rows=48 width=434) (actual time=15.577..1983.384 rows=2491 loops=1)
-> Nested Loop (cost=286.00..6871.33 rows=13335 width=434) (actual time=0.843..482.864 rows=666738
vs
-> Nested Loop (cost=2.49..3023.47 rows=1 width=434) (actual time=111.345..130.883 rows=86 loops=2474)
-> Hash Join (cost=2.06..2045.18 rows=1905 width=434) (actual time=0.010..28.028 rows=54853 loops=2474)
Materialize can be an expensive operation that doesn't necessarily scale with the estimated cost. Take a look at https://www.postgresql.org/docs/10/static/using-explain.html , and search for "Materialize". Also note that the estimated number of rows is much higher in the slow version.
I'm not 100% sure, but I believe that tweaking the "work_mem" parameter can have some effect in this area (https://www.postgresql.org/docs/9.4/static/runtime-config-resource.html#GUC-WORK-MEM). To test this theory, you can change that value per session using
SET LOCAL work_mem = '8MB';

Why is this query taking so long on JSONB Gin index field? Can I fix it so it actually uses the index?

Recently we changed the format of one of our tables from using a single entry in a column to having a JSONB column in the format of ["key1","key2","key3"] etc. Although we built a GIN index on the JSONB field the queries that we use on it are EXTREMELY slow (in the range of 50 minutes in explain plan). I am trying to find out a way to optimize the query and to correctly utilize the index. I pasted the query below as well as the explain plan for it. The indexed fields are visit.visitor, launch.campaign_key, launch.launch_key, visit.store_key and visits.stop JSONB field as GIN index. We are using PostgresQL 9.4
explain (analyze on) select count(subselect.visitors) as visitors,
subselect.campaign as campaign
from (
select distinct visit.visitor as visitors,
launch.campaign_key as campaign
from visit
join launch on (jsonb_exists(visit.stops, launch.launch_key)) where
visit.store_key = 'ahBzfmdlYXJsYXVuY2gtaHVi'
and launch.state = 'PRODUCTION') as subselect group by subselect.campaign
Explain results:
HashAggregate (cost=63873548.47..63873550.47 rows=200 width=88) (actual time=248617.348..248617.365 rows=58 loops=1)
Group Key: launch.campaign_key
-> HashAggregate (cost=63519670.22..63661221.52 rows=14155130 width=88) (actual time=248587.320..248616.558 rows=1938 loops=1)
Group Key: visit.visitor, launch.campaign_key
-> HashAggregate (cost=63307343.27..63448894.57 rows=14155130 width=88) (actual time=248553.278..248584.868 rows=1938 loops=1)
Group Key: visit.visitor, launch.campaign_key
-> Nested Loop (cost=4903.09..56997885.96 rows=1261891461 width=88) (actual time=180648.410..248550.249 rows=2085 loops=1)
Join Filter: jsonb_exists(visit.stops, (launch.launch_key)::text)
Rows Removed by Join Filter: 624114512
-> Bitmap Heap Scan on launch (cost=3213.19..126084.38 rows=169389 width=123) (actual time=32.082..317.561 rows=166121 loops=1)
Recheck Cond: ((state)::text = 'PRODUCTION'::text)
Heap Blocks: exact=56635
-> Bitmap Index Scan on launch_state_idx (cost=0.00..3170.85 rows=169389 width=0) (actual time=21.172..21.172 rows=166121 loops=1)
Index Cond: ((state)::text = 'PRODUCTION'::text)
-> Materialize (cost=1689.89..86736.04 rows=22349 width=117) (actual time=0.000..0.487 rows=3757 loops=166121)
-> Bitmap Heap Scan on visit (cost=1689.89..86624.29 rows=22349 width=117) (actual time=1.324..14.381 rows=3757 loops=1)
Recheck Cond: ((store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRinzbKcDQw'::text)
Heap Blocks: exact=3672
-> Bitmap Index Scan on visit_store_key_idx (cost=0.00..1684.31 rows=22349 width=0) (actual time=0.780..0.780 rows=3757 loops=1)
Index Cond: ((store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRinzbKcDQw'::text)
Planning time: 0.232 ms
Execution time: 248708.088 ms
I should mention the index on stops is built
CREATE INDEX ON visit USING GIN (stops)
I'm wondering if switching to building it to
CREATE INDEX ON visit USING GIN (stops->’value')
Will resolve the issue?
The wrapper function jsonb_exists() prevents the use of the gin index on visits.stops. Instead of
from visit
join launch on (jsonb_exists(visit.stops, launch.launch_key))
try
from visit
join launch on visit.stops ? launch.launch_key::text