Difference between ANY(ARRAY[..]) vs ANY(VALUES (), () ..) in PostgreSQL - sql

I am trying to workout query optimisation on id. Not sure which one way should I use. Below is the query plan using explain and cost wise looks similar.
1. explain (analyze, buffers) SELECT * FROM table1 WHERE id = ANY (ARRAY['00e289b0-1ac8-451f-957f-e00bc289148e'::uuid,...]);
QUERY PLAN:
Index Scan using table1_pkey on table1 (cost=0.42..641.44 rows=76 width=835) (actual time=0.258..2.603 rows=76 loops=1)
Index Cond: (id = ANY ('{00e289b0-1ac8-451f-957f-e00bc289148e,...}'::uuid[]))
Buffers: shared hit=231 read=73
Planning Time: 0.487 ms
Execution Time: 2.715 ms)
2. explain (analyze, buffers) SELECT * FROM table1 WHERE id = ANY (VALUES ('00e289b0-1ac8-451f-957f-e00bc289148e'::uuid),...);
QUERY PLAN:
Nested Loop (cost=1.56..644.10 rows=76 width=835) (actual time=0.058..0.297 rows=76 loops=1)
Buffers: shared hit=304
-> HashAggregate (cost=1.14..1.90 rows=76 width=16) (actual time=0.049..0.060 rows=76 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.95 rows=76 width=16) (actual time=0.006..0.022 rows=76 loops=1)
-> Index Scan using table1_pkey on table1 (cost=0.42..8.44 rows=1 width=835) (actual time=0.002..0.003 rows=1 loops=76)
Index Cond: (id = "*VALUES*".column1)
Buffers: shared hit=304
Planning Time: 0.437 ms
Execution Time: 0.389 ms
Looks like VALUES () does some hashing and join to improve performance but not sure.
NOTE: In my practical use case, id is uuid_generate_v4() e.x. d31cddc0-1771-4de8-ad41-e6c568b39a5d but the column may not be indexed as such.
Also, I have a table of with 5-10 million records.
Which way is for the better query performance?

Both options seem reasonable. I would, however, suggest to avoid casting the column you filter on. Instead, you should cast the literal values to uuid:
SELECT *
FROM table1
WHERE id = ANY (ARRAY['00e289b0-1ac8-451f-957f-e00bc289148e'::uuid, ...]);
This should allow the database to take advantage of an index on column id.

Related

Postgres changing the query from index Only scan to bit map scan when data set increases

I have two same queries but with different where condition values
explain analyse select survey_contact_id, relation_id, count(survey_contact_id), count(relation_id) from nomination where survey_id = 1565 and account_id = 225 and deleted_at is NULL group by survey_contact_id, relation_id;
explain analyse select survey_contact_id, relation_id, count(survey_contact_id), count(relation_id) from nomination where survey_id = 888 and account_id = 12 and deleted_at is NULL group by survey_contact_id, relation_id;
When I ran this two queries they both producing different result
for first query the result
GroupAggregate (cost=0.28..8.32 rows=1 width=24) (actual time=0.016..0.021 rows=4 loops=1)
Group Key: survey_contact_id, relation_id
-> Index Only Scan using test on nomination (cost=0.28..8.30 rows=1 width=8) (actual time=0.010..0.012 rows=5 loops=1)
Index Cond: ((account_id = 225) AND (survey_id = 1565))
Heap Fetches: 5
Planning time: 0.148 ms
Execution time: 0.058 ms
and for the 2nd one
GroupAggregate (cost=11.08..11.12 rows=2 width=24) (actual time=0.015..0.015 rows=0 loops=1)
Group Key: survey_contact_id, relation_id
-> Sort (cost=11.08..11.08 rows=2 width=8) (actual time=0.013..0.013 rows=0 loops=1)
Sort Key: survey_contact_id, relation_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on nomination (cost=4.30..11.07 rows=2 width=8) (actual time=0.008..0.008 rows=0 loops=1)
Recheck Cond: ((account_id = 12) AND (survey_id = 888) AND (deleted_at IS NULL))
-> Bitmap Index Scan on test (cost=0.00..4.30 rows=2 width=0) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: ((account_id = 12) AND (survey_id = 888))
Planning time: 0.149 ms
Execution time: 0.052 ms
Can anyone explain Me why Postgres is making a BitMap Scan instead of Index Only scan ?
The short version is that Postgres has a cost-based approach, so it has estimated that the cost of doing so is less in the second case, based on the statistics it has.
In your case, the total cost (estimated) of each of these queries is 8.32 and 11.12 respectively. You may be able to see the cost of the index-only scan for the second query by running set enable_bitmapscan = off.
Note that, based on its statistics, Postgres estimated that the first query would return 1 row (actually 4), and that the second would return 2 rows (actually 0).
There are several ways to get better statistics, but if analyze (or autovacuum) hasn't been run on that table for a while, that is a common cause of bad estimates. Another tell-tale that vacuum may not have been run recently (at least on this table) is the Heap Fetches: 5 you can see in the first query plan.
I'm confused by the "when data set increases" part of your question, please do add more context on that front if relevant.
Finally, if you're not already planning a PostgreSQL upgrade, I highly recommend doing so soon. 9.6 is nearly out of support, and versions 10, 11, 12, and 13 each contained a host of performance focussed features.

Even simplest postresql query takes ages in a big table with indexes

I have a problem with a very slow query execution in my psql table which has 145 602 995 rows (yep 145+ millions). I have indexes created but even simplest query takes ages to execute... For example query like SELECT COUNT(*) FROM events; takes 708 seconds (~12 minutes).
I have a column called org_id which has an index and when I try to execute query like:
EXPLAIN ANALYZE SELECT COUNT(*) FROM events WHERE org_id = 1;
Aggregate (cost=8191.76..8191.77 rows=1 width=8) (actual time=9.758..9.758 rows=1 loops=1)
-> Index Only Scan using org_id on events (cost=0.57..8179.63 rows=4853 width=0) (actual time=1.172..9.729 rows=48 loops=1)
Index Cond: (org_id = 1)
Heap Fetches: 48
Planning time: 0.167 ms
Execution time: 9.803 ms
it's using the org_id index but estimated cost was huge.
As I increase org_id number I get slower and slower execution times. It's probably related with less records for the org_ids with smaller numbers. When I get to org_id = 9 where there are a lot of records it stops using the org_id index and uses Bitmap Heap Scan and Bitmap Index Scan instead.
EXPLAIN SELECT COUNT(*) FROM events WHERE org_id = 9;
Aggregate (cost=10834654.32..10834654.33 rows=1 width=8)
-> Bitmap Heap Scan on events (cost=147380.56..10814983.35 rows=7868386 width=0)
Recheck Cond: (org_id = 9)
-> Bitmap Index Scan on org_id (cost=0.00..145413.46 rows=7868386 width=0)
Index Cond: (org_id = 9)
Is there any way of improving speed with such a big tables? One extra info is that I have 11 columns in this table where one of them is of the jsonb NOT NULL type. Just mentioning. Maybe it's important.
EDIT:
EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM events;
Aggregate (cost=12873195.66..12873195.67 rows=1 width=8) (actual time=653255.247..653255.248 rows=1 loops=1)
Buffers: shared hit=292755 read=10754192
-> Seq Scan on events (cost=0.00..12507945.93 rows=146099893 width=0) (actual time=0.015..638846.285 rows=146318426 loops=1)
Buffers: shared hit=292755 read=10754192
Planning time: 0.215 ms
Execution time: 653255.315 ms

PL/pgSQL Query Plan Worse Inside Function Than Outside

I have a function that is running too slow. I've isolated which piece of the function is slow.. a small SELECT statement:
SELECT image_group_id
FROM programs.image_family fam
JOIN programs.provider_file pf
ON (fam.provider_data_id = pf.provider_data_id
AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL)
LIMIT 1
When I run the function this piece of SQL generates the following query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=0.56..6.75 rows=1 width=6) (actual time=3471.004..3471.004 rows=0 loops=1)
-> Nested Loop (cost=0.56..594054.42 rows=96017 width=6) (actual time=3471.002..3471.002 rows=0 loops=1)
-> Seq Scan on image_family fam (cost=0.00..391880.08 rows=96023 width=6) (actual time=3471.001..3471.001 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
When I run the selected query in a query tool (outside of the function) the query plan looks like this:
Limit (cost=1.12..3.81 rows=1 width=6) (actual time=0.043..0.043 rows=1 loops=1)
Output: pf.image_group_id
Buffers: shared hit=11
-> Nested Loop (cost=1.12..14.55 rows=5 width=6) (actual time=0.041..0.041 rows=1 loops=1)
Output: pf.image_group_id
Inner Unique: true
Buffers: shared hit=11
-> Index Only Scan using image_family_family_id_provider_data_id_idx on programs.image_family fam (cost=0.56..1.65 rows=5 width=6) (actual time=0.024..0.024 rows=1 loops=1)
Output: fam.family_id, fam.provider_data_id
Index Cond: (fam.family_id = 8419853)
Heap Fetches: 2
Buffers: shared hit=6
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on programs.provider_file pf (cost=0.56..2.58 rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
Output: pf.provider_data_id, pf.provider_file_path, pf.posted_dt, pf.file_repository_id, pf.restricted_size, pf.image_group_id, pf.is_master, pf.is_biggest
Index Cond: (pf.provider_data_id = fam.provider_data_id)
Filter: (pf.image_group_id IS NOT NULL)
Buffers: shared hit=5
Planning time: 0.809 ms
Execution time: 0.100 ms
If I disable sequence scans in the function I can get a similar query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=1.12..8.00 rows=1 width=6) (actual time=3855.722..3855.722 rows=0 loops=1)
-> Nested Loop (cost=1.12..660217.34 rows=96017 width=6) (actual time=3855.721..3855.721 rows=0 loops=1)
-> Index Only Scan using image_family_family_id_provider_data_id_idx on image_family fam (cost=0.56..458043.00 rows=96023 width=6) (actual time=3855.720..3855.720 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
Heap Fetches: 368
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
The query plans are different where the Filter functions are for the Index Only Scan. The function has more Heap Fetches and seems to treat the argument as a string casted to a numeric.
Things I've tried:
Increasing statistics (and running vacuum/analyze)
Calling the problematic piece of SQL in another function with language SQL
Add another index (the one that its using now to perform an INDEX ONLY scan)
Create a CTE for the image_family table (this did help performance but would still do a sequence scan on the image_family instead of using the index so still, too slow)
Change from executing raw SQL to using an EXECUTE ... INTO .. USING in the function.
Makeup of the two tables:
image_family:
provider_data_id: numeric(16)
family_id: int4
(rest omitted for brevity)
unique index on provider_data_id
index on family_id
I recently added a unique index on (family_id, provider_data_id) as well
Approximately 20 million rows here. Families have many provider_data_ids but not all provider_data_ids are part of families and thus aren't all in this table.
provider_file:
provider_data_id numeric(16)
image_group_id numeric(16)
(rest omitted for brevity)
unique index on provider_data_id
Approximately 32 million rows in this table. Most rows (> 95%) have a Non-Null image_group_id.
Postgres Version 10
How can I get the query performance to match whether I call it from a function or as raw SQL in a query tool?
The problem is exhibited in this line:
Filter: ((family_id)::numeric = '8419853'::numeric)
The index on family_id cannot be used because family_id is compared to a numeric value. This requires a cast to numeric, and there is no index on family_id::numeric.
Even though integer and numeric both are types representing numbers, their internal representation is quite different, and so the indexes are incompatible. In other words, the cast to numeric is like a function for PostgreSQL, and since it has no index on that functional expression, it has to resort to a scan of the whole table (or index).
The solution is simple, however: use an integer instead of a numeric parameter for the query. If in doubt, use a cast like
fam.family_id = $1::integer

PostgreSQL copy big data from table to table

I have architecture when some data some times is copied to temp table, and then on command, this data with condition must be copied into one of other tables, before this run counts, deletes and updates based on object_id which the same in all tables.
The longest operation is copying - it takes to 10 minutes! on 300 000 rows.
insert into t1 (t1_f1, t1_f2, name, value) SELECT DISTINCT ON (object_id) t1_f1, t1_f2, name, value where loading_process_id = 695 - it is for example.
Can I speed up the process? Or this is bad architecture and I have to change it?
Some more - heap table can contains very much data, to copy can be some millions rows. Some fields (which used for counting or filtering) indexed in heap and in other tables.
And this is plan for not so big data
Insert on main_like (cost=2993.63..3115.51 rows=6094 width=797) (actual time=6143.194..6143.194 rows=0 loops=1)
-> Subquery Scan on "*SELECT*" (cost=2993.63..3115.51 rows=6094 width=797) (actual time=55.995..125.081 rows=6094 loops=1)
-> Unique (cost=2993.63..3024.10 rows=6094 width=796) (actual time=55.909..79.237 rows=6094 loops=1)
-> Sort (cost=2993.63..3008.86 rows=6094 width=796) (actual time=55.904..69.195 rows=6094 loops=1)
Sort Key: main_loadingprocessobjects.object_id
Sort Method: quicksort Memory: 3321kB
-> Seq Scan on main_loadingprocessobjects (cost=0.00..465.02 rows=6094 width=796) (actual time=0.578..8.285 rows=6094 loops=1)
Filter: (loading_process_id = 695)
Rows Removed by Filter: 1428
Planning time: 0.394 ms
Execution time: 6143.631 ms
Explain without insert -
Unique (cost=2993.63..3024.10 rows=6094 width=796) (actual time=48.915..52.902 rows=6094 loops=1)
-> Sort (cost=2993.63..3008.86 rows=6094 width=796) (actual time=48.911..49.959 rows=6094 loops=1)
Sort Key: object_id
Sort Method: quicksort Memory: 3321kB
-> Seq Scan on main_loadingprocessobjects (cost=0.00..465.02 rows=6094 width=796) (actual time=0.401..5.516 rows=6094 loops=1)
Filter: (loading_process_id = 695)
Rows Removed by Filter: 1428
Planning time: 0.214 ms
Execution time: 53.694 ms
main_loadingprocessobjects - is heap
main_like - is t1
There is several point that you might concern about this issue:
COPY statement in PostgreSQL is faster than insert into select statement.
Create composite index on this following query ex: (type,category).
SELECT DISTINCT ON (object_id) t1_f1, t1_f2, name, value where
type='ti' and category='added'
GROUP BY Statement is faster than DISTINCT statement.
Increase temp_buffer on postgresql.conf if you consider high usage on temp table.
Try CTE (Common Table Expresions) instead temp table.
Hope this point my help you in your future development.

Why is this query taking so long on JSONB Gin index field? Can I fix it so it actually uses the index?

Recently we changed the format of one of our tables from using a single entry in a column to having a JSONB column in the format of ["key1","key2","key3"] etc. Although we built a GIN index on the JSONB field the queries that we use on it are EXTREMELY slow (in the range of 50 minutes in explain plan). I am trying to find out a way to optimize the query and to correctly utilize the index. I pasted the query below as well as the explain plan for it. The indexed fields are visit.visitor, launch.campaign_key, launch.launch_key, visit.store_key and visits.stop JSONB field as GIN index. We are using PostgresQL 9.4
explain (analyze on) select count(subselect.visitors) as visitors,
subselect.campaign as campaign
from (
select distinct visit.visitor as visitors,
launch.campaign_key as campaign
from visit
join launch on (jsonb_exists(visit.stops, launch.launch_key)) where
visit.store_key = 'ahBzfmdlYXJsYXVuY2gtaHVi'
and launch.state = 'PRODUCTION') as subselect group by subselect.campaign
Explain results:
HashAggregate (cost=63873548.47..63873550.47 rows=200 width=88) (actual time=248617.348..248617.365 rows=58 loops=1)
Group Key: launch.campaign_key
-> HashAggregate (cost=63519670.22..63661221.52 rows=14155130 width=88) (actual time=248587.320..248616.558 rows=1938 loops=1)
Group Key: visit.visitor, launch.campaign_key
-> HashAggregate (cost=63307343.27..63448894.57 rows=14155130 width=88) (actual time=248553.278..248584.868 rows=1938 loops=1)
Group Key: visit.visitor, launch.campaign_key
-> Nested Loop (cost=4903.09..56997885.96 rows=1261891461 width=88) (actual time=180648.410..248550.249 rows=2085 loops=1)
Join Filter: jsonb_exists(visit.stops, (launch.launch_key)::text)
Rows Removed by Join Filter: 624114512
-> Bitmap Heap Scan on launch (cost=3213.19..126084.38 rows=169389 width=123) (actual time=32.082..317.561 rows=166121 loops=1)
Recheck Cond: ((state)::text = 'PRODUCTION'::text)
Heap Blocks: exact=56635
-> Bitmap Index Scan on launch_state_idx (cost=0.00..3170.85 rows=169389 width=0) (actual time=21.172..21.172 rows=166121 loops=1)
Index Cond: ((state)::text = 'PRODUCTION'::text)
-> Materialize (cost=1689.89..86736.04 rows=22349 width=117) (actual time=0.000..0.487 rows=3757 loops=166121)
-> Bitmap Heap Scan on visit (cost=1689.89..86624.29 rows=22349 width=117) (actual time=1.324..14.381 rows=3757 loops=1)
Recheck Cond: ((store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRinzbKcDQw'::text)
Heap Blocks: exact=3672
-> Bitmap Index Scan on visit_store_key_idx (cost=0.00..1684.31 rows=22349 width=0) (actual time=0.780..0.780 rows=3757 loops=1)
Index Cond: ((store_key)::text = 'ahBzfmdlYXJsYXVuY2gtaHVicg8LEgVTdG9yZRinzbKcDQw'::text)
Planning time: 0.232 ms
Execution time: 248708.088 ms
I should mention the index on stops is built
CREATE INDEX ON visit USING GIN (stops)
I'm wondering if switching to building it to
CREATE INDEX ON visit USING GIN (stops->’value')
Will resolve the issue?
The wrapper function jsonb_exists() prevents the use of the gin index on visits.stops. Instead of
from visit
join launch on (jsonb_exists(visit.stops, launch.launch_key))
try
from visit
join launch on visit.stops ? launch.launch_key::text