POSTGRESQL: How to optimize index for substring of a column?

POSTGRESQL: How to optimize index for substring of a column? - sql

How to optimize index for substring of a column ?
For example, having a column postal_code storing a string of 5 characters. If most of my queries filter on the 2 first characters having an index on this column is not useful.
What if I create an index only on the substring:
CREATE INDEX ON index.annonces_parsed (left(postal_code, 2))
Is it a good solution, or is it better to add a new column storing only the substring and having an index on it ?
A query using this index could be:
select *
from index.cities
where left(postal_code, 2) = '83' --- Will it use the index on the substring ?
Thanks so much

I have test table which is has a 20 million records.
Test 1
CREATE INDEX test_a1_idx ON test (a1)
explain analyze
select * from test
where left(a1, 2) = '58'
Gather (cost=1000.00..103565.05 rows=40000 width=12) (actual time=0.429..468.428 rows=89712 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on test (cost=0.00..98565.05 rows=16667 width=12) (actual time=0.114..407.330 rows=29904 loops=3)
Filter: ("left"(a1, 2) = '58'::text)
Rows Removed by Filter: 2636765
Planning Time: 0.424 ms
Execution Time: 470.472 ms
explain analyze
select * from test
where a1 like '58%'
Gather (cost=1000.00..99284.01 rows=80523 width=12) (actual time=0.990..337.339 rows=89712 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on test (cost=0.00..90231.71 rows=33551 width=12) (actual time=0.233..278.740 rows=29904 loops=3)
Filter: (a1 ~~ '58%'::text)
Rows Removed by Filter: 2636765
Planning Time: 0.092 ms
Execution Time: 339.259 ms
Test 2
CREATE INDEX test_a1_idx1 ON test (left(a1, 2))
explain analyze
select * from test
where left(a1, 2) = '58'
Bitmap Heap Scan on test (cost=446.43..49455.46 rows=40000 width=12) (actual time=10.507..206.800 rows=89712 loops=1)
Recheck Cond: ("left"(a1, 2) = '58'::text)
Heap Blocks: exact=38298
-> Bitmap Index Scan on test_a1_idx1 (cost=0.00..436.43 rows=40000 width=0) (actual time=5.450..5.450 rows=89712 loops=1)
Index Cond: ("left"(a1, 2) = '58'::text)
Planning Time: 0.501 ms
Execution Time: 209.217 ms
explain analyze
select * from test
where a1 like '58%'
Gather (cost=1000.00..99284.01 rows=80523 width=12) (actual time=0.341..334.759 rows=89712 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on test (cost=0.00..90231.71 rows=33551 width=12) (actual time=0.110..287.313 rows=29904 loops=3)
Filter: (a1 ~~ '58%'::text)
Rows Removed by Filter: 2636765
Planning Time: 0.067 ms
Execution Time: 336.762 ms
Result:
It should be noted that DB does not use indexes when we use any function in conditions. For this reason, functional indexing provides very good performance for these cases.

Related

jsonb field indexing

I am trying to optimize the following sql query
select * from begin_transaction where ("group"->>'id')::bigint = '5'
without using additional indexing i get this
Gather (cost=1000.00..91957.50 rows=4179 width=750) (actual time=0.158..218.972 rows=715002 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on begin_transaction (cost=0.00..90539.60 rows=1741 width=750) (actual time=0.020..127.525 rows=238334 loops=3)
Filter: (((""group"" ->> 'id'::text))::bigint = '5'::bigint)
Rows Removed by Filter: 40250
Planning Time: 0.039 ms
Execution Time: 235.200 ms
if add index (btree)
CREATE INDEX begin_transaction_group_id_idx
ON public.begin_transaction USING btree (((("group"->>'id'::text))::bigint))
I receive
Bitmap Heap Scan on begin_transaction (cost=80.81..13773.97 rows=4179 width=750) (actual time=43.647..414.756 rows=715002 loops=1)
Recheck Cond: (((""group"" ->> 'id'::text))::bigint = '5'::bigint)
Rows Removed by Index Recheck: 52117
Heap Blocks: exact=50534 lossy=33026
-> Bitmap Index Scan on begin_transaction_group_id_idx (cost=0.00..79.77 rows=4179 width=0) (actual time=35.852..35.852 rows=715002 loops=1)
Index Cond: (((""group"" ->> 'id'::text))::bigint = '5'::bigint)
Planning Time: 0.045 ms
Execution Time: 429.968 ms
any ideas how to go about it to increase performance? the group field is jsonb.

Postgres changing the query from index Only scan to bit map scan when data set increases

I have two same queries but with different where condition values
explain analyse select survey_contact_id, relation_id, count(survey_contact_id), count(relation_id) from nomination where survey_id = 1565 and account_id = 225 and deleted_at is NULL group by survey_contact_id, relation_id;
explain analyse select survey_contact_id, relation_id, count(survey_contact_id), count(relation_id) from nomination where survey_id = 888 and account_id = 12 and deleted_at is NULL group by survey_contact_id, relation_id;
When I ran this two queries they both producing different result
for first query the result
GroupAggregate (cost=0.28..8.32 rows=1 width=24) (actual time=0.016..0.021 rows=4 loops=1)
Group Key: survey_contact_id, relation_id
-> Index Only Scan using test on nomination (cost=0.28..8.30 rows=1 width=8) (actual time=0.010..0.012 rows=5 loops=1)
Index Cond: ((account_id = 225) AND (survey_id = 1565))
Heap Fetches: 5
Planning time: 0.148 ms
Execution time: 0.058 ms
and for the 2nd one
GroupAggregate (cost=11.08..11.12 rows=2 width=24) (actual time=0.015..0.015 rows=0 loops=1)
Group Key: survey_contact_id, relation_id
-> Sort (cost=11.08..11.08 rows=2 width=8) (actual time=0.013..0.013 rows=0 loops=1)
Sort Key: survey_contact_id, relation_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on nomination (cost=4.30..11.07 rows=2 width=8) (actual time=0.008..0.008 rows=0 loops=1)
Recheck Cond: ((account_id = 12) AND (survey_id = 888) AND (deleted_at IS NULL))
-> Bitmap Index Scan on test (cost=0.00..4.30 rows=2 width=0) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: ((account_id = 12) AND (survey_id = 888))
Planning time: 0.149 ms
Execution time: 0.052 ms
Can anyone explain Me why Postgres is making a BitMap Scan instead of Index Only scan ?

The short version is that Postgres has a cost-based approach, so it has estimated that the cost of doing so is less in the second case, based on the statistics it has.
In your case, the total cost (estimated) of each of these queries is 8.32 and 11.12 respectively. You may be able to see the cost of the index-only scan for the second query by running set enable_bitmapscan = off.
Note that, based on its statistics, Postgres estimated that the first query would return 1 row (actually 4), and that the second would return 2 rows (actually 0).
There are several ways to get better statistics, but if analyze (or autovacuum) hasn't been run on that table for a while, that is a common cause of bad estimates. Another tell-tale that vacuum may not have been run recently (at least on this table) is the Heap Fetches: 5 you can see in the first query plan.
I'm confused by the "when data set increases" part of your question, please do add more context on that front if relevant.
Finally, if you're not already planning a PostgreSQL upgrade, I highly recommend doing so soon. 9.6 is nearly out of support, and versions 10, 11, 12, and 13 each contained a host of performance focussed features.

Postgres similar query taking different time, not sure what's wrong

We have a query that runs on several different child tables created hourly and inherited from a base table, say tab_name1 and tab_name2. The query was working fine, but suddenly it started to perform badly for all the Childs since a particular date.
This is the query which works fine till tab_name_20180621*, not sure what happened after that.
SELECT
*
FROM
tab_name_201806220300
WHERE
id = 201806220300
AND col1 IN (
SELECT
col1
FROM
tab_name2_201806220300
WHERE
uid = 5452
AND id = 201806220300
);
The analyze output shows something like this, there's huge difference in execution time.
#1
Nested Loop Semi Join (cost=0.00..84762.11 rows=1 width=937) (actual time=117.599..117.599 rows=0 loops=1)
Join Filter: (tab_name_201806210100.col1 = tab_name2_201806210100.col1)
-> Seq Scan on tab_name_201806210100 (cost=0.00..31603.56 rows=1 width=937) (actual time=117.596..117.596 rows=0 loops=1)
Filter: (log_id = '201806220100'::bigint)
Rows Removed by Filter: 434045
-> Materialize (cost=0.00..53136.74 rows=1454 width=41) (never executed)
-> Seq Scan on tab_name2_201806210100 (cost=0.00..53129.47 rows=1454 width=41) (never executed)
Filter: ((uid = 5452) AND (log_id = '201806210100'::bigint))
Planning time: 1.490 ms
Execution time: 117.723 ms
#2
Nested Loop Semi Join (cost=0.00..10299.31 rows=48 width=1476) (actual time=1082.255..47041.945 rows=671 loops=1)
Join Filter: (tab_name_201806220100.col1 = tab_name2_201806220100.col1)
Rows Removed by Join Filter: 252444174
-> Seq Scan on tab_name_201806220100 (cost=0.00..4023.69 rows=95 width=1476) (actual time=0.008..36.292 rows=64153 loops=1)
Filter: (log_id = '201806220100'::bigint)
-> Materialize (cost=0.00..6274.19 rows=1 width=32) (actual time=0.000..0.264 rows=3935 loops=64153)
-> Seq Scan on tab_name2_201806220100 (cost=0.00..6274.19 rows=1 width=32) (actual time=0.464..55.475 rows=3960 loops=1)
Filter: ((log_id = '201806220100'::bigint) AND (uid = 5452))
Rows Removed by Filter: 140592
Planning time: 1.024 ms
Execution time: 47042.234 ms
I don't know what to infer from this and how to proceed from here. Could you help me?

Stream records one by one against matching query in PostgreSQL

Hello everyone I hope all of u are doing well, I working on a time critical search engine and I want to know is there a way to get fetching records stream one by one against a marching query.
Like in my case I have a table of 40 million records and I want to search according to Search String but if the search string is unique (contain rare records) or lengthy it took time to see through all the records and then return result which is time costly.
I already use indexing (ts_vector, GIN) but no help.
Search Query:
explain analyze
Select "SNR_SKU",
"SNR_Title",
"SNR_ModelNo",
"SNR_Brand",
"SNR_UPC",
"SNR_Available",
"SNR_ProductURL",
"SNR_ImageURL",
"SNR_Description",
"SNR_isShow",
"SNR_Date",
"SNR_Category",
"SNR_Condition",
"SNR_SubCategory",
"SNR_PriceBefore",
"SNR_CustomerReviews",
"SNR_Price"
from (
Select *,
similarity('apple iphone 5s 16gb',"SNR_Title") as rank
from products_allproducts
Where "SNR_Title" ~ 'apple'
AND "SNR_Title" ~ 'iphone'
AND "SNR_Title" ~ '5s'
AND "SNR_Title" ~ '16gb'
ORDER BY rank DESC LIMIT 36 OFFSET 0
) AS rankTable;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on ranktable (cost=1974848.30..1974848.32 rows=1 width=450) (actual time=16316.394..16316.395 rows=1 loops=1)
-> Limit (cost=1974848.30..1974848.31 rows=1 width=490) (actual time=16316.392..16316.393 rows=1 loops=1)
-> Sort (cost=1974848.30..1974848.31 rows=1 width=490) (actual time=16316.391..16316.391 rows=1 loops=1)
Sort Key: (similarity('apple iphone 5s 16gb'::text, (products_allproducts."SNR_Title")::text)) DESC
Sort Method: quicksort Memory: 25kB
-> Gather (cost=1000.00..1974848.29 rows=1 width=490) (actual time=16302.006..16316.383 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on products_allproducts (cost=0.00..1973848.19 rows=1 width=490) (actual time=11275.609..16294.657 rows=0 loops=3)
Filter: ((("SNR_Title")::text ~ 'apple'::text) AND (("SNR_Title")::text ~ 'iphone'::text) AND (("SNR_Title")::text ~ '5s'::text) AND (("SNR_Title")::text ~ '16gb'::text))
Rows Removed by Filter: 6722677
Planning time: 0.902 ms
Execution time: 16462.318 ms
(13 rows)
Thanks.
Here is my query plan:

PL/pgSQL Query Plan Worse Inside Function Than Outside

I have a function that is running too slow. I've isolated which piece of the function is slow.. a small SELECT statement:
SELECT image_group_id
FROM programs.image_family fam
JOIN programs.provider_file pf
ON (fam.provider_data_id = pf.provider_data_id
AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL)
LIMIT 1
When I run the function this piece of SQL generates the following query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=0.56..6.75 rows=1 width=6) (actual time=3471.004..3471.004 rows=0 loops=1)
-> Nested Loop (cost=0.56..594054.42 rows=96017 width=6) (actual time=3471.002..3471.002 rows=0 loops=1)
-> Seq Scan on image_family fam (cost=0.00..391880.08 rows=96023 width=6) (actual time=3471.001..3471.001 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
When I run the selected query in a query tool (outside of the function) the query plan looks like this:
Limit (cost=1.12..3.81 rows=1 width=6) (actual time=0.043..0.043 rows=1 loops=1)
Output: pf.image_group_id
Buffers: shared hit=11
-> Nested Loop (cost=1.12..14.55 rows=5 width=6) (actual time=0.041..0.041 rows=1 loops=1)
Output: pf.image_group_id
Inner Unique: true
Buffers: shared hit=11
-> Index Only Scan using image_family_family_id_provider_data_id_idx on programs.image_family fam (cost=0.56..1.65 rows=5 width=6) (actual time=0.024..0.024 rows=1 loops=1)
Output: fam.family_id, fam.provider_data_id
Index Cond: (fam.family_id = 8419853)
Heap Fetches: 2
Buffers: shared hit=6
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on programs.provider_file pf (cost=0.56..2.58 rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
Output: pf.provider_data_id, pf.provider_file_path, pf.posted_dt, pf.file_repository_id, pf.restricted_size, pf.image_group_id, pf.is_master, pf.is_biggest
Index Cond: (pf.provider_data_id = fam.provider_data_id)
Filter: (pf.image_group_id IS NOT NULL)
Buffers: shared hit=5
Planning time: 0.809 ms
Execution time: 0.100 ms
If I disable sequence scans in the function I can get a similar query plan:
Query Text: SELECT image_group_id FROM programs.image_family fam JOIN programs.provider_file pf ON (fam.provider_data_id = pf.provider_data_id AND fam.family_id = $1 AND pf.image_group_id IS NOT NULL) LIMIT 1
Limit (cost=1.12..8.00 rows=1 width=6) (actual time=3855.722..3855.722 rows=0 loops=1)
-> Nested Loop (cost=1.12..660217.34 rows=96017 width=6) (actual time=3855.721..3855.721 rows=0 loops=1)
-> Index Only Scan using image_family_family_id_provider_data_id_idx on image_family fam (cost=0.56..458043.00 rows=96023 width=6) (actual time=3855.720..3855.720 rows=0 loops=1)
Filter: ((family_id)::numeric = '8419853'::numeric)
Rows Removed by Filter: 19204671
Heap Fetches: 368
-> Index Scan using "IX_DBO_PROVIDER_FILE_1" on provider_file pf (cost=0.56..2.11 rows=1 width=12) (never executed)
Index Cond: (provider_data_id = fam.provider_data_id)
Filter: (image_group_id IS NOT NULL)
The query plans are different where the Filter functions are for the Index Only Scan. The function has more Heap Fetches and seems to treat the argument as a string casted to a numeric.
Things I've tried:
Increasing statistics (and running vacuum/analyze)
Calling the problematic piece of SQL in another function with language SQL
Add another index (the one that its using now to perform an INDEX ONLY scan)
Create a CTE for the image_family table (this did help performance but would still do a sequence scan on the image_family instead of using the index so still, too slow)
Change from executing raw SQL to using an EXECUTE ... INTO .. USING in the function.
Makeup of the two tables:
image_family:
provider_data_id: numeric(16)
family_id: int4
(rest omitted for brevity)
unique index on provider_data_id
index on family_id
I recently added a unique index on (family_id, provider_data_id) as well
Approximately 20 million rows here. Families have many provider_data_ids but not all provider_data_ids are part of families and thus aren't all in this table.
provider_file:
provider_data_id numeric(16)
image_group_id numeric(16)
(rest omitted for brevity)
unique index on provider_data_id
Approximately 32 million rows in this table. Most rows (> 95%) have a Non-Null image_group_id.
Postgres Version 10
How can I get the query performance to match whether I call it from a function or as raw SQL in a query tool?

The problem is exhibited in this line:
Filter: ((family_id)::numeric = '8419853'::numeric)
The index on family_id cannot be used because family_id is compared to a numeric value. This requires a cast to numeric, and there is no index on family_id::numeric.
Even though integer and numeric both are types representing numbers, their internal representation is quite different, and so the indexes are incompatible. In other words, the cast to numeric is like a function for PostgreSQL, and since it has no index on that functional expression, it has to resort to a scan of the whole table (or index).
The solution is simple, however: use an integer instead of a numeric parameter for the query. If in doubt, use a cast like
fam.family_id = $1::integer

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

POSTGRESQL: How to optimize index for substring of a column? - sql

Related

jsonb field indexing

Postgres changing the query from index Only scan to bit map scan when data set increases

Postgres similar query taking different time, not sure what's wrong

Stream records one by one against matching query in PostgreSQL

PL/pgSQL Query Plan Worse Inside Function Than Outside

Categories

Resources