I'm performing full text search within PgSQL with queries like that (the aim is to return list of geocoded cities depending on searched city name):
SELECT *
FROM cities
WHERE city_tsvector ## to_tsquery('Paris')
AND postcode LIKE '75%'
This query performs quite fast on my database (425626 entries in the cities table): around 100ms. So far, so good.
Now, I have to perform this search, on 400 cities at the same time.
400 x 100ms = 40 seconds, which is way too long for my users.
I'm trying to write a single query, to perform this search in one go. One detail: the cities I must search for are not stored in database.
So I wrote this kind of query :
SELECT DISTINCT ON (myid) *
FROM unnest(
array[19977,19978,19979, (and so on)]::int[],
array['SAULXURES','ARGENTEUIL','OBERHOFFEN&SUR&MODER', (and so on)]::text[],
array['67','95','67','44', (and so on))]::text[]
) AS t(myid, cityname,mypostcode)
LEFT JOIN cities gc2 ON gc2.city_tsvector ## to_tsquery(cityname) AND gc2.postcode LIKE CONCAT(mypostcode,'%')
ORDER BY myid
;
The result is just catastrophic: for the same searches, the query is taking 4 times slower!
Is it possible to perform this kind of query so that the execution takes less time?
Thanks
EDIT
Here is the table cities structure (425626 lines):
EDIT with #The-Impaler answer:
Option #1: index on the first two characters of postcode
query takes 11 seconds
EXPLAIN VERBOSE :
Unique (cost=71133.21..71138.53 rows=100 width=40)
Output: t.myid, (st_astext(gc2.gps_coordinates))
-> Sort (cost=71133.21..71135.87 rows=1064 width=40)
Output: t.myid, (st_astext(gc2.gps_coordinates))
Sort Key: t.myid
-> Hash Right Join (cost=2.26..71079.72 rows=1064 width=40)
Output: t.myid, st_astext(gc2.gps_coordinates)
Hash Cond: (left((gc2.postcode)::text, 2) = t.mypostcode)
Join Filter: (gc2.city_tsvector ## to_tsquery(t.cityname))
-> Seq Scan on public.geo_cities gc2 (cost=0.00..13083.26 rows=425626 width=69)
Output: gc2.id, gc2.country_code, gc2.city, gc2.postcode, gc2.gps_coordinates, gc2.administrative_level_1_name, gc2.administrative_level_1_code, gc2.administrative_level_2_name, gc2.administrative_level_2_code, gc2.administrative_level_ (...)
-> Hash (cost=1.01..1.01 rows=100 width=72)
Output: t.myid, t.cityname, t.mypostcode
-> Function Scan on t (cost=0.01..1.01 rows=100 width=72)
Output: t.myid, t.cityname, t.mypostcode
Function Call: unnest('{289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,28914887922 (...)
Option #2: Use = instead of LIKE when filtering by postcode
query takes 12 seconds
EXPLAIN VERBOSE:
Unique (cost=71665.25..71670.57 rows=100 width=40)
Output: t.myid, (st_astext(gc2.gps_coordinates))
-> Sort (cost=71665.25..71667.91 rows=1064 width=40)
Output: t.myid, (st_astext(gc2.gps_coordinates))
Sort Key: t.myid
-> Hash Right Join (cost=2.26..71611.75 rows=1064 width=40)
Output: t.myid, st_astext(gc2.gps_coordinates)
Hash Cond: ((substring((gc2.postcode)::text, 1, 2))::text = t.mypostcode)
Join Filter: (gc2.city_tsvector ## to_tsquery(t.cityname))
-> Seq Scan on public.geo_cities gc2 (cost=0.00..13083.26 rows=425626 width=69)
Output: gc2.id, gc2.country_code, gc2.city, gc2.postcode, gc2.gps_coordinates, gc2.administrative_level_1_name, gc2.administrative_level_1_code, gc2.administrative_level_2_name, gc2.administrative_level_2_code, gc2.administrative_level_ (...)
-> Hash (cost=1.01..1.01 rows=100 width=72)
Output: t.myid, t.cityname, t.mypostcode
-> Function Scan on t (cost=0.01..1.01 rows=100 width=72)
Output: t.myid, t.cityname, t.mypostcode
Function Call: unnest('{289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,289148879225,28914887922 (...)
PART ONE - Enforcing Index Range Scan on postcode.
The key optimization is to make each search to work on 4000 rows and not 400k rows, This should be easy if the filtering by the first two digits of postcode is done appropriately.
Option #1: Create an index on the first two characters of postcode:
create index ix1 on cities (substring(postcode from 1 for 2));
Option #2: Use = instead of LIKE when filtering by postcode. You'll need to create a pseudo-column minipostcode and create an index on it:
create or replace function minipostcode(cities)
returns char(2) as $$
select substring($1.postcode from 1 for 2)
$$ stable language sql;
create index ix_cities_minipostcode on cities(minipostcode(cities));
And your SQL will change to:
SELECT DISTINCT ON (myid) *
FROM unnest(
array[19977,19978,19979, (and so on)]::int[],
array['SAULXURES','ARGENTEUIL','OBERHOFFEN&SUR&MODER', (and so on)]::text[],
array['67','95','67','44', (and so on))]::text[]
) AS t(myid, cityname,mypostcode)
LEFT JOIN cities gc2 ON gc2.city_tsvector ## to_tsquery(cityname)
AND gc2.minipostcode = mypostcode
ORDER BY myid
;
See the gc2.minipostcode = there?
Check the execution plan
Get the execution plan for each one of the options above and compare using:
explain verbose <my-query>
Please post the execution plan of both options.
PART TWO - Reducing the number of scans.
Once you are sure it's using Index Range Scan on postcode you can further optimize it.
Considering your search has 400 values, each postcode is probably repeated around 4 times each. Then, do a single Index Range Scan for each postcode, You'll reduce your execution time by 75% just by doing this.
However, you can't do this using pure SQL, you'll need to pre-process your SQL generating a single query per postcode. You won't use unnest anymore but you'll prebuild the SQL in your application language.
For example, since 67 shows up twice in your example it should be combined into a single Index Range Scan, resulting in something like:
select from cities gc2
where (gc2.city_tsvector ## to_tsquery('SAULXURES')
or gc2.city_tsvector ## to_tsquery('OBERHOFFEN&SUR&MODER'))
and gc2.minipostcode = '67'
union
(next select for another postcode here, and so on...)
This is much more optimized than your unnest option since it executes at most 100 Index Range Scans, even if you increase your search to 1000 or 5000 criteria.
Try this, and post the new execution plan.
Related
I am designing a table that has a jsonb column realizing permissions with the following format:
[
{"role": 5, "perm": "view"},
{"role": 30, "perm": "edit"},
{"role": 52, "perm": "view"}
]
TL;DR
How do I convert such jsonb value into an SQL array of integer roles? In this example, it would be '{5,30,52}'::int[]. I have some solutions but none are fast enough. Keep reading...
Each logged-in user has some roles (one or more). The idea is to filter the records using the overlap operator (&&) on int[].
SELECT * FROM data WHERE extract_roles(access) && '{1,5,17}'::int[]
I am looking for the extract_roles function/expression that can also be used in the definition of an index:
CREATE INDEX data_roles ON data USING gin ((extract_roles(access)))
jsonb in Postgres seems to have broad support for building and transforming but less for extracting values - SQL arrays in this case.
What I tried:
create or replace function extract_roles(access jsonb) returns int[]
language sql
strict
parallel safe
immutable
-- with the following bodies:
-- (0) 629ms
select translate(jsonb_path_query_array(access, '$.role')::text, '[]', '{}')::int[]
-- (1) 890ms
select array_agg(r::int) from jsonb_path_query(access, '$.role') r
-- (2) 866ms
select array_agg((t ->> 'role')::int) from jsonb_array_elements(access) as x(t)
-- (3) 706ms
select f1 from jsonb_populate_record(row('{}'::int[]), jsonb_build_object('f1', jsonb_path_query_array(access, '$.role'))) as x (f1 int[])
When the index is used, the query is fast. But there are two problems with these expressions:
some of the functions are only stable and not immutable; this also applies to cast. Am I allowed to mark my function as immutable? The immutability is required by the index definition.
they are slow; the planner does not use the index in some scenarios, and then the query can become really slow (times above are on a table with 3M records):
explain (analyse)
select id, access
from data
where extract_roles(access) && '{-3,99}'::int[]
order by id
limit 100
with the following plan (same for all variants above; prefers scanning the index associated with the primary key, gets sorted results and hopes that it finds 100 of them soon):
Limit (cost=1000.45..2624.21 rows=100 width=247) (actual time=40.668..629.193 rows=100 loops=1)
-> Gather Merge (cost=1000.45..476565.03 rows=29288 width=247) (actual time=40.667..629.162 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using data_pkey on data (cost=0.43..472184.44 rows=12203 width=247) (actual time=25.522..513.463 rows=35 loops=3)
Filter: (extract_roles(access) && '{-3,99}'::integer[])
Rows Removed by Filter: 84918
Planning Time: 0.182 ms
Execution Time: 629.245 ms
Removing the LIMIT clause is paradoxically fast:
Gather Merge (cost=70570.65..73480.29 rows=24938 width=247) (actual time=63.263..75.710 rows=40094 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=69570.63..69601.80 rows=12469 width=247) (actual time=59.870..61.569 rows=13365 loops=3)
Sort Key: id
Sort Method: external merge Disk: 3744kB
Worker 0: Sort Method: external merge Disk: 3232kB
Worker 1: Sort Method: external merge Disk: 3160kB
-> Parallel Bitmap Heap Scan on data (cost=299.93..68722.36 rows=12469 width=247) (actual time=13.823..49.336 rows=13365 loops=3)
Recheck Cond: (extract_roles(access) && '{-3,99}'::integer[])
Heap Blocks: exact=9033
-> Bitmap Index Scan on data_roles (cost=0.00..292.44 rows=29926 width=0) (actual time=9.429..9.430 rows=40094 loops=1)
Index Cond: (extract_roles(access) && '{-3,99}'::integer[])
Planning Time: 0.234 ms
Execution Time: 77.719 ms
Is there any better and faster way to extract int[] from a jsonb? Because I cannot rely on the planner always using the index. Playing with COST of the extract_roles function helps a bit (planner starts using the index for LIMIT 1000) but even an insanely high value does not force the index for LIMIT 100.
Comments:
If there is not, I will probably store the information in another column roles int[], which is fast but takes extra space and requires extra treatment (can be solved using generated columns on Postgres 12+, which Azure still does not provide, or a trigger, or in the application logic).
Looking into the future, will there be any better support in Postgres 15? Maybe JSON_QUERY but I don’t see any immediate improvement because its RETURNING clause probably refers to the whole result and not its elements.
Maybe jsonb_populate_record could also consider non-composite types (its signature allows it) such as:
select jsonb_populate_record(null::int[], '[123,456]'::jsonb)
The two closest questions are:
Extract integer array from jsonb within postgres 9.6
Cast postgresql jsonb value as array of int and remove element from it
Reaction to suggested normalization:
Normalization is probably not viable. But let's follow the train of thoughts.
I assume that the extra table would look like this: *_perm (id, role, perm). There would be an index on id and another index on role.
Because a user has multiple roles, it could join multiple records for the same id, which would cause multiplication of the records in the data table and force a group by aggregation.
A group by is bad for performance because it prevents some optimizations. I am designing a building block. So there can be for example two data tables at play:
select pd.*, jsonb_agg(to_jsonb(pp))
from posts_data pd
join posts_perm pp on pd.id = pp.id
where exists(
select 1
from comments_data cd on cd.post_id = pd.id
join comments_perm cp on cp.id = cd.id
where cd.reputation > 100
and cp.role in (3,34,52)
-- no group by needed due to semi-join
)
and cp.role in (3,34,52)
group by pd.id
order by pd.title
limit 10
If I am not mistaken, this query will require the aggregation of all records before they are sorted. No index can help here. That will never be fast with millions of records. Moreover, there is non-trivial logic behind group by usage - it is not always needed.
What if we did not need to return the permissions but only cared about its existence?
select pd.*
from posts_data pd
where exists(
select 1
from posts_perm pp on pd.id = pp.id
where cp.role in (3,34,52)
)
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where exists(
select 1
from comments_perm cp on cp.id = cd.id
where cp.role in (3,34,52)
)
and cd.reputation > 100
)
order by pd.title
limit 10
Then we don't need any aggregation - the database will simply issue a SEMI-JOIN. If there is an index on title, the database may consider using it. We can even fetch the permissions in the projection; something like this:
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
...
Where a nested-loop join will be issued for only the few (10) records. I will test this approach.
Another option is to keep the data in both tables - the data table would only store an int[] of roles. Then we save a JOIN and only fetch from the permission table at the end. Now we need an index that supports array operations - GIN.
select pd.*, (select jsonb_agg(to_jsonb(pp)) from posts_perm pp on pd.id = pp.id) perm
from posts_data pd
where pd.roles && '{3,34,52}'::int[]
and exists(
select 1
from comments_data cd on cd.post_id = pd.id
where cd.roles && '{3,34,52}'::int[]
and cd.reputation > 100
)
order by pd.title
limit 10
Because we always aggregate all permissions for the returned records (their interpretation is in the application and does not matter that we return all of them), we can store the post_perms as a json. Because we never need to work with the values in SQL, storing them directly in the data table seems reasonable.
We will need to support some bulk-sharing operations later that update the permissions for many records, but that is much rarer than selects. Because of this we could favor jsonb instead.
The projection does not need the select of permissions anymore:
select pd.*
...
But now the roles column is redundant - we have the same information in the same table, just in JSON format. If we can write a function that extracts just the roles, we can directly index it.
And we are back at the beginning. But it looks like the extract_roles function is never going to be fast, so we need to keep roles column.
Another reason for keeping permissions in the same table is the possibility of combining multiple indices using Bitmap And and avoiding a join.
There will be a huge bias in the roles. Some are going to be present on almost all rows (admin can edit everything), others will be rare (John Doe can only access these 3 records that were explicitly shared with him). I am not sure how well statistics will work on the int[] approach but so far my tests show that the GIN index is used when the role is infrequent (high selectivity).
It looks like the core problem here is the classic one with WHERE...ORDER BY...LIMIT, that the planner assumes all of the qualifying rows are scattered evenly throughout the ordering. But that isn't the case here: rows meeting your && condition are selectively deficient in low-numbered "id". So it has to walk that index far farther than it thought it would need to before it catches the LIMIT.
There is nothing you can do (in any version) to get the planner to estimate that better. You could just prevent that index from being used by rewriting it to order by id+0. But then it wouldn't use that plan even when it would truly be faster, like the admin who is on everything. (Which by the way seems like a bad idea--an exceptional user should probably be handled exceptionally, not shoehorned into the normal system).
The immutable extraction function certainly is slow, but if the above planning problem were fixed that wouldn't matter. Making the function faster would probably require some compiled code, and Azure surely would not let you link the .so file into their managed server.
Because the JSON has a regular structure (int, text), I also considered two alternative storages:
create a composite type role of (int, text) and store the array role[]; extract_roles function is still needed;
store two arrays int[] and text[].
The latter one won for the following reasons:
smallest disk space (important for queries that require seq scan);
no need for extract_roles function - the int array is stored directly;
no need for functional index;
easy append (but same is true for JSON);
the library that I am using (jOOQ) has a good binding for arrays so working with them may even be more pleasant than with a JSON.
Disadvantages are:
harder remove - need to unnest and reaggregate.
I have a table in which I want to search by a prefix of the primary key. The primary key has values like 03.000221.1, 03.000221.2, 03.000221.3, etc. and I want to retrieve all that begin with 03.000221..
My first thought was to filter with LIKE '03.000221.%', thinking Postgres would be smart enough to look up 03.000221. in the index and perform a range scan from that point. But no, this performs a sequential scan.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..253626.34 rows=78 width=669)
Workers Planned: 2
-> Parallel Seq Scan on ... (cost=0.00..252618.54 rows=32 width=669)
Filter: ((id ~~ '03.000221.%'::text)
JIT:
Functions: 2
Options: Inlining false, Optimization false, Expressions true, Deforming true
If I do an equivalent operation using a plain >= and < range, e. g. id >= '03.000221.' and id < '03.000221.Z' it does use the index:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ... on ... (cost=0.56..8.58 rows=1 width=669)
Index Cond: ((id >= '03.000221.'::text) AND (id < '03.000221.Z'::text))
But this is dirtier and it seems to me that Postgres should be able to deduce it can do an equivalent index range lookup with LIKE. Why doesn't it?
PostgreSQL will do this if you are build the index with text_pattern_ops operator, or if you are using the C collation.
If you are using some random other collation, PostgreSQL can't deduce much of anything about it. Observe this, in the very common "en_US.utf8" collation.
select * from (values ('03.000221.1'), ('03.0002212'), ('03.000221.3')) f(x) order by x;
x
-------------
03.000221.1
03.0002212
03.000221.3
Which then naturally leads to this wrong answer with your query:
select * from (values ('03.000221.1'), ('03.0002212'), ('03.000221.3')) f(id)
where ((id >= '03.000221.'::text) AND (id < '03.000221.Z'::text))
id
-------------
03.000221.1
03.0002212
03.000221.3
I have to extract DB to external DB server for licensed software.
DB has to be Postgres and I cannot change select query from application (cannot change source code).
Table (it has to be 1 table) holds around 6,5M rows and has unique values in main column (prefix).
All requests are read request, no inserts/update/delete, and there are ~200k selects/day with peaks of 15 TPS.
Select query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
AND company = 0 and ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC
LIMIT 1;
Explain analyze shows following
Limit (cost=406433.75..406433.75 rows=1 width=113) (actual time=1721.360..1721.361 rows=1 loops=1)
-> Sort (cost=406433.75..406436.72 rows=1188 width=113) (actual time=1721.358..1721.358 rows=1 loops=1)
Sort Key: ("position"((prefix)::text, '%'::text)), (char_length(prefix)) DESC
Sort Method: quicksort Memory: 25kB
-> Seq Scan on table (cost=0.00..406427.81 rows=1188 width=113) (actual time=1621.159..1721.345 rows=1 loops=1)
Filter: ((company = 0) AND ('00381691997142'::text ~~ (prefix)::text) AND ((strpos(("Day")::text, (to_char(now(), 'ID'::text))::text) > 0) OR ("Day" IS NULL)) AND (((('now'::cstring)::time with time zone >= (timefrom)::time with time zone) AN (...)
Rows Removed by Filter: 6417130
Planning time: 0.165 ms
Execution time: 1721.404 ms`
Slowest part of query is:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table
WHERE '00436641997142' LIKE prefix
which generates 1,6s (tested only this part of query)
Part of query tested separately:
Seq Scan on table (cost=0.00..181819.07 rows=32086 width=113) (actual time=1488.359..1580.607 rows=1 loops=1)
Filter: ('004366491997142'::text ~~ (prefix)::text)
Rows Removed by Filter: 6417130
Planning time: 0.061 ms
Execution time: 1580.637 ms
About data itself:
column "prefix" has identical first several digits (first 5) and rest are different, unique ones.
Postgres version is 9.5
I've changed following settings of Postgres:
random-page-cost = 40
effective_cashe_size = 4GB
shared_buffer = 4GB
work_mem = 1GB
I have tried with several index types (unique, gin, gist, hash), but in all cases indexes are not used (as stated in explain above) and result speed is same.
I've also did, but no visible improvements:
vacuum analyze verbose table
Please recommend settings of DB and/or index configuration in order to speed up execution time of this query.
Current HW is
i5, SSD, 16GB RAM on Win7, but I have option to buy stronger HW.
As I understood, for cases where read (no inserts/updates) is dominant, faster CPU cores are much more important than number of cores or disk speed > please, confirm.
Add-on 1:
After adding 9 indexes, index is not used also.
Add-on 2:
1) I found out reason for not using index: word order in query in part like is reason. if query would be:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE prefix like '00436641997142%'
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
it uses index.
notice difference:
... WHERE '00436641997142%' like prefix ...
query which uses index correctly:
... WHERE prefix like '00436641997142%' ...
since I cannot change query itself, any idea how to overcome this? I can change data and Postgres settings, but not query itself.
2) Also, I intalled Postgres 9.6 version in order to use parallel seq.scan. In this case, parallel scan is used only if last part of query is ommited. So, query:
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null))
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
uses parallel mode.
Any idea how to force original query (I cannot change query):
SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM erm_table WHERE '00436641997142' LIKE prefix
AND company = 0 and
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1
to use parallel seq. scan?
It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.
I can suggest one risky solution:
Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:
select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;
prefix_low | prefix_high | prefix_length
----------------------------------+---------------------------------------+-----
00436641997142000000000000000000 | 00436641997142999999999999999999 | 14
Make index with these values
CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
Check modified requests against table:
SELECT prefix, changeprefix, deletelast, outgroup, tariff
FROM table
WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
AND company = 0
AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY prefix_length DESC
LIMIT 1
Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.
Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:
import re
def rewrite_query(username, query):
q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
if not re.match(q1, query):
return query # nothing to do with other queries
else:
new_query = # ... rewrite query here
return new_query
Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.
When proxy is ready and debugged reconnect your application to pgBouncer.
Pro:
no changes to application
no changes to basic structure of DB
Contra:
extra maintenance - you need triggers to keep all new columns with actual data
extra tools to support
rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.
Further development:
highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing
If I understand your problem correctly, creating proxy server which rewrites queries could be solution here.
Here is an example from another question.
Then you could change "LIKE" to "=" in your query, and it would run a lot faster.
You should change your index by adding proper operator class, according to documentation:
The operator classes text_pattern_ops, varchar_pattern_ops, and
bpchar_pattern_ops support B-tree indexes on the types text, varchar,
and char respectively. The difference from the default operator
classes is that the values are compared strictly character by
character rather than according to the locale-specific collation
rules. This makes these operator classes suitable for use by queries
involving pattern matching expressions (LIKE or POSIX regular
expressions) when the database does not use the standard "C" locale.
As an example, you might index a varchar column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);
id | url
---------
1 | "facebook.com/user?query=hello"
2 | "stackoverflow.com/question/?query=postgres"
3 | "facebook.com/videos?"
4 | "facebook.com/user?query="
So, this is not the best example, but essentially there is a field in the table I'm querying that is a URL and contains query inputs as well, I want to only select urls that have something after query=
so I would want to get rows 1 and 2.
It would be even cooler if I could also capture what might follow after? (hello, postgres)
Thank you for any insights or help
Here is more index-friendly solution:
drop table if exists foo.t;
create table foo.t (x bigserial, url varchar);
insert into foo.t(url) values
('facebook.com/user?query=hello'),
('stackoverflow.com/question/?query1=postgres'),
('facebook.com/videos?'),
('facebook.com/user?query=');
insert into foo.t(url) select url from foo.t, generate_series(1,100000);
create index iturl on foo.t(position('?query=' in url), length(url));
explain analyze select * from foo.t where position('?query=' in url) > 0 and length(url) > position('?query=' in url) + 7;
And result is:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=3736.79..12211.16 rows=67005 width=37) (actual time=62.072..407.323 rows=100001 loops=1)
Recheck Cond: ("position"((url)::text, '?query='::text) > 0)
Filter: (length((url)::text) > ("position"((url)::text, '?query='::text) + 7))
Rows Removed by Filter: 100001
Heap Blocks: exact=3449
-> Bitmap Index Scan on iturl (cost=0.00..3720.04 rows=201015 width=0) (actual time=60.473..60.473 rows=200002 loops=1)
Index Cond: ("position"((url)::text, '?query='::text) > 0)
Planning time: 0.512 ms
Execution time: 423.587 ms
(9 rows)
Note that I changed your initial data to make index selectable (1/4 hits).
You can readily select the rows using:
where url like '%query=_%'
You can get what comes after in several ways. Here is an easy way:
select (case when url like '%query=_%'
then substr(url, position('query=' in url) + 6)
end) as querystring
from pageviews;
You can use the wild card % with a like clause.
SELECT * FROM person WHERE email like '%gmail.com%';
This will get all the emails with gmail.com in email column from person table.
The below part may be out of context for this question but if someone wants to use regex
~ character can be used to match a regex e.g.
SELECT * FROM person WHERE email ~ '[A-Z]';
This will list all the rows where email column contains uppercase letters
I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.
Query #1:
SELECT id
FROM subsource_position
WHERE
id NOT IN (SELECT position_id FROM subsource)
This comes back with the following plan:
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on subsource_position (cost=0.00..362486535.10 rows=128524 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2566.50 rows=101500 width=4)
-> Seq Scan on subsource (cost=0.00..1662.00 rows=101500 width=4)
Query #2:
SELECT id FROM subsource_position
EXCEPT
SELECT position_id FROM subsource;
Plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------
SetOp Except (cost=24760.35..25668.66 rows=95997 width=4)
-> Sort (cost=24760.35..25214.50 rows=181663 width=4)
Sort Key: "*SELECT* 1".id
-> Append (cost=0.00..6406.26 rows=181663 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..4146.94 rows=95997 width=4)
-> Seq Scan on subsource_position (cost=0.00..3186.97 rows=95997 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..2259.32 rows=85666 width=4)
-> Seq Scan on subsource (cost=0.00..1402.66 rows=85666 width=4)
(8 rows)
I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN to optimize well; is NOT IN always a performance problem or is there a reason it does not optimize here?
Additional data:
=> select count(*) from subsource;
count
-------
85158
(1 row)
=> select count(*) from subsource_position;
count
-------
93261
(1 row)
Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.
Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.
Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.
Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan).
Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN):
SELECT sp.id
FROM subsource_position AS sp
LEFT JOIN subsource AS s ON (s.position_id = sp.id)
WHERE
s.position_id IS NULL
EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.
Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.
Your queries are not functionally equivalent so any comparison of their query plans is meaningless.
Your first query is, in set theory terms, this:
{subsource.position_id} - {subsource_position.id}
^ ^ ^ ^
but your second is this:
{subsource_position.id} - {subsource.position_id}
^ ^ ^ ^
And A - B is not the same as B - A for arbitrary sets A and B.
Fix your queries to be semantically equivalent and try again.
If id and position_id are both indexed (either on their own or first column in a multi-column index), then two index scans are all that are necessary - it's a trivial sorted-merge based set algorithm.
Personally I think PostgreSQL simply doesn't have the optimization intelligence to understand this.
(I came to this question after diagnosing a query running for over 24 hours that I could perform with sort x y y | uniq -u on the command line in seconds. Database less than 50MB when exported with pg_dump.)
PS: more interesting comment here:
more work has been put into optimizing
EXCEPT and NOT EXISTS than NOT IN, because the latter is substantially
less useful due to its unintuitive but spec-mandated handling of NULLs.
We're not going to apologize for that, and we're not going to regard it as a bug.
What it comes down to is that except is different to not in with respect to null handling. I haven't looked up the details, but it means PostgreSQL (aggressively) doesn't optimize it.
The second query makes usage of the HASH JOIN feature of postgresql. This is much faster then the Seq Scan of the first one.