wildcard or "in list" when querying in Postgres

wildcard or "in list" when querying in Postgres - sql

I have a few tables where I need to get the data related to foo. The size of the tables are about 10^8 rows.
So I need to get all rows where the column include substring 'foo' from these tables.
select * from bar where my_col like '%foo%';
I know this is slow so I check the possible values:
select distinct my_col from bar where my_col like '%foo%';
-- => ('xx_foo', 'yy_foo', 'xx_foo_xx', 'foo' ... 'xx_foo_yy')
The number of possible values varies between 3 and 20.
Now how slow is '%foo%' really?
select * from bar where my_col like '%foo%';
-- or
select * from bar where my_col in('foo', 'xx_foo' ... 'foo_yy'); -- list_size = 20
Any general rule on when to use what, or is testing the speed for different cases the only way to go?
Edit: I do not own the table and no index exists on the column foo. So it needs to do a full table scan no matter what.

If you use %foo%, you will get a full-table scan, which is slow.
If you use IN with a list of values, than an index can be used if it exists on the column on which you have the condition.
So, if you are able, you should avoid using %foo%. Depending on how often new values may appear in the table, you might consider using an extra table holding the distinct values and use it when querying your main table, and update that extra table whenever new distinct value comes to play (if it is possible in your design).

A search using the like operator will sure lead to a table scan when the pattern starts with a %. When using the in operator and the values are not more than a few percent of the values in the table an index can be used, if it exists. Check the cardinality concept:
http://en.wikipedia.org/wiki/Cardinality_%28SQL_statements%29
The DBMS knows about the cardinalities keeping statistics about the tables. If your column has high cardinality and an index on it then an index scan is likely when using the in operator. To update the statistics issue an analyze command.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?

Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.

This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

how to improve the performance on this type of Query?

I have query that I am searching for a range of user accounts but every time I pass query I will be using multiple id number's first 5 digits and based on that I will searching. I wanted to know is there any other way to re-write this query for user Id range when we use more than 10 userIDs to search? Is there going to be huge performance hit with this kind of search in query?
example:
select A.col1,B.col2,B.col3
from table1 A,table2 B
where A.col2=B.col2
and (B.col_id like '12345%'
OR B.col_id like '47474%'
OR B.col_id like '59598%');
I am using Oracle11g.

Actually it is not important how many UserIDs you will pass to the query. The most considerable part is what is selectivity of your query. In other words: how many rows will return your query and how many rows are there in your tables. If the number of returned rows is relatively small then it is good idea to create an index on column B.col_id. There is also nothing bad in using OR condition. Basically each OR will add one more INDEX RANGE SCAN to the execution plan with final CONCATENATION (but you'd rather check your actual plan to be sure). If the total cost of all that operations are lower than full table scan then Oracle CBO will use your index. In other case if you select >=20-30% of data at once then full table scan is most likely to happen and you should even less be worried about OR because all data will be read and comparing each value with your multiple conditions won't add much overhead.

Generally the use of LIKE makes it impossible for Oracle to use indexes.
If the query is going to be reused, consider creating a synthetic column with the first 5 characters of COL_ID. Put a non-unique index on it. Put your search keys in a table and join that to that column.
There may be a way to do it with a functional index on the first 5 characters.

I don't know if the performance will be better or not, but another way to write this is with a union:
select A.col1,B.col2,B.col3
from table1 A,table2 B
where A.col2=B.col2
and (A.col_id like '12345%')
union all
select A.col1,B.col2,B.col3
from table1 A,table2 B
where A.col2=B.col2
and (A.col_id like '47474%') -- you did mean A.col_id again, right?
union all
select A.col1,B.col2,B.col3
from table1 A,table2 B
where A.col2=B.col2
and (A.col_id like '59598%'); -- and A.col_id here too?

SQL Server like statement behavior for %%

In terms of performance, how does the like operator behaves when applied to strings with multiple % placeholders?
for example:
select A from table_A where A like 'A%'
takes the same time to select than
select A from table_A where A like 'A%%'
???

Your queries:
select A from table_A where A like 'A%'
and
select A from table_A where A like 'A%%'
^ optimizer will remove second redundant %
are equivalent, the optimizer will remove the second % in the second query
just like it would remove the 1=1 from:
select A from table_A where A like 'A%%' and 1=1
However, this query is very different:
select A from table_A where A like '%A%'
The when using 'A%' it will use the index to find everything starting with an A, like a person using a phone book would quickly look for the start of a name. However when using '%A%' it will scan the entire table looking for anything containing an A, thus slower and no index usage. Like if you had to find every name in the phone book that contained an A, that would take a while!

It will treat them same. If there is an index on column A, it will use that index just as it would with a single wildcard. However, if you were to add a leading wildcard, that would force a table scan regardless of whether an index existed or not.

For the most part the pattern that you're using will not affect the performance of the query. The key to the performance for this is the appropriate use of indexes. In your example, an index on the column will work well because it will seek values that start with 'A', then match the full pattern. There may be some more-challenging patterns around, but the performance difference is negligible between them.
There is one important condition where the wildcard character will hurt performance. And, that is when it is at the beginning of of the pattern. For, example, '%A' will gain no benefit from an index, because it indicates you want to match on any value that starts with any valid character. All rows must be evaluated to meet this criteria.

Adding fields to optimize MySQL queries

I have a MySQL table with 3 fields:
Location
Variable
Value
I frequently use the following query:
SELECT *
FROM Table
WHERE Location = '$Location'
AND Variable = '$Variable'
ORDER BY Location, Variable
I have over a million rows in my table and queries are somewhat slow. Would it increase query speed if I added a field VariableLocation, which is the Variable and the Location combined? I would be able to change the query to:
SELECT *
FROM Table
WHERE VariableLocation = '$Location$Variable'
ORDER BY VariableLocation

I would add a covering index, for columns location and variable:
ALTER TABLE
ADD INDEX (variable, location);
...though if the variable & location pairs are unique, they should be the primary key.
Combining the columns will likely cause more grief than it's worth. For example, if you need to pull out records by location or variable only, you'd have to substring the values in a subquery.

Try adding an index which covers the two fields you should then still get a performance boost but also keep your data understandable because it wouldn't seem like the two columns should be combine but you are just doing it to get performance.

I would advise against combining the fields. Instead, create an index that covers both fields in the same order as your ORDER BY clause:
ALTER TABLE tablename ADD INDEX (location, variable);
Combined indices and keys are only used in queries that involve all fields of the index or a subset of these fields read from left to right. Or in other words: If you use location in a WHERE condition, this index would be used, but ordering by variable would not use the index.
When trying to optimize queries, the EXPLAIN command is quite helpful: EXPLAIN in mysql docs

Correction Update:
Courtesy: #paxdiablo:
A column in the table will make no difference. All you need is an index over both columns and the MySQL engine will use that. Adding a column in the table is actually worse than that since it breaks 3NF and wastes space. See http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html which states: SELECT * FROM tbl_name WHERE col1=val1 AND col2=val2; If a multiple-column index exists on col1 and col2, the appropriate rows can be fetched directly.

Database Index not used if the where criteria is !=?

I have a index on a column and it is correctly used when the query is
select * from Table where x = 'somestring'
However it seems to be not used when the query is something like
select * from Table where x != 'someotherstring'
Is this normal or am I missing something else in the query? The actual query is of course much larger and so it could be caused by some other factor. Any other ideas why an index would not be used in a query?

This is normal. Index will only be used if you have a '=' condition. Searching index for != condition is not effective.
Similarly, this may use the index (in Oracle)
select * from Table where x like 'some%'
but this wouldn't
select * from Table where x like '%thing%'
Also,
select * from Table where x between 1 and 10 will use the index
but not
select * from Table where x not between 1 and 10

this is absolutely normal. index is used to look for exact something. where you start when I ask you to look a dictionary when I told you not start with 'S'.
you can always do this.
select * from Table a
where not exist (select * from table b where x = 'somestring' and a.key = b.key)

It may use index if the index is clustering and there are not so many different values of the indexed attribute (so we can quickly decide which blocks we may skip). But if the indexed attribute is, say, a key then using index in this case makes absolutely no sense.

That is indeed normal - to use the index, you need to use a exact match (like the "=" equals operator), or something like a range query.
A query that defines a "negative" criteria (NOT something or another) typically can't be satisfied by an index lookup - you'll have to look up everything except a certain value. That doesn't work nicely - typically, a full table scan (clustered index scan in SQL Server) will be quicker, just checking for the criteria to be matched (or not matched, in that case).

I think that a != condition can use an index (in MSSQL). According the the execution plan in MSSQL, if I have an index on a single field, and I apply a where clause on that field, one with a != and one with =, they both result the same execution plan, both using an index seek.

You didn't say what database engine you are using.
MS SQL Server, for example, has both Equality indexes and Inequality indexes.
The latter are used when the not equal operator is in play.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

wildcard or "in list" when querying in Postgres - sql

Related

Performance impact of view on aggregate function vs result set limiting

how to improve the performance on this type of Query?

SQL Server like statement behavior for %%

Adding fields to optimize MySQL queries

Database Index not used if the where criteria is !=?

Categories

Resources