SELECT by string prefix using an index

SELECT by string prefix using an index - sql

You have a column foo, of some string type, with a index on that column. You want to SELECT from the table WHERE the foo column has the prefix 'pre'. Obviously, the index should be able to help here.
Here is the most obvious way to search by prefix:
SELECT * FROM tab WHERE foo LIKE 'pre%';
Unfortunately, this does not get optimized to use the index (in Oracle or Postgres, at least).
The following, however, does work:
SELECT * FROM tab WHERE 'pre' <= foo AND foo < 'prf';
But are there better ways to accomplish this, or are there ways of making the above more elegant? In particular:
I need a function from 'pre' to 'prf', but this has to work for any underlying collation. Also, it's more complicated than above, because if searching for e.g. 'prz' then the upper bound would have to be 'psa', and so on.
Can I abstract this into a stored function/procedure and still hit the index? So I could write something like ... WHERE prefix('pre', foo);?
Answers for all DBMSes appreciated.

The database is quite important here. It so happens that SQL Server does this optimization for like.
One way is to do something like this:
where foo >= 'pre' and foo <= 'pre+'~'
'~' has the largest 7-bit ASCII value of a printable character, so it is basically bigger than anything else. This however, may be a problem if you are using wide characters or a non-standard character set.
You cannot abstract this into a function, because use of a function generally precludes the use of indexes. If you are always looking at the first three characters, then in Oracle you can create an index on those three characters (something called a "function-based index").

How about
select * from tab where foo between 'pre' and 'prf' and foo != 'prf'
this enables the index same way. The RDBMS must be pretty dumb not to use an index for that.

Related

Is this possible with the SELECT LIKE specification?

Looking at this link: SQL SELECT LIKE
What if you were searching for a name that starts with H and ends with dinger?
Would I use:
SELECT NAME LIKE
'H_dinger'
'H...dinger' or
'H%dinger' ?
I'll assume H_dinger would think there is only 1 character in between, but I don't know what it is -- so I'm searching for it.
H...dinger isn't valid.
And H%dinger seems like it would check it all, but on the site, that isn't even listed?

You would use %, which is the variable-sized wildcard.
But you need to get the syntax right, such as with:
select NAME from TABLE where NAME like 'H%dinger'
Keep in mind that queries using % may be a performance issue (depending on how it's used and the DBMS engine). It can prevent the efficient use of indexes to speed up queries. It probably won't matter for small tables but it's something to keep in mind if you ever need to scale.

For an Oracle NUMBER datatype, LIKE operator vs BETWEEN..AND operator

Assume mytable is an Oracle table and it has a field called id. The datatype of id is NUMBER(8). Compare the following queries:
select * from mytable where id like '715%'
and
select * from mytable where id between 71500000 and 71599999
I would think the second is more efficient since I think "number comparison" would require fewer number of assembly language instructions than "string comparison". I need a confirmation or correction. Please confirm/correct and throw any further comment related to either operator.
UPDATE: I forgot to mention 1 important piece of info. id in this case must be an 8-digit number.

If you only want values between 71500000 and 71599999 then yes the second one is much more efficient. The first one would also return values between 7150-7159, 71500-71599 etc. and so forth. You would either need to sift through unecessary results or write another couple lines of code to filter the rest of them out. The second option is definitely more efficient for what you seem to want to do.

It seems like the execution plan on the second query is more efficient.
The first query is doing a full table scan of the id's, whereas the second query is not.
My Test Data:
Execution Plan of first query:
Execution Plan of second query:

I don't like the idea of using LIKE with a numeric column.
Also, it may not give the results you are looking for.
If you have a value of 715000000, it will show up in the query result, even though it is larger than 71599999.
Also, I do not like between on principle.
If a thing is between two other things, it should not include those two other things. But this is just a personal annoyance.
I prefer to use >= and <= This avoids confusion when I read the query. In addition, sometimes I have to change the query to something like >= a and < c. If I started by using the between operator, I would have to rewrite it when I don't want to be inclusive.
Harv

In addition to the other points raised, using LIKE in the manner you suggest would cause Oracle to not use any indexes on the ID column due to the implicit conversion of the data from number to character, resulting in a full table scan when using LIKE versus and index range scan when using BETWEEN. Assuming, of course, you have an index on ID. Even if you don't, however, Oracle will have to do the type conversion on each value it scans in the LIKE case, which it won't have to do in the other.

You can use math function, otherwise you have to use to_char function to use like, but it will cause performance problems.
select * from mytable where floor(id /100000) = 715
or
select * from mytable where floor(id /100000) = TO_NUMBER('715') // this is parametric

How to create simple fuzzy search with PostgreSQL only?

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:
Product.where("code ILIKE ?", "%" + params[:search] + "%")
It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".
What should I do for this? May be Postgres has some string normalization function, or some other methods to help me?

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.
Example:
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.
Try this query sample: (with your own object names and data of course)
SELECT *
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10
This query says:
Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...
Note: if you get an error like:
function levenshtein(character varying, unknown) does not exist
Install the fuzzystrmatch extension using:
test=# CREATE EXTENSION fuzzystrmatch;

Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the Levenshtein distance from the search term for every single row. That's expensive and cannot use an index. The "accelerated" variant levenshtein_less_equal() is faster for long strings, but still slow without index support.
If your requirements are as simple as the example suggests, you can still use LIKE. Just replace any - in your search term with % in the WHERE clause. So instead of:
WHERE code ILIKE '%AB-123-lHdfj%'
Use:
WHERE code ILIKE '%AB%123%lHdfj%'
Or, dynamically:
WHERE code ILIKE '%' || replace('AB-123-lHdfj', '-', '%') || '%'
% in LIKE patterns stands for 0-n characters. Or use _ for exactly one character. Or use regular expressions for a smarter match:
WHERE code ~* 'AB.?123.?lHdfj'
.? ... 0 or 1 characters
Or:
WHERE code ~* 'AB\-?123\-?lHdfj'
\-? ... 0 or 1 dashes
You may want to escape special characters in LIKE or regexp patterns. See:
Escape function for regular expression or LIKE patterns
If your actual problem is more complex and you need something faster then there are various options, depending on your requirements:
There is full text search, of course. But this may be an overkill in your case.
A more likely candidate is trigram-matching with the additional module pg_trgm. See:
Using Levenshtein function on each element in a tsvector?
PostgreSQL LIKE query performance variations
Related blog post by Depesz
Can be combined it with LIKE, ILIKE, ~, or ~* since PostgreSQL 9.1.
Also interesting in this context: the similarity() function or % operator of that module.
Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform AB1-23-lHdfj --> ab123lhdfj, save it in an additional column and search with terms transformed the same way.
Or use an index on the expression instead of the redundant column. (Involved functions must be IMMUTABLE.) Possibly combine that with pg_tgrm from above.
Overview of pattern-matching techniques:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

SQL Contains - only match at start

For some reason I cannot find the answer on Google! But with the SQL contains function how can I tell it to start at the beginning of a string, I.e I am looking for the full-text equivalent to
LIKE 'some_term%'.
I know I can use like, but since I already have the full-text index set up, AND the table is expected to have thousands of rows, I would prefer to use Contains.
Thanks!

You want something like this:
Rather than specify multiple terms, you can use a 'prefix term' if the
terms begin with the same characters. To use a prefix term, specify
the beginning characters, then add an asterisk (*) wildcard to the end
of the term. Enclose the prefix term in double quotes. The following
statement returns the same results as the previous one.
-- Search for all terms that begin with 'storm'
SELECT StormID, StormHead, StormBody FROM StormyWeather
WHERE CONTAINS(StormHead, '"storm*"')
http://www.simple-talk.com/sql/learn-sql-server/full-text-indexing-workbench/

You can use CONTAINS with a LIKE subquery for matching only a start:
SELECT *
FROM (
SELECT *
FROM myTable WHERE CONTAINS('"Alice in wonderland"')
) AS S1
WHERE S1.edition LIKE 'Alice in wonderland%'
This way, the slow LIKE query will be run against a smaller set

The only solution I can think of it to actually prepend a unique word to the beginning of every field in the table.
e.g. Update every row so that 'xfirstword ' appears at the start of the text (e.g. Field1). Then you can search for CONTAINS(Field1, 'NEAR ((xfirstword, "TERM*"),0)')
Pretty crappy solution, especially as we know that the full text index stores the actual position of each word in the text (see this link for details: http://msdn.microsoft.com/en-us/library/ms142551.aspx)

I am facing the similar issue. This is what I have implemented as a work around.
I have made another table and pulled only the rows like 'some_term%'.
Now, on this new table I have implemented the FullText search.
Please do inform me if you tried some other better approach

Placing index columns on the left of a mysql WHERE statement?

I was curious since i read it in a doc. Does writing
select * from CONTACTS where id = ‘098’ and name like ‘Tom%’;
speed up the query as oppose to
select * from CONTACTS where name like ‘Tom%’ and id = ‘098’;
The first has an indexed column on the left side. Does it actually speed things up or is it superstition?
Using php and mysql

Check the query plans with explain. They should be exactly the same.

This is purely superstition. I see no reason that either query would differ in speed. If it was an OR query rather than an AND query however, then I could see that having it on the left may spped things up.

interesting question, i tried this once. query plans are the same (using EXPLAIN).
but considering short-circuit-evaluation i was wondering too why there is no difference (or does mysql fully evaluate boolean statements?)

You may be mis-remembering or mis-reading something else, regarding which side the wildcards are on a string literal in a Like predicate. Putting the wildcard on the right (as in yr example), allows the query engine to use any indices that might exist on the table column you are searching (in this case - name). But if you put the wildcard on the left,
select * from CONTACTS where name like ‘%Tom’ and id = ‘098’;
then the engine cannot use any existing index and must do a complete table scan.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SELECT by string prefix using an index - sql

How about select * from tab where foo between 'pre' and 'prf' and foo != 'prf' this enables the index same way. The RDBMS must be pretty dumb not to use an index for that.

Related

Is this possible with the SELECT LIKE specification?

For an Oracle NUMBER datatype, LIKE operator vs BETWEEN..AND operator

How to create simple fuzzy search with PostgreSQL only?

SQL Contains - only match at start

Placing index columns on the left of a mysql WHERE statement?

Categories

Resources