postgres Like query performance for single word columns - sql

DB-Type: PostgreSQL
DB-Version: 11
We have a column which has a single word as a value always. The maxlength is 10 chars.
We always have unique value for this column in the table.
We do not have any updates to this column, only new rows are inserted with this column.
We would like to enable like queries for this column.
Should we consider the PostgreSQL TRGM extension and using a GIN index? or will a normal index suffice in this case?
The queries will be like this:
select * from my_table where my_column like '%abc%';
The question arrives from the fact that TRGM is quite powerful when full text search is required for a long text with many words, but wanted to know if it will be better than a normal index for the single word scenario also.

A trigram index is the only index that could help with a LIKE query with a leading wildcard. For short search strings like the one you show, it may still be slow if the trigram occurs in many words. But that's the best you can get.
For a LIKE condition without a wildcard in the beginning, a b-tree index might well be faster.

A "regular" index (b-tree) will generally be able to resolve:
where x like 'abcdefghij'
where x = 'abcdefghij'
It can also be used for prefix matches:
where x like 'abcd%'
However, it cannot be used when the pattern starts with a wildcard:
where x like '%hij'
So, whether the index is used depends on how you are going to use it. If the pattern starts with wildcards then a GIN index could be used.
I should add that regardless of the index, there are considerations if you want case-independence or are mixing collations.

I think you have a fundamental (but kind of common) misunderstanding here:
The question arrives from the fact that TRGM is quite powerful when full text search is required for a long text with many words
No, that is what Full Text Search is for, which is very different than pg_trgm.
pg_trgm is fairly bad at long text with many words (not as bad since 9.6 as it was before that, but still not its strongest point), but it is good at this very thing you want.
The problem is that you need to have trigrams to start with. If your query was changed to like '%ab%' then pg_trgm would probably be worse than not having the index at all. So it might be worthwhile to validate the query in the app or client side to reject attempts to specify such patterns.

Related

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Which performances better in a query, 'LIKE' or '='?

I am using asp.net mvc3.
Which of the two '=' or 'LIKE' will perform faster in my SQL query.
= will be faster than LIKE because testing against exact matches is faster in SQL. The LIKE expression needs to scan the string and search for occurrences of the given expression for each row. Of course that's not a reason not to use LIKE. Databases are pretty optimized and unless you discover that this is a performance bottleneck for your application you should not be afraid of using it. Doing premature optimizations like this is not good.
As Darin says, searching for equality is likely to be faster - it allows better use of indexes etc. It partly depends on the kind of LIKE operation - leading substring LIKE queries (WHERE Name LIKE 'Fred%') can be optimized in the database with cunning indexes (you'd need to check whether any special work is needed to enable that in your database). Trailing substring matches could potentially be optimized in the same sort of way, but I don't know whether most databases handle this. Arbitrary matches (e.g. WHERE Name LIKE '%Fred%Bill%') would be very hard to optimize.
However, you should really be driven by functionality - do you need pattern-based matching, or exact equality? Given that they don't do the same thing, which results do you want? If you have a LIKE pattern which doesn't specify any wildcards, I would hope that the query optimizer could notice that and give you the appropriate optimization anyway - although you'd want to test that.
If you're wondering whether or not to include pattern-matching functionality, you'll need to work out whether your users are happy to have that for occasional "power searches" at the cost of speed - without knowing much about your use case, it's hard to say...
Equal and like are different operators so are not comparable
Equal is exact match
LIKE is pattern matching
That said, LIKE without wildcards should run the same as equal. But you wouldn't run that.
And it depends on indexes. Every row will need examined without an index for any operator.
Note: LIKE '%something' can never be optimised by an index (edit: see comments)
Equal is fastest
Then LIKE 'something%'
LIKE '%something' is slowest.
The last one have to go through the entire column to find a match. Hence it's the slowest one.
As you are talking about using them interchangeably I assume the desired semantics are equality.
i.e. You are talking about queries such as
WHERE bar = 'foo' vs WHERE bar LIKE 'foo'
This contains no wild cards so the queries are semantically equivalent. However you should probably use the former as
It is clearer what the expression of the query is
In this case the search term does not contain any characters of particular significance to the LIKE operator but if you wanted to search for bar = '10% off' you would need to escape these characters when using LIKE.
Trailing space is significant in LIKE queries but not = (tested on SQL Server and MySQL not sure what the standard says here)
You don't specify RDBMS, in the case of SQL Server just to discuss a few possible scenarios.
1. The bar column is not indexed.
In that case both queries will involve a full scan of all rows. There might be some minor difference in CPU time because of the different semantics around how trailing space should be treated.
2. The bar column has a non unique index.
In that case the = query will seek into the index where bar = 'foo' and then follow the index along until it finds the first row where bar <> 'foo'. The LIKE query will seek into the index where bar >= 'foo' following the index along until it finds the first row where bar > 'foo'
3. The bar column has a unique index.
In that case the = query will seek into the index where bar = 'foo' and return that row if it exists and not scan any more. The LIKE query will still do the range seek on Start: bar >= 'foo' End: bar <= 'foo' so will still examine the next row.

SQL `LIKE` complexity

Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?
Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.
If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.
Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.

What are some ways I can improve the performance of a regular expression query in PostgreSQL 8?

I'm performing a regular expression match on a column of type character varying(256) in PostgreSQL 8.3.3. The column currently has no indices. I'd like to improve the performance of this query if I can.
Will adding an index help? Are there other things I can try to help improve performance?
You cannot create an index that will speed up any generic regular expression; however, if you have one or a limited number of regular expressions that you are matching against, you have a few options.
As Paul Tomblin mentions, you can use an extra column or columns to indicate whether or not a given row matches that regex or regexes. That column can be indexed, and queried efficiently.
If you want to go further than that, this paper discusses an interesting sounding technique for indexing against regular expressions, which involves looking for long substrings in the regex and indexing based on whether those are present in the text to generate candidate matches. That filters down the number of rows that you actually need to check the regex against. You could probably implement this using GiST indexes, though that would be a non-trivial amount of work.
An index can't do anything with a regular expression. You're going to have to do a full table scan.
If at all possible, like if you're querying for the same regex all the time, you could add a column that specifies whether this row matches that regex and maintain that on inserts and updates.
Regex matches do not perform well on fairly big text columns. Try to accomplish this without the regex, or do the match in code if the dataset is not large.
This may be one of those times when you don't want to use RegEx. What does your reg-ex code look like? Maybe that's a way to speed it up.
If you have a limited set of regexes to match against you could create a table with the primary key of your table and a field indicating if it matches that regex, which you would update on a trigger and then index your tables key in that table. This trades a small decrease in update and insert speed for a probably large speed increase in select.
Alternatively, you could write a function which compares your field to that regex (or even pass the regex along with the field you are matching to the function), then create a functional index on your table against that function. This also assumes a fixed set of regexes (but you can add new regex matches more easily this way).
If the regex is dynamically created from user input you might have to live with the table scan or change the user app to produce a more simple search like field like 'value%', which would use an index on field ('%value%' wouldn't).
If you do manage to reduce your needs to a simple LIKE query, look up indexes with text_pattern_ops to speed those up.

SQL full text search vs "LIKE"

Let's say I have a fairly simple app that lets users store information on DVDs they own (title, actors, year, description, etc.) and I want to allow users to search their collection by any of these fields (e.g. "Keanu Reeves" or "The Matrix" would be valid search queries).
What's the advantage of going with SQL full text search vs simply splitting the query up by spaces and doing a few "LIKE" clauses in the SQL statement? Does it simply perform better or will it actually return results that are more accurate?
Full text search is likely to be quicker since it will benefit from an index of words that it will use to look up the records, whereas using LIKE is going to need to full table scan.
In some cases LIKE will more accurate since LIKE "%The%" AND LIKE "%Matrix" will pick out "The Matrix" but not "Matrix Reloaded" whereas full text search will ignore "The" and return both. That said both would likely have been a better result.
Full-text indexes (which are indexes) are much faster than using LIKE (which essentially examines each row every time). However, if you know the database will be small, there may not be a performance need to use full-text indexes. The only way to determine this is with some intelligent averaging and some testing based on that information.
Accuracy is a different question. Full-text indexing allows you to do several things (weighting, automatically matching eat/eats/eating, etc.) you couldn't possibly implement that in any sort of reasonable time-frame using LIKE. The real question is whether you need those features.
Without reading the full-text documentation's description of these features, you're really not going to know how you should proceed. So, read up!
Also, some basic tests (insert a bunch of rows in a table, maybe with some sort of public dictionary as a source of words) will go a long way to helping you decide.
A full text search query is much faster. Especially when working which lots of data in various columns.
Additionally you will have language specific search support. E.g. german umlauts like "ü" in "über" will also be found when stored as "ueber". Also you can use synonyms where you can automatically expand search queries, or replace or substitute specific phrases.
In some cases LIKE will more accurate
since LIKE "%The%" AND LIKE "%Matrix"
will pick out "The Matrix" but not
"Matrix Reloaded" whereas full text
search will ignore "The" and return
both. That said both would likely have
been a better result.
That is not correct. The full text search syntax lets you specify "how" you want to search. E.g. by using the CONTAINS statement you can use exact term matching as well fuzzy matching, weights etc.
So if you have performance issues or would like to provide a more "Google-like" search experience, go for the full text search engine. It is also very easy to configure.
Just a few notes:
LIKE can use an Index Seek if you don't start your LIKE with %. Example: LIKE 'Santa M%' is good! LIKE '%Maria' is bad! and can cause a Table or Index Scan because this can't be indexed in the standard way.
This is very important. Full-Text Indexes updates are Asynchronous. For instance, if you perform an INSERT on a table followed by a SELECT with Full-Text Search where you expect the new data to appear, you might not get the data immediatly. Based on your configuration, you may have to wait a few seconds or a day. Generally, Full-Text Indexes are populated when your system does not have many requests.
It will perform better, but unless you have a lot of data you won't notice that difference. A SQL full text search index lets you use operators that are more advanced then a simple "LIKE" operation, but if all you do is the equivalent of a LIKE operation against your full text index then your results will be the same.
Imagine if you will allow to enter notes/descriptions on DVDs.
In this case it will be good to allow to search by descriptions.
Full text search in this case will do better job.
You may get slightly better results, or else at least have an easier implementation with full text indexing. But it depends on how you want it to work ...
What I have in mind is that if you are searching for two words, with LIKE you have to then manually implement (for example) a method to weight those with both higher in the list. A fulltext index should do this for you, and allow you to influence the weightings too using relevant syntax.
To FullTextSearch in SQL Server as LIKE
First, You have to create a StopList and assign it to your table
CREATE FULLTEXT STOPLIST [MyStopList];
GO
ALTER FULLTEXT INDEX ON dbo.[MyTableName] SET STOPLIST [MyStopList]
GO
Second, use the following tSql script:
SELECT * FROM dbo.[MyTableName] AS mt
WHERE CONTAINS((mt.ColumnName1,mt.ColumnName2,mt.ColumnName3), N'"*search text s*"')
If you do not just search English word, say you search a Chinese word, then how your fts tokenizes words will make your search a big different, as I gave an example here https://stackoverflow.com/a/31396975/301513. But I don't know how sql server tokenizes Chinese words, does it do a good job for that?