How to create simple fuzzy search with PostgreSQL only? - sql

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:
Product.where("code ILIKE ?", "%" + params[:search] + "%")
It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".
What should I do for this? May be Postgres has some string normalization function, or some other methods to help me?

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.
Example:
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.
Try this query sample: (with your own object names and data of course)
SELECT *
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10
This query says:
Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...
Note: if you get an error like:
function levenshtein(character varying, unknown) does not exist
Install the fuzzystrmatch extension using:
test=# CREATE EXTENSION fuzzystrmatch;

Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the Levenshtein distance from the search term for every single row. That's expensive and cannot use an index. The "accelerated" variant levenshtein_less_equal() is faster for long strings, but still slow without index support.
If your requirements are as simple as the example suggests, you can still use LIKE. Just replace any - in your search term with % in the WHERE clause. So instead of:
WHERE code ILIKE '%AB-123-lHdfj%'
Use:
WHERE code ILIKE '%AB%123%lHdfj%'
Or, dynamically:
WHERE code ILIKE '%' || replace('AB-123-lHdfj', '-', '%') || '%'
% in LIKE patterns stands for 0-n characters. Or use _ for exactly one character. Or use regular expressions for a smarter match:
WHERE code ~* 'AB.?123.?lHdfj'
.? ... 0 or 1 characters
Or:
WHERE code ~* 'AB\-?123\-?lHdfj'
\-? ... 0 or 1 dashes
You may want to escape special characters in LIKE or regexp patterns. See:
Escape function for regular expression or LIKE patterns
If your actual problem is more complex and you need something faster then there are various options, depending on your requirements:
There is full text search, of course. But this may be an overkill in your case.
A more likely candidate is trigram-matching with the additional module pg_trgm. See:
Using Levenshtein function on each element in a tsvector?
PostgreSQL LIKE query performance variations
Related blog post by Depesz
Can be combined it with LIKE, ILIKE, ~, or ~* since PostgreSQL 9.1.
Also interesting in this context: the similarity() function or % operator of that module.
Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform AB1-23-lHdfj --> ab123lhdfj, save it in an additional column and search with terms transformed the same way.
Or use an index on the expression instead of the redundant column. (Involved functions must be IMMUTABLE.) Possibly combine that with pg_tgrm from above.
Overview of pattern-matching techniques:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Related

Using index on `LIKE :varname || '%'` in firebird

I have a query
SELECT DISTINCT FKDOCUMENT
FROM PNTM_DOCUMENTS_FT_INDEX
WHERE WORD LIKE 'sometext%'
PLAN SORT ((PNTM_DOCUMENTS_FT_INDEX INDEX (IX_PNTM_DOCUMENTS_FT_INDEX)))
And it works okay.
BUT When I try to use concatenated string with LIKE, firebird does not use indicies:
SELECT DISTINCT FKDOCUMENT
FROM PNTM_DOCUMENTS_FT_INDEX
WHERE WORD LIKE 'sometext' || '%'
PLAN SORT ((PNTM_DOCUMENTS_FT_INDEX NATURAL))
How to force it to use indicies?
The short answer, as ain already commented, is to use STARTING [WITH] instead of LIKE if you don't need a like pattern, but always want to do a prefix search. So:
WHERE WORD STARTING WITH 'sometext' -- No %!
or
WHERE WORD STARTING WITH :param
As far as I know this is exactly what Firebird does with LIKE 'sometext%'. This will use an index when available, and you don't need to escape it for presence of like pattern symbols. The downside is that you can't use like pattern symbols.
Now as to why Firebird doesn't use an index when you use
WHERE WORD LIKE :param || '%' -- (or LIKE :param) for that matter
or
WHERE WORD LIKE 'sometext' || '%'
The first case is easily explained: statement preparation is done separately from execution. Firebird needs to take into account the possibility that the parameter value starts with a _ or - worse - a %, and it can't use an index for that.
As to the second case, it should be possible to optimize it to the equivalent of LIKE 'sometext%', but Firebird probably considers anything that is not a plain literal as not optimizable. For this specific example it would be possible to decide it should be optimizable, but this a very specific exception (usually one doesn't concatenate literals like this, most of the time one or more 'black' boxes like columns, functions, case statements etc are involved).

String matching in PostgreSQL

I need to implement a regular expression (as I understand) matching in PostgreSQL 8.4. It seems regular expression matching are only available in 9.0+.
My need is:
When I give an input 14.1 I need to get these results:
14.1.1
14.1.2
14.1.Z
...
But exclude:
14.1.1.1
14.1.1.K
14.1.Z.3.A
...
The pattern is not limited to a single character. There is always a possibility that a pattern like this will be presented: 14.1.1.2K, 14.1.Z.13.A2 etc., because the pattern is provided the user. The application has no control over the pattern (it's not a version number).
Any idea how to implement this in Postgres 8.4?
After one more question my issue was solved:
Escaping a LIKE pattern or regexp string in Postgres 8.4 inside a stored procedure
Regular expression matching has been in Postgres practically for ever, at least since version 7.1. Use the these operators:
~ !~ ~* !~*
For an overview, see:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
The point in your case seems to be to disallow more dots:
SELECT *
FROM tbl
WHERE version LIKE '14.1.%' -- for performance
AND version ~ '^14\.1\.[^.]+$'; -- for correct result
db<>fiddle here
Old sqlfiddle
The LIKE expression is redundant, but it is going to improve performance dramatically, even without index. You should have an index, of course.
The LIKE expression can use a basic text_pattern_ops index, while the regular expression cannot, at least in Postgres 8.4.
Or with COLLATE "C" since Postgres 9.1. See:
Is there a difference between text_pattern_ops and COLLATE "C"?
PostgreSQL LIKE query performance variations
[^.] in the regex pattern is a character class that excludes the dot (.). So more characters are allowed, just no more dots.
Performance
To squeeze out top performance for this particular query you could add a specialized index:
CREATE INDEX tbl_special_idx ON tbl
((length(version) - length(replace(version, '.', ''))), version text_pattern_ops);
And use a matching query, the same as above, just replace the last line with:
AND length(version) - length(replace(version, '.', '')) = 2
db<>fiddle here
Old sqlfiddle
You can't do regex matching, but I believe you can do like operators so:
SELECT * FROM table WHERE version LIKE '14.1._';
Will match any row with a version of '14.1.' followed by a single character. This should match your examples. Note that this will not match just '14.1', if you needed this as well. You could do this with an OR.
SELECT * FROM table WHERE version LIKE '14.1._' OR version = '14.1';
Regex matching should be possible with Postgresql-8.4 like this:
SELECT * FROM table WHERE version ~ '^14\.1\..$';

SELECT by string prefix using an index

You have a column foo, of some string type, with a index on that column. You want to SELECT from the table WHERE the foo column has the prefix 'pre'. Obviously, the index should be able to help here.
Here is the most obvious way to search by prefix:
SELECT * FROM tab WHERE foo LIKE 'pre%';
Unfortunately, this does not get optimized to use the index (in Oracle or Postgres, at least).
The following, however, does work:
SELECT * FROM tab WHERE 'pre' <= foo AND foo < 'prf';
But are there better ways to accomplish this, or are there ways of making the above more elegant? In particular:
I need a function from 'pre' to 'prf', but this has to work for any underlying collation. Also, it's more complicated than above, because if searching for e.g. 'prz' then the upper bound would have to be 'psa', and so on.
Can I abstract this into a stored function/procedure and still hit the index? So I could write something like ... WHERE prefix('pre', foo);?
Answers for all DBMSes appreciated.
The database is quite important here. It so happens that SQL Server does this optimization for like.
One way is to do something like this:
where foo >= 'pre' and foo <= 'pre+'~'
'~' has the largest 7-bit ASCII value of a printable character, so it is basically bigger than anything else. This however, may be a problem if you are using wide characters or a non-standard character set.
You cannot abstract this into a function, because use of a function generally precludes the use of indexes. If you are always looking at the first three characters, then in Oracle you can create an index on those three characters (something called a "function-based index").
How about
select * from tab where foo between 'pre' and 'prf' and foo != 'prf'
this enables the index same way. The RDBMS must be pretty dumb not to use an index for that.

How to find strings which are similar to given string in SQL server?

I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.
For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:
1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری
I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:
I am using MS SQL Server not MySQL
my table contents are in Persian, so I can't use Levenshtein distance and similar methods
I prefer an SQL Server only solution, not an indexing or daemon based solution.
The best solution would be a solution which help us sort result by similarity, but, its optional.
Do you have any suggestion for that?
Thanks
MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?
SELECT * FROM table WHERE input LIKE '%مختار%'
Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?
--This part searches for a string you want
declare #MyString varchar(max)
set #MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)
--This part searches for that string
select searchColumn, ABS(Len(searchColumn) - Len(#MyString)) as Similarity
from table where data LIKE '%' + #MyString + '%'
Order by Similarity, searchColumn
The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query.
The absolute part can be avoided obviously but I did it just in case.
Hope that helps =-)
Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.
The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.
The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.
And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection
Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.
Look at the following reference:
http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/
Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for #method Algorithm:
0 The Levenshtein edit distance algorithm
1 The Jaccard similarity coefficient algorithm
2 A form of the Jaro-Winkler distance algorithm
3 Longest common subsequence algorithm
Like operator may not do what he is asking for. Like for example, if i have a record value "please , i want to ask a question' in my database record. and lets say on my query, i want to find a match similarity like this 'Can i ask a question, please'. like operator may do this using like %[your senttence] or [your sentence]% but it is not advisable to use it for string similarity cos sentences may change and all your like logic may not fetch the matching records. It is advisable to use naive bayes text classification for similarities assigning labels to your sentences or you can try the semantic search function in MSSQL server

SQL Contains - only match at start

For some reason I cannot find the answer on Google! But with the SQL contains function how can I tell it to start at the beginning of a string, I.e I am looking for the full-text equivalent to
LIKE 'some_term%'.
I know I can use like, but since I already have the full-text index set up, AND the table is expected to have thousands of rows, I would prefer to use Contains.
Thanks!
You want something like this:
Rather than specify multiple terms, you can use a 'prefix term' if the
terms begin with the same characters. To use a prefix term, specify
the beginning characters, then add an asterisk (*) wildcard to the end
of the term. Enclose the prefix term in double quotes. The following
statement returns the same results as the previous one.
-- Search for all terms that begin with 'storm'
SELECT StormID, StormHead, StormBody FROM StormyWeather
WHERE CONTAINS(StormHead, '"storm*"')
http://www.simple-talk.com/sql/learn-sql-server/full-text-indexing-workbench/
You can use CONTAINS with a LIKE subquery for matching only a start:
SELECT *
FROM (
SELECT *
FROM myTable WHERE CONTAINS('"Alice in wonderland"')
) AS S1
WHERE S1.edition LIKE 'Alice in wonderland%'
This way, the slow LIKE query will be run against a smaller set
The only solution I can think of it to actually prepend a unique word to the beginning of every field in the table.
e.g. Update every row so that 'xfirstword ' appears at the start of the text (e.g. Field1). Then you can search for CONTAINS(Field1, 'NEAR ((xfirstword, "TERM*"),0)')
Pretty crappy solution, especially as we know that the full text index stores the actual position of each word in the text (see this link for details: http://msdn.microsoft.com/en-us/library/ms142551.aspx)
I am facing the similar issue. This is what I have implemented as a work around.
I have made another table and pulled only the rows like 'some_term%'.
Now, on this new table I have implemented the FullText search.
Please do inform me if you tried some other better approach