Postgresql prefix wildcard for full text - sql

I am trying to run a fulltext query using Postgresql that can cater for partial matches using wildcards.
It seems easy enough to have a postfix wildcard after the search term, however I cannot figure out how to specify a prefix wildcard.
For example, I can perform a postfix search easily enough using something like..
SELECT "t1".*
FROM "t1"
WHERE (to_tsvector('simple', "t1"."city") ## to_tsquery('simple', 'don:*') )
should return results matching "London"
However I cant seem to do a prefix search like...
SELECT "t1".*
FROM "t1"
WHERE (to_tsvector('simple', "t1"."city") ## to_tsquery('simple', ':*don') )
Ideally I'd like to have a wildcard prefixed to the front and end of the search term, something like...
SELECT "t1".*
FROM "t1"
WHERE (to_tsvector('simple', "t1"."city") ## to_tsquery('simple', ':*don:*') )
I can use a LIKE condition however I was hoping to benefit from the performance of the full text search features in Postgres.

Full text search is good for finding words, not substrings.
For substring searches you'd better use like '%don%' with pg_trgm extension available from PostgreSQL 9.1 and using gin (column_name gin_trgm_ops) or using gist (column_name gist_trgm_ops) indexes. But your index would be very big (even several times bigger than your table) and write performance not very good.
There's a very good example of using pg_trgm for substring search on select * from depesz blog.

One wild and crazy way of doing it would be to create a tsvector index of all your documents, reversed. And reverse your queries for postfix search too.
This is essentially what Solr does with its ReversedWildcardFilterFactory
select
reverse('brown fox')::tsvector ## (reverse('rown') || ':*')::tsquery --true

Related

PostgreSQL Full Text Search with substrings

I'm trying to create the fastest way to search millions (80+ mio) of records in a PostgreSQL (version 9.4), over multiple columns.
I would like to try and use standard PostgreSQL, and not Solr etc.
I'm currently testing Full Text Search followed https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/.
It works, but I would like some more flexible way to search.
Currently, if I have a column containing ex. "Volvo" and one containing "Blue" I am able to find the record with the search string "volvo blue", but I would like to also find the record using "volvo blu" as if I used LIKE and "%blu%'.
Is that possible with full text search?
The only option to something like this is by using the pg_trgm contrib module.
This enables you to create a GIN or GiST index that indexes all sequences of three characters, which can be used for a search with the similarity operator %.
Two notes:
Using the % operator may return “false positive” results, so be sure to add a second condition (e.g. with LIKE) that eliminates those.
A trigram search works well with longer search strings, but performs badly with short search strings because of the many false positive results.
If that is not good enough for your purposes, you'll have to resort to an third-party solution.

SQL Server full text search and spaces

I have a column with a product names. Some names look like ‘ab-cd’ ‘ab cd’
Is it possible to use full text search to get these names when user types ‘abc’ (without spaces) ? The like operator is working for me, but I’d like to know if it’s possible to use full text search.
If you want to use FTS to find terms that are adjacent to each other, like words separated by a space you should use a proximity term.
You can define a proximity term by using the NEAR keyword or the ~ operator in the search expression, as documented here.
So if you want to find ab followed immediately by cd you could use the expression,
'NEAR((ab,cd), 0)'
searching for the word ab followed by the word cd with 0 terms in-between.
No, unfortunately you cannot make such search via full-text. You can only use LIKE in that case LIKE ('ab%c%')
EDIT1:
You can create a view (WITH SCHEMABINDING!) with some id and column name in which you want to search:
CREATE VIEW dbo.ftview WITH SCHEMABINDING
AS
SELECT id,
REPLACE(columnname,' ','') as search_string
FROM YourTable
Then create index
CREATE UNIQUE CLUSTERED INDEX UCI_ftview ON dbo.ftview (id ASC)
Then create full-text search index on search_string field.
After that you can run CONTAINS query with "abc*" search and it will find what you need.
EDIT2:
But it wont help if search_string does not start with your search term.
For example:
ab c d -> abcd and you search cd
No. Full Text Search is based on WORDS and Phrases. It does not store the original text. In fact, depending on configuration it will not even store all words - there are so called stop words that never go into the index. Example: in english the word "in" is not selective enough to be considered worth storing.
Some names look like ‘ab-cd’ ‘ab cd’
Those likely do not get stored at all. At least the 2nd example is actually 2 extremely short words - quite likely they get totally ignored.
So, no - full text search is not suitable for this.

SQL Server Full Text Search matches part of a word, even without wildcard

Take this query:
SELECT * FROM Books
WHERE CONTAINS(([Description], ReverseDescription), '"øgle"')
And these two text for the columns being search:
http://textuploader.com/5bg5r
http://textuploader.com/5bg59
Why does that one match? I cannot for find an exact match in either of those texts. And as far as I know only the partial match should show up if I use the following query:
SELECT * FROM Books
WHERE CONTAINS(([Description], ReverseDescription), '"øgle*"')
Anyone know what's going on?
Full-Text works on selected language grammar and vocabulary basis, not on simple character comparison like LIKE would do. Each language defines stemmers and word breakers. I can't say weather øgle is a full word by itself and how is your FT index treating that ø. My suspicion is that your index is not created with Danish language rules. If your index is indeed using the correct language, then you need to check the stemmer and breakers rules in use for that language.
Update
Actually I think is simpler. The presence of "" makes the search term a prefix term, event without an *. MSDN is a bit ambiguous here, because for example in Performing Prefix Searches it states:
When the prefix term is a phrase, each token making up the phrase is considered a separate prefix term. All rows that have words beginning with the prefix terms will be returned. For example, the prefix term "light bread*" will find rows with text of either "light breaded," "lightly breaded," or "light bread", but will not return "Lightly toasted bread".
Note how light in the example is a prefix and does not require light*. I do not have a system to test, so is a bit of speculation on my side, but I suspect that CONTAINS will consider "øgle" as a case insensitive prefix search and then your text contains two matches for Øgledronning and Øgledronningens.
Change COLLATE Latin1_General_CS_AS
for example query will look like
SELECT * FROM Books
WHERE CONTAINS(([Description], ReverseDescription), '"øgle*"')
AND [Description] COLLATE Latin1_General_CS_AS LIKE '%"øgle*"%'

How to create simple fuzzy search with PostgreSQL only?

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:
Product.where("code ILIKE ?", "%" + params[:search] + "%")
It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".
What should I do for this? May be Postgres has some string normalization function, or some other methods to help me?
Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.
Example:
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.
Try this query sample: (with your own object names and data of course)
SELECT *
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10
This query says:
Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...
Note: if you get an error like:
function levenshtein(character varying, unknown) does not exist
Install the fuzzystrmatch extension using:
test=# CREATE EXTENSION fuzzystrmatch;
Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the Levenshtein distance from the search term for every single row. That's expensive and cannot use an index. The "accelerated" variant levenshtein_less_equal() is faster for long strings, but still slow without index support.
If your requirements are as simple as the example suggests, you can still use LIKE. Just replace any - in your search term with % in the WHERE clause. So instead of:
WHERE code ILIKE '%AB-123-lHdfj%'
Use:
WHERE code ILIKE '%AB%123%lHdfj%'
Or, dynamically:
WHERE code ILIKE '%' || replace('AB-123-lHdfj', '-', '%') || '%'
% in LIKE patterns stands for 0-n characters. Or use _ for exactly one character. Or use regular expressions for a smarter match:
WHERE code ~* 'AB.?123.?lHdfj'
.? ... 0 or 1 characters
Or:
WHERE code ~* 'AB\-?123\-?lHdfj'
\-? ... 0 or 1 dashes
You may want to escape special characters in LIKE or regexp patterns. See:
Escape function for regular expression or LIKE patterns
If your actual problem is more complex and you need something faster then there are various options, depending on your requirements:
There is full text search, of course. But this may be an overkill in your case.
A more likely candidate is trigram-matching with the additional module pg_trgm. See:
Using Levenshtein function on each element in a tsvector?
PostgreSQL LIKE query performance variations
Related blog post by Depesz
Can be combined it with LIKE, ILIKE, ~, or ~* since PostgreSQL 9.1.
Also interesting in this context: the similarity() function or % operator of that module.
Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform AB1-23-lHdfj --> ab123lhdfj, save it in an additional column and search with terms transformed the same way.
Or use an index on the expression instead of the redundant column. (Involved functions must be IMMUTABLE.) Possibly combine that with pg_tgrm from above.
Overview of pattern-matching techniques:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

How to disable word stemming in SQL Server Full Text Search?

I need to do a Full Text Search with NO word stemming...I tried to wrap the term I'm searching in double quotes, but no joy... I still get results like "bologna" when I search for '"bolognalo"'
Any help is appreciated..
Switch from using FREETEXT to CONTAINS.
I assume that you're currently using FREETEXT because stemming is automatically applied to FREETEXT queries, whereas CONTAINS doesn't use stemming by default.
A second, inferior, option is to specify language neutrality in your FREETEXT query:
SELECT *
FROM my_table
WHERE FREETEXT(my_column, 'my search', LANGUAGE 0x0)
If you use this then no other language-specific rules will be applied either (eg, word breaking, stopwords etc).
After too many days spend in try, finally
I can do this:
I recreate catalog setting the language to 0 (neutral)
CREATE FULLTEXT INDEX ON table_name
(DescriptionField LANGUAGE 0)
KEY INDEX idx_DescriptionField
ON catalog_name
and after in each query with contains I set the language to 0
select * from table_name where contains(DescriptionField,'bolognolo',LANGUAGE 0)
Before I couldn't do this because I didn't do the first step
Thank you very much!
Maybe setting the language of the fulltextindex to neutral will do the trick...
(Though, then you'd never get stemming at all...)