Full text search with php - sql

I do not get any results for the following query:
"SELECT * FROM test2 WHERE MATCH(txt) AGAINST('hello' IN BOOLEAN MODE)"
while test2 looks like:
id | txt
1 | ...
2 | ...
3 | ...
4 | ...
.. | ...
txt is 30 characters long (TEXT) and fulltext. I have about 16 records (tiny db) and the word hello is placed almost in every record in txt along with other words. I just wanted to know how full-text search works. So i get zero results and I can't understand why.

there are two reasons that you are not getting any results:
Reason 1: your search word 'hello' occurs in too many rows.
A natural language search interprets
the search string as a phrase in
natural human language (a phrase in
free text). There are no special
operators. The stopword list applies.
In addition, words that are present in
50% or more of the rows are considered
common and do not match. Full-text
searches are natural language searches
if no modifier is given.
Source: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Reason 2: your search word 'hello' is on the stop-word list.
Any word on the stopword list will never match!
Source: http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html

Related

Postgres regex rowwise on large comma separated text string

I have created a table with various columns to filter on such as id, date, and regex_col in the examples below. The goal of this database is to allow the user to filter appropriately for the json_b_value they are looking for. The current database is not very large around ~100M rows.
I have taken the row names out of json_b_value to create the regex_col with the thought process that I can index the regex_col in some way and allow users to regex search for the json_b_value they are looking for. The text in the regex_col is stored as a large comma separated string, with the number of words ranging from 10 - 150.
id | date | regex_col | json_b_value
1 2019 'some','stuff','to','search' json
2 2018 'different','stuff','other' json
3 2019 'lots','of','stuff' json
The user will interact and search this column using a selectize.js dropdown. A separate table takes all of the comma separated words from regex_col and binds them all together rowwise, like below. Then words matching their search will populate as they type, any words not matching anything will result in a null.
search_words |
'some'
'stuff'
'to'
'search'
What would be an effective way to index the regex_col? Is this the optimal way to do this, should I even be creating the regex_col or should I be trying to optimize around the json_b_value?
example of json value for id 1 below
[{"regex_col":"some","current":100,"previous":200},{"regex_col":"stuff","current":200,"previous":400},{"regex_col":"to","current":300,"previous":600},{"regex_col":"search","current":400,"previous":800}]
There can be a lot of factors, but in most cases the best solution is the folowing:
Create a new table with two columns
id, term
and then populate
1 'some'
1 stuff'
1 'to'
1 'search'
2 'different'
2 'stuff'
2 'other'
3 'lots'
3 'of'
3 'stuff'
Now put an index on this table and you are GTG

Efficiently return words that match, or whose synonym(s), match a keyword

I have a database of industry-specific terms, each of which may have zero or more synonyms. Users of the system can search for terms by keyword and the results should include any term that contains the keyword or that has at least one synonym that contains the keyword. The result should then include the term and ONLY ONE of the matching synonyms.
Here's the setup... I have a term table with 2 fields: id and term. I also have a synonym table with 3 fields: id, termId, and synonym. So there would data like:
term Table
id | term
-- | -----
1 | dog
2 | cat
3 | bird
synonym Table
id | termId | synonym
-- | ------ | --------
1 | 1 | canine
2 | 1 | man's best friend
3 | 2 | feline
A keyword search for (the letter) "i" should return the following as a result:
id | term | synonym
-- | ------ | --------
1 | dog | canine <- because of the "i" in "canine"
2 | cat | feline <- because of the "i" in "feline"
3 | bird | <- because of the "i" in "bird"
Notice how, even though both "dog" synonyms contain the letter "i", only one was returned in the result (doesn't matter which one).
Because I need to return all matches from the term table regardless of whether or not there's a synonym and I need no more than 1 matching synonym, I'm using an OUTER APPLY as follows:
<!-- language: sql -->
SELECT
term.id,
term.term,
synonyms.synonym
FROM
term
OUTER APPLY (
SELECT
TOP 1
term.id,
synonym.synonym
FROM
synonym
WHERE
term.id = synonym.termId
AND synonym.synonym LIKE #keyword
) AS synonyms
WHERE
term.term LIKE #keyword
OR synonyms.synonym LIKE #keyword
There are indexes on term.term, synonym.termId and synonym.synonym. #Keyword is always something like '%foo%'. The problem is that, with close to 50,000 terms (not that much for databases, I know, but...), the performance is horrible. Any thoughts on how this can be done more efficiently?
Just a note, one thing I had thought to try was flattening the synonyms into a comma-delimited list in the term table so that I could get around the OUTER APPLY. Unfortunately though, that list can easily exceed 900 characters which would then prevent SQL Server from adding an index to that column. So that's a no-go.
Thanks very much in advance.
You've got a lot of unnecessary logic in there. There's no telling how SQL server is creating an execution path. It's simpler and more efficient to split this up into two separate db calls and then merge them in your code:
Get matches based on synonyms:
SELECT
term.id
,term.term
,synonyms.synonym
FROM
term
INNER JOIN synonyms ON term.termId = synonyms.termId
WHERE
synonyms.synonym LIKE #keyword
Get matches based on terms:
SELECT
term.id
,term.term
FROM
term
WHERE
term.term LIKE #keyword
For "flattening the synonyms into a comma-delimited list in the term table: - Have you considered using Full Text Search feature? It would be much faster even when your data goes on becoming bulky.
You can put all synonyms (as comma delimited) in "synonym" column and put full text index on the same.
If you want to get results also with the synonyms of the words, I recommend you to use Freetext. This is an example:
SELECT Title, Text, * FROM [dbo].[Post] where freetext(Title, 'phone')
The previous query will match the words with ‘phone’ by it’s meaning, not the exact word. It will also compare the inflectional forms of the words. In this case it will return any title that has ‘mobile’, ‘telephone’, ‘smartphone’, etc.
Take a look at this article about SQL Server Full Text Search, hope it helps

Using Lucene Fuzzy search with a word that has no aliases

I wish do searches using fuzzy search. Using Luke to help me, if I search for a word that has aliases (eg similar words) it all works as expected:
However if I enter a search term that doesn't have any similar words (eg a serial code), the search fails and I get no results, even though it should be valid:
Do I need to structure my search in a different way? Why don't I get the same in the second search as the first, but with only one "term"?
You have not specified Lucene version so I would assume you are using 6.x.x.
The behavior that you are seeing is a correct behavior of Lucene Fuzzy Search.
Refer this and I quote ,
At most, this query will match terms up to 2 edits.
Which roughly but not very accurately means that two texts varying with maximum of two characters at any positions would be a returned as match if using FuzzyQuery.
Below is a sample output from one of my simple Java programs that I illustrate here,
Lets say three Indexed Docs have a field with values like -
"123456787" , "123456788" , "123456789" ( Appended 7 , 8 and 9 to
– 12345678 )
Results :
No Hits Found for search string -> 123456 ( Edit distance = 3 , last
3 digits are missing)
3 Docs found !! for Search String -> 1234567 ( Edit distance = 2 )
3 Docs found !! for Search String -> 12345678 ( Edit distance = 1 )
1 Docs found !! for Search String -> 1236787 ( Edit distance = 2 for
found one, missing 4 , 5 and last digit for remaining two documents)
No Hits Found for search string -> 123678789 ( Edit distance = 4 ,
missing 4 , 5 and last two digits)
So you should read more about Edit Distance.
If your requirement is to match N-Continuous characters without worrying about edit distance , then N-Gram Indexing using NGramTokenizer is the way to go.
See this too for more about N-Gram

"to_tsquery" on tsvector yields different results when using "simple" and "english"?

I've been enlisted to help on a project and I'm diving back into PostgreSQL after not working with it for several years. Lack of use aside, I've never run into using tsvector fields before and now find myself facing a bug based on them. I read the documentation on the field type and it's purpose, but I'm having a hard time digging up documentation on how 'simple' differs from 'english' as the first parameter to to_tsquery()
Example
> SELECT to_tsvector('mortgag') ## to_tsquery('simple', 'mortgage')
?column?
----------
f
(1 row)
> SELECT to_tsvector('mortgag') ## to_tsquery('english', 'mortgage')
?column?
----------
t
(1 row)
I would think they should both return true, but obviously the first does not - why?
The FTS utilizes dictionaries to normalize the text:
12.6. Dictionaries
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme.
So dictionaries are used to throw out things that are too common or meaningless to consider in a search (stop words) and to normalize everything else so city and cities, for example, will match even though they're different words.
Let us look at some output from ts_debug and see what's going on with the dictionaries:
=> select * from ts_debug('english', 'mortgage');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+-----------
asciiword | Word, all ASCII | mortgage | {english_stem} | english_stem | {mortgag}
=> select * from ts_debug('simple', 'mortgage');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | mortgage | {simple} | simple | {mortgage}
Notice that simple uses the simple dictionary whereas english uses the english_stem dictionary.
The simple dictionary:
operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme.
The simple dictionary just throws out stop words, downcases, and that's about it. We can see its simplicity ourselves:
=> select to_tsquery('simple', 'Mortgage'), to_tsquery('simple', 'Mortgages');
to_tsquery | to_tsquery
------------+-------------
'mortgage' | 'mortgages'
The simple dictionary is too simple to even handle simple plurals.
So what is this english_stem dictionary all about? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. From the fine manual:
12.6.6. Snowball Dictionary
The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. [...] Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language.
And just below that we see the english_stem dictionary:
CREATE TEXT SEARCH DICTIONARY english_stem (
TEMPLATE = snowball,
Language = english,
StopWords = english
);
So the english_stem dictionary stems words and we can see that happen:
=> select to_tsquery('english', 'Mortgage'), to_tsquery('english', 'Mortgages');
to_tsquery | to_tsquery
------------+------------
'mortgag' | 'mortgag'
Executive Summary: 'simple' implies simple minded literal matching, 'english' applies stemming to (hopefully) produce better matching. The stemming turns mortgage into mortgag and that gives you your match.

How do I create sql query for searching partial matches?

I have a set of items in db .Each item has a name and a description.I need to implement a search facility which takes a number of keywords and returns distinct items which have at least one of the keywords matching a word in the name or description.
for example
I have in the db ,three items
1.item1 :
name : magic marker
description: a writing device which makes erasable marks on whiteboard
2.item2:
name: pall mall cigarettes
description: cigarette named after a street in london
3.item3:
name: XPigment Liner
description: for writing and drawing
A search using keyword 'writing' should return magic marker and XPigment Liner
A search using keyword 'mall' should return the second item
I tried using the LIKE keyword and IN keyword separately ,..
For IN keyword to work,the query has to be
SELECT DISTINCT FROM mytable WHERE name IN ('pall mall cigarettes')
but
SELECT DISTINCT FROM mytable WHERE name IN ('mall')
will return 0 rows
I couldn't figure out how to make a query that accommodates both the name and description columns and allows partial word match..
Can somebody help?
update:
I created the table through hibernate and for the description field, used javax.persistence #Lob annotation.Using psql when I examined the table,It is shown
...
id | bigint | not null
description | text |
name | character varying(255) |
...
One of the records in the table is like,
id | description | name
21 | 133414 | magic marker
First of all, this approach won't scale in the large, you'll need a separate index from words to item (like an inverted index).
If your data is not large, you can do
SELECT DISTINCT(name) FROM mytable WHERE name LIKE '%mall%' OR description LIKE '%mall%'
using OR if you have multiple keywords.
This may work as well.
SELECT *
FROM myTable
WHERE CHARINDEX('mall', name) > 0
OR CHARINDEX('mall', description) > 0