Select closest match in dictionary order - sql

Supporse I have the fllowing table with words from a dictionary:
Word
---
cat
dog
giraffe
zebra
I would like to find a word, and if it doesn't exist, the closest before it in dictionary order, e.g. aardvark would return nothing, cat would return cat, cow would return cat, horse would return giraffe.
This should be relatively straightforward to search for using a BTREE index but I haven't figured out a way to do it. I'm using sqlite for this, but other engines are also acceptable.
I'm only interested in the dictionary order, i.e. the query should work exactly with the above examples. Other similarity metrics are of course nice, but are irrelevant to this question.

Assuming that you have declared the column with the correct collation for dictionary order (which might be the default, or COLLATE NOCASE, or a user-defined collation), getting an exact match is trivial:
SELECT Word FROM Dictionary WHERE Word = ?
and getting the closest before is easy:
SELECT MAX(Word) FROM Dictionary WHERE Word < ?
To get only the first result of these two queries, combine them with UNION ALL, and use LIMIT 1 so that the second query is ignored if the first one succeeds:
SELECT Word FROM Dictionary WHERE Word = ?
UNION ALL
SELECT MAX(Word) FROM Dictionary WHERE Word < ?
LIMIT 1

for approx match
select a.word,b.word from dictionary a, words b
where (b.word like '%'+right(a.word,2) )
or
for exact match
select a.word,b.word from dictionary a, words b
where soundex(a.word)=soundex(b.word) or

this may helps you
select a.word,b.word from dictionary a, words b
where difference (a.word,'DOG') in(3,4)
or you can use soundex function

Related

PostgreSQL - Query keyword patterns in columns in table

we all know in SQL we can query a column (lets say, column "breeds") for a certain word like "dog" via a query like this:
select breeds
from myStackOverflowDBTable
where breeds = 'dog'
However, say I had many more columns with much more data, say millions of records, and I did not want to find a word, but rather the most common keyword pattern or wildcard expression, a query like this:
SELECT *
FROM myStackOverflowDBTable
WHERE address LIKE '%alb%'"
Is there an efficient way to find these 'patterns' inside the columns using SQL? I need to find the most common substring so-to-speak, per the query above, say the wildcard string "alb" appeared the most in a "location" column that had words like Albany, Albuquerque, Alabama, obviously querying the words directly would yield 0 results but querying on that wildcard keyword pattern would yield many, but I want to find the most repeating or most frequent wildcard/keyword pattern/regex expression/substring (however you want to define it) for a given column - is there an easy way to do this without querying a million test queries and doing it manually???
Well, if you want to find three character patterns, you could extract all 3-character patterns, aggregate and count:
select substr(t.address, gs.i, 3) as ngram_3, count(*)
from t cross join lateral
generate_series(1, length(address) - 3, 1) gs(i)
group by ngram_3
order by count(*) desc
limit 100;

Index for comparing to beginning of every word in a column

So I have a table
id | name | gender
---+-----------------+-------
0 | Markus Meskanen | M
1 | Jack Jackson | M
2 | Jane Jackson | F
And I've created an index
CREATE INDEX people_name_idx ON people (LOWER(name));
And then I query with
SELECT * FROM people WHERE name LIKE LOWER('Jack%');
Where %(name)s is the user's input. However, it now matches only to the beginning of the whole column, but I'd like it to match to the beginning of any of the words. I'd prefer not to use '%Jack%' since it would also result into invalid results from the middle of the word.
Is there a way to create an index so that each word gets a separate row?
Edit: If the name is something long like 'Michael Jackson's First Son Bob' it should match to the beginning of any of the words, i.e. Mich would match to Michael and Fir would match to First but ackson wouldn't match to anything since it's not from the beginning.
Edit 2: And we have 3 million rows so performance is an issue, thus I'm looking at indexes mostly.
Postgres has two index types to help with full text searches: GIN and GIST indexes (and I think GIN is the more commonly used one).
There is a brief overview of the indexes in the documentation. There is more extensive documentation for each index class, as well as plenty of blogs on the subject (here is one and here is another).
These can speed up the searches that you are trying to do.
The pg_trgm module does exactly what you want.
You need to create either:
CREATE INDEX people_name_idx ON people USING GIST (name gist_trgm_ops);
Or:
CREATE INDEX people_name_idx ON people USING GIN (name gin_trgm_ops);
See the difference here.
After that, these queries could use one of the indexes above:
SELECT * FROM people WHERE name ILIKE '%Jack%';
SELECT * FROM people WHERE name ~* '\mJack';
As #GordonLinoff answered, full text search is also capable of searching by prefix matches. But FTS is not designed to do that efficiently, it is best in matching lexemes. Though if you want to achieve the best performace, I advise you to give it a try too & measure each. In FTS, your query looks something like this:
SELECT * FROM people WHERE to_tsvector('english', name) ## to_tsquery('english', 'Jack:*');
Note: however if your query filter (Jack) comes from user input, both of these queries above needs some protection (i.e. in the ILIKE one you need to escape % and _ characters, in the regexp one you need to escape a lot more, and in the FTS one, well you'll need to parse the query with some parser & generate a valid FTS' tsquery query, because to_tsquery() will give you an error if its parameter is not valid. And in plainto_tsquery() you cannot use a prefix matching query).
Note 2: the regexp variant with name ~* '\mJack' will work best with english names. If you want to use the whole range of unicode (i.e. you want to use characters, like æ), you'll need a slightly different pattern. Something like:
SELECT * FROM people WHERE name ~* '(^|\s|,)Jack';
This will work with most of the names, plus this will work like a real prefix match with some old names too, like O'Brian.
You can use Regex expressions to find text inside name:
create table ci(id int, name text);
insert into ci values
(1, 'John McEnroe Blackbird Petrus'),
(2, 'Michael Jackson and Blade');
select id, name
from ci
where name ~ 'Pe+'
;
Returns:
1 John McEnroe Blackbird Petrus
Or can use something similar where substring(name, <regex exp>) is not null
Check it here: http://rextester.com/LHA16094
If you know that the words are space separated, You can do
SELECT * FROM people WHERE name LIKE LOWER('Jack%') or name LIKE LOWER(' Jack%') ;
For more control you can use RegEx with MySQl
see https://dev.mysql.com/doc/refman/5.7/en/regexp.html

oracle text definescore with accum and Query rewriting

I am using Oracle text to search in a corpus of sentences
I want the scoring to be as counting the discrete occurrences only,
Example : My Query is ( dog cat table )
If it found the term " dog " it must count 1 even if the sentence has more than one "dog" term. If it found " dog cat " it must count 2 ... etc
I used this query, but it gives me 51 if it finds the two terms. I need to accumulate the discrete occurrences. So I want to override the behaviour of the scoring algorithm of Oracle Text.
select /*+ FIRST_ROWS(1)*/ sentence_id
,score(1) as sc
, isn
,sentence_length
from plag_docsentences
where contains(PROCESSED_TEXT,'DEFINESCORE(dog, DISCRETE*.01)
,DEFINESCORE(cat, DISCRETE*.01)'
,1)>0
order by score(1) desc
OK, I Solved that Issue.
suppose I find 2 terms out of 3, the score will be 67
which means ( 2/3=67 ) this is the default behavior of oracle text scoring alg.
so I derived an equation to find the number of occurrences (i.e number of terms in query found in the corpus sentence)
as follows:
x/query_lenght = score/100
then
x=query_lenght * score/100
this will find the number of matching words between the query and the corpus query
I hope this will help reasearchers in IR.

Oracle 'Contains' / 'Group' function return incorrect value

I have this query:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, '%sul%', 1) > 0
It produces output below:
The question is:
Why does the SCORE(1) produce 9? As I recall that CONTAINS function returns number of occurrences of search_string (in this case '%sul%').
I expect the output should be:
Sullivan 1
Sully 1
But when I try this syntax:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, 'sul', 1) >0;
It returns 0 rows selected.
And can someone please explain me what is the third parameter for?
Thanks in advance :)
The reason your second query is returning no rows is, you are looking for word sul in your search. Contains will not do pattern search unless you tell it to, it searches for words which you specified as your second paramter. To look for patterns, you will have to use wildcards, as you did in your first example.
Now, coming to the third parameter in CONTAINS - it is label and is just used to label the score operator. You should use the third parameter when you use SCORE in your SELECT list. It's importance is more clear when there are multiple SCORE operators
Quoting directly from documentaion
label
Specify a number to identify the score produced by the query.
Use this number to identify the CONTAINS clause which returns this
score.
Example
Single CONTAINS
When the SCORE operator is called (for example, in a SELECT clause),
the CONTAINS clause must reference the score label value as in the
following example:
SELECT SCORE(1), title from newsindex
WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC;
Multiple CONTAINS
Assume that a news database stores and indexes the title and body of
news articles separately. The following query returns all the
documents that include the words Oracle in their title and java in
their body. The articles are sorted by the scores for the first
CONTAINS (Oracle) and then by the scores for the second CONTAINS
(java).
SELECT title, body, SCORE(10), SCORE(20) FROM news WHERE CONTAINS
(news.title, 'Oracle', 10) > 0 OR CONTAINS (news.body, 'java', 20) > 0
ORDER BY SCORE(10), SCORE(20);
The Oracle Text Scoring Algorithm does not score by simply counting the number of occurrences. It uses an inverse frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
Think of a google search. If you search for the term Oracle you will not find (directly) any result that may help to explain your scoring value questioning, so we can consider this term a "noise" to your expectations. But if you search for the term Oracle Text Scoring Algorithm you will find your answer in the first google result.
And about your other questionings, I think that #Incognito already gives them a good answer.

search criteria difference between Like vs Contains() in oracle

I created a table with two columns.I inserted two rows.
id name
1 narsi reddy
2 narei sia
one is simply number type and another one is CLOB type.So i decided to use indexing on that. I queried on that by using contains.
query:
select * from emp where contains(name,'%a%e%')>0
2 narei sia
I expected 2 would come,but not. But if i give same with like it's given what i wanted.
query:
select * from emp where name like '%a%e%'
ID NAME
1 (CLOB) narsi reddy
2 (CLOB) narei sia
2 rows selected
finally i understood that like is searching whole document or paragraph but contains is looking in words.
so how can i get required output?
LIKE and CONTAINS are fundamentally different methods for searching.
LIKE is a very simple string pattern matcher - it recognises two wildcards (%) and (_) which match zero-or-more, or exactly-one, character respectively. In your case, %a%e% matches two records in your table - it looks for zero or more characters followed by a, followed by zero or more characters followed by e, followed by zero or more characters. It is also very simplistic in its return value: it either returns "matched" or "not matched" - no shades of grey.
CONTAINS is a powerful search tool that uses a context index, which builds a kind of word tree which can be searched using the CONTAINS search syntax. It can be used to search for a single word, a combination of words, and has a rich syntax of its own, such as boolean operators (AND, NEAR, ACCUM). It is also more powerful in that instead of returning a simple "matched" or "not matched", it returns a "score", which can be used to rank results in order of relevance; e.g. CONTAINS(col, 'dog NEAR cat') will return a higher score for a document where those two words are both found close together.
I believe that your CONTAINS query is matching 'narei sia' because the pattern '%a%e%' matches the word 'narei'. It does not match against 'narsi reddy' because neither word, taken individually, matches the pattern.
I assume you want to use CONTAINS instead of LIKE for performance reasons. I am not by any means an expert on CONTAINS query expressions, but I don't see a simple way to do the exact search you want, since you are looking for letters that can be in the same word or different words, but must occur in a given order. I think it may be best to do a combination of the two techniques:
WHERE CONTAINS(name,'%a% AND %e%') > 0
AND name LIKE '%a%e%'
I think this would allow the text index to be used to find candidate matches (anything which has at least one word containing 'a' and at least one word containing 'e'). These would would then be filtered by the LIKE condition, enforcing the requirement that 'a' precede 'e' in the string.