Oracle 'Contains' / 'Group' function return incorrect value - sql

I have this query:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, '%sul%', 1) > 0
It produces output below:
The question is:
Why does the SCORE(1) produce 9? As I recall that CONTAINS function returns number of occurrences of search_string (in this case '%sul%').
I expect the output should be:
Sullivan 1
Sully 1
But when I try this syntax:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, 'sul', 1) >0;
It returns 0 rows selected.
And can someone please explain me what is the third parameter for?
Thanks in advance :)

The reason your second query is returning no rows is, you are looking for word sul in your search. Contains will not do pattern search unless you tell it to, it searches for words which you specified as your second paramter. To look for patterns, you will have to use wildcards, as you did in your first example.
Now, coming to the third parameter in CONTAINS - it is label and is just used to label the score operator. You should use the third parameter when you use SCORE in your SELECT list. It's importance is more clear when there are multiple SCORE operators
Quoting directly from documentaion
label
Specify a number to identify the score produced by the query.
Use this number to identify the CONTAINS clause which returns this
score.
Example
Single CONTAINS
When the SCORE operator is called (for example, in a SELECT clause),
the CONTAINS clause must reference the score label value as in the
following example:
SELECT SCORE(1), title from newsindex
WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC;
Multiple CONTAINS
Assume that a news database stores and indexes the title and body of
news articles separately. The following query returns all the
documents that include the words Oracle in their title and java in
their body. The articles are sorted by the scores for the first
CONTAINS (Oracle) and then by the scores for the second CONTAINS
(java).
SELECT title, body, SCORE(10), SCORE(20) FROM news WHERE CONTAINS
(news.title, 'Oracle', 10) > 0 OR CONTAINS (news.body, 'java', 20) > 0
ORDER BY SCORE(10), SCORE(20);

The Oracle Text Scoring Algorithm does not score by simply counting the number of occurrences. It uses an inverse frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
Think of a google search. If you search for the term Oracle you will not find (directly) any result that may help to explain your scoring value questioning, so we can consider this term a "noise" to your expectations. But if you search for the term Oracle Text Scoring Algorithm you will find your answer in the first google result.
And about your other questionings, I think that #Incognito already gives them a good answer.

Related

oracle text definescore with accum and Query rewriting

I am using Oracle text to search in a corpus of sentences
I want the scoring to be as counting the discrete occurrences only,
Example : My Query is ( dog cat table )
If it found the term " dog " it must count 1 even if the sentence has more than one "dog" term. If it found " dog cat " it must count 2 ... etc
I used this query, but it gives me 51 if it finds the two terms. I need to accumulate the discrete occurrences. So I want to override the behaviour of the scoring algorithm of Oracle Text.
select /*+ FIRST_ROWS(1)*/ sentence_id
,score(1) as sc
, isn
,sentence_length
from plag_docsentences
where contains(PROCESSED_TEXT,'DEFINESCORE(dog, DISCRETE*.01)
,DEFINESCORE(cat, DISCRETE*.01)'
,1)>0
order by score(1) desc
OK, I Solved that Issue.
suppose I find 2 terms out of 3, the score will be 67
which means ( 2/3=67 ) this is the default behavior of oracle text scoring alg.
so I derived an equation to find the number of occurrences (i.e number of terms in query found in the corpus sentence)
as follows:
x/query_lenght = score/100
then
x=query_lenght * score/100
this will find the number of matching words between the query and the corpus query
I hope this will help reasearchers in IR.

Sql Server Contains search not Giving Result as expected

select * from table1 where contains(searchWord,"*comfort*")
I want result as
Uncomfortable
with search Word in between but it is showing
comfort xyz
only
You do not need contains function here. Searches for precise or fuzzy (less precise) matches to single words and phrases, words within a certain distance of one another, or weighted matches in SQL Server
You need simple predicate for the required result.
select * from table1 where searchWord like '%comfort%'

search criteria difference between Like vs Contains() in oracle

I created a table with two columns.I inserted two rows.
id name
1 narsi reddy
2 narei sia
one is simply number type and another one is CLOB type.So i decided to use indexing on that. I queried on that by using contains.
query:
select * from emp where contains(name,'%a%e%')>0
2 narei sia
I expected 2 would come,but not. But if i give same with like it's given what i wanted.
query:
select * from emp where name like '%a%e%'
ID NAME
1 (CLOB) narsi reddy
2 (CLOB) narei sia
2 rows selected
finally i understood that like is searching whole document or paragraph but contains is looking in words.
so how can i get required output?
LIKE and CONTAINS are fundamentally different methods for searching.
LIKE is a very simple string pattern matcher - it recognises two wildcards (%) and (_) which match zero-or-more, or exactly-one, character respectively. In your case, %a%e% matches two records in your table - it looks for zero or more characters followed by a, followed by zero or more characters followed by e, followed by zero or more characters. It is also very simplistic in its return value: it either returns "matched" or "not matched" - no shades of grey.
CONTAINS is a powerful search tool that uses a context index, which builds a kind of word tree which can be searched using the CONTAINS search syntax. It can be used to search for a single word, a combination of words, and has a rich syntax of its own, such as boolean operators (AND, NEAR, ACCUM). It is also more powerful in that instead of returning a simple "matched" or "not matched", it returns a "score", which can be used to rank results in order of relevance; e.g. CONTAINS(col, 'dog NEAR cat') will return a higher score for a document where those two words are both found close together.
I believe that your CONTAINS query is matching 'narei sia' because the pattern '%a%e%' matches the word 'narei'. It does not match against 'narsi reddy' because neither word, taken individually, matches the pattern.
I assume you want to use CONTAINS instead of LIKE for performance reasons. I am not by any means an expert on CONTAINS query expressions, but I don't see a simple way to do the exact search you want, since you are looking for letters that can be in the same word or different words, but must occur in a given order. I think it may be best to do a combination of the two techniques:
WHERE CONTAINS(name,'%a% AND %e%') > 0
AND name LIKE '%a%e%'
I think this would allow the text index to be used to find candidate matches (anything which has at least one word containing 'a' and at least one word containing 'e'). These would would then be filtered by the LIKE condition, enforcing the requirement that 'a' precede 'e' in the string.

Custom SQL sort by

Use:
The user searches for a partial postcode such as 'RG20' which should then be displayed in a specific order. The query uses the MATCH AGAINST method in boolean mode where an example of the postcode in the database would be 'RG20 7TT' so it is able to find it.
At the same time it also matches against a list of other postcodes which are in it's radius (which is a separate query).
I can't seem to find a way to order by a partial match, e.g.:
ORDER BY FIELD(postcode, 'RG20', 'RG14', 'RG18','RG17','RG28','OX12','OX11')
DESC, city DESC
Because it's not specifically looking for RG20 7TT, I don't think it can make a partial match.
I have tried SUBSTR (postcode, -4) and looked into left and right, but I haven't had any success using 'by field' and could not find another route...
Sorry this is a bit long winded, but I'm in a bit of a bind.
A UK postcode splits into 2 parts, the last section always being 3 characters and within my database there is a space between the two if that helps at all.
Although there is a DESC after the postcodes, I do need them to display in THAT particular order (RG20, RG14 then RG18 etc..) I'm unsure if specifying descending will remove the ordering or not
Order By Case
When postcode Like 'RG20%' Then 1
When postcode Like 'RG14%' Then 2
When postcode Like 'RG18%' Then 3
When postcode Like 'RG17%' Then 4
When postcode Like 'RG28%' Then 5
When postcode Like 'OX12%' Then 6
When postcode Like 'OX11%' Then 7
Else 99
End Asc
, City Desc
You're on the right track, trimming the field down to its first four characters:
ORDER BY FIELD(LEFT(postcode, 4), 'RG20', 'RG14', ...),
-- or SUBSTRING(postcode FROM 1 FOR 4)
-- or SUBSTR(postcode, 1, 4)
Here you don't want DESC.
(If your result set contains postcodes whose prefixes do not appear in your FIELD() ordering list, you'll have a bit more work to do, since those records will otherwise appear before any explicitly ordered records you specify. Before 'RG20' in the example above.)
If you want a completely custom sorting scheme, then I only see one way to do it...
Create a table to hold the values upon which to sort, and include a "sequence" or "sort_order" field. You can then join to this table and sort by the sequence field.
One note on the sequence field. It makes sense to create it as an int as... well, sequences are often ints :)
If there is any possibility of changing the sort order, you may want to consider making it alpha numeric... It is a lot easier to insert "5A" between "5 and "6" than it is to insert a number into a sequence of integers.
Another method I use is utilising the charindex function:
order by charindex(substr(postcode,4,1),"RG20RG14RG18...",1)
I think that's the syntax anyway, I'm just doing this in SAS at the moment so I've had to adapt from memory!
But essentially the sooner you hit your desired part of the string, the higher the rank.
If you're trying to rank on a large variety of postcodes then a case statement gets pretty hefty.

Related rows based on text columns

Given that I have a table with a column of TEXT in it (MySQL or SQlite) is it possible to use the value of that column in a way that I could find similar rows with somewhat related text values?
For example, I if I wanted to find related rows to row_3 - both 1 & 2 would match:
row_1 = this is about sports
row_2 = this is about study
row_3 = this is about study and sports
I know that I could use FULLTEXT or FTS3 if I had a key word I wanted to MATCH against the column values - but I'm just trying to find text that is somewhat related among the rows.
MySQL supports a fulltext search option called QUERY EXPANSION. The idea is that you search for a keyword, it finds a row, and then it uses the words in that row as keywords, to search for more matching rows.
SELECT ... FROM StudiesTable WHERE MATCH(description_text)
AGAINST ('sports' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION);
Read about it here: http://dev.mysql.com/doc/refman/5.1/en/fulltext-query-expansion.html
You're using the wrong hammer to pound that screw in. A single string in a database column isn't the way to store that data. You can't easily get at the part you care about, which is the individual words.
There is a lot of research into the problem of comparison of text. If you're serious about this need, you'll want to start reading about the variety of techniques in that problem domain.
The first clue is that you want to access / index the data not by complete text string, but by word or sentence fragment (unless you're interested in words that are spelled similarly being matched together, which is harder).
As an example of one technique, generate a chain out of your sentences by grabbing overlapping sets of three words, and store the chain. Then you can search for entries that have a large number of chain segments in common. A set of chain segments for your statements above would be:
row_1 = this is about sports
row_2 =
this is about study
row_3 = this is
about study and sports
this is about (3 matches)
is about sports
is about study (2 matches)
about study and
study and sports
Maybe it would be enough to take each relevant word (more than 4 letters? or comparing against a list of commom words?) in the base row using them as keywords for the fulltext search and building a tmp table (id, row_matched_id, count) to record the matches for each row adding 1 to count when it matches. At the end you'll get in the tmp table all the lines that matched and how many times they matched (how many relevant words were the same).If you want to run it once against the whole database and keep the results, use a persisted table, add a column for the id of the base row and do the search for each new row inserted (or updated) to update the results table.
Using this results table you can find quickly the rows matching more words of the base row without doing the search again.
Edit: with this you can "score" the results, for example, if you count x relevant words in the base row, you can calculate a score in % as (matches/x * 100) and filter all results with for example less than 50% matches. In your example, each row_1 and row_2 would give 50% if considering relevants only words with more than 4 letters or 67% if you consider all the words.