oracle text definescore with accum and Query rewriting - sql

I am using Oracle text to search in a corpus of sentences
I want the scoring to be as counting the discrete occurrences only,
Example : My Query is ( dog cat table )
If it found the term " dog " it must count 1 even if the sentence has more than one "dog" term. If it found " dog cat " it must count 2 ... etc
I used this query, but it gives me 51 if it finds the two terms. I need to accumulate the discrete occurrences. So I want to override the behaviour of the scoring algorithm of Oracle Text.
select /*+ FIRST_ROWS(1)*/ sentence_id
,score(1) as sc
, isn
,sentence_length
from plag_docsentences
where contains(PROCESSED_TEXT,'DEFINESCORE(dog, DISCRETE*.01)
,DEFINESCORE(cat, DISCRETE*.01)'
,1)>0
order by score(1) desc

OK, I Solved that Issue.
suppose I find 2 terms out of 3, the score will be 67
which means ( 2/3=67 ) this is the default behavior of oracle text scoring alg.
so I derived an equation to find the number of occurrences (i.e number of terms in query found in the corpus sentence)
as follows:
x/query_lenght = score/100
then
x=query_lenght * score/100
this will find the number of matching words between the query and the corpus query
I hope this will help reasearchers in IR.

Related

Structuring BigQuery with large array of data as input

I am interested in obtaining the most frequently word associations with a particular word via BigQuery's ability find trigrams data. For example, when using Google's Ngram viewer, I could input great *, which will give me the most frequently associated word that follows "great", such as "great deal", then "great and" and "great many". My goal is to do it for a large list of words so that I could query with word1 * all the way to word10000 *
Following the discussion on this SO answer, I was led to the BigQuery's publicly available trigram data. What I can't seem to figure out at this point is how to use this service with input of an array of words, either as a file input or a way to paste them in. Any assistance is much appreciated - thanks.
Here is how you would find 10 most frequent words to follow "great":
SELECT second, SUM(cell.page_count) total
FROM [publicdata:samples.trigrams]
WHERE first = "great"
group by 1
order by 2 desc
limit 10
This results in
second total
------------------
deal 3048832
and 1689911
, 1576341
a 1019511
number 984993
many 875974
importance 805215
part 739409
. 700694
as 628978
If you wanted to limit to specific years - say between 1820 and 1840, then you can also restrict on cell.value (which is year of publication)
SELECT second, SUM(cell.page_count) total FROM [publicdata:samples.trigrams]
WHERE first = "great" and cell.value between '1820' and '1840'
group by 1
order by 2 desc
limit 10

Select closest match in dictionary order

Supporse I have the fllowing table with words from a dictionary:
Word
---
cat
dog
giraffe
zebra
I would like to find a word, and if it doesn't exist, the closest before it in dictionary order, e.g. aardvark would return nothing, cat would return cat, cow would return cat, horse would return giraffe.
This should be relatively straightforward to search for using a BTREE index but I haven't figured out a way to do it. I'm using sqlite for this, but other engines are also acceptable.
I'm only interested in the dictionary order, i.e. the query should work exactly with the above examples. Other similarity metrics are of course nice, but are irrelevant to this question.
Assuming that you have declared the column with the correct collation for dictionary order (which might be the default, or COLLATE NOCASE, or a user-defined collation), getting an exact match is trivial:
SELECT Word FROM Dictionary WHERE Word = ?
and getting the closest before is easy:
SELECT MAX(Word) FROM Dictionary WHERE Word < ?
To get only the first result of these two queries, combine them with UNION ALL, and use LIMIT 1 so that the second query is ignored if the first one succeeds:
SELECT Word FROM Dictionary WHERE Word = ?
UNION ALL
SELECT MAX(Word) FROM Dictionary WHERE Word < ?
LIMIT 1
for approx match
select a.word,b.word from dictionary a, words b
where (b.word like '%'+right(a.word,2) )
or
for exact match
select a.word,b.word from dictionary a, words b
where soundex(a.word)=soundex(b.word) or
this may helps you
select a.word,b.word from dictionary a, words b
where difference (a.word,'DOG') in(3,4)
or you can use soundex function

Oracle 'Contains' / 'Group' function return incorrect value

I have this query:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, '%sul%', 1) > 0
It produces output below:
The question is:
Why does the SCORE(1) produce 9? As I recall that CONTAINS function returns number of occurrences of search_string (in this case '%sul%').
I expect the output should be:
Sullivan 1
Sully 1
But when I try this syntax:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, 'sul', 1) >0;
It returns 0 rows selected.
And can someone please explain me what is the third parameter for?
Thanks in advance :)
The reason your second query is returning no rows is, you are looking for word sul in your search. Contains will not do pattern search unless you tell it to, it searches for words which you specified as your second paramter. To look for patterns, you will have to use wildcards, as you did in your first example.
Now, coming to the third parameter in CONTAINS - it is label and is just used to label the score operator. You should use the third parameter when you use SCORE in your SELECT list. It's importance is more clear when there are multiple SCORE operators
Quoting directly from documentaion
label
Specify a number to identify the score produced by the query.
Use this number to identify the CONTAINS clause which returns this
score.
Example
Single CONTAINS
When the SCORE operator is called (for example, in a SELECT clause),
the CONTAINS clause must reference the score label value as in the
following example:
SELECT SCORE(1), title from newsindex
WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC;
Multiple CONTAINS
Assume that a news database stores and indexes the title and body of
news articles separately. The following query returns all the
documents that include the words Oracle in their title and java in
their body. The articles are sorted by the scores for the first
CONTAINS (Oracle) and then by the scores for the second CONTAINS
(java).
SELECT title, body, SCORE(10), SCORE(20) FROM news WHERE CONTAINS
(news.title, 'Oracle', 10) > 0 OR CONTAINS (news.body, 'java', 20) > 0
ORDER BY SCORE(10), SCORE(20);
The Oracle Text Scoring Algorithm does not score by simply counting the number of occurrences. It uses an inverse frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
Think of a google search. If you search for the term Oracle you will not find (directly) any result that may help to explain your scoring value questioning, so we can consider this term a "noise" to your expectations. But if you search for the term Oracle Text Scoring Algorithm you will find your answer in the first google result.
And about your other questionings, I think that #Incognito already gives them a good answer.

Weighted Keyword Search

Hello: I want to do a "weighted search" on product that are tagged with keywords.
(So: not fulltext search, but n-to-m-relation). So here it is:
Table 'product':
sku - the primary key
name
Table 'keywords':
kid - keyword idea
keyword_de - German language String (e.g. 'Hund','Katze','Maus')
keyword_en - English language String (e.g. 'Dog','Cat','Mouse')
Table 'product_keyword' (the cross-table)
sku \__ combined primary key
kid /
What I want is to get a score for all products that at least "contain" one relevant keyword. If I search for ('Dog','Elephant','Maus') I want that
Dog credits a score of 1.003,
Elephant of 1.002
Maus of 1.001
So least important search term starts at 1.001, everything else 0.001++. That way, a lower score limit of 3.0 would equal "AND" query (all three keywords must be found), a lower score limit of 1.0 would equal an "OR". Anything in between something more or less matching. In particular by sorting according to this score, most relevant search results would be first (regardless of lower limit)...
I guess I will have to do something with
IF( keyword1 == 'dog', 1.001, 0) + IF...
maybe inside a SUM() and probably with a GROUP BY at the end of a JOIN over the cross table, eh? But I am fairly clueless how to tackle this.
What would be feasible, is to get the keyword id's from the keywords beforehand. That's a cheap query. So the keywords table can be left ignored and it's all about the other of the cross and product table...
I have PHP at hand to automatically prepare a fairly lengthy PHP statement, but I would like to avoid further multiple SQL statements. In particular since I will limit the query outcome (most often to "LIMIT 0, 20") for paging mode results, so looping a very large number of in between results through a script would be no good...
DANKESCHÖN, if you can help me on this :-)
I think a lot of this is in the Lucene engine (http://lucene.apache.org/java/docs/index.html), which is available for PHP in the Zend Framework: http://framework.zend.com/manual/en/zend.search.lucene.html.
EDIT:
If you want to do the weighted thing you are talking about, I guess you could use something like this:
select p.sku, sum(case k.keyword_en when 'Dog' then 1001 when 'Cat' then 1002 when 'Mouse' then 1003 else 0 end) as totalscore
from products p
left join product_keyword pk on p.sku = pk.sku
inner join keywords k on k.kid = pk.kid
where k.keyword_en in ('Dog', 'Cat', 'Mouse')
group by p.sku
(Edit 2: forgot the group by clause.)

Related rows based on text columns

Given that I have a table with a column of TEXT in it (MySQL or SQlite) is it possible to use the value of that column in a way that I could find similar rows with somewhat related text values?
For example, I if I wanted to find related rows to row_3 - both 1 & 2 would match:
row_1 = this is about sports
row_2 = this is about study
row_3 = this is about study and sports
I know that I could use FULLTEXT or FTS3 if I had a key word I wanted to MATCH against the column values - but I'm just trying to find text that is somewhat related among the rows.
MySQL supports a fulltext search option called QUERY EXPANSION. The idea is that you search for a keyword, it finds a row, and then it uses the words in that row as keywords, to search for more matching rows.
SELECT ... FROM StudiesTable WHERE MATCH(description_text)
AGAINST ('sports' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION);
Read about it here: http://dev.mysql.com/doc/refman/5.1/en/fulltext-query-expansion.html
You're using the wrong hammer to pound that screw in. A single string in a database column isn't the way to store that data. You can't easily get at the part you care about, which is the individual words.
There is a lot of research into the problem of comparison of text. If you're serious about this need, you'll want to start reading about the variety of techniques in that problem domain.
The first clue is that you want to access / index the data not by complete text string, but by word or sentence fragment (unless you're interested in words that are spelled similarly being matched together, which is harder).
As an example of one technique, generate a chain out of your sentences by grabbing overlapping sets of three words, and store the chain. Then you can search for entries that have a large number of chain segments in common. A set of chain segments for your statements above would be:
row_1 = this is about sports
row_2 =
this is about study
row_3 = this is
about study and sports
this is about (3 matches)
is about sports
is about study (2 matches)
about study and
study and sports
Maybe it would be enough to take each relevant word (more than 4 letters? or comparing against a list of commom words?) in the base row using them as keywords for the fulltext search and building a tmp table (id, row_matched_id, count) to record the matches for each row adding 1 to count when it matches. At the end you'll get in the tmp table all the lines that matched and how many times they matched (how many relevant words were the same).If you want to run it once against the whole database and keep the results, use a persisted table, add a column for the id of the base row and do the search for each new row inserted (or updated) to update the results table.
Using this results table you can find quickly the rows matching more words of the base row without doing the search again.
Edit: with this you can "score" the results, for example, if you count x relevant words in the base row, you can calculate a score in % as (matches/x * 100) and filter all results with for example less than 50% matches. In your example, each row_1 and row_2 would give 50% if considering relevants only words with more than 4 letters or 67% if you consider all the words.