Related rows based on text columns - sql

Given that I have a table with a column of TEXT in it (MySQL or SQlite) is it possible to use the value of that column in a way that I could find similar rows with somewhat related text values?
For example, I if I wanted to find related rows to row_3 - both 1 & 2 would match:
row_1 = this is about sports
row_2 = this is about study
row_3 = this is about study and sports
I know that I could use FULLTEXT or FTS3 if I had a key word I wanted to MATCH against the column values - but I'm just trying to find text that is somewhat related among the rows.

MySQL supports a fulltext search option called QUERY EXPANSION. The idea is that you search for a keyword, it finds a row, and then it uses the words in that row as keywords, to search for more matching rows.
SELECT ... FROM StudiesTable WHERE MATCH(description_text)
AGAINST ('sports' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION);
Read about it here: http://dev.mysql.com/doc/refman/5.1/en/fulltext-query-expansion.html

You're using the wrong hammer to pound that screw in. A single string in a database column isn't the way to store that data. You can't easily get at the part you care about, which is the individual words.
There is a lot of research into the problem of comparison of text. If you're serious about this need, you'll want to start reading about the variety of techniques in that problem domain.
The first clue is that you want to access / index the data not by complete text string, but by word or sentence fragment (unless you're interested in words that are spelled similarly being matched together, which is harder).
As an example of one technique, generate a chain out of your sentences by grabbing overlapping sets of three words, and store the chain. Then you can search for entries that have a large number of chain segments in common. A set of chain segments for your statements above would be:
row_1 = this is about sports
row_2 =
this is about study
row_3 = this is
about study and sports
this is about (3 matches)
is about sports
is about study (2 matches)
about study and
study and sports

Maybe it would be enough to take each relevant word (more than 4 letters? or comparing against a list of commom words?) in the base row using them as keywords for the fulltext search and building a tmp table (id, row_matched_id, count) to record the matches for each row adding 1 to count when it matches. At the end you'll get in the tmp table all the lines that matched and how many times they matched (how many relevant words were the same).If you want to run it once against the whole database and keep the results, use a persisted table, add a column for the id of the base row and do the search for each new row inserted (or updated) to update the results table.
Using this results table you can find quickly the rows matching more words of the base row without doing the search again.
Edit: with this you can "score" the results, for example, if you count x relevant words in the base row, you can calculate a score in % as (matches/x * 100) and filter all results with for example less than 50% matches. In your example, each row_1 and row_2 would give 50% if considering relevants only words with more than 4 letters or 67% if you consider all the words.

Related

postgreSQL: Remove last character in a VARCHAR until match is found

I am looking for an efficient way to query a postgreSQL database by removing the the right-most character in a string until a match is found. For example, if my dialing number is 442079285200 then it should strip characters from the end of the sequence, eventually matching to UNITED KINGDOM-LONDON 44207.
442079285200 -> No match
44207928520 -> No match
4420792852 -> No match
442079285 -> No match
44207928 -> No match
4420792 -> No match
442079 -> No match
44207 -> Matches UNITED KINGDOM-LONDON
v_destination_rates
destination
dialing_code
current_rate
rounding_rule
INMARSAT
870
10.8239
1-1-1
INTERNATIONAL NETWORKS
882
10.8239
1-1-1
INTERNATIONAL NETWORKS
883
10.8239
1-1-1
IRIDIUM
521844207
5.1167
1-1-1
UNITED KINGDOM-LONDON
44207
0.0056
1-1-1
I know one way of doing this is to loop over the number of characters in the dialing number (n) and do a select query for the left-most n characters. I haven't successfully ran my query, but I believe it would look something similar to:
$do$
DECLARE
m varchar := '442079285200';
BEGIN
FOR counter IN LENGTH(m)..1 loop
select destination from v_destination_rates where v_destination_rates.dialing_code = left(dialing_number, counter);
END LOOP;
END
$do$
I'm wondering if there is a more efficient way of performing this query, perhaps with the LIKE wildcard operator? We have thousands of dialing numbers to match to approximately 20 000 dialing_codes so a less expensive operation would be preferred.
You haven't said whether dialing_number is coming from a table / sanitized user input / something else.
For simplicity I'll assume it comes from a table contacts and that you want to return everything in contacts and every column in v_destination_rates joined as you describe.
Without using pl/pgSQL:
SELECT
*
FROM contacts c
LEFT JOIN v_destination_rates vdr
ON vdr.dialing_code::TEXT LIKE c.dialing_number::TEXT || '%'
I've tested this on a table of 9,000 records, which I assume is as about as large or larger than the lookup table v_destination_rates, and matched 16 sample integers in less than a tenth of a second.
You could possibly get even better performance if the dialing code is already type TEXT, and indexed lexicographically since that's how you're searching here.
I generally avoid regular expressions and like/similar searches whenever possible; they are slow. In this case they are completely avoidable instead you can use substring and length and do a equal match. (yea, 2 subroutines vs regex it is a toss up). The following does just that and when multiple matches occur it selects the longest match. (see demo)
with dialing_number (dn) as
( values ('442079285200') )
select dr.*
from dialing_number
join v_destination_rates dr
on dr.dialing_code = substring(dn,1, length(dr.dialing_code))
order by length(dr.dialing_code) desc
limit 1 ;
Concerning performance, a search set of 20000 is a very small number of entries to search. For curiosity I generated random dialing_code values until the above query took more than 1sec. That occurred at 4755006 rows searched in 1.07 sec.
I'm wondering if there is a more efficient way of performing this query
Yes, parse the number to get the country and dialing code. There are any number of existing libraries to do this. Then concatenate them and search.
For example, 442079285200 is the country code 44 and the dialing code 20 (207 is obsolete). Then you'd search for '4420'.
Note: 870, 882, and 883 are not dialing codes, they are country codes. And Iridium is 881. Mixing up country codes with dialing codes will probably cause more problems down the road, you may be better off separating them in your table.

SQL count query number of lines per row

I am trying to count the amount of urls we have in field in sql I have googled but cannot find anything !
So for example this could be in field "url" row 1 / id 1
url/32432
url/32434
So for example this could be field "url" in row 2 / id 2
url/32432
url/32488
url/32477
So if you were to run the query the count would be 5. There is no comma in between them, only space.
Kind Regards
Scott
This is a very bad layout for data. If you have multiple urls per id, then they should be stored as separate rows in another table.
But, sometimes we are stuck with other people's bad design decisions. You can do something like this:
select (length(replace(urls, 'url', 'urlx')) - length(urls)) as num_urls
Note that the specific functions for length() and replace() might vary, depending on the database.

Oracle 'Contains' / 'Group' function return incorrect value

I have this query:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, '%sul%', 1) > 0
It produces output below:
The question is:
Why does the SCORE(1) produce 9? As I recall that CONTAINS function returns number of occurrences of search_string (in this case '%sul%').
I expect the output should be:
Sullivan 1
Sully 1
But when I try this syntax:
SELECT last_name, SCORE(1)
FROM Employees
WHERE CONTAINS(last_name, 'sul', 1) >0;
It returns 0 rows selected.
And can someone please explain me what is the third parameter for?
Thanks in advance :)
The reason your second query is returning no rows is, you are looking for word sul in your search. Contains will not do pattern search unless you tell it to, it searches for words which you specified as your second paramter. To look for patterns, you will have to use wildcards, as you did in your first example.
Now, coming to the third parameter in CONTAINS - it is label and is just used to label the score operator. You should use the third parameter when you use SCORE in your SELECT list. It's importance is more clear when there are multiple SCORE operators
Quoting directly from documentaion
label
Specify a number to identify the score produced by the query.
Use this number to identify the CONTAINS clause which returns this
score.
Example
Single CONTAINS
When the SCORE operator is called (for example, in a SELECT clause),
the CONTAINS clause must reference the score label value as in the
following example:
SELECT SCORE(1), title from newsindex
WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC;
Multiple CONTAINS
Assume that a news database stores and indexes the title and body of
news articles separately. The following query returns all the
documents that include the words Oracle in their title and java in
their body. The articles are sorted by the scores for the first
CONTAINS (Oracle) and then by the scores for the second CONTAINS
(java).
SELECT title, body, SCORE(10), SCORE(20) FROM news WHERE CONTAINS
(news.title, 'Oracle', 10) > 0 OR CONTAINS (news.body, 'java', 20) > 0
ORDER BY SCORE(10), SCORE(20);
The Oracle Text Scoring Algorithm does not score by simply counting the number of occurrences. It uses an inverse frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
Think of a google search. If you search for the term Oracle you will not find (directly) any result that may help to explain your scoring value questioning, so we can consider this term a "noise" to your expectations. But if you search for the term Oracle Text Scoring Algorithm you will find your answer in the first google result.
And about your other questionings, I think that #Incognito already gives them a good answer.

PostgreSQL, find strings differ by n characters

Suppose I have a table like this
id data
1 0001
2 1000
3 2010
4 0120
5 0020
6 0002
sql fiddle demo
id is primary key, data is fixed length string where characters could be 0, 1, 2.
Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.
Thanks.
There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:
SELECT *
FROM a
WHERE levenshtein(data, '1110') = 1;
SQL Fiddle.
But it is not very fast. Slow with big tables, because it can't use an index.
You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".
Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.
If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:
SELECT *
FROM a
WHERE data <> '1110'
AND (data LIKE '_110' OR
data LIKE '1_10' OR
data LIKE '11_0' OR
data LIKE '111_');
Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.
However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.
I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.

search criteria difference between Like vs Contains() in oracle

I created a table with two columns.I inserted two rows.
id name
1 narsi reddy
2 narei sia
one is simply number type and another one is CLOB type.So i decided to use indexing on that. I queried on that by using contains.
query:
select * from emp where contains(name,'%a%e%')>0
2 narei sia
I expected 2 would come,but not. But if i give same with like it's given what i wanted.
query:
select * from emp where name like '%a%e%'
ID NAME
1 (CLOB) narsi reddy
2 (CLOB) narei sia
2 rows selected
finally i understood that like is searching whole document or paragraph but contains is looking in words.
so how can i get required output?
LIKE and CONTAINS are fundamentally different methods for searching.
LIKE is a very simple string pattern matcher - it recognises two wildcards (%) and (_) which match zero-or-more, or exactly-one, character respectively. In your case, %a%e% matches two records in your table - it looks for zero or more characters followed by a, followed by zero or more characters followed by e, followed by zero or more characters. It is also very simplistic in its return value: it either returns "matched" or "not matched" - no shades of grey.
CONTAINS is a powerful search tool that uses a context index, which builds a kind of word tree which can be searched using the CONTAINS search syntax. It can be used to search for a single word, a combination of words, and has a rich syntax of its own, such as boolean operators (AND, NEAR, ACCUM). It is also more powerful in that instead of returning a simple "matched" or "not matched", it returns a "score", which can be used to rank results in order of relevance; e.g. CONTAINS(col, 'dog NEAR cat') will return a higher score for a document where those two words are both found close together.
I believe that your CONTAINS query is matching 'narei sia' because the pattern '%a%e%' matches the word 'narei'. It does not match against 'narsi reddy' because neither word, taken individually, matches the pattern.
I assume you want to use CONTAINS instead of LIKE for performance reasons. I am not by any means an expert on CONTAINS query expressions, but I don't see a simple way to do the exact search you want, since you are looking for letters that can be in the same word or different words, but must occur in a given order. I think it may be best to do a combination of the two techniques:
WHERE CONTAINS(name,'%a% AND %e%') > 0
AND name LIKE '%a%e%'
I think this would allow the text index to be used to find candidate matches (anything which has at least one word containing 'a' and at least one word containing 'e'). These would would then be filtered by the LIKE condition, enforcing the requirement that 'a' precede 'e' in the string.