SQL performance like and UTL_MATCH.EDIT_DISTANCE - sql

I have 2 tables OSD_SPLIT which contains are 180K records with words to be flagged and TRANSPLIT which has the sentence split up on spaces e.g. sentence My name is XYZ so TRANSPLIT would have 4 rows My, name, is, XYZ.
I have 2 sqls one to match 1 letter distance e.g. TRANSPLIT has a row SHARK and OSD_SPLIT has word SHRAK I would like it to match i.e. edit_distance is 1 and lengths are same.
The other SQL is to look for OSD_SPLIT words inline within TRANSPLIT e.g. TRANSPLIT has XSHARKX I would like it to match with SHARK because SHARK exists inline within XSHARKX
Below are the 2 SQLs and I would like to improve performance both the SQLs are doing full table scans on OSD_SPLIT table I guess because I am using edit_distance and like '%%' any suggestions on improving performance
SELECT DISTINCT OS.STOP_DSC
FROM OSD_SPLIT OS, TRANSPLIT TS
WHERE TS.TXTSPLIT1 <> OS.STOP_DSC
AND TS.ORGTXT = TS.TXTSPLIT1
AND LENGTH(TS.TXTSPLIT1) > 1
AND LENGTH(TS.TXTSPLIT1) = LENGTH(OS.STOP_DSC)
AND UTL_MATCH.EDIT_DISTANCE(TS.TXTSPLIT1, OS.STOP_DSC) = 1
SELECT DISTINCT OS.STOP_DSC
FROM OSD_SPLIT OS, TRANSPLIT TS
WHERE TS.TXTSPLIT1 <> OS.STOP_DSC
AND TS.TXTSPLIT1 LIKE '%'||OS.STOP_DSC||'%'
AND LENGTH(TS.TXTSPLIT1) >= LENGTH(OS.STOP_DSC)
AND LENGTH(TS.TXTSPLIT1) >= 3 AND LENGTH(OS.STOP_DSC) >= 3
AND ((LENGTH(TS.TXTSPLIT1) - LENGTH(REPLACE(TS.TXTSPLIT1,OS.STOP_DSC,'')))/LENGTH(TS.TXTSPLIT1))*100 >= 80

Related

How to write a SQL query to calculate percentages based on values across different tables?

Suppose I have a database containing two tables, similar to below:
Table 1:
tweet_id tweet
1 Scrap the election results
2 The election was great!
3 Great stuff
Table 2:
politician tweet_id
TRUE 1
FALSE 2
FALSE 3
I'm trying to write a SQL query which returns the percentage of tweets that contain the word 'election' broken down by whether they were a politician or not.
So for instance here, the first 2 tweets in Table 1 contain the word election. By looking at Table 2, you can see that tweet_id 1 was written by a politician, whereas tweet_id 2 was written by a non-politician.
Hence, the result of the SQL query should return 50% for politicians and 50% for non-politicians (i.e. two tweets contained the word 'election', one by a politician and one by a non-politician).
Any ideas how to write this in SQL?
You could do this by creating one subquery to return all election tweets, and one subquery to return all election tweets by politicians, then join.
Here is a sample. Note that you may need to cast the totals to decimals before dividing (depending on which SQL provider you are working in).
select
politician_tweets.total / election_tweets.total
from
(
select
count(tweet) as total
from
table_1
join table_2 on table_1.tweet_id = table_2.tweet_id
where
tweet like '%election%'
) election_tweets
join
(
select
count(tweet) as total
from
table_1
join table_2 on table_1.tweet_id = table_2.tweet_id
where
tweet like '%election%' and
politician = 1
) politician_tweets
on 1 = 1
You can use aggregation like this:
select t2.politician, avg( case when t.tweet like '%election%' then 1.0 else 0 end) as election_ratio
from tweets t join
table2 t2
on t.tweet_id = t2.tweet_id
group by t2.politician;
Here is a db<>fiddle.

In SQL, how to check if a string is the substring of any other string in the same table?

I have table full of strings (TEXT) and I like to get all the strings that are substrings of any other string in the same table. For example if I had these three strings in my table:
WORD WORD_ID
cup 0
cake 1
cupcake 2
As result of my query I would like to get something like this:
WORD WORD_ID SUBSTRING SUBSTRING_ID
cupcake 2 cup 0
cupcake 2 cake 1
I know that I could do this with two loops (using Python or JS) by looping over every word in my table and match it against every word in the same table, but I'm not sure how this can be done using SQL (PostgreSQL for that matter).
Use self-join:
select w1.word, w1.word_id, w2.word, w2.word_id
from words w1
join words w2
on w1.word <> w2.word
and w1.word like format('%%%s%%', w2.word);
word | word_id | word | word_id
---------+---------+------+---------
cupcake | 2 | cup | 0
cupcake | 2 | cake | 1
(2 rows)
Problem
The task has the potential to stall your database server for tables of non-trivial size, since it's an O(N²) problem as long as you cannot utilize an index for it.
In a sequential scan you have to check every possible combination of two rows, that's n * (n-1) / 2 combinations - Postgres will run n * n-1 tests since it's not easy to rule out reverse duplicate combinations. If you are satisfied with the first match, it gets cheaper - how much depends on data distribution. For many matches, Postgres will find a match for a row early and can skip testing the rest. For few matches, most of the checks have to be performed anyway.
Either way, performance deteriorates rapidly with the number of rows in the table. Test each query with EXPLAIN ANALYZE and 10, 100, 1000 etc. rows in the table to see for yourself.
Solution
Create a trigram index on word - preferably GIN.
CREATE INDEX tbl_word_trgm_gin_idx ON tbl USING gin (word gin_trgm_ops);
Details:
PostgreSQL LIKE query performance variations
The queries in both answers so far wouldn't use the index even if you had it. Use a query that can actually work with this index:
To list all matches (according to the question body):
Use a LATERAL CROSS JOIN:
SELECT t2.word_id, t2.word, t1.word_id, t1.word
FROM tbl t1
, LATERAL (
SELECT word_id, word
FROM tbl
WHERE word_id <> t1.word_id
AND word like format('%%%s%%', t1.word)
) t2;
To just get rows that have any match (according to your title):
Use an EXISTS semi-join:
SELECT t1.word_id, t1.word
FROM tbl t1
WHERE EXISTS (
SELECT 1
FROM tbl
WHERE word_id <> t1.word_id
AND word like format('%%%s%%', t1.word)
);
I would approach this as:
select w1.word_id, w1.word, w2.word_id as substring_id w2.word as substring
from words w1 join
words w2
on w1.word like '%' || w2.word || '%' and w1.word <> w2.word;
Note: this is probably a bit faster than doing the loop in the application. However, this query will be implemented as a nested loop in Postgres, so it won't be blazingly fast.

Searching for a number in a database column where column contains series of numbers seperated by a delimeter '"&" in SQLite

My table structure is as follows :
id category
1 1&2&3
2 18&2&1
3 11
4 1&11
5 3&1
6 1
My Question: I need a sql query which generates the result set as follows when the user searched category is 1
id category
1 1&2&3
2 18&2&1
4 1&11
5 3&1
6 1
but i am getting all the results not the expected one
I have tried regexp and like operators but no success.
select * from mytable where category like '%1%'
select * from mytable where category regexp '([.]*)(1)(.*)'
I really dont know about regexp I just found it.
so please help me out.
For matching a list item separated by &, use:
SELECT * FROM mytable WHERE '&'||category||'&' LIKE '%&1&%';
this will match entire item (ie, only 1, not 11, ...), whether it is at list beginning, middle or end.

Count the number of rows that contain a letter/number

What I am trying to achieve is straightforward, however it is a little difficult to explain and I don't know if it is actually even possible in postgres. I am at a fairly basic level. SELECT, FROM, WHERE, LEFT JOIN ON, HAVING, e.t.c the basic stuff.
I am trying to count the number of rows that contain a particular letter/number and display that count against the letter/number.
i.e How many rows have entries that contain an "a/A" (Case insensitive)
The table I'm querying is a list of film names. All I want to do is group and count 'a-z' and '0-9' and output the totals. I could run 36 queries sequentially:
SELECT filmname FROM films WHERE filmname ilike '%a%'
SELECT filmname FROM films WHERE filmname ilike '%b%'
SELECT filmname FROM films WHERE filmname ilike '%c%'
And then run pg_num_rows on the result to find the number I require, and so on.
I know how intensive like is and ilike even more so I would prefer to avoid that. Although the data (below) has upper and lower case in the data, I want the result sets to be case insensitive. i.e "The Men Who Stare At Goats" the a/A,t/T and s/S wouldn't count twice for the resultset. I can duplicate the table to a secondary working table with the data all being strtolower and working on that set of data for the query if it makes the query simpler or easier to construct.
An alternative could be something like
SELECT sum(length(regexp_replace(filmname, '[^X|^x]', '', 'g'))) FROM films;
for each letter combination but again 36 queries, 36 datasets, I would prefer if I could get the data in a single query.
Here is a short data set of 14 films from my set (which actually contains 275 rows)
District 9
Surrogates
The Invention Of Lying
Pandorum
UP
The Soloist
Cloudy With A Chance Of Meatballs
The Imaginarium of Doctor Parnassus
Cirque du Freak: The Vampires Assistant
Zombieland
9
The Men Who Stare At Goats
A Christmas Carol
Paranormal Activity
If I manually lay out each letter and number in a column and then register if that letter appears in the film title by giving it an x in that column and then count them up to produce a total I would have something like this below. Each vertical column of x's is a list of the letters in that filmname regardless of how many times that letter appears or its case.
The result for the short set above is:
A x x xxxx xxx 9
B x x 2
C x xxx xx 6
D x x xxxx 6
E xx xxxxx x 8
F x xxx 4
G xx x x 4
H x xxxx xx 7
I x x xxxxx xx 9
J 0
K x 0
L x xx x xx 6
M x xxxx xxx 8
N xx xxxx x x 8
O xxx xxx x xxx 10
P xx xx x 5
Q x 1
R xx x xx xxx 7
S xx xxxx xx 8
T xxx xxxx xxx 10
U x xx xxx 6
V x x x 3
W x x 2
X 0
Y x x x 3
Z x 1
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 x x 1
In the example above, each column is a "filmname" As you can see, column 5 marks only a "u" and a "p" and column 11 marks only a "9". The final column is the tally for each letter.
I want to build a query somehow that gives me the result rows: A 9, B 2, C 6, D 6, E 8 e.t.c taking into account every row entry extracted from my films column. If that letter doesn't appear in any row I would like a zero.
I don't know if this is even possible or whether to do it systematically in php with 36 queries is the only possibility.
In the current dataset there are 275 entries and it grows by around 8.33 a month (100 a year). I predict it will reach around 1000 rows by 2019 by which time I will be no doubt using a completely different system so I don't need to worry about working with a huge dataset to trawl through.
The current longest title is "Percy Jackson & the Olympians: The Lightning Thief" at 50 chars (yes, poor film I know ;-) and the shortest is 1, "9".
I am running version 9.0.0 of Postgres.
Apologies if I've said the same thing multiple times in multiple ways, I am trying to get as much information out so you know what I am trying to achieve.
If you need any clarification or larger datasets to test with please just ask and I'll edit as needs be.
Suggestion are VERY welcome.
Edit 1
Erwin Thanks for the edits/tags/suggestions. Agree with them all.
Fixed the missing "9" typo as suggested by Erwin. Manual transcribe error on my part.
kgrittn, Thanks for the suggestion but I am not able to update the version from 9.0.0. I have asked my provider if they will try to update.
Response
Thanks for the excellent reply Erwin
Apologies for the delay in responding but I have been trying to get your query to work and learning the new keywords to understand the query you created.
I adjusted the query to adapt into my table structure but the result set was not as expected (all zeros) so I copied your lines directly and had the same result.
Whilst the result set in both cases lists all 36 rows with the appropriate letters/numbers however all the rows shows zero as the count (ct).
I have tried to deconstruct the query to see where it may be falling over.
The result of
SELECT DISTINCT id, unnest(string_to_array(lower(film), NULL)) AS letter
FROM films
is "No rows found". Perhaps it ought to when extracted from the wider query, I'm not sure.
When I removed the unnest function the result was 14 rows all with "NULL"
If I adjust the function
COALESCE(y.ct, 0) to COALESCE(y.ct, 4)<br />
then my dataset responds all with 4's for every letter instead of zeros as explained previously.
Having briefly read up on COALESCE the "4" being the substitute value I am guessing that y.ct is NULL and being substituted with this second value (this is to cover rows where the letter in the sequence is not matched, i.e if no films contain a 'q' then the 'q' column will have a zero value rather than NULL?)
The database I tried this on was SQL_ASCII and I wondered if that was somehow a problem but I had the same result on one running version 8.4.0 with UTF-8.
Apologies if I've made an obvious mistake but I am unable to return the dataset I require.
Any thoughts?
Again, thanks for the detailed response and your explanations.
This query should do the job:
Test case:
CREATE TEMP TABLE films (id serial, film text);
INSERT INTO films (film) VALUES
('District 9')
,('Surrogates')
,('The Invention Of Lying')
,('Pandorum')
,('UP')
,('The Soloist')
,('Cloudy With A Chance Of Meatballs')
,('The Imaginarium of Doctor Parnassus')
,('Cirque du Freak: The Vampires Assistant')
,('Zombieland')
,('9')
,('The Men Who Stare At Goats')
,('A Christmas Carol')
,('Paranormal Activity');
Query:
SELECT l.letter, COALESCE(y.ct, 0) AS ct
FROM (
SELECT chr(generate_series(97, 122)) AS letter -- a-z in UTF8!
UNION ALL
SELECT generate_series(0, 9)::text -- 0-9
) l
LEFT JOIN (
SELECT letter, count(id) AS ct
FROM (
SELECT DISTINCT -- count film once per letter
id, unnest(string_to_array(lower(film), NULL)) AS letter
FROM films
) x
GROUP BY 1
) y USING (letter)
ORDER BY 1;
This requires PostgreSQL 9.1! Consider the release notes:
Change string_to_array() so a NULL separator splits the string into
characters (Pavel Stehule)
Previously this returned a null value.
You can use regexp_split_to_table(lower(film), ''), instead of unnest(string_to_array(lower(film), NULL)) (works in versions pre-9.1!), but it is typically a bit slower and performance degrades with long strings.
I use generate_series() to produce the [a-z0-9] as individual rows. And LEFT JOIN to the query, so every letter is represented in the result.
Use DISTINCT to count every film once.
Never worry about 1000 rows. That is peanuts for modern day PostgreSQL on modern day hardware.
A fairly simple solution which only requires a single table scan would be the following.
SELECT
'a', SUM( (title ILIKE '%a%')::integer),
'b', SUM( (title ILIKE '%b%')::integer),
'c', SUM( (title ILIKE '%c%')::integer)
FROM film
I left the other 33 characters as a typing exercise for you :)
BTW 1000 rows is tiny for a postgresql database. It's beginning to get large when the DB is larger then the memory in your server.
edit: had a better idea
SELECT chars.c, COUNT(title)
FROM (VALUES ('a'), ('b'), ('c')) as chars(c)
LEFT JOIN film ON title ILIKE ('%' || chars.c || '%')
GROUP BY chars.c
ORDER BY chars.c
You could also replace the (VALUES ('a'), ('b'), ('c')) as chars(c) part with a reference to a table containing the list of characters you are interested in.
This will give you the result in a single row, with one column for each matching letter and digit.
SELECT
SUM(CASE WHEN POSITION('a' IN filmname) > 0 THEN 1 ELSE 0 END) AS "A",
SUM(CASE WHEN POSITION('b' IN filmname) > 0 THEN 1 ELSE 0 END) AS "B",
SUM(CASE WHEN POSITION('c' IN filmname) > 0 THEN 1 ELSE 0 END) AS "C",
...
SUM(CASE WHEN POSITION('z' IN filmname) > 0 THEN 1 ELSE 0 END) AS "Z",
SUM(CASE WHEN POSITION('0' IN filmname) > 0 THEN 1 ELSE 0 END) AS "0",
SUM(CASE WHEN POSITION('1' IN filmname) > 0 THEN 1 ELSE 0 END) AS "1",
...
SUM(CASE WHEN POSITION('9' IN filmname) > 0 THEN 1 ELSE 0 END) AS "9"
FROM films;
A similar approach like Erwins, but maybe more comfortable in the long run:
Create a table with each character you're interested in:
CREATE TABLE char (name char (1), id serial);
INSERT INTO char (name) VALUES ('a');
INSERT INTO char (name) VALUES ('b');
INSERT INTO char (name) VALUES ('c');
Then grouping over it's values is easy:
SELECT char.name, COUNT(*)
FROM char, film
WHERE film.name ILIKE '%' || char.name || '%'
GROUP BY char.name
ORDER BY char.name;
Don't worry about ILIKE.
I'm not 100% happy about using the keyword 'char' as table title, but hadn't had bad experiences so far. On the other hand it is the natural name. Maybe if you translate it to another language - like 'zeichen' in German, you avoid ambiguities.

Finding what words a set of letters can create?

I am trying to write some SQL that will accept a set of letters and return all of the possible words it can make. My first thought was to create a basic three table database like so:
Words -- contains 200k words in real life
------
1 | act
2 | cat
Letters -- contains the whole alphabet in real life
--------
1 | a
3 | c
20 | t
WordLetters --First column is the WordId and the second column is the LetterId
------------
1 | 1
1 | 3
1 | 20
2 | 3
2 | 1
2 | 20
But I'm a bit stuck on how I would write a query that returns words that have an entry in WordLetters for every letter passed in. It also needs to account for words that have two of the same letter. I started with this query, but it obviously does not work:
SELECT DISTINCT w.Word
FROM Words w
INNER JOIN WordLetters wl
ON wl.LetterId = 20 AND wl.LetterId = 3 AND wl.LetterId = 1
How would I write a query to return only words that contain all of the letters passed in and accounting for duplicate letters?
Other info:
My Word table contains close to 200,000 words which is why I am trying to do this on the database side rather than in code. I am using the enable1 word list if anyone cares.
Ignoring, for the moment, the SQL part of the problem, the algorithm I'd use is fairly simple: start by taking each word in your dictionary, and producing a version of it with the letters in sorted order, along with a pointer back to the original version of that word.
This would give a table with entries like:
sorted_text word_id
act 123 /* we'll assume `act` was word number 123 in the original list */
act 321 /* we'll assume 'cat' was word number 321 in the original list */
Then when we receive an input (say, "tac") we sort it's letters, look it up in our table of sorted letters joined to the table of the original words, and that gives us a list of the words that can be created from that input.
If I were doing this, I'd have the tables for that in a SQL database, but probably use something else to pre-process the word list into the sorted form. Likewise, I'd probably leave sorting the letters of the user's input to whatever I was using to create the front-end, so SQL would be left to do what it's good at: relational database management.
If you use the solution you provide, you'll need to add an order column to the WordLetters table. Without that, there's no guarantee that you'll retrieve the rows that you retrieve are in the same order you inserted them.
However, I think I have a better solution. Based on your question, it appears that you want to find all words with the same component letters, independent of order or number of occurrences. This means that you have a limited number of possibilities. If you translate each letter of the alphabet into a different power of two, you can create a unique value for each combination of letters (aka a bitmask). You can then simply add together the values for each letter found in a word. This will make matching the words trivial, as all words with the same letters will map to the same value. Here's an example:
WITH letters
AS (SELECT Cast('a' AS VARCHAR) AS Letter,
1 AS LetterValue,
1 AS LetterNumber
UNION ALL
SELECT Cast(Char(97 + LetterNumber) AS VARCHAR),
Power(2, LetterNumber),
LetterNumber + 1
FROM letters
WHERE LetterNumber < 26),
words
AS (SELECT 1 AS wordid, 'act' AS word
UNION ALL SELECT 2, 'cat'
UNION ALL SELECT 3, 'tom'
UNION ALL SELECT 4, 'moot'
UNION ALL SELECT 5, 'mote')
SELECT wordid,
word,
Sum(distinct LetterValue) as WordValue
FROM letters
JOIN words
ON word LIKE '%' + letter + '%'
GROUP BY wordid, word
As you'll see if you run this query, "act" and "cat" have the same WordValue, as do "tom" and "moot", despite the difference in number of characters.
What makes this better than your solution? You don't have to build a lot of non-words to weed them out. This will constitute a massive savings of both storage and processing needed to perform the task.
There is a solution to this in SQL. It involves using a trick to count the number of times that each letter appears in a word. The following expression counts the number of times that 'a' appears:
select len(word) - len(replace(word, 'a', ''))
The idea is to count the total of all the letters in the word and see if that matches the overall length:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word, (LEN(w.word) - LEN(replace(w.word, wl.letter))) as LettersInWord
from word w
cross join wordletters wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
This particular solution allows multiple occurrences of a letter. I'm not sure if this was desired in the original question or not. If we want up to a certain number of occurrences, then we might do the following:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word,
(case when (LEN(w.word) - LEN(replace(w.word, wl.letter))) <= maxcount
then (LEN(w.word) - LEN(replace(w.word, wl.letter)))
else maxcount end) as LettersInWord
from word w
cross join
(
select letter, count(*) as maxcount
from wordletters wl
group by letter
) wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
If you want an exact match to the letters, then the case statement should use " = maxcount" instead of " <= maxcount".
In my experience, I have actually seen decent performance with small cross joins. This might actually work server-side. There are two big advantages to doing this work on the server. First, it takes advantage of the parallelism on the box. Second, a much smaller set of data needs to be transfered across the network.