In SQL, how to check if a string is the substring of any other string in the same table? - sql

I have table full of strings (TEXT) and I like to get all the strings that are substrings of any other string in the same table. For example if I had these three strings in my table:
WORD WORD_ID
cup 0
cake 1
cupcake 2
As result of my query I would like to get something like this:
WORD WORD_ID SUBSTRING SUBSTRING_ID
cupcake 2 cup 0
cupcake 2 cake 1
I know that I could do this with two loops (using Python or JS) by looping over every word in my table and match it against every word in the same table, but I'm not sure how this can be done using SQL (PostgreSQL for that matter).

Use self-join:
select w1.word, w1.word_id, w2.word, w2.word_id
from words w1
join words w2
on w1.word <> w2.word
and w1.word like format('%%%s%%', w2.word);
word | word_id | word | word_id
---------+---------+------+---------
cupcake | 2 | cup | 0
cupcake | 2 | cake | 1
(2 rows)

Problem
The task has the potential to stall your database server for tables of non-trivial size, since it's an O(N²) problem as long as you cannot utilize an index for it.
In a sequential scan you have to check every possible combination of two rows, that's n * (n-1) / 2 combinations - Postgres will run n * n-1 tests since it's not easy to rule out reverse duplicate combinations. If you are satisfied with the first match, it gets cheaper - how much depends on data distribution. For many matches, Postgres will find a match for a row early and can skip testing the rest. For few matches, most of the checks have to be performed anyway.
Either way, performance deteriorates rapidly with the number of rows in the table. Test each query with EXPLAIN ANALYZE and 10, 100, 1000 etc. rows in the table to see for yourself.
Solution
Create a trigram index on word - preferably GIN.
CREATE INDEX tbl_word_trgm_gin_idx ON tbl USING gin (word gin_trgm_ops);
Details:
PostgreSQL LIKE query performance variations
The queries in both answers so far wouldn't use the index even if you had it. Use a query that can actually work with this index:
To list all matches (according to the question body):
Use a LATERAL CROSS JOIN:
SELECT t2.word_id, t2.word, t1.word_id, t1.word
FROM tbl t1
, LATERAL (
SELECT word_id, word
FROM tbl
WHERE word_id <> t1.word_id
AND word like format('%%%s%%', t1.word)
) t2;
To just get rows that have any match (according to your title):
Use an EXISTS semi-join:
SELECT t1.word_id, t1.word
FROM tbl t1
WHERE EXISTS (
SELECT 1
FROM tbl
WHERE word_id <> t1.word_id
AND word like format('%%%s%%', t1.word)
);

I would approach this as:
select w1.word_id, w1.word, w2.word_id as substring_id w2.word as substring
from words w1 join
words w2
on w1.word like '%' || w2.word || '%' and w1.word <> w2.word;
Note: this is probably a bit faster than doing the loop in the application. However, this query will be implemented as a nested loop in Postgres, so it won't be blazingly fast.

Related

Select rows that do not contain a word from another table

I have a table with one word each row and a table with some text in a row. I need
to select from the second table only those rows that does not contain words from the first table.
For example:
Table with constratint words
constraint_word
example
apple
orange
mushroom
car
qwerty
Table with text
text
word1. apple; word3, example
word1, apple, word2. car
word1 word2 orange word3
mushroomword1 word2 word3
word1 car
qwerty
Nothing should be selected in this case, because every row in the second table contains words from the first table.
I only have an idea to use CROSS JOIN to achive this
SELECT DISTINCT text FROM text_table CROSS JOIN words_table
WHERE CONTAINS(text, constraint_word ) = 0
Is there a way to do it without using CROSS JOIN?
contains means Oracle Text; cross join means Cartesian product (usually performance nightmare).
One option which avoids both of these is instr function (which checks existence of the constraint_word in text, but this time using inner join) and the minus set operator.
Something like this, using sample data you posted:
SQL> select * from text_table;
TEXT
---------------------------
word1.apple; word3, example
word1, apple, word2.car
word1 word2 orange word3
mushroomword1 word2 word3
word1 car
qwerty
6 rows selected.
SQL> select * From words_table;
CONSTRAI
--------
example
apple
orange
mushroom
car
qwerty
6 rows selected.
SQL>
As you said, initially query shouldn't return anything because all constraint_words exist in text:
SQL> select c.text
2 from text_table c
3 minus
4 select b.text
5 from words_table a join text_table b on instr(b.text, a.constraint_word) > 0;
no rows selected
Let's modify one of text rows:
SQL> update text_table set text = 'xxx' where text = 'qwerty';
1 row updated.
What's the result now?
SQL> select c.text
2 from text_table c
3 minus
4 select b.text
5 from words_table a join text_table b on instr(b.text, a.constraint_word) > 0;
TEXT
---------------------------
xxx
SQL>
Right; text we've just modified.
Your idea is fine, since you need to test all words for each text.
This is what CROSS JOIN does - a combination (cartesian product).
We can even be more restrictive for better performance and use INNER JOIN, or the shorthand JOIN.
See also: CROSS JOIN vs INNER JOIN in SQL
Additionally you need to filter all text records, where there are no matches at all. This means the count of non-matches over all combinations per text is maximum (= number of constraint_words, here 6).
This filter can be done using GROUP BY WITH HAVING
-- text without any constaint_word
SELECT t.text, count(*)
FROM text_table t
JOIN words_table w ON CONTAINS(t.text, w.constraint_word, 1) = 0
GROUP BY t.text
HAVING count(*) = (SELECT count(*) FROM words_table)
;
It will output:
text
count(*)
mushroomword1 word2 word3
6
Try the demo on on SQL Fiddle
Entire-word vs partial matches
Note that 'mushroom' from constraint words is not matched by CONTAINS because it is contained as word-part not as entire word.
For partial-matches you can use INSTR as answered by Littlefoot.
See also
Use string contains function in oracle SQL query
How does contains() in PL-SQL work?
Oracle context indexes
Creating and Maintaining Oracle Text Indexes
I believe this works (I think the issue with the CROSS JOIN route is that it includes any texts that don't contain at least one of the words--not just texts that don't contain any):
SELECT DISTINCT text FROM text_table WHERE (SELECT COUNT(*) FROM words_table WHERE CONTAINS(text, constraint_word)) = 0;

SQL performance like and UTL_MATCH.EDIT_DISTANCE

I have 2 tables OSD_SPLIT which contains are 180K records with words to be flagged and TRANSPLIT which has the sentence split up on spaces e.g. sentence My name is XYZ so TRANSPLIT would have 4 rows My, name, is, XYZ.
I have 2 sqls one to match 1 letter distance e.g. TRANSPLIT has a row SHARK and OSD_SPLIT has word SHRAK I would like it to match i.e. edit_distance is 1 and lengths are same.
The other SQL is to look for OSD_SPLIT words inline within TRANSPLIT e.g. TRANSPLIT has XSHARKX I would like it to match with SHARK because SHARK exists inline within XSHARKX
Below are the 2 SQLs and I would like to improve performance both the SQLs are doing full table scans on OSD_SPLIT table I guess because I am using edit_distance and like '%%' any suggestions on improving performance
SELECT DISTINCT OS.STOP_DSC
FROM OSD_SPLIT OS, TRANSPLIT TS
WHERE TS.TXTSPLIT1 <> OS.STOP_DSC
AND TS.ORGTXT = TS.TXTSPLIT1
AND LENGTH(TS.TXTSPLIT1) > 1
AND LENGTH(TS.TXTSPLIT1) = LENGTH(OS.STOP_DSC)
AND UTL_MATCH.EDIT_DISTANCE(TS.TXTSPLIT1, OS.STOP_DSC) = 1
SELECT DISTINCT OS.STOP_DSC
FROM OSD_SPLIT OS, TRANSPLIT TS
WHERE TS.TXTSPLIT1 <> OS.STOP_DSC
AND TS.TXTSPLIT1 LIKE '%'||OS.STOP_DSC||'%'
AND LENGTH(TS.TXTSPLIT1) >= LENGTH(OS.STOP_DSC)
AND LENGTH(TS.TXTSPLIT1) >= 3 AND LENGTH(OS.STOP_DSC) >= 3
AND ((LENGTH(TS.TXTSPLIT1) - LENGTH(REPLACE(TS.TXTSPLIT1,OS.STOP_DSC,'')))/LENGTH(TS.TXTSPLIT1))*100 >= 80

Random() in Redshift CTE returns wildly incorrect results under certain conditions

(Cross posting this from the AWS forums...)
Need a fairly sizable chunk of dummy data for this. I used this list of English words: http://www.mieliestronk.com/corncob_lowercase.txt
I'm seeing a MASSIVE difference in the number of results I get for seemingly equivalent queries involving the random() function within a CTE in Amazon Redshift. (I'm trying to take a random sample - one query returns an actual sample as expected, the other basically just returns the entire list of items I was trying to sample.)
Can somebody take a look at this? Am I doing something wrong? Is there another issue here?
/* Create tables to hold words */
create table main_words(word varchar(max));
create table couple_words(word varchar(max));
/* Get some words */
copy main_words
from 'S3 LOCATION OF CORNCOB FILE'
credentials 'aws_access_key_id=ID;aws_secret_access_key=KEY'
csv;
/* Put a few in another table */
insert into
couple_words
select top 5000
word
from
main_words;
/* Returns about 500 results */
with the_cte as
(
select
word,
random() as random_value
from
main_words
where
word not in (select word from couple_words)
)
select
count(*)
from
the_cte
where
random_value > .99;
/* Returns about 58,000 results (basically, the whole list) */
with the_cte as
(
select
word
from
main_words
where
word not in (select word from couple_words)
and random() > .99
)
select
count(*)
from
the_cte;
/* Clean up */
drop table if exists main_words;
drop table if exists couple_words;
Have you try it on a different server?
I just create a sample on SqlFidle with 100 rows plus random() > 0.9 and result are very similar.
First CTE
| count |
|-------|
| 4 |
Second CTE
| count |
|-------|
| 13 |
Average count(*) with 10 runs
CTE 1 CTE 2
8.3 9.8
I suspect some funky query rewriting. If you have to have the inner query, you can use LIMIT 2147483647 inside and see what comes up.

Count particular substring text within column

I have a Hive table, titled 'UK.Choices' with a column, titled 'Fruit', with each row as follows:
AppleBananaAppleOrangeOrangePears
BananaKiwiPlumAppleAppleOrange
KiwiKiwiOrangeGrapesAppleKiwi
etc.
etc.
There are 2.5M rows and the rows are much longer than the above.
I want to count the number of instances that the word 'Apple' appears.
For example above, it is:
Number of 'Apple'= 5
My sql so far is:
select 'Fruit' from UK.Choices
Then in chunks of 300,000 I copy and paste into Excel, where I'm more proficient and able to do this using formulas. Problem is, it takes upto an hour and a half to generate each chunk of 300,000 rows.
Anyone know a quicker way to do this bypassing Excel? I can do simple things like counts using where clauses, but something like the above is a little beyond me right now. Please help.
Thank you.
I think I am 2 years too late. But since I was looking for the same answer and I finally managed to solve it, I thought it was a good idea to post it here.
Here is how I do it.
Solution 1:
+-----------------------------------+---------------------------+-------------+-------------+
| Fruits | Transform 1 | Transform 2 | Final Count |
+-----------------------------------+---------------------------+-------------+-------------+
| AppleBananaAppleOrangeOrangePears | #Banana#OrangeOrangePears | ## | 2 |
| BananaKiwiPlumAppleAppleOrange | BananaKiwiPlum##Orange | ## | 2 |
| KiwiKiwiOrangeGrapesAppleKiwi | KiwiKiwiOrangeGrapes#Kiwi | # | 1 |
+-----------------------------------+---------------------------+-------------+-------------+
Here is the code for it:
SELECT length(regexp_replace(regexp_replace(fruits, "Apple", "#"), "[A-Za-z]", "")) as number_of_apples
FROM fruits;
You may have numbers or other special characters in your fruits column and you can just modify the second regexp to incorporate that. Just remember that in hive to escape a character you may need to use \\ instead of just one \.
Solution 2:
SELECT size(split(fruits,"Apple"))-1 as number_of_apples
FROM fruits;
This just first split the string using "Apple" as a separator and makes an array. The size function just tells the size of that array. Note that the size of the array is one more than the number of separators.
This is straight-forward if you have any delimiter ( eg: comma ) between the fruit names. The idea is to split the column into an array, and explode the array into multiple rows using the 'explode' function.
SELECT fruit, count(1) as count FROM
( SELECT
explode(split(Fruit, ',')) as fruit
FROM UK.Choices ) X
GROUP BY fruit
From your example, it looks like fruits are delimited by Capital letters. One idea is to split the column based on capital letters, assuming there are no fruits with same suffix.
SELECT fruit_suffix, count(1) as count FROM
( SELECT
explode(split(Fruit, '[A-Z]')) as fruit_suffix
FROM UK.Choices ) X
WHERE fruit_suffix <> ''
GROUP BY fruit_suffix
The downside is that, the output will not have first letter of the fruit,
pple - 5
range - 4
I think you want to run in one select, and use the Hive if UDF to sum for the different cases. Something like the following...
select sum( if( fruit like '%Apple%' , 1, 0 ) ) as apple_count,
sum( if( fruit like '%Orange%', 1, 0 ) ) as orange_count
from UK.Choices
where ID > start and ID < end;
instead of a join in the above query.
No experience of Hive, I'm afraid, so this may or may not work. But on SQLServer, Oracle etc I'd do something like this:
Assuming that you have an int PK called ID on the row, something along the lines of:
select AppleCount, OrangeCount, AppleCount - OrangeCount score
from
(
select count(*) as AppleCount
from UK.Choices
where ID > start and ID < end
and Fruit like '%Apple%'
) a,
(
select count(*) as OrangeCount
from UK.Choices
where ID > start and ID < end
and Fruit like '%Orange%'
) o
I'd leave the division by the total count to the end, when you have all the rows in the spreadsheet and can count them there.
However, I'd urgently ask my boss to let me change the Fruit field to be a table with an FK to Choices and one fruit name per row. Unless this is something you can't do in Hive, this design is something that makes kittens cry.
PS I'd missed that you wanted the count of occurances of Apple which this won't do. I'm leaving my answer up, because I reckon that my However... para is actually a good answer. :(

Finding what words a set of letters can create?

I am trying to write some SQL that will accept a set of letters and return all of the possible words it can make. My first thought was to create a basic three table database like so:
Words -- contains 200k words in real life
------
1 | act
2 | cat
Letters -- contains the whole alphabet in real life
--------
1 | a
3 | c
20 | t
WordLetters --First column is the WordId and the second column is the LetterId
------------
1 | 1
1 | 3
1 | 20
2 | 3
2 | 1
2 | 20
But I'm a bit stuck on how I would write a query that returns words that have an entry in WordLetters for every letter passed in. It also needs to account for words that have two of the same letter. I started with this query, but it obviously does not work:
SELECT DISTINCT w.Word
FROM Words w
INNER JOIN WordLetters wl
ON wl.LetterId = 20 AND wl.LetterId = 3 AND wl.LetterId = 1
How would I write a query to return only words that contain all of the letters passed in and accounting for duplicate letters?
Other info:
My Word table contains close to 200,000 words which is why I am trying to do this on the database side rather than in code. I am using the enable1 word list if anyone cares.
Ignoring, for the moment, the SQL part of the problem, the algorithm I'd use is fairly simple: start by taking each word in your dictionary, and producing a version of it with the letters in sorted order, along with a pointer back to the original version of that word.
This would give a table with entries like:
sorted_text word_id
act 123 /* we'll assume `act` was word number 123 in the original list */
act 321 /* we'll assume 'cat' was word number 321 in the original list */
Then when we receive an input (say, "tac") we sort it's letters, look it up in our table of sorted letters joined to the table of the original words, and that gives us a list of the words that can be created from that input.
If I were doing this, I'd have the tables for that in a SQL database, but probably use something else to pre-process the word list into the sorted form. Likewise, I'd probably leave sorting the letters of the user's input to whatever I was using to create the front-end, so SQL would be left to do what it's good at: relational database management.
If you use the solution you provide, you'll need to add an order column to the WordLetters table. Without that, there's no guarantee that you'll retrieve the rows that you retrieve are in the same order you inserted them.
However, I think I have a better solution. Based on your question, it appears that you want to find all words with the same component letters, independent of order or number of occurrences. This means that you have a limited number of possibilities. If you translate each letter of the alphabet into a different power of two, you can create a unique value for each combination of letters (aka a bitmask). You can then simply add together the values for each letter found in a word. This will make matching the words trivial, as all words with the same letters will map to the same value. Here's an example:
WITH letters
AS (SELECT Cast('a' AS VARCHAR) AS Letter,
1 AS LetterValue,
1 AS LetterNumber
UNION ALL
SELECT Cast(Char(97 + LetterNumber) AS VARCHAR),
Power(2, LetterNumber),
LetterNumber + 1
FROM letters
WHERE LetterNumber < 26),
words
AS (SELECT 1 AS wordid, 'act' AS word
UNION ALL SELECT 2, 'cat'
UNION ALL SELECT 3, 'tom'
UNION ALL SELECT 4, 'moot'
UNION ALL SELECT 5, 'mote')
SELECT wordid,
word,
Sum(distinct LetterValue) as WordValue
FROM letters
JOIN words
ON word LIKE '%' + letter + '%'
GROUP BY wordid, word
As you'll see if you run this query, "act" and "cat" have the same WordValue, as do "tom" and "moot", despite the difference in number of characters.
What makes this better than your solution? You don't have to build a lot of non-words to weed them out. This will constitute a massive savings of both storage and processing needed to perform the task.
There is a solution to this in SQL. It involves using a trick to count the number of times that each letter appears in a word. The following expression counts the number of times that 'a' appears:
select len(word) - len(replace(word, 'a', ''))
The idea is to count the total of all the letters in the word and see if that matches the overall length:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word, (LEN(w.word) - LEN(replace(w.word, wl.letter))) as LettersInWord
from word w
cross join wordletters wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
This particular solution allows multiple occurrences of a letter. I'm not sure if this was desired in the original question or not. If we want up to a certain number of occurrences, then we might do the following:
select w.word, (LEN(w.word) - SUM(LettersInWord))
from
(
select w.word,
(case when (LEN(w.word) - LEN(replace(w.word, wl.letter))) <= maxcount
then (LEN(w.word) - LEN(replace(w.word, wl.letter)))
else maxcount end) as LettersInWord
from word w
cross join
(
select letter, count(*) as maxcount
from wordletters wl
group by letter
) wl
) wls
having (LEN(w.word) = SUM(LettersInWord))
If you want an exact match to the letters, then the case statement should use " = maxcount" instead of " <= maxcount".
In my experience, I have actually seen decent performance with small cross joins. This might actually work server-side. There are two big advantages to doing this work on the server. First, it takes advantage of the parallelism on the box. Second, a much smaller set of data needs to be transfered across the network.