Select rows that do not contain a word from another table - sql

I have a table with one word each row and a table with some text in a row. I need
to select from the second table only those rows that does not contain words from the first table.
For example:
Table with constratint words
constraint_word
example
apple
orange
mushroom
car
qwerty
Table with text
text
word1. apple; word3, example
word1, apple, word2. car
word1 word2 orange word3
mushroomword1 word2 word3
word1 car
qwerty
Nothing should be selected in this case, because every row in the second table contains words from the first table.
I only have an idea to use CROSS JOIN to achive this
SELECT DISTINCT text FROM text_table CROSS JOIN words_table
WHERE CONTAINS(text, constraint_word ) = 0
Is there a way to do it without using CROSS JOIN?

contains means Oracle Text; cross join means Cartesian product (usually performance nightmare).
One option which avoids both of these is instr function (which checks existence of the constraint_word in text, but this time using inner join) and the minus set operator.
Something like this, using sample data you posted:
SQL> select * from text_table;
TEXT
---------------------------
word1.apple; word3, example
word1, apple, word2.car
word1 word2 orange word3
mushroomword1 word2 word3
word1 car
qwerty
6 rows selected.
SQL> select * From words_table;
CONSTRAI
--------
example
apple
orange
mushroom
car
qwerty
6 rows selected.
SQL>
As you said, initially query shouldn't return anything because all constraint_words exist in text:
SQL> select c.text
2 from text_table c
3 minus
4 select b.text
5 from words_table a join text_table b on instr(b.text, a.constraint_word) > 0;
no rows selected
Let's modify one of text rows:
SQL> update text_table set text = 'xxx' where text = 'qwerty';
1 row updated.
What's the result now?
SQL> select c.text
2 from text_table c
3 minus
4 select b.text
5 from words_table a join text_table b on instr(b.text, a.constraint_word) > 0;
TEXT
---------------------------
xxx
SQL>
Right; text we've just modified.

Your idea is fine, since you need to test all words for each text.
This is what CROSS JOIN does - a combination (cartesian product).
We can even be more restrictive for better performance and use INNER JOIN, or the shorthand JOIN.
See also: CROSS JOIN vs INNER JOIN in SQL
Additionally you need to filter all text records, where there are no matches at all. This means the count of non-matches over all combinations per text is maximum (= number of constraint_words, here 6).
This filter can be done using GROUP BY WITH HAVING
-- text without any constaint_word
SELECT t.text, count(*)
FROM text_table t
JOIN words_table w ON CONTAINS(t.text, w.constraint_word, 1) = 0
GROUP BY t.text
HAVING count(*) = (SELECT count(*) FROM words_table)
;
It will output:
text
count(*)
mushroomword1 word2 word3
6
Try the demo on on SQL Fiddle
Entire-word vs partial matches
Note that 'mushroom' from constraint words is not matched by CONTAINS because it is contained as word-part not as entire word.
For partial-matches you can use INSTR as answered by Littlefoot.
See also
Use string contains function in oracle SQL query
How does contains() in PL-SQL work?
Oracle context indexes
Creating and Maintaining Oracle Text Indexes

I believe this works (I think the issue with the CROSS JOIN route is that it includes any texts that don't contain at least one of the words--not just texts that don't contain any):
SELECT DISTINCT text FROM text_table WHERE (SELECT COUNT(*) FROM words_table WHERE CONTAINS(text, constraint_word)) = 0;

Related

sql how to convert multi select field to rows with totals

I have a table that has a field where the contents are a concatenated list of selections from a multi-select form. I would like to convert the data in this field into in another table where each row has the text of the selection and a count the number of times this selection was made.
eg.
Original table:
id selections
1 A;B
2 B;D
3 A;B;D
4 C
I would like to get the following out:
selection count
A 2
B 3
C 1
D 2
I could easily do this with split and maps in javascript etc, but not sure how to approach it in SQL. (I use Postgresql) The goal is to use the second table to plot a graph in Google Data Studio.
A much simpler solution:
select regexp_split_to_table(selections, ';'), count(*)
from test_table
group by 1
order by 1;
You can use a lateral join and handy set-returning function regexp_split_to_table() to unnest the strings to rows, then aggregate and count:
select x.selection, count(*) cnt
from mytable t
cross join lateral regexp_split_to_table(t.selections, ';') x(selection)
group by x.selection

Getting rows for values greater by 3 numbers or letters

Am trying to come up with a query where I can return back values where the the distance between the letters could be one or more than one for the chosen letter.
For example:
I have two columns which have letters in Column A and in Column B. I want to return back with rows when column B distance is more than Column A by one or more letters.
It's not clear to me, when you say "greater" if you mean that the distance between any two letters is 2 or 3 (Column B can be alphabetically before or after Column A, by a distance of 2 or 3).. Or if Column B has to be alphabetically after Column A, by a distance of 2 or 3
Because I'm not certain what you're talking about, I present two options. Read the "if" rule and choose the one that applies to your situation, then use the query under it:
If columnA is D and columnB can be any of: A B F G
SELECT * FROM table WHERE ABS(ASCII(columna) - ASCII(columnb)) IN (2,3)
If columnA is D and columnB can be any of: F G
SELECT * FROM table WHERE ASCII(columnb) - ASCII(columna) IN (2,3)
Edit1: Per your later comment, you are now saying that the distance is not just 2 or 3 letters (the first line of your question states "2 or 3") but any number of letters distance equal to or greater than 2:
SELECT * FROM table WHERE ASCII(columnb) - ASCII(columna) >= 2
Overall the technique isn't much different to the above queries and there are many ways to specify what you want:
SELECT * FROM table
WHERE
ASCII(columnb) - ASCII(columna)
BETWEEN <some_number_here> AND <other_number_here>
Ultimately the most important thing is to note the use of ASCII function, which gives us the ascii char code of the first letter in a string:
ASCII('ABCD') => 65
And we can use maths on this to work out if a letter distance from 'A' is more than 1 etc..
Probably also worth noting that ASCII() works on single byte ascii characters. If your data is multibyte (Unicode), you might need to use ORD() instead:
Edit2: Your latest edit to the question revises the limit to "B greater than A by one or more" which is equivalent to >= 1 ..
The question seems not to have a clear spec, please treat the answer as a guide for the general technique:
--for an open ended distance, ascii chars
SELECT * FROM table WHERE ASCII(columnb) - ASCII(columna) >= <some_distance>
--for an open ended distance, unicode
SELECT * FROM table
WHERE ORD(columnb) - ORD(columna) >= <some_distance>
--for a definite range of distances (replace … appropriately)
SELECT * FROM table
WHERE ... BETWEEN <some_distance> AND <some_other_distance>
this will work indeed:
select * from table_name where ascii(col_1)+2=ascii(col_2);
You can use something like this if you need it to be exactly 2 or 3 letters greater
select Column A, ColumnB from table name where ASCII(ColumnB) - ASCII(ColumnA) in (2,3)
If you want all those rows where the the difference is equal more than 2, then use this
select Column A, ColumnB from table name where ASCII(ColumnB) - ASCII(ColumnA) >=2
this is where you can make ascii in action..
select * from SampleTable where (ASCII(sampleTable.ColumnB) - ASCII(ColumnA)) >= 2;

Find rows that contain all words in any order

My application is built in vb.net with SQL Server Compact as the database so I'm unable to use a full-text index.
Here's my data...
MainTable field1
A B C
B G C
X Y Z
C P B
Search term = B C
Expected Results = any combination of the search term = Rows 1, 2, 4
Here's what I'm currently doing...
I'm permuting the search term B C into an array containing %B%C% and %C%B% and inserting those values into field1 of tempTable.
So my SQL looks like this:
SELECT * FROM MainTable INNER JOIN tempTable ON MainTable.field1 LIKE tempTable.field1
In this simple example, it does return the expected results correctly. However, my search term can contain more values. For example 6 search terms B C D E F G when permuted has 720 different values and as more search terms are used, the permutations grow exponentially...which is not good.
Is there a better way to do this?
The following will work for your example above:
Select * from table where field1 like '%[BC]%'
But it will also return strings that contain ONLY "B" or "C". Do you need both characters in any order or one or more?
EDIT: Then the following would work:
Select * from test_data where col1 LIKE '%Apple%' and col1 like '%Dog%'
See the demo here: http://rextester.com/edit/LNDQ49764

In SQL, how to check if a string is the substring of any other string in the same table?

I have table full of strings (TEXT) and I like to get all the strings that are substrings of any other string in the same table. For example if I had these three strings in my table:
WORD WORD_ID
cup 0
cake 1
cupcake 2
As result of my query I would like to get something like this:
WORD WORD_ID SUBSTRING SUBSTRING_ID
cupcake 2 cup 0
cupcake 2 cake 1
I know that I could do this with two loops (using Python or JS) by looping over every word in my table and match it against every word in the same table, but I'm not sure how this can be done using SQL (PostgreSQL for that matter).
Use self-join:
select w1.word, w1.word_id, w2.word, w2.word_id
from words w1
join words w2
on w1.word <> w2.word
and w1.word like format('%%%s%%', w2.word);
word | word_id | word | word_id
---------+---------+------+---------
cupcake | 2 | cup | 0
cupcake | 2 | cake | 1
(2 rows)
Problem
The task has the potential to stall your database server for tables of non-trivial size, since it's an O(N²) problem as long as you cannot utilize an index for it.
In a sequential scan you have to check every possible combination of two rows, that's n * (n-1) / 2 combinations - Postgres will run n * n-1 tests since it's not easy to rule out reverse duplicate combinations. If you are satisfied with the first match, it gets cheaper - how much depends on data distribution. For many matches, Postgres will find a match for a row early and can skip testing the rest. For few matches, most of the checks have to be performed anyway.
Either way, performance deteriorates rapidly with the number of rows in the table. Test each query with EXPLAIN ANALYZE and 10, 100, 1000 etc. rows in the table to see for yourself.
Solution
Create a trigram index on word - preferably GIN.
CREATE INDEX tbl_word_trgm_gin_idx ON tbl USING gin (word gin_trgm_ops);
Details:
PostgreSQL LIKE query performance variations
The queries in both answers so far wouldn't use the index even if you had it. Use a query that can actually work with this index:
To list all matches (according to the question body):
Use a LATERAL CROSS JOIN:
SELECT t2.word_id, t2.word, t1.word_id, t1.word
FROM tbl t1
, LATERAL (
SELECT word_id, word
FROM tbl
WHERE word_id <> t1.word_id
AND word like format('%%%s%%', t1.word)
) t2;
To just get rows that have any match (according to your title):
Use an EXISTS semi-join:
SELECT t1.word_id, t1.word
FROM tbl t1
WHERE EXISTS (
SELECT 1
FROM tbl
WHERE word_id <> t1.word_id
AND word like format('%%%s%%', t1.word)
);
I would approach this as:
select w1.word_id, w1.word, w2.word_id as substring_id w2.word as substring
from words w1 join
words w2
on w1.word like '%' || w2.word || '%' and w1.word <> w2.word;
Note: this is probably a bit faster than doing the loop in the application. However, this query will be implemented as a nested loop in Postgres, so it won't be blazingly fast.

SQL - Select the longest substrings

I have the data like that.
AB
ABC
ABCD
ABCDE
EF
EFG
IJ
IJK
IJKL
and I just want to get ABCDE,EFG,IJKL. how can i do that oracle sql?
the size of the char are min 2 but doesn't have a fixed length, can be from 2 to 100.
In the event that you mean "longest string for each sequence of strings", the answer is a little different -- you are not guaranteed that all have a length of 4. Instead, you want to find the strings where adding a letter isn't another string.
select t.str
from table t
where not exists (select 1
from table t2
where substr(t2.str, 1, length(t.str)) = t.str and
length(t2.str) = length(t.str) + 1
);
Do note that performance of this query will not be great if you have even a moderate number of rows.
Select all rows where the string is not a substring of any other row. It's not clear if this is what you want though.
select t.str
from table t
where not exists (
select 1
from table t2
where instr(t1.str, t2.str) > 0
);