SQL: Select strings which have equal words - sql

Suppose I have a table of strings, like this:
VAL
-----------------
Content of values
Values identity
Triple combo
my combo
sub-zero combo
I want to find strings which have equal words. The result set should be like
VAL MATCHING_VAL
------------------ ------------------
Content of values Values identity
Triple combo My combo
Triple combo sub-zero combo
or at least something like this.
Can you help?

One method is to use a hack for regular expressions:
select t1.val, t2.val
from t t1 join
t t2
on regexp_like(t1.val, replace(t2.val, ' ', '|');
You might want the case to be identical as well:
on regexp_like(lower(t1.val), replace(lower(t2.val), ' ', '|');

You could use a combination of SUBSTRING and LIKE.
use charIndex(" ") to split the words up in the substring if thats what you want to do.

Using some of the [oracle internal similiarity] found in UTL_Match (https://docs.oracle.com/database/121/ARPLS/u_match.htm#ARPLS71219) matching...
This logic is more for matching names or descriptions that are 'Similar' and where phonetic spellings or typo's may cause the records not to match.
By adjusting the .5 below you can see how the %'s get you closer and closer to perfect matches.
with cte as (
select 'Content of values' val from dual union all
select 'Values identity' val from dual union all
select 'triple combo' from dual union all
select 'my combo'from dual union all
select 'sub-zero combo'from dual)
select a.*, b.*, utl_match.edit_distance_similarity(a.val, b.val) c, UTL_MATCH.JARO_WINKLER(a.val,b.val) JW
from cte a
cross join cte b
where UTL_MATCH.JARO_WINKLER(a.val,b.val) > .5
order by utl_match.edit_distance_similarity(a.val, b.val) desc
and screenshot of query/output.
Or we could use an inner join and > if we only want one way compairisons...
select a.*, b.*, utl_match.edit_distance_similarity(a.val, b.val) c, UTL_MATCH.JARO_WINKLER(a.val,b.val) JW
from cte a
inner join cte b
on A.Val > B.Val
where utl_match.jaro_winkler(a.val,b.val) > .5
order by utl_match.edit_distance_similarity(a.val, b.val) desc
this returns the 3 desired records.
But this does not explicitly check each any word matches. which was your base requirement. I just wanted you to be aware of alternatives.

Related

Count for a list of items with zero for does not exist

If I have a table t1 with:
my_col
------
foo
foo
bar
And I have a list with foo and hello
How can I get:
my_col | count
-------|-------
foo | 2
hello | 0
If I just do
SELECT my_col, COUNT(*)
FROM t1
WHERE my_col in ('foo', 'hello')
GROUP BY my_col
I get
my_col | count
-------|------
foo | 2
without any value for hello.
I'm specifically wanting this to be in reference to a list of items because this will be called in a program where the list is a variable.
Ideally you should maintain a separate table with all the possible column values which you want to appear in your report. In the absence of that, we can try using a CTE here:
WITH cte AS (
SELECT 'foo' AS my_col UNION ALL
SELECT 'bar' UNION ALL
SELECT 'hello'
)
SELECT
a.my_col,
COUNT(b.my_col) AS count
FROM cte a
LEFT JOIN t1 b
ON a.my_col = b.my_col
WHERE
a.my_col IN ('foo', 'hello')
GROUP BY
a.my_col;
Demo
Here's yet another way, using values:
select
t2.my_col, count (t1.my_col)
from
(values ('foo'), ('hello')) as t2 (my_col)
left join t1 on t1.my_col = t2.my_col
group by
t2.my_col
Note that count (t1.my_col) returns a 0 for "hello" since nulls are not counted. count (*) by contast would have returned 1 for "hello" because it was counting the row.
You can turn your list into a set of rows and use a LEFT JOIN, like :
SELECT x.val, COUNT(t.my_col)
FROM
(SELECT 'foo' val UNION SELECT 'hello') x
LEFT JOIN t ON t.my_col = x.val
GROUP BY x.val
Postgres solution:
One way is to place the 'list' into an ARRAY, and then convert the ARRAY into a column using unnest. Then perform a left join on that column with the other table and perform a count.
WITH t1 AS (
SELECT 'foo' AS my_col UNION ALL
SELECT 'foo' UNION ALL
SELECT 'bar'
)
SELECT
a.my_col,
COUNT(b.my_col) AS count
FROM unnest(ARRAY['foo', 'hello']) a (my_col)
LEFT JOIN t1 b
ON a.my_col = b.my_col
GROUP BY
a.my_col;
The issue I had with the other answers is that (while they they helped me get to the solution) they did not provide a solution where the items of interest were in a single list (which isn't an actual sql term, so the fault is on me).
However, my real use case is to perform a native query using java and hibernate, and unfortunately the above does not work because the typing cannot be determined. Instead I converted my list into a single string and used string_to_array in place of the ARRAY function.
So the solution that worked best for my use case is below (but at this point, the other answers would be just as correct since I'm now having to do manual string manipulation, but I'm leaving this here for the sake of posterity)
WITH t1 AS (
SELECT 'foo' AS my_col UNION ALL
SELECT 'foo' UNION ALL
SELECT 'bar'
)
SELECT
a.my_col,
COUNT(b.my_col) AS count
FROM unnest(string_to_array('foo, hello', ',')) a (my_col)
LEFT JOIN t1 b
ON a.my_col = b.my_col
GROUP BY
a.my_col;

SQL: Return a count of 0 with count(*)

I am using WinSQL to run a query on a table to count the number of occurrences of literal strings. When trying to do a count on a specific set of strings, I still want to see if some values return a count of 0. For example:
select letter, count(*)
from table
where letter in ('A', 'B', 'C')
group by letter
Let's say we know that 'A' occurs 3 times, 'B' occurs 0 times, and 'C' occurs 5 times. I expect to have a table returned as such:
letter count
A 3
B 0
C 5
However, the table never returns a row with a 0 count, which results like so:
letter count
A 3
C 5
I've looked around and saw some articles mentioning the use of joins, but I've had no luck in correctly returning a table that looks like the first example.
You can create an in-line table containing all letters that you look for, then LEFT JOIN your table to it:
select t1.col, count(t2.letter)
from (
select 'A' AS col union all select 'B' union all select 'C'
) as t1
left join table as t2 on t1.col = t2.letter
group by t1.col
on many platforms you can now use the values statement instead of union all to create your "in line" table - like this
select t.letter, count(mytable.letter)
from ( values ('A'),('B'),('C') ) as t(letter)
left join mytable on t.letter = mytable.letter
group by t.letter
I'm not that familiar with WinSQL, but it's not pretty if you don't have the values that you want in the left most column in a table somewhere. If you did, you could use a left join and a conditional. Without it, you can do something like this:
SELECT all_letters.letter, IFNULL(letter_count.letter_count, 0)
FROM
(
SELECT 'A' AS letter
UNION
SELECT 'B' AS letter
UNION
SELECT 'C' AS letter
) all_letters
LEFT JOIN
(SELECT letter, count(*) AS letter_count
FROM table
WHERE letter IN ('A', 'B', 'C')
GROUP BY letter) letter_count
ON all_letters.letter = letter_count.letter

SQL Customized search with special characters

I am creating a key-wording module where I want to search data using the comma separated words.And the search is categorized into comma , and minus -.
I know a relational database engine is designed from the principle that a cell holds a single value and obeying to this rule can help for performance.But in this case table is already running and have millions of data and can't change the table structure.
Take a look on the example what I exactly want to do is
I have a main table name tbl_main in SQL
AS_ID KWD
1 Man,Businessman,Business,Office,confidence,arms crossed
2 Man,Businessman,Business,Office,laptop,corridor,waiting
3 man,business,mobile phone,mobile,phone
4 Welcome,girl,Greeting,beautiful,bride,celebration,wedding,woman,happiness
5 beautiful,bride,wedding,woman,girl,happiness,mobile phone,talking
6 woman,girl,Digital Tablet,working,sitting,online
7 woman,girl,Digital Tablet,working,smiling,happiness,hand on chin
If search text is = Man,Businessman then result AS_ID is =1,2
If search text is = Man,-Businessman then result AS_ID is =3
If search text is = woman,girl,-Working then result AS_ID is =4,5
If search text is = woman,girl then result AS_ID is =4,5,6,7
What is the best why to do this, Help is much appreciated.Thanks in advance
I think you can easily solve this by creating a FULL TEXT INDEX on your KWD column. Then you can use the CONTAINS query to search for phrases. The FULL TEXT index takes care of the punctuation and ignores the commas automatically.
-- If search text is = Man,Businessman then the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"Man" AND "Businessman"')
-- If search text is = Man,-Businessman then the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"Man" AND NOT "Businessman"')
-- If search text is = woman,girl,-Working the query will be
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"woman" AND "girl" AND NOT "working"')
To search the multiple words (like the mobile phone in your case) use the quoted phrases:
SELECT AS_ID FROM tbl_main
WHERE CONTAINS(KWD, '"woman" AND "mobile phone"')
As commented below the quoted phrases are important in all searches to avoid bad searches in the case of e.g. when a search term is "tablet working" and the KWD value is woman,girl,Digital Tablet,working,sitting,online
There is a special case for a single - search term. The NOT cannot be used as the first term in the CONTAINS. Therefore, the query like this should be used:
-- If search text is = -Working the query will be
SELECT AS_ID FROM tbl_main
WHERE NOT CONTAINS(KWD, '"working"')
Here is my attempt using Jeff Moden's DelimitedSplit8k to split the comma-separated values.
First, here is the splitter function (check the article for updates of the script):
CREATE FUNCTION [dbo].[DelimitedSplit8K](
#pString VARCHAR(8000), #pDelimiter CHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
,E2(N) AS (SELECT 1 FROM E1 a, E1 b)
,E4(N) AS (SELECT 1 FROM E2 a, E2 b)
,cteTally(N) AS(
SELECT TOP (ISNULL(DATALENGTH(#pString), 0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
,cteStart(N1) AS(
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString, t.N, 1) = #pDelimiter
),
cteLen(N1, L1) AS(
SELECT
s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter, #pString, s.N1),0) - s.N1, 8000)
FROM cteStart s
)
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
Here is the complete solution:
-- search parameter
DECLARE #search_text VARCHAR(8000) = 'woman,girl,-working'
-- split comma-separated search parameters
-- items starting in '-' will have a value of 1 for exclude
DECLARE #search_values TABLE(ItemNumber INT, Item VARCHAR(8000), Exclude BIT)
INSERT INTO #search_values
SELECT
ItemNumber,
CASE WHEN LTRIM(RTRIM(Item)) LIKE '-%' THEN LTRIM(RTRIM(STUFF(Item, 1, 1 ,''))) ELSE LTRIM(RTRIM(Item)) END,
CASE WHEN LTRIM(RTRIM(Item)) LIKE '-%' THEN 1 ELSE 0 END
FROM dbo.DelimitedSplit8K(#search_text, ',') s
;WITH CteSplitted AS( -- split each KWD to separate rows
SELECT *
FROM tbl_main t
CROSS APPLY(
SELECT
ItemNumber, Item = LTRIM(RTRIM(Item))
FROM dbo.DelimitedSplit8K(t.KWD, ',')
)x
)
SELECT
cs.AS_ID
FROM CteSplitted cs
INNER JOIN #search_values sv
ON sv.Item = cs.Item
GROUP BY cs.AS_ID
HAVING
-- all parameters should be included (Relational Division with no Remainder)
COUNT(DISTINCT cs.Item) = (SELECT COUNT(DISTINCT Item) FROM #search_values WHERE Exclude = 0)
-- no exclude parameters
AND SUM(CASE WHEN sv.Exclude = 1 THEN 1 ELSE 0 END) = 0
SQL Fiddle
This one uses a solution from the Relational Division with no Remainder problem discussed in this article by Dwain Camps.
From what you've described, you want the keywords that are included in the search text to be a match in the KWD column, and those that are prefixed with a - to be excluded.
Despite the data existing in this format, it still makes most sense to normalize the data, and then query based on the existence or non existence of the keywords.
To do this, in very rough terms:-
Create two additional tables - Keyword and tbl_Main_Keyword. Keyword contains a distinct list of each of the possible keywords and tbl_Main_Keyword contains a link between each record in tbl_Main to each Keyword record where there's a match. Ensure to create an index on the text field for the keyword (e.g. the Keyword.KeywordText column, or whatever you call it), as well as the KeywordID field in the tbl_Main_Keyword table. Create Foreign Keys between tables.
Write some DML (or use a separate program, such as a C# program) to iterate through each record, parsing the text, and inserting each distinct keyword encountered into the Keyword table. Create a relationship to the row for each keyword in the tbl_main record.
Now, for searching, parse out the search text into keywords, and compose a query against the tbl_Main_Keyword table containing both a WHERE KeywordID IN and WHERE KeywordID NOT IN clause, depending on whether there is a match.
Take note to consider whether the case of each keyword is important to your business case, and consider the collation (case sensitive or insensitive) accordingly.
I would prefer cha's solution, but here's another solution:
declare #QueryParts table (q varchar(1000))
insert into #QueryParts values
('woman'),
('girl'),
('-Working')
select AS_ID
from tbl_main
inner join #QueryParts on
(q not like '-%' and ',' + KWD + ',' like '%,' + q + ',%') or
(q like '-%' and ',' + KWD + ',' not like '%,' + substring(q, 2, 1000) + ',%')
group by AS_ID
having COUNT(*) = (select COUNT(*) from #QueryParts)
With such a design, you would have two tables. One that defines the IDs and a subtable that holds the set of keywords per search string.
Likewise, you would transform the search strings into two tables, one for strings that should match and one for negated strings. Assuming that you put this in a stored procedure, these tables would be table-value parameters.
Once you have this set up, the query is simple to write:
SELECT M.AS_ID
FROM tbl_main M
WHERE (SELECT COUNT(*)
FROM tbl_keywords K
WHERE K.AS_ID = M.AS_ID
AND K.KWD IN (SELECT word FROM #searchwords)) =
(SELECT COUNT(*) FROM #searchwords)
AND NOT EXISTS (SELECT *
FROM tbl_keywords K
WHERE K.AS_ID = M.AS_ID
AND K.KWD IN (SELECT word FROM #minuswords))

Double IN Statements in SQL

Just curious about the IN statement in SQL.
I know I can search multiple columns with one value by doing
'val1' IN (col1,col2)
And can search a column for multiple values
col1 IN ('val1','val2')
But is there a way to do both of these simultaneously, without restorting to an repeating AND / OR in the SQl? I am looking to do this in the most scalable way, so independent of how many vals / cols i need to search in.
So essentially:
('val1','val2') IN (col1,col2)
but valid.
You could do something like this (which I've also put on SQLFiddle):
-- Test data:
WITH t(col1, col2) AS (
SELECT 'val1', 'valX' UNION ALL
SELECT 'valY', 'valZ'
)
-- Solution:
SELECT *
FROM t
WHERE EXISTS (
SELECT 1
-- Join all columns with all values to see if any column matches any value
FROM (VALUES(t.col1),(t.col2)) t1(col)
JOIN (VALUES('val1'),('val2')) t2(val)
ON col = val
)
Of course, one could argue, which version is more concise.
Yes, for example you can do this in Oracle:
select x, y from (select 1 as x, 2 as y from dual)
where (x,y) in (select 1 as p, 2 as q from dual)

Help with T-sql special sorting rules

I have a field likeļ¼š
SELECT * FROM
(
SELECT 'A9t' AS sortField UNION ALL
SELECT 'A10t' UNION ALL
SELECT 'A11t' UNION ALL
SELECT 'AB9F' UNION ALL
SELECT 'AB10t' UNION ALL
SELECT 'AB11t'
) t ORDER BY sortField
and the result is:
sortField
---------
A10t
A11t
A9t
AB10t
AB11t
AB9F
Actually I need is to combine the string and number sorting rules:
sortField
---------
A9t
A10t
A11t
AB9F
AB10t
AB11t
SELECT *
FROM (
SELECT 'A9t' AS sortField UNION ALL
SELECT 'A10t' UNION ALL
SELECT 'A11t' UNION ALL
SELECT 'AB9F' UNION ALL
SELECT 'AB10t' UNION ALL
SELECT 'AB11t'
)
t
ORDER BY LEFT(sortField,PATINDEX('%[0-9]%',sortField)-1) ,
CAST(substring(sortField,PATINDEX('%[0-9]%',sortField),1 + PATINDEX('%[0-9][A-Z]%',sortField) -PATINDEX('%[0-9]%',sortField) ) AS INT),
substring(sortField,PATINDEX('%[0-9][A-Z]%',sortField) + 1,LEN(sortField))
If the first character is always a letter, try:
SELECT * FROM
(
SELECT 'A9t' AS sortField UNION ALL
SELECT 'A10t' UNION ALL
SELECT 'A11t'
) t ORDER BY substring(sortField,2,len(sortField)-1) desc
I would say that you have combined the alpha and numeric sort. But what I think you're asking is that you want to sort letters in ascending order and numbers in descending order, and that might be hard to do in a nice looking way. The previous answers will not working for your problem, the problem is that Martin Smith's solution doesn't take strings with two letters as prefix and Parkyprg doesn't sort numbers before letters as you ask for.
What you need to do is to use a custom order, see example here: http://www.emadibrahim.com/2007/05/25/custom-sort-order-in-a-sql-statement/, but that is a tedious way to do it.
EDIT: Martins Smith's solution is updated and works just fine!