fuzzy join in SQL - sql

I was hoping someone could shed some light for me on my issue.
I need to be able to join the following two tables together in SQL
Values in table 1 for some column
QWERTY10
QAZWSXEDCR10
QAZWSXED1230
Values in table 2 for some column
QWWERTY20
QAZWSXEDCR20
QAZWSXED1240
the result that I need is
QWERTY100000 QWERTY200000
QAZWSXEDCR10 QAZWSXEDCR20
QAZWSXED1230 QAZWSXED1240
Now, for QWERTY10000 to be linked to QWERTY20000 I need to do the join on the first 6 characters of the value in the field
but for the QAZWSXEDCR10 to be linked to QAZWSXEDCR20 I need to do a join on the first 10 characters of the value in the field. If I do a join on the first 6 characters only then I will get duplicates. I will have smth like this:
QAZWSXEDCR10 QAZWSXEDCR20
QAZWSXEDCR10 QAZWSXED1240
QAZWSXED1230 QAZWSXEDCR20
QAZWSXED1230 QAZWSXED1240
and I also need QAZWSXED1230 to be linked to QAZWSXED1240 and there I need to do a join on 8 characters to make it work.
im having a hard time to figure out how to join my data together. I would like to avoid doing 10 different joins each based on a different number of characters.
eg do a join on 6 characters first and if not successful then do the join on 7, 8, 9 and 10. - there must be a different way...
Can someone recommend a solution here?
KR

As mentioned in Milney's comment, PatIndex may help by finding the string location of the first number - or special character if applicable. You can then construct a substring of the matching portions of the strings
select table1.col as col1,
table2.col as col2
from table1
inner join
table2
on substring( table1.col, 1, patindex( '[0-9]', table1.col ) ) =
substring( table2.col, 1, patindex( '[0-9]', table2.col ) )

This is a modification of Alex's answer, just to handle the case where one or both values do not contain a digit:
select t1.col as col1, t2.col as col2
from table1 t1 inner join
table2 t2
on left(t1.col, patindex('%[0-9]%', t1.col+'0')) = left(t2.col, patindex('%[0-9]%', t2.col+'0'));

I Think this will help
Create table #table1 ( strValue varchar(100) )
Create table #table2 ( strValue varchar(100) )
Insert Into #table1 ( strValue ) Values
('QWERTY10'), ('QAZWSXEDCR10'),('QAZWSXED1230')
Insert Into #table2 ( strValue ) Values
('QWERTY20'), ('QAZWSXEDCR20'),('QAZWSXED1240')
Declare #MaxlengthT1 int, #MaxlengthT2 int
SELECT #MaxlengthT1 = MAX(LEN(strValue)) FROM #table1
SELECT #MaxlengthT2 = MAX(LEN(strValue)) FROM #table2
select a.strValue + REPLICATE('0',#MaxlengthT1 - LEN(a.strValue)) as col1,
b.strValue + REPLICATE('0',#MaxlengthT1 - LEN(b.strValue)) as col2
from #table1 a
inner join
#table2 b
on substring( a.strValue, 0, patindex( '%[0-9]%', a.strValue )) =
substring( b.strValue, 0, patindex( '%[0-9]%', b.strValue ))
DROP TABLE #table1
DROP TABLE #table2

Related

Combine multiple rows into one by coalescing one column's value as CSV from two tables

I'll divide this into three parts:
What I have:
I have two tables Table1 and Table2.
Table1
ObjectName
Status
A
Active
C
Active
Table2
ParentObjectType
ChildObjectType
X
A
Y
C
Z
A
M
C
What I want:
I want to write a stored procedure that gives a result that looks something like this:
ObjectName
Status
ParentObjectName
A
Active
X, Z
C
Active
Y, M
What I have tried: I tried using the STUFF function and I'm getting a weird result.
Here's the query:
SELECT
ObjectName,
Status,
STUFF((SELECT '; ' + table2.ParentObjectType
FROM table1
INNER JOIN table2 ON table1.[ObjectName] = table2.[ChildObjectType]
FOR XML PATH('')), 1, 1, '') [ParentObjectName]
FROM
table1
Output
ObjectName
Status
ParentObjectName
A
Active
X, Z, Y, M
C
Active
X, Z, Y, M
Any help here is highly appreciated as I'm light handed on SQL and this is driving me nuts!
Demo: Fiddle
You are missing WHERE condition in your Subquery for a parent table.
Also I assume this is a typo. In Table2 you have column ChildObjectType but in your link you are linking over ˛table2.[ChildObjectName]
SELECT
ObjectName,
Status,
STUFF((SELECT '; ' + table2.ParentObjectType
FROM table1
INNER JOIN table2 ON table1.[ObjectName] = table2.[ChildObjectName]
WHERE Table1.ObjectName = src.ObjectName
FOR XML PATH('')), 1, 1, '') [ParentObjectName]
FROM
table1 src
Note: You can use STRING_AGG starting from SQL Server 2017 (14.x) and later
This helped me realize I didn't have this saved in my snippets, thanks! Being careful thatFOR XML PATH will return XML Encoded text, so "&" becomes "&", see below for an example that shows you can add , TYPE to your FOR XML statement; This returns an xml datatype, that you can query the text out of with value('.',....
I personally tend to favor subqueries below the FROM, so this also shows an alternative style for joining the data, via a WHERE clause inside the APPLY refernce:
DECLARE #tt1 TABLE ( ObjectName VARCHAR(10), StatusValue VARCHAR(20) )
INSERT INTO #tt1
SELECT 'A','Active'
UNION ALL SELECT 'C','Active'
UNION ALL SELECT 'D&E','Active'
DECLARE #tt2 TABLE ( A VARCHAR(100), B VARCHAR(100) )
INSERT INTO #tt2 SELECT 'X','A'
INSERT INTO #tt2 SELECT 'Y','C'
INSERT INTO #tt2 SELECT 'Z','A'
INSERT INTO #tt2 SELECT 'M','C'
INSERT INTO #tt2 SELECT 'E&F','D&E' --sample "&" that should NOT render "&"
INSERT INTO #tt2 SELECT '"G"','D&E'
INSERT INTO #tt2 SELECT 'F>G','C' --sample ">" that should NOT render ">"
SELECT
tt1.*,
f1.*
FROM
(SELECT ObjectName,StatusValue FROM #tt1) tt1
OUTER APPLY (SELECT
COALESCE(STUFF(
(SELECT ',' + CAST(tt2.A AS VARCHAR(10))
FROM
#tt2 tt2 WHERE tt2.B = tt1.ObjectName FOR XML PATH(''), TYPE ).value('.','nvarchar(max)'), 1,1,''),'') [csv1] ) f1
I'm assuming that you are on a SQL server version that does not have string aggregating functions?

Split a column with comma delimiter

I have a table with 3 columns with the data given below.
ID | Col1 | Col2 | Status
1 8007590006 8002240001,8002170828 I
2 8002170828 8002000004 I
3 8002000001 8002240001 I
4 8769879809 8002000001 I
5 8769879809 8002000001 I
Col2 can contain multiple comma delimited values. I need to update status to C if there is a value in col2 that is also present in col1.
For example, for ID = 1, col2 contains 8002170828 which is present in Col1, ID = 2. So, status = 'C'
From what I tried, I know it won't work where there are multiple values as I need to split that data and get individual values and then apply update.
UPDATE Table1
SET STATUS = 'C'
WHERE Col1 IN (SELECT Col2 FROM Table1)
If you are using SQL Server 2016 or later, then STRING_SPLIT comes in handy:
WITH cte AS (
SELECT ID, Col1, value AS Col2
FROM Table1
CROSS APPLY STRING_SPLIT(Col2, ',')
)
UPDATE t1
SET Status = 'C'
FROM Table1 t1
INNER JOIN cte t2
ON t1.Col1 = t2.Col2;
Demo
This answer is intended as a supplement to Tim's answer
As you don't have the native string split that came in 2016 we can make one:
CREATE FUNCTION dbo.STRING_SPLIT
(
#List NVARCHAR(MAX),
#Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT y.i.value('(./text())[1]', 'nvarchar(4000)') as value
FROM
(
SELECT x = CONVERT(XML, '<i>'
+ REPLACE(#List, #Delimiter, '</i><i>')
+ '</i>').query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO
--credits to sqlserverperfomance.com for the majority of this code - https://sqlperformance.com/2012/07/t-sql-queries/split-strings
Now Tim's answer should work out for you, so I won't need to repeat it here
I chose an xml based approach because it performs well and your data seems sane and won't have any xml chars in it. If it ever will contain xml chars like > that will break the parsing they should be escaped then unescaped after split
If you aren't allowed to make functions you can extract everything between the RETURNS and the GO, insert it into Tim's query,tweak the variable names to be column names and it'll still work out

How to compare two tables with one table having a temporary column name

I have two tables, both with columns ID_Number and version. However, the format of the ID_number is different. My goal is to compare the two tables and output any rows with the same ID_number but different versions.
The format difference is as follows:
Eg. ID number in first table is "23-4567".
The corresponding ID number in the second table is "1023004567".
"10" is added before the "23" and "-" is replaced with "00". It's the same for all the IDs.
First, I used "substring" to make the ID_Number from table1 to match with ID_Number from table2 and rename this new column as newIDNumber(table1), and compare newIDNumber(table1) with ID_Number(table2). The code to convert ID_Number in table1 for the above example is as such,
Eg. '10' + SUBSTRING(#ID_number, 1, 2) + '00' + SUBSTRING(#ID_number, 4,4) AS newIDNumber
Now I write the following code to check the version difference
SELECT ID_Number, version, "10" + SUBSTRING(#ID_number, 1, 2) + "00" + SUBSTRING(#ID_number, 4,4) (From first table) AS newIDNumber
FROM table1
WHERE (NOT EXISTS
(SELECT ID_Number, version
FROM table2
WHERE (table1.newIDNumber= table2.ID_Number) AND (table1.version = table2.version)
)
)
It outputs an error saying "Unknown column 'table1.newIDNumber' in 'where clause'". How am I able to do compare without disrupting the database (inserting newIDNumber column to table1)?
Would "declare newIDNumber" work?
SELECT
ID_Number
, version
FROM
table1
WHERE
NOT EXISTS
(
SELECT 1
FROM table2
WHERE
"10" + SUBSTRING(table1.ID_number, 1, 2)
+ "00" + SUBSTRING(table1.ID_number, 4,4) = table2.ID_Number
AND table1.version = table2.version
)
you should throw some sample data .
Try this,
declare #t table(col1 varchar(50))
insert into #t values('23-4567')
declare #t1 table(col1 varchar(50))
insert into #t1 values('1023004567')
;With CTE as
(
select t.col1
from #t t
inner join #t1 t1
on substring(t.col1,1,charindex('-',t.col1)-1)=
substring(t1.col1,charindex('10',t1.col1)+2,charindex('00',t1.col1)-3)
where NOT EXISTS(
select * from #t1 t1
where substring(t.col1,charindex('-',t.col1)+1,len(col1))
=substring(t1.col1,charindex('00',t1.col1)+2,len(col1))
)
)
select * from cte t

How to check if first five characters of one field match another?

Assuming I have the following table:
AAAAAA
AAAAAB
CCCCCC
How could I craft a query that would let me know that AAAAA and AAAAB are similar (as they share five characters in a row)? Ideally I would like to write this as a query that would check if the two fields shared five characters in a row anywhere in the string but this seems outside the scope of SQL and something I should write into a C# application?
Ideally the query would add another column that displays: Similar to 'AAAAA', 'AAAAB'
I suggest you do not try to violate 1NF by introducing a multi-valued attribute.
Noting that SUBSTRING is highly portable:
WITH T
AS
(
SELECT *
FROM (
VALUES ('AAAAAA'),
('AAAAAB'),
('CCCCCC')
) AS T (data_col)
)
SELECT T1.data_col,
T2.data_col AS data_col_similar_to
FROM T AS T1, T AS T2
WHERE T1.data_col < T2.data_col
AND SUBSTRING(T1.data_col, 1, 5)
= SUBSTRING(T2.data_col, 1, 5);
Alternativvely:
T1.data_col LIKE SUBSTRING(T2.data_col, 1, 5) + '%';
This will find all matches, also those in the middle of the word, it will not perform well on a big table
declare #t table(a varchar(20))
insert #t select 'AAAAAA'
insert #t select 'AAAAAB'
insert #t select 'CCCCCC'
insert #t select 'ABCCCCC'
insert #t select 'DDD'
declare #compare smallint = 5
;with cte as
(
select a, left(a, #compare) suba, 1 h
from #t
union all
select a, substring(a, h + 1, #compare), h+1
from cte where cte.h + #compare <= len(a)
)
select t.a, cte.a match from #t t
-- if you don't want the null matches, remove the 'left' from this join
left join cte on charindex(suba, t.a) > 0 and t.a <> cte.a
group by t.a, cte.a
Result:
a match
-------------------- ------
AAAAAA AAAAAB
AAAAAB AAAAAA
ABCCCCC CCCCCC
CCCCCC ABCCCCC
You can use left to compare the first five characters and you can use for xml path to concatenate the similar strings to one column.
declare #T table
(
ID int identity primary key,
Col varchar(10)
)
insert into #T values
('AAAAAA'),
('AAAAAB'),
('AAAAAC'),
('CCCCCC')
select Col,
stuff((select ','+T2.Col
from #T as T2
where left(T1.Col, 5) = left(T2.Col, 5) and
T1.ID <> T2.ID
for xml path(''), type).value('.', 'varchar(max)'), 1, 1, '') as Similar
from #T as T1
Result:
Col Similar
---------- -------------------------
AAAAAA AAAAAB,AAAAAC
AAAAAB AAAAAA,AAAAAC
AAAAAC AAAAAA,AAAAAB
CCCCCC NULL

How to make 2 rows into single row in sql

I have a query
example
Title Description
A XYZ
A ABC
now i want a sql query so that i can get a single row
Output :
Title Description
A XYZ | ABC
Declare #tbl table(Title nvarchar(1),[Description] nvarchar(100))
Insert into #tbl values('A','XYZ');
Insert into #tbl values('A','ABC');
Insert into #tbl values('A','PQR');
DECLARE #CSVList varchar(100)
SELECT #CSVList = COALESCE(#CSVList + ' | ', '') +
[Description]
FROM #tbl
WHERE Title='A'
SELECT #CSVList
declare #table table (i int, a varchar(10))
insert into #table
select 1, 'ABC' union all
select 1, 'XYZ' union all
select 2, '123'
select t.i,
max(stuff(d.i, 1, 1, '')) [iList]
from #table t
cross
apply ( select '|' + a
from #table [tt]
where t.i = tt.i
for xml path('')
) as d(i)
group
by t.i;
In mysql there is a group_concat function, that can help you.
Use it like this:
SELECT Title,GROUP_CONCAT(Description) FROM table_name GROUP BY Title
The output will be
Title Description
A XYZ,ABC
Then you can replace "," with "|" if you want(it can be done with replace function)
For 2 rows you can self join in SQL Server. This avoids the assorted "concatenate rows into a column" tricks. You can use a LEFT JOIN and NULL handling too for 1 or 2 rows
SELECT
T1.Title,
T1.Description + '|' + T2.Description
FROM
MyTable T1
JOIN
MyTable T2 ON T1.Title = T2.Title
SELECT
T1.Title,
T1.Description + ISNULL('|' + T2.Description, '') --COALESCE for the pedants)
FROM
MyTable T1
LEFT JOIN
MyTable T2 ON T1.Title = T2.Title
If you are using SQL Server, try this: How to return 1 single row data from 2 different tables with dynamic contents in sql