Find Last Record in Chain - a Customer Merge Process - sql

I am importing customer data from a another vendor's system and we have merge processes that we use to identify potential duplicate customer accounts and them merge them if they meet certain criteria - like same first name, last name, SSN and DOB. In this process, I am seeing where we are creating chains - for instance, Customer A is merged to Customer B who is then merged to Customer C.
What I am hoping to do it to identify these chains and update the customer record to point to the last record in the chain. So in my example above, Customer A and Customer B would both have Customer C's id in their merged To field.
CustID FName LName CustStatusType isMerged MergedTo
1 Kevin Smith M 1 2
2 Kevin Smith M 1 3
3 Kevin Smith M 1 4
4 Kevin Smith O 0 NULL
5 Mary Jones O 0 NULL
6 Wyatt Earp M 1 7
7 Wyatt Earp O 1 NULL
8 Bruce Wayn M 1 10
9 Brice Wayne M 1 10
10 Bruce Wane M 1 11
11 Bruce Wayne O 1 NULL
CustStatusType indicates if the customer account is open ("O") or merged ("M"). And then we have an isMerged field as a BIT field that indicates whether the account has been merged and finally the MergedTo field that indicates what customer account the record was merged to.
With the example provided, what I would like to achieve would to have the CustID records of 1 & 2 have their MergedTo record set to 3 - while CustID 3 could either be updated or left as is. For Cust IDs 4, 5, and 6 - these records are find and do not need to be updated. But on Cust IDs 8 - 10, I would like these records to be set to 11 - like the table below.
CustID FName LName CustStatusType isMerged MergedTo
1 Kevin Smith M 1 4
2 Kevin Smith M 1 4
3 Kevin Smith M 1 4
4 Kevin Smith O 0 NULL
5 Mary Jones O 0 NULL
6 Wyatt Earp M 1 7
7 Wyatt Earp O 1 NULL
8 Bruce Wayn M 1 11
9 Brice Wayne M 1 11
10 Bruce Wane M 1 11
11 Bruce Wayne O 1 NULL
I haven't been able to figure out how to achieve this with TSQL - suggestions?
Test Data:
DROP TABLE IF EXISTS #Customers;
CREATE TABLE #Customers
(
CustomerID INT ,
FirstName VARCHAR (25) ,
LastName VARCHAR (25) ,
CustomerStatusTypeID VARCHAR (1) ,
isMerged BIT ,
MergedTo INT
);
INSERT INTO #Customers
VALUES ( 1, 'Kevin', 'Smith', 'M', 1, 2 ) ,
( 2, 'Kevin', 'Smith', 'M', 1, 3 ) ,
( 3, 'Kevin', 'Smith', 'M', 1, 4 ) ,
( 4, 'Kevin', 'Smith', 'O', 0, NULL ) ,
( 5, 'Mary', 'Jones', 'O', 0, NULL ) ,
( 6, 'Wyatt', 'Earp', 'M', 1, 7 ) ,
( 7, 'Wyatt', 'Earp', 'O', 1, NULL ) ,
( 8, 'Bruce', 'Wayn', 'M', 1, 10 ) ,
( 9, 'Brice', 'Wayne', 'M', 1, 10 ) ,
( 10, 'Bruce', 'Wane', 'M', 1, 11 ) ,
( 11, 'Bruce', 'Wayne', 'O', 1, NULL );
SELECT *
FROM #Customers;
DROP TABLE #Customers;

For your example soundex() seems good enough. It returns a code, that is based on the word's pronunciation in English. Use it on the first and last name to join the customer table and a subquery which queries the customer table adding the row_number() partitioned by the Soundex of the names and order descending by the ID -- to number the "latest" record with 1. For the join condition use the Soundex of the names, a row number of 1 and of course inequality of the IDs.
UPDATE c1
SET c1.mergedto = x.customerid
FROM #customers c1
LEFT JOIN (SELECT c2.customerid,
soundex(c2.firstname) sefn,
soundex(c2.lastname) seln,
row_number() OVER (PARTITION BY soundex(c2.firstname),
soundex(c2.lastname)
ORDER BY c2.customerid DESC) rn
FROM #customers c2) x
ON x.sefn = soundex(c1.firstname)
AND x.seln = soundex(c1.lastname)
AND x.rn = 1
AND x.customerid <> c1.customerid;
db<>fiddle
I don't really get the concept behind the customerstatustypeid and ismerged columns. As what I understand, they're all derived from whether mergedto is null or not. But the sample data neither the expected result doesn't support that. But as these columns apparently don't change between your sample input and output I guess it's alright, that I just left them alone.
If Soundex proves to be insufficient for your needs you may want to look for other string distance metrics, like the Levenshtein distance. AFAIK there's no implementation of that included in SQL Server but search engines may spit out implementations by third parties or maybe there's something that can used via CLR. Or you roll your own, of course.

Below query finds the latest CustomerID which is match to each customer and returns the id in Ref column
select *
, Ref = (select top 1 CustomerID from #Customers where soundex(FirstName) = soundex(ma.FirstName) and soundex(LastName) = soundex(ma.LastName) order by CustomerID desc)
from #Customers ma
using below update, you can update MergedTo column
;with ct as (
select *
, Ref = (select top 1 CustomerID from #Customers where soundex(FirstName) = soundex(ma.FirstName) and soundex(LastName) = soundex(ma.LastName) order by CustomerID desc)
from #Customers ma
)
update c1
set c1.MergedTo = iif(c1.CustomerID = ct.Ref, null, ct.Ref)
from #Customers c1
inner join ct on ct.CustomerID = c1.CustomerID
Final data in Customer table after update

Recursion can be used for this:
WITH CTE as
(
SELECT P.CustomerID, P.MergedTo, CAST(P.CustomerID AS VarChar(Max)) as Levels
FROM #Customers P
WHERE P.MergedTo IS NULL
UNION ALL
SELECT P1.CustomerID, P1.MergedTo, M.Levels + ', ' + CAST(P1.CustomerID AS VarChar(Max))
FROM #Customers P1
INNER JOIN CTE M ON M.CustomerID = P1.MergedTo
)
SELECT
CustomerID
, MergedTo
, x -- "end of chain"
, Levels
FROM CTE
CROSS APPLY (
SELECT LEFT(levels,charindex(',',levels+',')-1) x
) a
WHERE MergedTo IS NOT NULL
Result:
+----+------------+----------+----+------------+
| | CustomerID | MergedTo | x | levels |
+----+------------+----------+----+------------+
| 1 | 10 | 11 | 11 | 11, 10 |
| 2 | 8 | 10 | 11 | 11, 10, 8 |
| 3 | 9 | 10 | 11 | 11, 10, 9 |
| 4 | 6 | 7 | 7 | 7, 6 |
| 5 | 3 | 4 | 4 | 4, 3 |
| 6 | 2 | 3 | 4 | 4, 3, 2 |
| 7 | 1 | 2 | 4 | 4, 3, 2, 1 |
+----+------------+----------+----+------------+
Note the string levels is formed by the recursion, and in the manner this is concatenated the first part will be the "end of chain" (see column x). That first part is extracted using a cross apply although using an apply isn't essential.
Available as a demo

Related

Displaying whole table after stripping characters in SQL Server

This question has 2 parts.
Part 1
I have a table "Groups":
group_ID person
-----------------------
1 Person 10
2 Person 11
3 Jack
4 Person 12
Note that not all data in the "person" column have the same format.
In SQL Server, I have used the following query to strip the "Person " characters out of the person column:
SELECT
REPLACE([person],'Person ','')
AS [person]
FROM Groups
I did not use UPDATE in the query above as I do not want to alter the data in the table.
The query returned this result:
person
------
10
11
12
However, I would like this result instead:
group_ID person
-------------------
1 10
2 11
3 Jack
4 12
What should be my query to achieve this result?
Part 2
I have another table "Details":
detail_ID group1 group2
-------------------------------
100 1 2
101 3 4
From the intended result in Part 1, where the numbers in the "person" column correspond to those in "group1" and "group2" of table "Details", how do I selectively convert the numbers in "person" to integers and join them with "Details"?
Note that all data under "person" in Part 1 are strings (nvarchar(100)).
Here is the intended query output:
detail_ID group1 group2
-------------------------------
100 10 11
101 Jack 12
Note that I do not wish to permanently alter anything in both tables and the intended output above is just a result of a SELECT query.
I don't think first part will be a problem here. Your query is working fine with your expected result.
Schema:
CREATE TABLE #Groups (group_ID INT, person VARCHAR(50));
INSERT INTO #Groups
SELECT 1,'Person 10'
UNION ALL
SELECT 2,'Person 11'
UNION ALL
SELECT 3,'Jack'
UNION ALL
SELECT 4,'Person 12';
CREATE TABLE #Details(detail_ID INT,group1 INT, group2 INT);
INSERT INTO #Details
SELECT 100, 1, 2
UNION ALL
SELECT 101, 3, 4 ;
Part 1:
For me your query is giving exactly what you are expecting
SELECT group_ID,REPLACE([person],'Person ','') AS person
FROM #Groups
+----------+--------+
| group_ID | person |
+----------+--------+
| 1 | 10 |
| 2 | 11 |
| 3 | Jack |
| 4 | 12 |
+----------+--------+
Part 2:
;WITH CTE AS(
SELECT group_ID
,REPLACE([person],'Person ','') AS person
FROM #Groups
)
SELECT D.detail_ID, G1.person, G2.person
FROM #Details D
INNER JOIN CTE G1 ON D.group1 = G1.group_ID
INNER JOIN CTE G2 ON D.group1 = G2.group_ID
Result:
+-----------+--------+--------+
| detail_ID | person | person |
+-----------+--------+--------+
| 100 | 10 | 10 |
| 101 | Jack | Jack |
+-----------+--------+--------+
Try following query, it should give you the desired output.
;WITH MT AS
(
SELECT
GroupId, REPLACE([person],'Person ','') Person
AS [person]
FROM Groups
)
SELECT Detail_Id , MT1.Person AS group1 , MT2.Person AS AS group2
FROM
Details D
INNER JOIN MT MT1 ON MT1.GroupId = D.group1
INNER JOIN MT MT2 ON MT2.GroupId= D.group2
The first query works
declare #T table (id int primary key, name varchar(10));
insert into #T values
(1, 'Person 10')
, (2, 'Person 11')
, (3, 'Jack')
, (4, 'Person 12');
declare #G table (id int primary key, grp1 int, grp2 int);
insert into #G values
(100, 1, 2)
, (101, 3, 4);
with cte as
( select t.id, t.name, ltrim(rtrim(replace(t.name, 'person', ''))) as sp
from #T t
)
-- select * from cte order by cte.id;
select g.id, c1.sp as grp1, c2.sp as grp2
from #G g
join cte c1
on c1.id = g.grp1
join cte c2
on c2.id = g.grp2
order
by g.id;
id grp1 grp2
----------- ----------- -----------
100 10 11
101 Jack 12

Recursive CTE with three tables

I'm using SQL Server 2008 R2 SP1.
I would like to recursively find the first non-null manager for a certain organizational unit by "walking up the tree".
I have one table containing organizational units "ORG", one table containing parents for each org. unit in "ORG", lets call that table "ORG_PARENTS" and one table containing managers for each organizational unit, lets call that table "ORG_MANAGERS".
ORG has a column ORG_ID:
ORG_ID
1
2
3
ORG_PARENTS has two columns.
ORG_ID, ORG_PARENT
1, NULL
2, 1
3, 2
MANAGERS has two columns.
ORG_ID, MANAGER
1, John Doe
2, Jane Doe
3, NULL
I'm trying to create a recursive query that will find the first non-null manager for a certain organizational unit.
Basically if I do a query today for the manager for ORG_ID=3 I will get NULL.
SELECT MANAGER FROM ORG_MANAGERS WHERE ORG_ID = '3'
I want the query to use the ORG_PARENTS table to get the parent for ORG_ID=3, in this case get "2" and repeat the query against the ORG_MANAGERS table with ORG_ID=2 and return in this example "Jane Doe".
In case the query also returns NULL I want to repeat the process with the parent of ORG_ID=2, i.e. ORG_ID=1 and so on.
My CTE attempts so far have failed, one example is this:
WITH BOSS (MANAGER, ORG_ID, ORG_PARENT)
AS
( SELECT m.MANAGER, m.ORG_ID, p.ORG_PARENT
FROM dbo.MANAGERS m INNER JOIN
dbo.ORG_PARENTS p ON p.ORG_ID = m.ORG_ID
UNION ALL
SELECT m1.MANAGER, m1.ORG_ID, b.ORG_PARENT
FROM BOSS b
INNER JOIN dbo.MANAGERS m1 ON m1.ORG_ID = b.ORG_PARENT
)
SELECT * FROM BOSS WHERE ORG_ID = 3
It returns:
Msg 530, Level 16, State 1, Line 4
The statement terminated. The maximum recursion 100 has been exhausted before statement completion.
MANAGER ORG_ID ORG_PARENT
NULL 3 2
You need to keep track of the original ID you start with. Try this:
DECLARE #ORG_PARENTS TABLE (ORG_ID INT, ORG_PARENT INT )
DECLARE #MANAGERS TABLE (ORG_ID INT, MANAGER VARCHAR(100))
INSERT #ORG_PARENTS (ORG_ID, ORG_PARENT)
VALUES (1, NULL)
, (2, 1)
, (3, 2)
INSERT #MANAGERS (ORG_ID, MANAGER)
VALUES (1, 'John Doe')
, (2, 'Jane Doe')
, (3, NULL)
;
WITH BOSS
AS
(
SELECT m.MANAGER, m.ORG_ID AS ORI, m.ORG_ID, p.ORG_PARENT, 1 cnt
FROM #MANAGERS m
INNER JOIN #ORG_PARENTS p
ON p.ORG_ID = m.ORG_ID
UNION ALL
SELECT m1.MANAGER, b.ORI, m1.ORG_ID, OP.ORG_PARENT, cnt +1
FROM BOSS b
INNER JOIN #ORG_PARENTS AS OP
ON OP.ORG_ID = b.ORG_PARENT
INNER JOIN #MANAGERS m1
ON m1.ORG_ID = OP.ORG_ID
)
SELECT *
FROM BOSS
WHERE ORI = 3
Results in:
+----------+-----+--------+------------+-----+
| MANAGER | ORI | ORG_ID | ORG_PARENT | cnt |
+----------+-----+--------+------------+-----+
| NULL | 3 | 3 | 2 | 1 |
| Jane Doe | 3 | 2 | 1 | 2 |
| John Doe | 3 | 1 | NULL | 3 |
+----------+-----+--------+------------+-----+
General tips:
Don't predefine the columns of a CTE; it's not necessary, and makes maintenance annoying.
With recursive CTE, always keep a counter, so you can limit the recursiveness, and you can keep track how deep you are.
edit:
By the way, if you want the first not null manager, you can do for example (there are many ways) this:
SELECT BOSS.*
FROM BOSS
INNER JOIN (
SELECT BOSS.ORI
, MIN(BOSS.cnt) cnt
FROM BOSS
WHERE BOSS.MANAGER IS NOT NULL
GROUP BY BOSS.ORI
) X
ON X.ORI = BOSS.ORI
AND X.cnt = BOSS.cnt
WHERE BOSS.ORI IN (3)

Compare values in SQL using CASE, if match return and exit from CASE statement [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Person table:
PersonID Name
1 David
2 Victor
Phone Number table:
PersonID Phonetype PhoneNumber
1 7 7735821547
1 6 7731245263
1 5 7731426587
1 4 7731243654
1 8 7731241478
1 2 7731423658
1 1 7731427485
2 1 7731547841
Priority to pull number by phonetype: 4 then 6 then 7 then 5 then 8 then 2 then 1
I want to pull phone number for each person but when i use left join then it returns multiple columns. I only want one number for each person.
Outputs should be:
PersonID Name PhoneNumber
1 David 7731243654
2 Victor 7731547841
I will use postgres for this example, I use row_number() and cte, if you have mySql will need a workaorund
You need a table priority
CREATE TABLE priority
("PhoneType" int, "Priority" int)
;
INSERT INTO priority
("PhoneType", "Priority")
VALUES
(7, 1), (6, 2),
(1, 3), (2, 4),
(3, 5), (4, 6),
(5, 7), (8, 8),
(9, 9) ;
Then put a rownumber to each phonetype acording to priority
WITH cte as (
SELECT
p.*,
pr."Priority",
row_number() over (partition by "PersonID" ORDER BY "Priority") as rn
FROM person p
JOIN priority pr
ON p."PhoneType" = pr."PhoneType"
ORDER BY pr."Priority"
)
SELECT
c."PersonID",
c."PhoneType",
c."PhoneNumber",
CASE rn
WHEN 1 THEN 1
ELSE NULL
END as rn
FROM cte c
SqlFiddle Demo
OUTPUT
| PersonID | PhoneType | PhoneNumber | rn |
|----------|-----------|-------------|--------|
| 1 | 7 | 7735487695 | 1 |
| 1 | 1 | 7731234569 | (null) |
| 1 | 5 | 7731547895 | (null) |
NOTE: I also change type 6 => 5 in your sample to highlight even more how the priority is working
After your edit SQL Server verion without table Fiddle
With Priority as (
SELECT 7 as PhoneType, 1 as Priority UNION ALL
SELECT 6 as PhoneType, 2 as Priority UNION ALL
SELECT 1 as PhoneType, 3 as Priority UNION ALL
SELECT 2 as PhoneType, 4 as Priority UNION ALL
SELECT 3 as PhoneType, 5 as Priority UNION ALL
SELECT 4 as PhoneType, 6 as Priority UNION ALL
SELECT 5 as PhoneType, 7 as Priority UNION ALL
SELECT 8 as PhoneType, 8 as Priority UNION ALL
SELECT 9 as PhoneType, 9 as Priority
),
cte as (
SELECT
p.*,
pr.Priority,
row_number() over (partition by PersonID ORDER BY Priority) as rn
FROM Person p
JOIN Priority pr
ON p.PhoneType = pr.PhoneType
)
SELECT
c.PersonID,
c.PhoneType,
c.PhoneNumber
FROM cte c
WHERE rn = 1
OUTPUT
| PersonID | PhoneType | PhoneNumber |
|----------|-----------|-------------|
| 1 | 7 | 7735821547 |
| 2 | 1 | 7731547841 |
You have too many WHEN statements. If you only want to do something for value 7 you should only have one WHEN statement:
CASE
WHEN PhoneType = 7 and PhoneNbr is not null THEN 1
ELSE NULL
END AS RN

Tables for recursive hierarchial data in SQL [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Simplest way to do a recursive self-join in SQL Server?
I have to create a table in SQL that will comprise of groups of items/products. Each new group made will be made under one of the pre-defined groups or the groups previously formed. I want to keep all this data in a SQL Table. So far, I have though of creating a table like this:
Group ID
Group Name
Group Under (This will store the ID of the group under which this group is from
But this can only refer to just the next level, how will I get to know who is the super-parent of this group.
For example:
I have groups A, B, C.
A has further subgroups A1, A2, A3.
A1 has further subgroups, A11, A12, A13.
I will I have the information about super-parent group i.e A from A11 or A22 or A33?
Let me know if the problem is not clear..
Assuming T-SQL and MSSQLServer (you didn't specify), and given that your Group table should look something like this:
Id | Name | ParentId
---+------+---------
1 | A | NULL
2 | B | NULL
3 | C | NULL
4 | A1 | 1
5 | A2 | 1
6 | A3 | 1
7 | A11 | 4
8 | A12 | 4
9 | A13 | 4
You can use the following recursive CTE to find the top level a given group, say 'A12':
WITH [Group](Id, Name, ParentId) AS
(
SELECT 1, 'A' , NULL UNION
SELECT 2, 'B' , NULL UNION
SELECT 3, 'C' , NULL UNION
SELECT 4, 'A1' , 1 UNION
SELECT 5, 'A2' , 1 UNION
SELECT 6, 'A3' , 1 UNION
SELECT 7, 'A11', 4 UNION
SELECT 8, 'A12', 4 UNION
SELECT 9, 'A13', 4
), q AS
(
SELECT
*
FROM
[Group]
WHERE
[Name] = 'A12' -- Given 'A12' as the child
UNION ALL
SELECT
g.*
FROM
[Group] g
JOIN
q
ON
q.ParentId = g.Id
)
SELECT
*
FROM
q
WHERE
ParentId IS NULL
This query returns:
Id | Name | ParentId
---+------+---------
1 | A | NULL

Query for missing elements

I have a table with the following structure:
timestamp | name | value
0 | john | 5
1 | NULL | 3
8 | NULL | 12
12 | john | 3
33 | NULL | 4
54 | pete | 1
180 | NULL | 4
400 | john | 3
401 | NULL | 4
592 | anna | 2
Now what I am looking for is a query that will give me the sum of the values for each name, and treats the nulls in between (orderd by the timestamp) as the first non-null name down the list, as if the table were as follows:
timestamp | name | value
0 | john | 5
1 | john | 3
8 | john | 12
12 | john | 3
33 | pete | 4
54 | pete | 1
180 | john | 4
400 | john | 3
401 | anna | 4
592 | anna | 2
and I would query SUM(value), name from this table group by name. I have thought and tried, but I can't come up with a proper solution. I have looked at recursive common table expressions, and think the answer may lie in there, but I haven't been able to properly understand those.
These tables are just examples, and I don't know the timestamp values in advance.
Could someone give me a hand? Help would be very much appreciated.
With Inputs As
(
Select 0 As [timestamp], 'john' As Name, 5 As value
Union All Select 1, NULL, 3
Union All Select 8, NULL, 12
Union All Select 12, 'john', 3
Union All Select 33, NULL, 4
Union All Select 54, 'pete', 1
Union All Select 180, NULL, 4
Union All Select 400, 'john', 3
Union All Select 401, NULL, 4
Union All Select 592, 'anna', 2
)
, NamedInputs As
(
Select I.timestamp
, Coalesce (I.Name
, (
Select I3.Name
From Inputs As I3
Where I3.timestamp = (
Select Max(I2.timestamp)
From Inputs As I2
Where I2.timestamp < I.timestamp
And I2.Name Is not Null
)
)) As name
, I.value
From Inputs As I
)
Select NI.name, Sum(NI.Value) As Total
From NamedInputs As NI
Group By NI.name
Btw, what would be orders of magnitude faster than any query would be to first correct the data. I.e., update the name column to have the proper value, make it non-nullable and then run a simple Group By to get your totals.
Additional Solution
Select Coalesce(I.Name, I2.Name), Sum(I.value) As Total
From Inputs As I
Left Join (
Select I1.timestamp, MAX(I2.Timestamp) As LastNameTimestamp
From Inputs As I1
Left Join Inputs As I2
On I2.timestamp < I1.timestamp
And I2.Name Is Not Null
Group By I1.timestamp
) As Z
On Z.timestamp = I.timestamp
Left Join Inputs As I2
On I2.timestamp = Z.LastNameTimestamp
Group By Coalesce(I.Name, I2.Name)
You don't need CTE, just a simple subquery.
select t.timestamp, ISNULL(t.name, (
select top(1) i.name
from inputs i
where i.timestamp < t.timestamp
and i.name is not null
order by i.timestamp desc
)), t.value
from inputs t
And summing from here
select name, SUM(value) as totalValue
from
(
select t.timestamp, ISNULL(t.name, (
select top(1) i.name
from inputs i
where i.timestamp < t.timestamp
and i.name is not null
order by i.timestamp desc
)) as name, t.value
from inputs t
) N
group by name
I hope I'm not going to be embarassed by offering you this little recursive CTE query of mine as a solution to your problem.
;WITH
numbered_table AS (
SELECT
timestamp, name, value,
rownum = ROW_NUMBER() OVER (ORDER BY timestamp)
FROM your_table
),
filled_table AS (
SELECT
timestamp,
name,
value
FROM numbered_table
WHERE rownum = 1
UNION ALL
SELECT
nt.timestamp,
name = ISNULL(nt.name, ft.name),
nt.value
FROM numbered_table nt
INNER JOIN filled_table ft ON nt.rownum = ft.rownum + 1
)
SELECT *
FROM filled_table
/* or go ahead aggregating instead */