How to group-concatenate multiple columns? - sql

Assume this table:
PruchaseID | Customer | Product | Method
-----------|----------|----------|--------
1 | John | Computer | Credit
2 | John | Mouse | Cash
3 | Will | Computer | Credit
4 | Will | Mouse | Cash
5 | Will | Speaker | Cash
6 | Todd | Computer | Credit
I want to generate a report on each customer of what they bought, and their payment methods.
But I want that report to be one row per customer, such as:
Customer | Products | Methods
---------|--------------------------|--------------
John | Computer, Mouse | Credit, Cash
Will | Computer, Mouse, Speaker | Credit, Cash
Todd | Computer | Credit
What I've found so far is to group-concatenate using the XML PATH method, such as:
SELECT
p.Customer,
STUFF(
SELECT ', ' + xp.Product
FROM Purchases xp
WHERE xp.Customer = p.Customer
FOR XML PATH('')), 1, 1, '') AS Products,
STUFF(
SELECT ', ' + xp.Method
FROM Purchases xp
WHERE xp.Customer = p.Customer
FOR XML PATH('')), 1, 1, '') AS Methods
FROM Purchases
This gives me the result, but my concern is the speed of this.
At first glance there are three different selects going on here, two would each multiply by the number of rows Purchases has. Eventually this would slow down expenentially.
So, is there a way to do this with better performance?
I want to add even more columns to aggregate, should I do this STUFF() block for every column? That doesn't sound fast enough for me.
Siggestions?

Just an idea:
DECLARE #t TABLE (
Customer VARCHAR(50),
Product VARCHAR(50),
Method VARCHAR(50),
INDEX ix CLUSTERED (Customer)
)
INSERT INTO #t (Customer, Product, Method)
VALUES
('John', 'Computer', 'Credit'),
('John', 'Mouse', 'Cash'),
('Will', 'Computer', 'Credit'),
('Will', 'Mouse', 'Cash'),
('Will', 'Speaker', 'Cash'),
('Todd', 'Computer', 'Credit')
SELECT t.Customer
, STUFF(CAST(x.query('a/text()') AS NVARCHAR(MAX)), 1, 2, '')
, STUFF(CAST(x.query('b/text()') AS NVARCHAR(MAX)), 1, 2, '')
FROM (
SELECT DISTINCT Customer
FROM #t
) t
OUTER APPLY (
SELECT DISTINCT [a] = CASE WHEN id = 'a' THEN ', ' + val END
, [b] = CASE WHEN id = 'b' THEN ', ' + val END
FROM #t t2
CROSS APPLY (
VALUES ('a', t2.Product)
, ('b', t2.Method)
) t3 (id, val)
WHERE t2.Customer = t.Customer
FOR XML PATH(''), TYPE
) t2 (x)
Output:
Customer Product Method
---------- -------------------------- ------------------
John Computer, Mouse Cash, Credit
Todd Computer Credit
Will Computer, Mouse, Speaker Cash, Credit
Another idea with more performance benefits:
IF OBJECT_ID('tempdb.dbo.#EntityValues') IS NOT NULL
DROP TABLE #EntityValues
DECLARE #Values1 VARCHAR(MAX)
, #Values2 VARCHAR(MAX)
SELECT Customer
, Product
, Method
, RowNum = ROW_NUMBER() OVER (PARTITION BY Customer ORDER BY 1/0)
, Values1 = CAST(NULL AS VARCHAR(MAX))
, Values2 = CAST(NULL AS VARCHAR(MAX))
INTO #EntityValues
FROM #t
UPDATE #EntityValues
SET
#Values1 = Values1 =
CASE WHEN RowNum = 1
THEN Product
ELSE #Values1 + ', ' + Product
END
, #Values2 = Values2 =
CASE WHEN RowNum = 1
THEN Method
ELSE #Values2 + ', ' + Method
END
SELECT Customer
, Values1 = MAX(Values1)
, Values2 = MAX(Values2)
FROM #EntityValues
GROUP BY Customer
But with some limitations:
Customer Values1 Values2
------------- ----------------------------- ----------------------
John Computer, Mouse Credit, Cash
Todd Computer Credit
Will Computer, Mouse, Speaker Credit, Cash, Cash
Also check my old post about string aggregation:
http://www.codeproject.com/Articles/691102/String-Aggregation-in-the-World-of-SQL-Server

Another solution is the CLR method for group concatenation #aaron bertrand has done a performance comparison on this here.
If you can deploy CLR then download the script from https://orlando-colamatteo.github.io/ms-sql-server-group-concat-sqlclr/ which is free.
and all details are there in the documentation.
Your query will just change into like this
SELECT Customer,dbo.GROUP_CONCAT(product),dbo.GROUP_CONCAT(method)
FROM Purchases
GROUP BY Customer
This query is short, easy to remember and use, XML method also does the job but remembering the code is a bit difficult(atleast for me) and creeps in the problem like XML entitization which can be solved sure and some pitfalls also described in his blog.
Also from a performance view point using .query is time consuming I had the same issues with performance. I hope you can find this question I raised here in https://dba.stackexchange.com/questions/125771/multiple-column-concatenation
check the version 2 given by kenneth fisher a nested xml concatenation method or a unpivot /pivot method suggested by spaggettidba.

This is one of the use cases for recursive CTEs (Common Table Expressions). You can learn more here https://technet.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
;
WITH CTE1 (PurchaseID, Customer, Product, Method, RowID)
AS
(
SELECT
PurchaseID, Customer, Product, Method,
ROW_NUMBER() OVER (PARTITION BY Customer ORDER BY Customer)
FROM
#tbl
/* This table holds source data. I ommited declaring and inserting
data into it because that's not important. */
)
, CTE2 (PurchaseID, Customer, Product, Method, RowID)
AS
(
SELECT
PurchaseID, Customer,
CONVERT(VARCHAR(MAX), Product),
CONVERT(VARCHAR(MAX), Method),
1
FROM
CTE1
WHERE
RowID = 1
UNION ALL
SELECT
CTE2.PurchaseID, CTE2.Customer,
CONVERT(VARCHAR(MAX), CTE2.Product + ',' + CTE1.Product),
CONVERT(VARCHAR(MAX), CTE2.Method + ',' + CTE1.Method),
CTE2.RowID + 1
FROM
CTE2 INNER JOIN CTE1
ON CTE2.Customer = CTE1.Customer
AND CTE2.RowID + 1 = CTE1.RowID
)
SELECT Customer, MAX(Product) AS Products, MAX(Method) AS Methods
FROM CTE2
GROUP BY Customer
Output:
Customer Products Methods
John Computer,Mouse Credit,Cash
Todd Computer Credit
Will Computer,Mouse,Speaker Credit,Cash,Cash

Related

SQL - Separating a merge field into separate fields based on delimiters

I have a table with an nvarchar(max) column including a merged text like below:
ID MyString
61 Team:Finance,Accounting,HR,Country:Global,
62 Country:Germany,
63 Team:Legal,
64 Team:Finance,Accounting,Country:Global,External:Tenants,Partners,
65 External:Vendors,
What I need is to create another table for each item having the Team, Country and External values separated into 3 different columns.
Id Team Country External
61 Finance,Accounting,HR Global NULL
62 NULL Germany NULL
63 Legal NULL NULL
64 Finance,Accounting Global Tenants,Partners
65 NULL NULL Vendors
What is the most efficient way to do it? I'm trying to use STRING_SPLIT but couldn't manage it.
Any help would be appreciated.
Please try the following solution.
Data resembles JSON, so we'll compose a proper JSON via few REPLACE() function calls.
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT PRIMARY KEY, tokens NVARCHAR(MAX));
INSERT INTO #tbl (ID, tokens) VALUES
(61, 'Team:Finance,Accounting,HR,Country:Global,'),
(62, 'Country:Germany,'),
(63, 'Team:Legal,'),
(64, 'Team:Finance,Accounting,Country:Global,External:Tenants,Partners,'),
(65, 'External:Vendors,');
-- DDL and sample data population, end
SELECT *
FROM #tbl
CROSS APPLY OPENJSON('{"' + REPLACE(REPLACE(REPLACE(TRIM(',' FROM tokens), ':', '": "')
,',Country', '", "Country')
,',External', '", "External') + '"}')
WITH
(
Team VARCHAR(100) '$.Team',
Country VARCHAR(100) '$.Country',
[External] VARCHAR(100) '$.External'
) AS u;
Output
+----+-------------------------------------------------------------------+-----------------------+---------+------------------+
| ID | tokens | Team | Country | External |
+----+-------------------------------------------------------------------+-----------------------+---------+------------------+
| 61 | Team:Finance,Accounting,HR,Country:Global, | Finance,Accounting,HR | Global | NULL |
| 62 | Country:Germany, | NULL | Germany | NULL |
| 63 | Team:Legal, | Legal | NULL | NULL |
| 64 | Team:Finance,Accounting,Country:Global,External:Tenants,Partners, | Finance,Accounting | Global | Tenants,Partners |
| 65 | External:Vendors, | NULL | NULL | Vendors |
+----+-------------------------------------------------------------------+-----------------------+---------+------------------+
Firstly, let me repeat my comments here. SQL Server is the last place you should be doing this; it's string manipulation is poor and you have a severely denormalised design, with denormalised data containing denormalised data. Fixing your design to a normalised approach must be a priority, as leaving your data in this state is only going to make things harder the further you go down this rabbit hole.
One method you could use to achieve this, however, would be with a JSON splitter and some restring aggregation, but this is real ugly. The choice of having the "column" and "row" delimiter to both be a comma (,) makes this a complete mess, and I am not going to explain what it's doing because you just should not be doing this.
WITH YourTable AS(
SELECT *
FROM (VALUES(61,'Team:Finance,Accounting,HR,Country:Global,'),
(62,'Country:Germany,'),
(63,'Team:Legal,'),
(64,'Team:Finance,Accounting,Country:Global,External:Tenants,Partners,'),
(65,'External:Vendors,'))V(ID,MyString)),
PartiallyNormal AS(
SELECT YT.ID,
CONVERT(int,LEAD(OJC.[Key],1,OJC.[Key]) OVER (PARTITION BY ID ORDER BY OJC.[Key], OJV.[Key])) AS ColumnNo,
OJV.[value],
CONVERT(int,OJC.[key]) AS [key]
FROM YourTable YT
CROSS APPLY OPENJSON(CONCAT('["', REPLACE(YT.MyString,':','","'),'"]')) OJC
CROSS APPLY OPENJSON(CONCAT('["', REPLACE(OJC.[value],',','","'),'"]')) OJV),
WithNames AS(
SELECT ID,
ColumnNo,
[value],
[key],
FIRST_VALUE(PN.[Value]) OVER (PARTITION BY ID, ColumnNo ORDER BY [Key]) AS ColumnName
FROM PartiallyNormal PN)
SELECT ID,
TRIM(',' FROM STRING_AGG(CASE ColumnName WHEN 'Team' THEN NULLIF([value],'''') END,',') WITHIN GROUP (ORDER BY [key])) AS Team, --TRIM because I've not taken the time to work out why there are sometimes a trailing comma
TRIM(',' FROM STRING_AGG(CASE ColumnName WHEN 'Country' THEN NULLIF([value],'''') END,',') WITHIN GROUP (ORDER BY [key])) AS Country,
TRIM(',' FROM STRING_AGG(CASE ColumnName WHEN 'External' THEN NULLIF([value],'''') END,',') WITHIN GROUP (ORDER BY [key])) AS [External]
FROM WithNames WN
WHERE [value] <> [ColumnName]
GROUP BY ID
ORDER BY ID;
db<>fiddle
STRING_SPLIT in SQL Server 2017 doesn't tell us the order of the items in the list, so it can't be used here.
Only SQL Server 2022 would add a parameter to STRING_SPLIT that would tell the order of the items.
Until that version of SQL Server the most efficient method would likely be the CLR. Write your parser in C# and call your function using CLR.
Another option is:
splitting the string using the STRING_SPLIT function on the colon
extracting consecutive strings using the LAG function
removing the string identifiers (Team, Country and External)
aggregating on the ID to remove NULL values
Here's the query:
WITH cte AS (
SELECT ID,
LAG(value) OVER(PARTITION BY ID ORDER BY (SELECT 1)) AS prev_value,
value
FROM tab
CROSS APPLY STRING_SPLIT(MyString, ':')
)
SELECT ID,
MAX(CASE WHEN prev_value LIKE 'Team'
THEN REPLACE(value, ',Country', '') END) AS [Team],
MAX(CASE WHEN prev_value LIKE '%Country'
THEN LEFT(value, LEN(value)-1) END) AS [Country],
MAX(CASE WHEN prev_value LIKE '%External'
THEN LEFT(value, LEN(value)-1) END) AS [External]
FROM cte
GROUP BY ID
Check the demo here.

Check exist multiple value from table

I am trying to get data from table by checking some conditions:
Table Detail
CODE | PRODUCT |
FWD 4X4 | PROD1 |
Table Header
CODE | GROUP |
FWD | AAA |
4X4 | AAA |
FWD | CCC |
Expected Result
CODE | PRODUCT | GROUP |
FWD 4X4 | PROD1 | AAA |
Because group AAA have two codes: FWD & 4X4. Group CCC not qualified because only have one code.
Is it possible to do this by SQL query? I've tried with split string and cross apply not even close though.
Maybe I will use programming language if it's too complex. Since I am not really good with SQL.
Code combination maybe become longer too (3 words or more).
Thanks in advance.
Use String_agg function if your sql server version is 17
select a.code,a.prod,a.group from tabledetail a
inner join
(SELECT [group], STRING_AGG (code, ' ') as c
FROM tableheader
GROUP BY [group]) b on a.code=b.c
for lower version you can use stuff function:
select a.code,a.prod,a.group from tabledetail a
inner join
(SELECT [group], c= STUFF(
(SELECT ' ' + code
FROM tableheader t1
WHERE t1.id = t2.id
FOR XML PATH (''))
, 1, 1, '') from tableheader t2
group by [group])b on a.code=b.c
build the schema
create table Detail (CODE varchar(300) ,PRODUCT varchar(100) );
insert into Detail values ('FWD 4X4','PROD1');
insert into Detail values ('FWD','PROD2');
insert into Detail values ('FWD 4X4 FM','PROD3');
create table Header (CODE varchar(300) ,[GROUP] varchar(100) );
insert into Header values ('FWD','AAA');
insert into Header values ('4X4','AAA');
insert into Header values ('FWD','CCC');
insert into Header values ('4X4','DDD');
insert into Header values ('FM','DDD');
insert into Header values ('FWD','DDD');
solution sql
select d.CODE, d.PRODUCT, h.[GROUP]
from Detail d
inner join Header h on CHARINDEX(' ' +h.code+ ' ', ' ' + d.Code + ' ') > 0
inner join (
select [Group],count(Code) GroupCodesCount
from Header
Group By [Group]
) GroupCodes on GroupCodes.[Group] = h.[GROUP]
group by d.CODE, d.PRODUCT, h.[GROUP],GroupCodesCount
having len(d.CODE) - len(replace(d.CODE, ' ', '')) +1 = count(h.code) and count(h.code) = GroupCodesCount
output result
CODE PRODUCT GROUP
FWD PROD2 CCC
FWD 4X4 PROD1 AAA
FWD 4X4 FM PROD3 DDD
I join Detail table with Header when it contains at least one code on the group CHARINDEX(' ' +h.code+ ' ', ' ' + d.Code + ' ') > 0 then join it to count codes of each group view after that I group it by product and group then filter result groups to return groups that match exactly count of group codes having len(d.CODE) - len(replace(d.CODE, ' ', '')) +1 = count(h.code) and count(h.code) = GroupCodesCount
Hope it helps

How to synthesize attribute for joined tables

I have a view defined like this:
CREATE VIEW [dbo].[PossiblyMatchingContracts] AS
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts
FROM [dbo].AllContracts AS C
INNER JOIN [dbo].AllContracts AS CC
ON C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE C.UniqueID NOT IN
(
SELECT UniqueID FROM [dbo].DefinitiveMatches
)
AND C.AssociatedUser IS NULL
AND C.UniqueID <> CC.UniqueID
Which is basically finding contracts where f.e. the first name and the birthday are matching. This works great. Now I want to add a synthetic attribute to each row with the value from only one source row.
Let me give you an example to make it clearer. Suppose I have the following table:
UniqueID | FirstName | LastName | Birthday
1 | Peter | Smith | 1980-11-04
2 | Peter | Gray | 1980-11-04
3 | Peter | Gray-Smith| 1980-11-04
4 | Frank | May | 1985-06-09
5 | Frank-Paul| May | 1985-06-09
6 | Gina | Ericson | 1950-11-04
The resulting view should look like this:
UniqueID | PossiblyMatchingContracts | SyntheticID
1 | 2 | PeterSmith1980-11-04
1 | 3 | PeterSmith1980-11-04
2 | 1 | PeterSmith1980-11-04
2 | 3 | PeterSmith1980-11-04
3 | 1 | PeterSmith1980-11-04
3 | 2 | PeterSmith1980-11-04
4 | 5 | FrankMay1985-06-09
5 | 4 | FrankMay1985-06-09
6 | NULL | NULL [or] GinaEricson1950-11-04
Notice that the SyntheticID column uses ONLY values from one of the matching source rows. It doesn't matter which one. I am exporting this view to another application and need to be able to identify each "match group" afterwards.
Is it clear what I mean? Any ideas how this could be done in sql?
Maybe it helps to elaborate a bit on the actual use case:
I am importing contracts from different systems. To account for the possibility of typos or people that have married but the last name was only updated in one system, I need to find so called 'possible matches'. Two or more contracts are considered a possible match if they contain the same birthday plus the same first, last or birth name. That implies, that if contract A matches contract B, contract B also matches contract A.
The target system uses multivalue reference attributes to store these relationships. The ultimate goal is to create user objects for these contracts. The catch first is, that the shall only be one user object for multiple matching contracts. Thus I'm creating these matches in the view. The second catch is, that the creation of user objects happens by workflows, which run parallel for each contract. To avoid creating multiple user objects for matching contracts, each workflow needs to check, if there is already a matching user object or another workflow, which is about to create said user object. Because the workflow engine is extremely slow compared to sql, the workflows should not repeat the whole matching test. So the idea is, to let the workflow check only for the 'syntheticID'.
I have solved it with a multi step approach:
Create the list of possible 1st level matches for each contract
Create the base groups list, assigning a different group for for
each contract (as if they were not related to anybody)
Iterate the matches list updating the group list when more contracts need to
be added to a group
Recursively build up the SyntheticID from final group list
Output results
First of all, let me explain what I have understood, so you can tell if my approach is correct or not.
1) matching propagates in "cascade"
I mean, if "Peter Smith" is grouped up with "Peter Gray", it means that all Smith and all Gray are related (if they have the same birth date) so Luke Smith can be in the same group of John Gray
2) I have not understood what you mean with "Birth Name"
You say contracts matches on "first, last or birth name", sorry, I'm italian, I thought birth name and first were the same, also in your data there is not such column. Maybe it is related to that dash symbol between names?
When FirstName is Frank-Paul it means it should match both Frank and Paul?
When LastName is Gray-Smith it means it should match both Gray and Smith?
In following code I have simply ignored this problem, but it could be handled if needed (I already did a try, breaking names, unpivoting them and treating as double match).
Step Zero: some declaration and prepare base data
declare #cli as table (UniqueID int primary key, FirstName varchar(20), LastName varchar(20), Birthday varchar(20))
declare #comb as table (id1 int, id2 int, done bit)
declare #grp as table (ix int identity primary key, grp int, id int, unique (grp,ix))
declare #str_id as table (grp int primary key, SyntheticID varchar(1000))
declare #id1 as int, #g int
;with
t as (
select *
from (values
(1 , 'Peter' , 'Smith' , '1980-11-04'),
(2 , 'Peter' , 'Gray' , '1980-11-04'),
(3 , 'Peter' , 'Gray-Smith', '1980-11-04'),
(4 , 'Frank' , 'May' , '1985-06-09'),
(5 , 'Frank-Paul', 'May' , '1985-06-09'),
(6 , 'Gina' , 'Ericson' , '1950-11-04')
) x (UniqueID , FirstName , LastName , Birthday)
)
insert into #cli
select * from t
Step One: Create the list of possible 1st level matches for each contract
;with
p as(select UniqueID, Birthday, FirstName, LastName from #cli),
m as (
select p.UniqueID UniqueID1, p.FirstName FirstName1, p.LastName LastName1, p.Birthday Birthday1, pp.UniqueID UniqueID2, pp.FirstName FirstName2, pp.LastName LastName2, pp.Birthday Birthday2
from p
join p pp on (pp.Birthday=p.Birthday) and (pp.FirstName = p.FirstName or pp.LastName = p.LastName)
where p.UniqueID<=pp.UniqueID
)
insert into #comb
select UniqueID1,UniqueID2,0
from m
Step Two: Create the base groups list
insert into #grp
select ROW_NUMBER() over(order by id1), id1 from #comb where id1=id2
Step Three: Iterate the matches list updating the group list
Only loop on contracts that have possible matches and updates only if needed
set #id1 = 0
while not(#id1 is null) begin
set #id1 = (select top 1 id1 from #comb where id1<>id2 and done=0)
if not(#id1 is null) begin
set #g = (select grp from #grp where id=#id1)
update g set grp= #g
from #grp g
inner join #comb c on g.id = c.id2
where c.id2<>#id1 and c.id1=#id1
and grp<>#g
update #comb set done=1 where id1=#id1
end
end
Step Four: Build up the SyntheticID
Recursively add ALL (distinct) first and last names of group to SyntheticID.
I used '_' as separator for birth date, first names and last names, and ',' as separator for the list of names to avoid conflicts.
;with
c as(
select c.*, g.grp
from #cli c
join #grp g on g.id = c.UniqueID
),
d as (
select *, row_number() over (partition by g order by t,s) n1, row_number() over (partition by g order by t desc,s desc) n2
from (
select distinct c.grp g, 1 t, FirstName s from c
union
select distinct c.grp, 2, LastName from c
) l
),
r as (
select d.*, cast(CONVERT(VARCHAR(10), t.Birthday, 112) + '_' + s as varchar(1000)) Names, cast(0 as bigint) i1, cast(0 as bigint) i2
from d
join #cli t on t.UniqueID=d.g
where n1=1
union all
select d.*, cast(r.names + IIF(r.t<>d.t,'_',',') + d.s as varchar(1000)), r.n1, r.n2
from d
join r on r.g = d.g and r.n1=d.n1-1
)
insert into #str_id
select g, Names
from r
where n2=1
Step Five: Output results
select c.UniqueID, case when id2=UniqueID then id1 else id2 end PossibleMatchingContract, s.SyntheticID
from #cli c
left join #comb cb on c.UniqueID in(id1,id2) and id1<>id2
left join #grp g on c.UniqueID = g.id
left join #str_id s on s.grp = g.grp
Here is the results
UniqueID PossibleMatchingContract SyntheticID
1 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
1 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
4 5 1985-06-09_Frank,Frank-Paul_May
5 4 1985-06-09_Frank,Frank-Paul_May
6 NULL 1950-11-04_Gina_Ericson
I think that in this way the resulting SyntheticID should also be "unique" for each group
This creates a synthetic value and is easy to change to suit your needs.
DECLARE #T TABLE (
UniqueID INT
,FirstName VARCHAR(200)
,LastName VARCHAR(200)
,Birthday DATE
)
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 1,'Peter','Smith','1980-11-04'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 2,'Peter','Gray','1980-11-04'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 3,'Peter','Gray-Smith','1980-11-04'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 4,'Frank','May','1985-06-09'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 5,'Frank-Paul','May','1985-06-09'
INSERT INTO #T(UniqueID,FirstName,LastName,Birthday) SELECT 6,'Gina','Ericson','1950-11-04'
DECLARE #PossibleMatches TABLE (UniqueID INT,[PossibleMatch] INT,SynKey VARCHAR(2000)
)
INSERT INTO #PossibleMatches
SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Ln=' + t1.LastName + ' Fn=' + + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM #T t1
INNER JOIN #T t2 ON t1.Birthday=t2.Birthday
AND t1.FirstName=t2.FirstName
AND t1.LastName=t2.LastName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO #PossibleMatches
SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Fn=' + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM #T t1
INNER JOIN #T t2 ON t1.Birthday=t2.Birthday
AND t1.FirstName=t2.FirstName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO #PossibleMatches
SELECT t1.UniqueID,t2.UniqueID,'Ln=' + t1.LastName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM #T t1
INNER JOIN #T t2 ON t1.Birthday=t2.Birthday
AND t1.LastName=t2.LastName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO #PossibleMatches
SELECT t1.UniqueID,pm.UniqueID,'Ln=' + t1.LastName + ' Fn=' + + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM #T t1
LEFT JOIN #PossibleMatches pm on pm.UniqueID=t1.UniqueID
WHERE pm.UniqueID IS NULL
SELECT *
FROM #PossibleMatches
ORDER BY UniqueID,[PossibleMatch]
I think this will work for you
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday)
OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM
[dbo].AllContracts AS C INNER JOIN
[dbo].AllContracts AS CC ON
C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE
C.UniqueID NOT IN(
SELECT UniqueID FROM [dbo].DefinitiveMatches)
AND C.AssociatedUser IS NULL
You can try this:
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday)
OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM
[dbo].AllContracts AS C
INNER JOIN
[dbo].AllContracts AS CC
ON
C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE
C.UniqueID NOT IN
(
SELECT UniqueID FROM [dbo].DefinitiveMatches
)
AND
C.AssociatedUser IS NULL
This will generate one extra row (because we left out C.UniqueID <> CC.UniqueID) but will give you the good souluton.
Following an example with some example data extracted from your original post. The idea: Generate all SyntheticID in a CTE, query all records with a "PossibleMatch" and Union it with all records which are not yet included:
DECLARE #t TABLE(
UniqueID int
,FirstName nvarchar(20)
,LastName nvarchar(20)
,Birthday datetime
)
INSERT INTO #t VALUES (1, 'Peter', 'Smith', '1980-11-04');
INSERT INTO #t VALUES (2, 'Peter', 'Gray', '1980-11-04');
INSERT INTO #t VALUES (3, 'Peter', 'Gray-Smith', '1980-11-04');
INSERT INTO #t VALUES (4, 'Frank', 'May', '1985-06-09');
INSERT INTO #t VALUES (5, 'Frank-Paul', 'May', '1985-06-09');
INSERT INTO #t VALUES (6, 'Gina', 'Ericson', '1950-11-04');
WITH ctePrep AS(
SELECT UniqueID, FirstName, LastName, BirthDay,
ROW_NUMBER() OVER (PARTITION BY FirstName, BirthDay ORDER BY FirstName, BirthDay) AS k,
FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
FROM #t
),
cteKeys AS(
SELECT FirstName, BirthDay, SyntheticID
FROM ctePrep
WHERE k = 1
),
cteFiltered AS(
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
keys.SyntheticID
FROM #t AS C
JOIN #t AS CC ON C.FirstName = CC.FirstName
AND C.Birthday = CC.Birthday
JOIN cteKeys AS keys ON keys.FirstName = c.FirstName
AND keys.Birthday = c.Birthday
WHERE C.UniqueID <> CC.UniqueID
)
SELECT UniqueID, PossiblyMatchingContracts, SyntheticID
FROM cteFiltered
UNION ALL
SELECT UniqueID, NULL, FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
FROM #t
WHERE UniqueID NOT IN (SELECT UniqueID FROM cteFiltered)
Hope this helps. The result looked OK to me:
UniqueID PossiblyMatchingContracts SyntheticID
---------------------------------------------------------------
2 1 PeterSmith1980-11-04
3 1 PeterSmith1980-11-04
1 2 PeterSmith1980-11-04
3 2 PeterSmith1980-11-04
1 3 PeterSmith1980-11-04
2 3 PeterSmith1980-11-04
4 NULL FrankMay1985-06-09
5 NULL Frank-PaulMay1985-06-09
6 NULL GinaEricson1950-11-04
Tested in SSMS, it works perfect. :)
--create table structure
create table #temp
(
uniqueID int,
firstname varchar(15),
lastname varchar(15),
birthday date
)
--insert data into the table
insert #temp
select 1, 'peter','smith','1980-11-04'
union all
select 2, 'peter','gray','1980-11-04'
union all
select 3, 'peter','gray-smith','1980-11-04'
union all
select 4, 'frank','may','1985-06-09'
union all
select 5, 'frank-paul','may','1985-06-09'
union all
select 6, 'gina','ericson','1950-11-04'
select * from #temp
--solution is as below
select ab.uniqueID
, PossiblyMatchingContracts
, c.firstname+c.lastname+cast(c.birthday as varchar) as synID
from
(
select a.uniqueID
, case
when a.uniqueID < min(b.uniqueID)over(partition by a.uniqueid)
then a.uniqueID
else min(b.uniqueID)over(partition by a.uniqueid)
end as SmallestID
, b.uniqueID as PossiblyMatchingContracts
from #temp a
left join #temp b
on (a.firstname = b.firstname OR a.lastname = b.lastname) AND a.birthday = b.birthday AND a.uniqueid <> b.uniqueID
) as ab
left join #temp c
on ab.SmallestID = c.uniqueID
Result capture is attached below:
Say we have following table (a VIEW in your case):
UniqueID PossiblyMatchingContracts SyntheticID
1 2 G1
1 3 G2
2 1 G3
2 3 G4
3 1 G4
3 4 G6
4 5 G7
5 4 G8
6 NULL G9
In your case you can set initial SyntheticID as a string like PeterSmith1980-11-04 using UniqueID for each line. Here is a recursive CTE query it divides all lines to unconnected groups and select MAX(SyntheticId) in the current group as a new SyntheticID for all lines in this group.
WITH CTE AS
(
SELECT CAST(','+CAST(UniqueID AS Varchar(100)) +','+ CAST(PossiblyMatchingContracts as Varchar(100))+',' as Varchar(MAX)) as GroupCont,
SyntheticID
FROM PossiblyMatchingContracts
UNION ALL
SELECT CAST(GroupCont+CAST(UniqueID AS Varchar(100)) +','+ CAST(PossiblyMatchingContracts as Varchar(100))+',' AS Varchar(MAX)) as GroupCont,
pm.SyntheticID
FROM CTE
JOIN PossiblyMatchingContracts as pm
ON
(
CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
OR
CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
)
AND NOT
(
CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
AND
CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
)
)
SELECT pm.UniqueID,
pm.PossiblyMatchingContracts,
ISNULL(
(SELECT MAX(SyntheticID) FROM CTE WHERE
(
CTE.GroupCont LIKE '%,'+CAST(pm.UniqueID AS Varchar(100))+',%'
OR
CTE.GroupCont LIKE '%,'+CAST(pm.PossiblyMatchingContracts AS Varchar(100))+',%'
))
,pm.SyntheticID) as SyntheticID
FROM PossiblyMatchingContracts pm

SQL SELECT with JOINS and Multiple Rows for one Column

The issue at hand is that I have a basic query (from a document software database), taking from multiple tables with some LEFT JOINS, but I have one column in question that needs to take from a table where there are multiple results per unique document id (DocGUID).
CURRENT QUERY
SELECT doc.[Doc #]
, '' AS 'Authors'
, ud.[Lead Author]
, doc.[Title]
, ud.[Publication]
, ud.[Citation]
, ud.[Year]
, ud.[Month]
, ud.[Comments]
, notes.[Note]
FROM [tblDocuments] doc
LEFT JOIN [tblNotes] notes ON notes.[DocGUID] = doc.[DocGUID]
LEFT JOIN [tblUserData] ud ON ud.[MasterGUID] = doc.[DocGUID]
WHERE doc.[DocGUID] = '12345678'
As you can see, I have simply queried '' for "Authors". Here's where my issue comes in. I have a table named tblMultiValues where there are two or more authors listed per DocGUID.
Table Example: (for tblMultiValues)
|------|-------------|-------------|-------------------|
| Id | DocGUID | FieldName | Value |
|------|-------------|-------------|-------------------|
| 123 | 12345678 | Authors | Collins, Nick |
| 456 | 12345678 | Authors | Williams, Robert |
| 321 | 87654321 | Authors | Smith, Kate |
| 654 | 87654321 | Authors | Hanks, Tom |
|------|-------------|-------------|-------------------|
So, what I want to show for the 2nd column of 'Authors', is the result of:
Collins, Nick; Williams, Robert
Specifically for DocGUID of '12345678'
How might one go about doing this, mixed in with the query that is already built?
(I hope this was enough info... if more is needed, please advise).
-Nick
:::EDIT:::
I was able to get things running with the following code... (very well guided from the answer given by #mohan111
SELECT DISTINCT
STUFF((
SELECT '; ' + mv2.Value
FROM [dbo].[tblMultiValues] mv2
WHERE mv1.DocGUID = mv2.DocGUID
FOR XML PATH ('')),1,2,'') AS 'Authors', mv1.FieldName, mv1.DocGUID
INTO #TempMultival
FROM [dbo].[tblMultiValues] mv1
SELECT doc.[Doc #]
, tmv.[Authors]
, ud.[Lead Author]
, doc.[Title]
, ud.[Publication]
, ud.[Citation]
, ud.[Year]
, ud.[Month]
, ud.[Comments]
, notes.[Note]
FROM [tblDocuments] doc
LEFT JOIN [tblNotes] notes ON notes.[DocGUID] = doc.[DocGUID]
LEFT JOIN [tblUserData] ud ON ud.[MasterGUID] = doc.[DocGUID]
LEFT JOIN #TempMultiVal tmv ON tmv.DoCGUID = doc.[DocGUID]
DROP TABLE #TempMultiVal
Declare #table TABLE
(
Id INT,
DocGUID int,
FieldName VARCHAR(25),
Value VARCHAR(200)
);
INSERT INTO #table
( Id,
DocGUID,
FieldName,
Value
)
VALUES
(123,12345678,'Authors','Collins, Nick'),
(456,12345678,'Authors','Williams, Robert'),
(321,87654321,'Authors','Smith, Kate'),
(654,87654321,'Authors','Hanks, Tom');
Select distinct DocGUID,
(SELECT
Substring((SELECT ', ' + CAST(i.id AS VARCHAR(1024))
FROM
#table i
WHERE i.DocGUID = tt.DocGUID
ORDER BY i.id
FOR XML PATH('')), 3, 10000000) AS list) AS ID,
FieldName,
STUFF((Select distinct t.Value + ','
from #table t
where t.DocGUID = tt.DocGUID
FOR XML PATH(''),TYPE).value('.', 'NVARCHAR(MAX)')
, 1, 0, ' ') from #table tt

Flattening 1-to-many relationship

My current schema looks like this:
PersonType (PersonTypeID, Name, Description)
Person (PersonID,PersonTypeID, FirstName, ... )
PersonDynamicField (PersonDynamicFieldID, PersonTypeID, Name, Description, DataType, DefaultValue, ...)
PersonDynamicFieldValue (PersonDynamicFieldValueID, PersonDynamicFieldID, PersonID, Value, ...)
That is, a person is of a certain type. For example, Customer. For each PersonType, there can dynamically be added additional fields to store about a PersonType. For a Customer, we might want to add fields to PersonDynamicField such as LikesChocolate, FavoriteColor, HappinessScale, etc. The value for those fields would then be stored in PersonDynamicFieldValue.
I hope my writing makes sense.
What I would like to do, is a query that can flatten this structure and return a result looking like this:
PersonID, PersonTypeID, FirstName, LikesChocolate, FavoriteColor, HappinessScale
1, 2, Robert, 1, Green, 9
2, 2, John, 0, Orange, 5
...
I'm kind of stuck and don't really know where to even start.
Can you help?
In order to get the result that you want there are several ways that you can convert the rows of data into columns.
Starting in SQL Server 2005, you can use the PIVOT function. The basic structure of the code will be:
SELECT personid, persontypeid, firstname,[FavoriteColor],[HappinessScale],[LikesChocolate]
from
(
select p.personid, p.persontypeid, p.firstname, f.name fields, v.value
from person p
inner join persontype pt
on p.persontypeid = pt.persontypeid
left join PersonDynamicField f
on p.PersonTypeID = f.PersonTypeID
left join PersonDynamicFieldValue v
on f.PersonDynamicFieldID = v.PersonDynamicFieldID
and p.personid = v.personid
) x
pivot
(
max(value)
for fields in ([FavoriteColor],[HappinessScale],[LikesChocolate])
) p;
See SQL Fiddle with Demo. The one issue that you are going to have with PIVOT is that it requires that the values being converted to columns are known at run-time. For your situation, this seems impossible since the values can change. As a result, you will have to use dynamic SQL to get the result:
DECLARE #cols AS NVARCHAR(MAX),
#query AS NVARCHAR(MAX)
select #cols = STUFF((SELECT distinct ',' + QUOTENAME(Name)
from PersonDynamicField
where PersonTypeID = 2
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'')
set #query = 'SELECT personid, persontypeid, firstname,' + #cols + '
from
(
select p.personid,
p.persontypeid,
p.firstname,
f.name fields,
v.value
from person p
inner join persontype pt
on p.persontypeid = pt.persontypeid
left join PersonDynamicField f
on p.PersonTypeID = f.PersonTypeID
left join PersonDynamicFieldValue v
on f.PersonDynamicFieldID = v.PersonDynamicFieldID
and p.personid = v.personid
) x
pivot
(
max(value)
for fields in (' + #cols + ')
) p '
execute(#query);
See SQL Fiddle with Demo. These will give a result:
| PERSONID | PERSONTYPEID | FIRSTNAME | FAVORITECOLOR | HAPPINESSSCALE | LIKESCHOCOLATE |
-----------------------------------------------------------------------------------------
| 1 | 2 | Robert | Green | 9 | 1 |
| 2 | 2 | John | Orange | 5 | 0 |
What you want is commonly called a pivot and SQL Server has an operation that may help. Take a look at this article on MSDN for examples: http://msdn.microsoft.com/en-us/library/ms177410(v=sql.105).aspx