SELECT DISTINCT for data groups - sql

I have following table:
ID Data
1 A
2 A
2 B
3 A
3 B
4 C
5 D
6 A
6 B
etc. In other words, I have groups of data per ID. You will notice that the data group (A, B) occurs multiple times. I want a query that can identify the distinct data groups and number them, such as:
DataID Data
101 A
102 A
102 B
103 C
104 D
So DataID 102 would resemble data (A,B), DataID 103 would resemble data (C), etc. In order to be able to rewrite my original table in this form:
ID DataID
1 101
2 102
3 102
4 103
5 104
6 102
How can I do that?
PS. Code to generate the first table:
CREATE TABLE #t1 (id INT, data VARCHAR(10))
INSERT INTO #t1
SELECT 1, 'A'
UNION ALL SELECT 2, 'A'
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'A'
UNION ALL SELECT 3, 'B'
UNION ALL SELECT 4, 'C'
UNION ALL SELECT 5, 'D'
UNION ALL SELECT 6, 'A'
UNION ALL SELECT 6, 'B'

In my opinion You have to create a custom aggregate that concatenates data (in case of strings CLR approach is recommended for perf reasons).
Then I would group by ID and select distinct from the grouping, adding a row_number()function or add a dense_rank() your choice. Anyway it should look like this
with groupings as (
select concat(data) groups
from Table1
group by ID
)
select groups, rownumber() over () from groupings

The following query using CASE will give you the result shown below.
From there on, getting the distinct datagroups and proceeding further should not really be a problem.
SELECT
id,
MAX(CASE data WHEN 'A' THEN data ELSE '' END) +
MAX(CASE data WHEN 'B' THEN data ELSE '' END) +
MAX(CASE data WHEN 'C' THEN data ELSE '' END) +
MAX(CASE data WHEN 'D' THEN data ELSE '' END) AS DataGroups
FROM t1
GROUP BY id
ID DataGroups
1 A
2 AB
3 AB
4 C
5 D
6 AB
However, this kind of logic will only work in case you the "Data" values are both fixed and known before hand.
In your case, you do say that is the case. However, considering that you also say that they are 1000 of them, this will be frankly, a ridiculous looking query for sure :-)
LuckyLuke's suggestion above would, frankly, be the more generic way and probably saner way to go about implementing the solution though in your case.

From your sample data (having added the missing 2,'A' tuple, the following gives the renumbered (and uniqueified) data:
with NonDups as (
select t1.id
from #t1 t1 left join #t1 t2
on t1.id > t2.id and t1.data = t2.data
group by t1.id
having COUNT(t1.data) > COUNT(t2.data)
), DataAddedBack as (
select ID,data
from #t1 where id in (select id from NonDups)
), Renumbered as (
select DENSE_RANK() OVER (ORDER BY id) as ID,Data from DataAddedBack
)
select * from Renumbered
Giving:
1 A
2 A
2 B
3 C
4 D
I think then, it's a matter of relational division to match up rows from this output with the rows in the original table.

Just to share my own dirty solution that I'm using for the moment:
SELECT DISTINCT t1.id, D.data
FROM #t1 t1
CROSS APPLY (
SELECT CAST(Data AS VARCHAR) + ','
FROM #t1 t2
WHERE t2.id = t1.id
ORDER BY Data ASC
FOR XML PATH('') )
D ( Data )
And then going analog to LuckyLuke's solution.

Related

BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:
WITH t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
)
SELECT * FROM t
id_1
id_2
1
a
2
a
2
b
3
b
4
c
5
c
6
d
6
e
7
f
I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:
The desired output for the example above would look something like this:
id_1_coll
id_2_coll
1, 2, 3
a, b
4, 5
c
6
d, e
7
f
where each row contains all the other IDs one could reach following the links in the table.
Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.
One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.
Explanation
Our steps are
Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function
We can use something like this to remodel the relationships into a self join (1.):
SELECT
a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:
WITH RECURSIVE base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
)
We can also grab the lowest id_1 value at this time:
id_1
linked_id
lowest_linked_id
1
2
1
2
1
1
2
3
2
3
2
2
4
5
4
5
4
4
For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.
We can code this as follows:
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
)
Giving us the following results:
|id_1|linked_id|lowest_linked_id|linked_ids|
|----|---------|------------|---|
|3|2|1|[1,2]|
|2|3|1|[1,2,3]|
|4|5|4|[5]|
|1|2|1|[2]|
|5|4|4|[4]|
|2|3|2|[3]|
|2|1|1|[1]|
|3|2|2|[2]|
which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below
Solution
WITH RECURSIVE t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
),
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
),
link_back AS (
SELECT
t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
FROM t
LEFT JOIN recursive_loop
ON t.id_1 = recursive_loop.id_1
),
by_id_1 AS (
SELECT
id_1,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
by_id_2 AS (
SELECT
id_2,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
result AS (
SELECT
by_id_1.grp,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
FROM
by_id_1
INNER JOIN by_id_2
ON by_id_1.grp = by_id_2.grp
GROUP BY grp
)
SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll
FROM result ORDER BY grp
Giving us the required output:
grp
id1_coll
id2_coll
1
[1,2,3]
[a,b]
4
[4,5]
[c]
6
[6]
[d,e]
7
[7]
[f]
Limitations/Issues
Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).
Using ROW_NUMBER() the query is as the follow:
WITH RECURSIVE
t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
t1 AS (
SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
UNION ALL
SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
ELSE t2.arr_1 END,
CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
ELSE t2.arr_2 END,
t1.id_1, t1.id_2
FROM t2 JOIN t1 ON
t2.n < t1.n AND
t1.n NOT IN UNNEST(t2.n_arr) AND
(t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
(t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
SELECT
n,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
FROM t2
WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().
However, it may not help your Limitations/Issues.
The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):
SELECT GrpNr,
STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END)
OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
ROW_NUMBER() OVER () as RowNr,
LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr
I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from #GordonLinoff to the question here, which I tested (and presume works for BigQuery).
This can also be done using sequential updates as done by #RomanPekar
here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.
Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.
I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

How to add rows to a specific number multiple times in the same query

I already asked for help on a part of my problem here.
I used to get 10 rows no matter if there are filled or not. But now I'm facing something else where I need to do it multiple times in the same query result.
WITH NUMBERS AS
(
SELECT 1 rowNumber
UNION ALL
SELECT 2
UNION ALL
SELECT 3
UNION ALL
SELECT 4
UNION ALL
SELECT 5
UNION ALL
SELECT 6
UNION ALL
SELECT 7
UNION ALL
SELECT 8
UNION ALL
SELECT 9
UNION ALL
SELECT 10
)
SELECT DISTINCT sp.SLC_ID, c.rowNumber, c.PCE_ID
FROM SELECT_PART sp
LEFT JOIN (
SELECT b.*
FROM NUMBERS
LEFT OUTER JOIN (
SELECT a.*
FROM (
SELECT SELECT_PART.SLC_ID, ROW_NUMBER() OVER (ORDER BY SELECT_PART.SLC_ID) as
rowNumber, SELECT_PART.PCE_ID
FROM SELECT_PART
WHERE SELECT_PART.SLC_ID = (must be the same as sp.SLC_ID and can''t hardcode it)
) a
) b
ON b.rowNumber = NUMBERS.rowNumber
) c ON c.SLC_ID = sp.SLC_ID
ORDER BY sp.SLC_ID, c.rowNumber
It works fine for the first 10 lines, but next SLC_ID only got 1 empty line
I need it to be like that
SLC_ID rowNumer PCE_ID
1 1 0001
1 2 0002
1 3 NULL
1 ... ...
1 10 NULL
2 1 0011
2 2 0012
2 3 0013
2 ... ...
2 10 0020
3 1 0021
3 ... ...
Really need it that way to build a report.
Instead of manually building a query-specific number list where you have to include every possible number you need (1 through 10 in this case), create a numbers table.
DECLARE #UpperBound INT = 1000000;
;WITH cteN(Number) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY s1.[object_id]) - 1
FROM sys.all_columns AS s1
CROSS JOIN sys.all_columns AS s2
)
SELECT [Number] INTO dbo.Numbers
FROM cteN WHERE [Number] <= #UpperBound;
CREATE UNIQUE CLUSTERED INDEX CIX_Number ON dbo.Numbers([Number])
WITH
(
FILLFACTOR = 100, -- in the event server default has been changed
DATA_COMPRESSION = ROW -- if Enterprise & table large enough to matter
);
Source: mssqltips
Alternatively, since you can't add data, use a table that already exists in SQL Server.
WITH NUMBERS AS
(
SELECT DISTINCT Number as rowNumber FROM master..spt_values where type = 'P'
)
SELECT DISTINCT sp.SLC_ID, c.rowNumber, c.PCE_ID
FROM SELECT_PART sp
LEFT JOIN (
SELECT b.*
FROM NUMBERS
LEFT OUTER JOIN (
SELECT a.*
FROM(
SELECT SELECT_PART.SLC_ID, ROW_NUMBER() OVER (ORDER BY SELECT_PART.SLC_ID) as
rowNumber, SELECT_PART.PCE_ID
FROM SELECT_PART
WHERE SELECT_PART.SLC_ID = (must be the same as sp.SLC_ID and can''t hardcode it)
) a
) b
ON b.rowNumber = NUMBERS.rowNumber
) c ON c.SLC_ID = sp.SLC_ID
ORDER BY sp.SLC_ID, c.rowNumber
NOTE: Max value for this solution is 2047

Comparing Column Values and returning ERROR or OK

I'm needing to verify a source system with a destination system and ensure the values are matching between them. The problem is the source system is a total mess and is proving hard to validate.
I've got the following sample data where they should all be OK, but they're showing as ERROR. Does anyone know a way of doing a comparison that would result as an OK for all for the below?
CREATE TABLE #testdata (
ID INT
,ValueSource VARCHAR(800)
,ValueDestination VARCHAR(800)
,Value_Varchar_Check AS (
CASE
WHEN coalesce(ValueSource, '0') = coalesce(ValueDestination, '0')
THEN 'OK'
ELSE 'ERROR'
END
)
)
INSERT INTO #testdata (
ID
,ValueSource
,ValueDestination
)
SELECT 1
,'hepatitis c,other (specify)' 'hepatitis c, other (specify)'
UNION ALL
SELECT 2
,'lung problems / asthma,lung problems / asthma'
,'lung problems / asthma'
UNION ALL
SELECT 3
,'lung problems / asthma,diabetes'
,'diabetes, lung problems / asthma'
UNION ALL
SELECT 4
,'seizures/epilepsy,hepatitis c,seizures/epilepsy'
,'hepatitis c, seizures/epilepsy'
I don't think you can write this as a generated column as it is quite a tricky thing to compute. If you are using SQL Server 2016 or later, you can use STRING_SPLIT to convert the ValueSource and ValueDestination values into tables and then sort them alphabetically using a query like this:
SELECT DISTINCT ID, TRIM(value) AS value,
DENSE_RANK() OVER (PARTITION BY ID ORDER BY TRIM(value)) AS rn
FROM testdata
CROSS APPLY STRING_SPLIT(ValueSource, ',')
For ValueSource, this produces:
ID value rn
1 hepatitis c 1
1 other (specify) 2
2 lung problems / asthma 1
3 diabetes 1
3 lung problems / asthma 2
4 hepatitis c 1
4 seizures/epilepsy 2
You can then FULL OUTER JOIN those two tables on ID, value and rn, and detect an error when there are null values from either side (since that implies that the values for a given ID and rn don't match):
WITH t1 AS (
SELECT DISTINCT ID, TRIM(value) AS value,
DENSE_RANK() OVER (PARTITION BY ID ORDER BY TRIM(value)) AS rn
FROM testdata
CROSS APPLY STRING_SPLIT(ValueSource, ',')
),
t2 AS (
SELECT DISTINCT ID, TRIM(value) AS value,
DENSE_RANK() OVER (PARTITION BY ID ORDER BY TRIM(value)) AS rn
FROM testdata
CROSS APPLY STRING_SPLIT(ValueDestination, ',')
)
SELECT COALESCE(t1.ID, t2.ID) AS ID,
CASE WHEN COUNT(CASE WHEN t1.value IS NULL OR t2.value IS NULL THEN 1 END) > 0 THEN 'Error'
ELSE 'OK'
END AS Status
FROM t1
FULL OUTER JOIN t2 ON t2.ID = t1.ID AND t2.rn = t1.rn AND t2.value = t1.value
GROUP BY COALESCE(t1.ID, t2.ID)
Output (for your sample data):
ID Status
1 OK
2 OK
3 OK
4 OK
Demo on SQLFiddle
You can then use the entire query above as a CTE (call it t3) to update your original table:
UPDATE t
SET t.Value_Varchar_Check = t3.Status
FROM testdata t
JOIN t3 ON t.ID = t3.ID
Output:
ID ValueSource ValueDestination Value_Varchar_Check
1 hepatitis c,other (specify) hepatitis c, other (specify) OK
2 lung problems / asthma,lung problems / asthma lung problems / asthma OK
3 lung problems / asthma,diabetes diabetes, lung problems / asthma OK
4 seizures/epilepsy,hepatitis c,seizures/epilepsy hepatitis c, seizures/epilepsy OK
Demo on SQLFiddle

Get every combination of sort order and value of a csv

If I have a string with numbers separated by commas, like this:
Declare #string varchar(20) = '123,456,789'
And would like to return every possible combination + sort order of the values by doing this:
Select Combination FROM dbo.GetAllCombinations(#string)
Which would in result return this:
123
456
789
123,456
456,123
123,789
789,123
456,789
789,456
123,456,789
123,789,456
456,789,123
456,123,789
789,456,123
789,123,456
As you can see not only is every combination returned, but also each combination+sort order as well. The example shows only 3 values separated by commas, but should parse any amount--Recursive.
The logic needed would be somewhere in the realm of using a WITH CUBE statement, but the problem with using WITH CUBE (in a table structure instead of CSV of course), is that it won't shuffle the order of the values 123,456 456,123 etc., and will only provide each combination, which is only half of the battle.
Currently I have no idea what to try. If someone can provide some assistance it would be appreciated.
I use a User Defined Table-valued Function called split_delimiter that takes 2 values: the #delimited_string and the #delimiter_type.
CREATE FUNCTION [dbo].[split_delimiter](#delimited_string VARCHAR(8000), #delimiter_type CHAR(1))
RETURNS TABLE AS
RETURN
WITH cte10(num) AS
(
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
,cte100(num) AS
(
SELECT 1
FROM cte10 t1, cte10 t2
)
,cte10000(num) AS
(
SELECT 1
FROM cte100 t1, cte100 t2
)
,cte1(num) AS
(
SELECT TOP (ISNULL(DATALENGTH(#delimited_string),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM cte10000
)
,cte2(num) AS
(
SELECT 1
UNION ALL
SELECT t.num+1
FROM cte1 t
WHERE SUBSTRING(#delimited_string,t.num,1) = #delimiter_type
)
,cte3(num,[len]) AS
(
SELECT t.num
,ISNULL(NULLIF(CHARINDEX(#delimiter_type,#delimited_string,t.num),0)-t.num,8000)
FROM cte2 t
)
SELECT delimited_item_num = ROW_NUMBER() OVER(ORDER BY t.num)
,delimited_value = SUBSTRING(#delimited_string, t.num, t.[len])
FROM cte3 t;
Using that I was able to parse the CSV to a table and join it back to itself multiple times and use WITH ROLLUP to get the permutations you are looking for.
WITH Numbers as
(
SELECT delimited_value
FROM dbo.split_delimiter('123,456,789',',')
)
SELECT CAST(Nums1.delimited_value AS VARCHAR)
,ISNULL(CAST(Nums2.delimited_value AS VARCHAR),'')
,ISNULL(CAST(Nums3.delimited_value AS VARCHAR),'')
,CAST(Nums4.delimited_value AS VARCHAR)
FROM Numbers as Nums1
LEFT JOIN Numbers as Nums2
ON Nums2.delimited_value not in (Nums1.delimited_value)
LEFT JOIN Numbers as Nums3
ON Nums3.delimited_value not in (Nums1.delimited_value, Nums2.delimited_value)
LEFT JOIN Numbers as Nums4
ON Nums4.delimited_value not in (Nums1.delimited_value, Nums2.delimited_value, Nums3.delimited_value)
GROUP BY CAST(Nums1.delimited_value AS VARCHAR)
,ISNULL(CAST(Nums2.delimited_value AS VARCHAR),'')
,ISNULL(CAST(Nums3.delimited_value AS VARCHAR),'')
,CAST(Nums4.delimited_value AS VARCHAR) WITH ROLLUP
If you will potentially have more than 3 or 4, you'll want to expand your code accordingly.

Find overlapping sets of data in a table

I need to identify duplicate sets of data and give those sets who's data is similar a group id.
id threshold cost
-- ---------- ----------
1 0 9
1 100 7
1 500 6
2 0 9
2 100 7
2 500 6
I have thousands of these sets, most are the same with different id's. I need find all the like sets that have the same thresholds and cost amounts and give them a group id. I'm just not sure where to begin. Is the best way to iterate and insert each set into a table and then each iterate through each set in the table to find what already exists?
This is one of those cases where you can try to do something with relational operators. Or, you can just say: "let's put all the information in a string and use that as the group id". SQL Server seems to discourage this approach, but it is possible. So, let's characterize the groups using:
select d.id,
(select cast(threshold as varchar(8000)) + '-' + cast(cost as varchar(8000)) + ';'
from data d2
where d2.id = d.id
for xml path ('')
order by threshold
) as groupname
from data d
group by d.id;
Oh, I think that solves your problem. The groupname can serve as the group id. If you want a numeric id (which is probably a good idea, use dense_rank():
select d.id, dense_rank() over (order by groupname) as groupid
from (select d.id,
(select cast(threshold as varchar(8000)) + '-' + cast(cost as varchar(8000)) + ';'
from data d2
where d2.id = d.id
for xml path ('')
order by threshold
) as groupname
from data d
group by d.id
) d;
Here's the solution to my interpretation of the question:
IF OBJECT_ID('tempdb..#tempGrouping') IS NOT NULL DROP Table #tempGrouping;
;
WITH BaseTable AS
(
SELECT 1 id, 0 as threshold, 9 as cost
UNION SELECT 1, 100, 7
UNION SELECT 1, 500, 6
UNION SELECT 2, 0, 9
UNION SELECT 2, 100, 7
UNION SELECT 2, 500, 6
UNION SELECT 3, 1, 9
UNION SELECT 3, 100, 7
UNION SELECT 3, 500, 6
)
, BaseCTE AS
(
SELECT
id
--,dense_rank() over (order by threshold, cost ) as GroupId
,
(
SELECT CAST(TblGrouping.threshold AS varchar(8000)) + '/' + CAST(TblGrouping.cost AS varchar(8000)) + ';'
FROM BaseTable AS TblGrouping
WHERE TblGrouping.id = BaseTable.id
ORDER BY TblGrouping.threshold, TblGrouping.cost
FOR XML PATH ('')
) AS MultiGroup
FROM BaseTable
GROUP BY id
)
,
CTE AS
(
SELECT
*
,DENSE_RANK() OVER (ORDER BY MultiGroup) AS GroupId
FROM BaseCTE
)
SELECT *
INTO #tempGrouping
FROM CTE
-- SELECT * FROM #tempGrouping;
UPDATE BaseTable
SET BaseTable.GroupId = #tempGrouping.GroupId
FROM BaseTable
INNER JOIN #tempGrouping
ON BaseTable.Id = #tempGrouping.Id
IF OBJECT_ID('tempdb..#tempGrouping') IS NOT NULL DROP Table #tempGrouping;
Where BaseTable is your table, and and you don't need the CTE "BaseTable", because you have a data table.
You may need to take extra-precautions if your threshold and cost fields can be NULL.