Eliminating null values in union - sql

I'm doing a query across databases with an identical structure, to show a mapping from a source value to a target value.
Every one of my databases has a table with two columns: source and target
DB1
Source
Target
A
X
A
Y
B
NULL
C
NULL
DB2
Source
Target
A
NULL
A
Y
B
Z
So my query is
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
What I'm getting is
Source
Target
A
X
A
Y
B
NULL
C
NULL
B
Z
A
NULL
But I'm only interested in the target being NULL, if there is no other mapping present.
So I'm looking for the following result:
Source
Target
A
X
A
Y
C
NULL
B
Z
How can I easily eliminate the highlighted rows A | NULL and B | NULL from my results?
I've seen a few answers suggesting using MAX(Target), but that won't work for me since I can have multiple valid mappings for a single source (A | X and A | Y)

Something like this would work, just give a number based on NULL, and select the first:
SELECT TOP(1) WITH TIES UN.Source
, UN.Target
FROM (
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
) AS UN
ORDER BY DENSE_RANK()OVER(PARTITION BY UN.Source ORDER BY CASE WHEN UN.Target IS NOT NULL THEN 1 ELSE 2 END)

You might find it easier to think in terms of minimums:
with data as (
select Source, Target from DB1.<table> union
select Source, Target from DB2.<table>
), qualified as (
select *,
case when Target is not null or min(Target) over (partition by Source) is null
then 1 end as Keep
from data
)
select Source, Target from qualified where Keep = 1;

For completeness, here's the solution I went with, based on the answer of #HoneyBadger, including the suggestion made by #MartinSmith in the comments
SELECT * FROM
(SELECT UN.Source
, UN.Target
, DENSE_RANK()OVER(PARTITION BY UN.Source ORDER BY CASE WHEN UN.Target IS NOT NULL THEN 1 ELSE 2 END) as ranking
FROM (
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
) AS UN
) UN2
WHERE UN2.ranking = 1
ORDER BY UN2.Source, UN2.Target
This solution selects only the records that have a DENSE_RANK of 1, avoiding the TOP(1) WITH TIES.

Related

BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:
WITH t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
)
SELECT * FROM t
id_1
id_2
1
a
2
a
2
b
3
b
4
c
5
c
6
d
6
e
7
f
I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:
The desired output for the example above would look something like this:
id_1_coll
id_2_coll
1, 2, 3
a, b
4, 5
c
6
d, e
7
f
where each row contains all the other IDs one could reach following the links in the table.
Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.
One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.
Explanation
Our steps are
Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function
We can use something like this to remodel the relationships into a self join (1.):
SELECT
a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:
WITH RECURSIVE base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
)
We can also grab the lowest id_1 value at this time:
id_1
linked_id
lowest_linked_id
1
2
1
2
1
1
2
3
2
3
2
2
4
5
4
5
4
4
For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.
We can code this as follows:
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
)
Giving us the following results:
|id_1|linked_id|lowest_linked_id|linked_ids|
|----|---------|------------|---|
|3|2|1|[1,2]|
|2|3|1|[1,2,3]|
|4|5|4|[5]|
|1|2|1|[2]|
|5|4|4|[4]|
|2|3|2|[3]|
|2|1|1|[1]|
|3|2|2|[2]|
which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below
Solution
WITH RECURSIVE t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
),
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
),
link_back AS (
SELECT
t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
FROM t
LEFT JOIN recursive_loop
ON t.id_1 = recursive_loop.id_1
),
by_id_1 AS (
SELECT
id_1,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
by_id_2 AS (
SELECT
id_2,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
result AS (
SELECT
by_id_1.grp,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
FROM
by_id_1
INNER JOIN by_id_2
ON by_id_1.grp = by_id_2.grp
GROUP BY grp
)
SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll
FROM result ORDER BY grp
Giving us the required output:
grp
id1_coll
id2_coll
1
[1,2,3]
[a,b]
4
[4,5]
[c]
6
[6]
[d,e]
7
[7]
[f]
Limitations/Issues
Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).
Using ROW_NUMBER() the query is as the follow:
WITH RECURSIVE
t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
t1 AS (
SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
UNION ALL
SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
ELSE t2.arr_1 END,
CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
ELSE t2.arr_2 END,
t1.id_1, t1.id_2
FROM t2 JOIN t1 ON
t2.n < t1.n AND
t1.n NOT IN UNNEST(t2.n_arr) AND
(t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
(t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
SELECT
n,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
FROM t2
WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().
However, it may not help your Limitations/Issues.
The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):
SELECT GrpNr,
STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END)
OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
ROW_NUMBER() OVER () as RowNr,
LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr
I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from #GordonLinoff to the question here, which I tested (and presume works for BigQuery).
This can also be done using sequential updates as done by #RomanPekar
here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.
Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.
I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

How to add rows to a specific number multiple times in the same query

I already asked for help on a part of my problem here.
I used to get 10 rows no matter if there are filled or not. But now I'm facing something else where I need to do it multiple times in the same query result.
WITH NUMBERS AS
(
SELECT 1 rowNumber
UNION ALL
SELECT 2
UNION ALL
SELECT 3
UNION ALL
SELECT 4
UNION ALL
SELECT 5
UNION ALL
SELECT 6
UNION ALL
SELECT 7
UNION ALL
SELECT 8
UNION ALL
SELECT 9
UNION ALL
SELECT 10
)
SELECT DISTINCT sp.SLC_ID, c.rowNumber, c.PCE_ID
FROM SELECT_PART sp
LEFT JOIN (
SELECT b.*
FROM NUMBERS
LEFT OUTER JOIN (
SELECT a.*
FROM (
SELECT SELECT_PART.SLC_ID, ROW_NUMBER() OVER (ORDER BY SELECT_PART.SLC_ID) as
rowNumber, SELECT_PART.PCE_ID
FROM SELECT_PART
WHERE SELECT_PART.SLC_ID = (must be the same as sp.SLC_ID and can''t hardcode it)
) a
) b
ON b.rowNumber = NUMBERS.rowNumber
) c ON c.SLC_ID = sp.SLC_ID
ORDER BY sp.SLC_ID, c.rowNumber
It works fine for the first 10 lines, but next SLC_ID only got 1 empty line
I need it to be like that
SLC_ID rowNumer PCE_ID
1 1 0001
1 2 0002
1 3 NULL
1 ... ...
1 10 NULL
2 1 0011
2 2 0012
2 3 0013
2 ... ...
2 10 0020
3 1 0021
3 ... ...
Really need it that way to build a report.
Instead of manually building a query-specific number list where you have to include every possible number you need (1 through 10 in this case), create a numbers table.
DECLARE #UpperBound INT = 1000000;
;WITH cteN(Number) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY s1.[object_id]) - 1
FROM sys.all_columns AS s1
CROSS JOIN sys.all_columns AS s2
)
SELECT [Number] INTO dbo.Numbers
FROM cteN WHERE [Number] <= #UpperBound;
CREATE UNIQUE CLUSTERED INDEX CIX_Number ON dbo.Numbers([Number])
WITH
(
FILLFACTOR = 100, -- in the event server default has been changed
DATA_COMPRESSION = ROW -- if Enterprise & table large enough to matter
);
Source: mssqltips
Alternatively, since you can't add data, use a table that already exists in SQL Server.
WITH NUMBERS AS
(
SELECT DISTINCT Number as rowNumber FROM master..spt_values where type = 'P'
)
SELECT DISTINCT sp.SLC_ID, c.rowNumber, c.PCE_ID
FROM SELECT_PART sp
LEFT JOIN (
SELECT b.*
FROM NUMBERS
LEFT OUTER JOIN (
SELECT a.*
FROM(
SELECT SELECT_PART.SLC_ID, ROW_NUMBER() OVER (ORDER BY SELECT_PART.SLC_ID) as
rowNumber, SELECT_PART.PCE_ID
FROM SELECT_PART
WHERE SELECT_PART.SLC_ID = (must be the same as sp.SLC_ID and can''t hardcode it)
) a
) b
ON b.rowNumber = NUMBERS.rowNumber
) c ON c.SLC_ID = sp.SLC_ID
ORDER BY sp.SLC_ID, c.rowNumber
NOTE: Max value for this solution is 2047

Consolidate information (time serie) from two tables

MS SQL Server
I have two tables with different accounts from the same customer:
Table1:
ID
ACCOUNT
FROM
TO
1
A
01.10.2019
01.12.2019
1
A
01.02.2020
09.09.9999
and table2:
ID
ACCOUNT
FROM
TO
1
B
01.12.2019
01.01.2020
As result I want a table that summarize the story of this costumer and shows when he had an active account and when he doesn't.
Result:
ID
FROM
TO
ACTIV Y/N
1
01.10.2019
01.01.2020
Y
1
02.01.2020
31.01.2020
N
1
01.02.2020
09.09.9999
Y
Can someone help me with some ideas how to proceed?
This is the typical gaps and island problem, and it's not usually easy to solve.
You can achieve your goal using this query, I will explain it a little bit.
You can test on this db<>fiddle.
First of all... I have unified your two tables into one to simplify the query.
-- ##table1
select 1 as ID, 'A' as ACCOUNT, convert(date,'2019-10-01') as F, convert(date,'2019-12-01') as T into ##table1
union all
select 1 as ID, 'A' as ACCOUNT, convert(date,'2020-02-01') as F, convert(date,'9999-09-09') as T
-- ##table2
select 1 as ID, 'B' as ACCOUNT, convert(date,'2019-12-01') as F, convert(date,'2020-01-01') as T into ##table2
-- ##table3
select * into ##table3 from ##table1 union all select * from ##table2
You can then get your gaps and island using, for example, a query like this.
It combines recursive cte to generate a calendar (cte_cal) and lag and lead operations to get the previous/next record information to build the gaps.
with
cte_cal as (
select min(F) as D from ##table3
union all
select dateadd(day,1,D) from cte_cal where d < = '2021-01-01'
),
table4 as (
select t1.ID, t1.ACCOUNT, t1.F, isnull(t2.T, t1.T) as T, lag(t2.F, 1,null) over (order by t1.F) as SUP
from ##table3 t1
left join ##table3 t2
on t1.T=t2.F
)
select
ID,
case when T = D then F else D end as "FROM",
isnull(dateadd(day,-1,lead(D,1,null) over (order by D)),'9999-09-09') as "TO",
case when case when T = D then F else D end = F then 'Y' else 'N' end as "ACTIV Y/N"
from (
select *
from cte_cal c
cross apply (
select t.*
from table4 t
where t.SUP is null
and (
c.D = t or
c.D = dateadd(day,1,t.T)
)
) t
union all
select F, * from table4 where T = '9999-09-09'
) p
order by 1
option (maxrecursion 0)
Dates like '9999-09-09' must be treated like exceptions, otherwise I would have to create a calendar until that date, so the query would take long time to resolve.

SQL query to get column names if it has specific value

I have a situation here, I have a table with a flag assigned to the column names(like 'Y' or 'N'). I have to select the column names of a row, if it have a specific value.
My Table:
Name|sub-1|sub-2|sub-3|sub-4|sub-5|sub-6|
-----------------------------------------
Tom | Y | | Y | Y | | Y |
Jim | Y | Y | | | Y | Y |
Ram | | Y | | Y | Y | |
So I need to get, what are all the subs are have 'Y' flag for a particular Name.
For Example:
If I select Tom I need to get the list of 'Y' column name in query output.
Subs
____
sub-1
sub-3
sub-4
sub-6
Your help is much appreciated.
The problem is that your database model is not normalized. If it was properly normalized the query would be easy. So the workaround is to normalize the model "on-the-fly" to be able to make the query:
select col_name
from (
select name, sub_1 as val, 'sub_1' as col_name
from the_table
union all
select name, sub_2, 'sub_2'
from the_table
union all
select name, sub_3, 'sub_3'
from the_table
union all
select name, sub_4, 'sub_4'
from the_table
union all
select name, sub_5, 'sub_5'
from the_table
union all
select name, sub_6, 'sub_6'
from the_table
) t
where name = 'Tom'
and val = 'Y'
The above is standard SQL and should work on any (relational) DBMS.
Below code works for me.
select t.Subs from (select name, u.subs,u.val
from TableName s
unpivot
(
val
for subs in (sub-1, sub-2, sub-3,sub-4,sub-5,sub-6,sub-7)
) u where u.val='Y') T
where t.name='Tom'
Somehow I am near to the solution. I can get for all rows. (I just used 2 columns)
select col from ( select col, case s.col when 'sub-1' then sub-1 when 'sub-2' then sub-2 end AS val from mytable cross join ( select 'sub-1' AS col union all select 'sub-2' ) s ) s where val ='Y'
It gives the columns for all row. I need the same data for a single row. Like if I select "Tom", I need the column names for 'Y' value.
I'm answering this under a few assumptions here. The first is that you KNOW the names of the columns of the table in question. Second, that this is SQL Server. Oracle and MySql have ways of performing this, but I don't know the syntax for that.
Anyways, what I'd do is perform an 'UNPIVOT' on the data.
There's a lot of parans there, so to explain. The actual 'unpivot' statement (aliased as UNPVT) takes the data and twists the columns into rows, and the SELECT associated with it provides the data that is being returned. Here's I used the 'Name', and placed the column names under the 'Subs' column and the corresponding value into the 'Val' column. To be precise, I'm talking about this aspect of the above code:
SELECT [Name], [Subs], [Val]
FROM
(SELECT [Name], [Sub-1], [Sub-2], [Sub-3], [Sub-4], [Sub-5], [Sub-6]
FROM pvt) p
UNPIVOT
(Orders FOR [Name] IN
([Name], [Sub-1], [Sub-2], [Sub-3], [Sub-4], [Sub-5], [Sub-6])
)AS unpvt
My next step was to make that a 'sub-select' where I could find the specific name and val that was being hunted for. That would leave you with a SQL Statement that looks something along these lines
SELECT [Name], [Subs], [Val]
FROM (
SELECT [Name], [Subs], [Val]
FROM
(SELECT [Name], [Sub-1], [Sub-2], [Sub-3], [Sub-4], [Sub-5], [Sub-6]
FROM pvt) p
UNPIVOT
(Orders FOR [Name] IN
([Name], [Sub-1], [Sub-2], [Sub-3], [Sub-4], [Sub-5], [Sub-6])
)AS unpvt
) AS pp
WHERE 1 = 1
AND pp.[Val] = 'Y'
AND pp.[Name] = 'Tom'
select col from (
select col,
case s.col
when 'sub-1' then sub-1
when 'sub-2' then sub-2
when 'sub-3' then sub-3
when 'sub-4' then sub-4
when 'sub-5' then sub-5
when 'sub-6' then sub-6
end AS val
from mytable
cross join
(
select 'sub-1' AS col union all
select 'sub-2' union all
select 'sub-3' union all
select 'sub-4' union all
select 'sub-5' union all
select 'sub-6'
) s on name="Tom"
) s
where val ='Y'
included the join condition as
on name="Tom"

SELECT DISTINCT for data groups

I have following table:
ID Data
1 A
2 A
2 B
3 A
3 B
4 C
5 D
6 A
6 B
etc. In other words, I have groups of data per ID. You will notice that the data group (A, B) occurs multiple times. I want a query that can identify the distinct data groups and number them, such as:
DataID Data
101 A
102 A
102 B
103 C
104 D
So DataID 102 would resemble data (A,B), DataID 103 would resemble data (C), etc. In order to be able to rewrite my original table in this form:
ID DataID
1 101
2 102
3 102
4 103
5 104
6 102
How can I do that?
PS. Code to generate the first table:
CREATE TABLE #t1 (id INT, data VARCHAR(10))
INSERT INTO #t1
SELECT 1, 'A'
UNION ALL SELECT 2, 'A'
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'A'
UNION ALL SELECT 3, 'B'
UNION ALL SELECT 4, 'C'
UNION ALL SELECT 5, 'D'
UNION ALL SELECT 6, 'A'
UNION ALL SELECT 6, 'B'
In my opinion You have to create a custom aggregate that concatenates data (in case of strings CLR approach is recommended for perf reasons).
Then I would group by ID and select distinct from the grouping, adding a row_number()function or add a dense_rank() your choice. Anyway it should look like this
with groupings as (
select concat(data) groups
from Table1
group by ID
)
select groups, rownumber() over () from groupings
The following query using CASE will give you the result shown below.
From there on, getting the distinct datagroups and proceeding further should not really be a problem.
SELECT
id,
MAX(CASE data WHEN 'A' THEN data ELSE '' END) +
MAX(CASE data WHEN 'B' THEN data ELSE '' END) +
MAX(CASE data WHEN 'C' THEN data ELSE '' END) +
MAX(CASE data WHEN 'D' THEN data ELSE '' END) AS DataGroups
FROM t1
GROUP BY id
ID DataGroups
1 A
2 AB
3 AB
4 C
5 D
6 AB
However, this kind of logic will only work in case you the "Data" values are both fixed and known before hand.
In your case, you do say that is the case. However, considering that you also say that they are 1000 of them, this will be frankly, a ridiculous looking query for sure :-)
LuckyLuke's suggestion above would, frankly, be the more generic way and probably saner way to go about implementing the solution though in your case.
From your sample data (having added the missing 2,'A' tuple, the following gives the renumbered (and uniqueified) data:
with NonDups as (
select t1.id
from #t1 t1 left join #t1 t2
on t1.id > t2.id and t1.data = t2.data
group by t1.id
having COUNT(t1.data) > COUNT(t2.data)
), DataAddedBack as (
select ID,data
from #t1 where id in (select id from NonDups)
), Renumbered as (
select DENSE_RANK() OVER (ORDER BY id) as ID,Data from DataAddedBack
)
select * from Renumbered
Giving:
1 A
2 A
2 B
3 C
4 D
I think then, it's a matter of relational division to match up rows from this output with the rows in the original table.
Just to share my own dirty solution that I'm using for the moment:
SELECT DISTINCT t1.id, D.data
FROM #t1 t1
CROSS APPLY (
SELECT CAST(Data AS VARCHAR) + ','
FROM #t1 t2
WHERE t2.id = t1.id
ORDER BY Data ASC
FOR XML PATH('') )
D ( Data )
And then going analog to LuckyLuke's solution.