SQL: The multi-part identifier could not be bound with OPENJSON - sql

I'm trying get objects from JSON by this query
SELECT
co.contract_number
, co.objectId id1
, cbs.id id2
, co.summary
FROM (
SELECT
c.contract_number
, cb.summary
, cbo.id objectId
FROM
pas.contract C
CROSS APPLY
OPENJSON(c.common_body, '$')
WITH (
summary NVARCHAR(MAX) '$.summary' AS JSON
, objects NVARCHAR(MAX) '$.objects' AS JSON
) cb
CROSS APPLY OPENJSON(cb.objects, '$')
WITH (
id UNIQUEIDENTIFIER '$.id'
) cbo
) co
CROSS APPLY OPENJSON(co.summary, '$.insuredObjects')
WITH (
id UNIQUEIDENTIFIER '$.objectId'
) cbs
But here's the problem: double rows (2 objects from cb.objects x 2 objects from co.summary.insuredObjects)
contract_number
id1
id2
2200001459
1
1
2200001459
1
2
2200001459
2
1
2200001459
2
2
Expected result (objects compare each other 1 to 1):
contract_number
id1
id2
2200001459
1
1
2200001459
2
2
So I replaced CROSS APPLY by LEFT JOIN
...
LEFT JOIN OPENJSON(co.summary, '$.insuredObjects')
WITH (
id UNIQUEIDENTIFIER '$.objectId'
) cbs ON cbs.id = co.objectId
But this query causes error:
The multi-part identifier "co.summary" could not be bound.
Is there a method to get expected result without errors?

A left join can't refer to itself like that. That's not going to do it unless you run it from a subquery or a CTE. If you're just looking for where the two IDs are equal, the simplest thing would be to add a where statement to get rid of the duplicates.
WHERE co.objectId = cbs.id
Also, I don't know the structure of the JSON, but the nested OPENJSON calls seem unnecessary. You can factor that out.
SELECT c.contract_number, id AS id1, objectId AS id2
FROM pas.contract
CROSS APPLY OPENJSON(c.common_body, '$.summary.insuredObjects')
WITH (
objectId int '$.objectId'
) cb
CROSS APPLY
OPENJSON(c.common_body, '$.objects')
WITH (
id int '$.id'
) o
WHERE cb.objectId = o.id
I wouldn't say I'm an OPENJSON expert, so I imagine there may even be a way to get it down to one call. Good luck.

Related

SQL Algorithm for grouping all "equivalent" strings

I have an input table with two columns, each holding a string representing an id_number, and these columns are called id1 and id2. The two id_numbers that appear as a pair in any given row are defined as being equivalent to each other. If one of those id_numbers also appears in another row, then then all the strings in both rows are equivalent to eachother, etc. The goal is to return a table with two columns, one containing all unique id_numbers and another identifying their grouping by equivalency.
Sample Data:
`
create table input_table (
id1 varchar(100),
id2 varchar(100)
)
insert into input_table(id1,id2)
values
('a','b'),
('b','c'),
('d','a'),
('a','b'),
('f','g'),
('f','k'),
('l','m')
Expected Output:
| Id | Grouping |
| a | 1 |
| b | 1 |
| c | 1 |
| d | 1 |
| f | 2 |
| g | 2 |
| k | 2 |
| l | 3 |
| m | 3 |
To further explain the results:
Row 1 tells us a=b so they are assigned to group 1
Row 2 tells us b=c, since b is already in group 1, c is also assigned to group 1
Row 3 tells us d=a, since a is already in group 1, d is also assigned to group 1
Row 4 tells us a=b, which we already know so we don't need to do anything
Row 5 tells us f=g, since neither are in an existing group, we assign them to group 2
etc.
Your table describes a nondirected graph, where the set (distinct values) of all Id1 and Id2 values over all tuples ("rows") represent nodes and the tuples themselves represent the node's edges (as they connect - or link - these Id values in a relationship) - which means we can apply graph-theory techniques to solve your problem, which can be stated as finding disconnected components and assigning them an identifier) within your graph (as each disconnected component represents each set of connected nodes).
...with that formalism out of the way, I'll say I don't think it's necessarily correct to use the word "equivalent" or "equivalence": just because a binary relation (each of your tuples) is transitive (i.e. nondirected node edges are a transitive relation) it says nothing about what the nodes themselves represent, so I'm guessing you meant to say the Id1/Id2 values represent things-that-are-equivalent in your problem-domain, which is fair, but in the context of my answer I'll refrain from using that term (not least because "set equivalence" is something else entirely)
ANYWAY...
Step 1: Normalize your bad data:
Assuming that those Id1/Id2 values are comparable values (as in: every value can be sorted consistently and deterministically by a less-than comparison), such as strings or integers, then do that first so we can generate a normalized representation of your data.
This normalized representation of your data would be the same as your current data, except that:
Every row's Id1 value must be < (i.e. less-than) its Id2 value.
There are no duplicate rows.
So if there's a row where Id2 < Id1 you should swap the values, so ( 'd', 'a' ) becomes ( 'a', 'd' ), and the duplicate ('a', 'b') row would be removed.
This also means that every (disconnected) component in the graph we're representing now has a minimum node which we can treat as the (arbitrary) "root" of that component - which means we will have a target to look for from any node (but we don't know which values are minimums in each component yet, hang on...)
To aid identification, let's call the smallest value in a pair the smol value and the big value.
(You can do this step as another CTE step, but for clarity I'm using a
table-variable #normalized, like so):
DECLARE #horribleInputData TABLE (
Id1 char(1) NOT NULL,
Id2 char(1) NOT NULL
);
INSERT INTO #horribleInputData ( Id1, Id2 ) VALUES
( 'a', 'b' ),
( 'b', 'c' ),
( 'd', 'a' ),
( 'a', 'b' ),
( 'f', 'g' ),
( 'f', 'k' ),
( 'l', 'm' );
--
DECLARE #normalized TABLE (
Smol char(1) NOT NULL,
Big char(1) NOT NULL,
INDEX IDX1 ( Smol ),
INDEX IDX2 ( Big ),
CHECK ( Smol < Big )
);
INSERT INTO #normalized ( Smol, Big )
SELECT
DISTINCT -- Exclude duplicate rows.
CASE WHEN Id1 < Id2 THEN Id1 ELSE Id2 END AS Smol, -- Sort the values horizontally, not vertically.
CASE WHEN Id1 < Id2 THEN Id2 ELSE Id1 END AS Big
FROM
#horribleInputData
WHERE
Id1 <> Id2; -- Exclude zero-information rows.
After running the above, the #normalized table looks like this:
Smol
Big
a
b
a
d
b
c
f
g
f
k
l
m
Step 2: Designate the minimum node as root in each disconnected component:
If we treat the node with the minimum-value as the "root" in a directed graph (or in this case, the minimum node in each disconnected component) then we can find connected nodes by finding a path from each node to that root. But what else defines a root? In this case, it would be a node with no "outgoing" references from Smol to Big - so we can simply take #normalized and do an anti-join to itself (i.e. for every Smol, see if that Smol is a Big to any other Smol - if none exist then that Smol is the Smollest - and is therefore a root:
SELECT
DISTINCT
l.Smol AS Smollest
FROM
#normalized AS l
LEFT OUTER JOIN #normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL;
Which gives us this output:
Smollest
a
f
l
...which tells us that there's 3 disconnected-components in this graph, and what those minimum nodes are.
Step 3: the srsbsns part
Which means we can now try to trace a route from each node to a root using a recursive CTE - which is a technique in SQL to traverse
hierarchical directed graphs (either from top-to-bottom, or bottom-to-top), so it's a good thing we converted the data into a directed graph first.
For example:
SQL Server CTE and recursion example
SQL Server 2012 CTE Find Root or Top Parent of Hierarchical Data
Identifying equivalent sets in SQL Server
Like so:
-- This `smolRoots` CTE is the same as the query from Step 2 above:
WITH smolRoots AS (
SELECT
DISTINCT
l.Smol AS Smollest
FROM
-- This is an anti-join (a LEFT OUTER JOIN combined with a IS NULL predicate):
#normalized AS l
LEFT OUTER JOIN #normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL
),
-- Generate a simple flat list of every Id1/Id2 value:
everyNode AS (
SELECT DISTINCT Smol AS Id FROM #normalized
UNION
SELECT DISTINCT Big AS Id FROM #normalized
),
-- Now do the tree-walk, it's like a nature-walk but more math-y:
recursiveCte AS (
-- Each root (Smol) value:
SELECT
CONVERT( char(1), NULL ) AS Smol,
s.Smol AS Big,
s.Smol AS SmolRoot,
0 AS Depth
FROM
smolRoots AS s
UNION ALL
-- Then recurisvely UNION ALL (concatenate) all other rows that can be connected by their Smol-to-Big values:
SELECT
f.Smol,
f.Big,
r.SmolRoot,
r.Depth + 1 AS Depth
FROM
#normalized AS f
INNER JOIN recursiveCte AS r ON f.Smol = r.Big
)
The above recursiveCte, if evalated as-is, will return this output:
Smol
Big
SmolRoot
Depth
NULL
a
a
0
NULL
f
f
0
NULL
l
l
0
l
m
l
1
f
g
f
1
f
k
f
1
a
b
a
1
a
d
a
1
b
c
a
2
Notice how for each row, each Smol and Big value is mapped to one of the 3 SmolRoot values identified in Step 2.
...so I hope you can now start to see how this works. But we're not done yet...
Step 3.2: I lied about Step 3 being Step 3, it's really Step 3.1:
We still need to now then query the recursiveCte's results to convert those SmolRoot values into unique identifiers for each disconnected component (i.e. each set) - so let's generate a new 1-based int value for each distinct value in SmolRoot - we can do this with ROW_NUMBER() but you could also use GENERATE_SERIES - and I'm sure other techniques exist to do this too:
WITH smolRoots AS ( /* same as above */ ),
everyNode AS ( /* same as above */ ),
recursiveCte AS ( /* same as above */ ),
numberForEachRoot AS ( -- Generate distinct numbers (from 1, 2, ...) for each SmolRoot number, i.e. a number for each disconnected-component or disjoint graph:
SELECT
SmolRoot,
ROW_NUMBER() OVER ( ORDER BY SmolRoot ) AS DisconnectedComponentNumber
FROM
recursiveCte AS r
GROUP BY
r.SmolRoot
And numberForEachRoot looks like this:
SmolRoot
DisconnectedComponentNumber
a
1
f
2
l
3
Step 5: There is no Step 4
Now ignore all of the above and start over with that everyNode CTE that was tucked-away in Step 2: Take the everyNode CTE and JOIN it to recursiveCte and numberForEachRoot to get the actual output you're after:
WITH
smolRoots AS ( /* same as above */ ),
everyNode AS ( /* same as above */ ),
recursiveCte AS ( /* same as above */ ),
numberForEachRoot AS ( /* same as above */ )
SELECT
e.Id,
n.DisconnectedComponentNumber
FROM
everyNode AS e
INNER JOIN recursiveCte AS r ON e.Id = r.Big
INNER JOIN numberForEachRoot AS n ON r.SmolRoot = n.SmolRoot
ORDER BY
e.Id;
Which gives us...
Id
DisconnectedComponentNumber
a
1
b
1
d
1
c
1
f
2
g
2
k
2
l
3
m
3
This technique also works on graphs with cycles, though you might get duplicate output rows in that case, but adjusting the above query to filter those out is a trivial exercise left to the reader.
Also it seems I spent too much time on this answer, that means the Vyvanse is working tonight. And yup, it's just gone 4am where I am right now, good job, me. Now where's my Ambien gone?
Step ∞: Just give me the solution I can copy-and-paste into my CS homework and/or Nissan car firmware
Alright, you asked for it...
DECLARE #horribleInputData TABLE (
Id1 char(1) NOT NULL,
Id2 char(1) NOT NULL
);
INSERT INTO #horribleInputData ( Id1, Id2 ) VALUES
('a','b'),
('b','c'),
('d','a'),
('a','b'),
('f','g'),
('f','k'),
('l','m'),
-- Also adding this to show that it works for cycles in graphs too:
( '0', '1' ),
( '1', '2' ),
( '2', '3' ),
( '3', '0' );
-------------
-- 1. Normalize to a table form that's easier to work with: given that transitivity exists, enforce `Id1 < Id2` so we can have a "direction" to look in.
-- This can be a CTE step instead of a TABLE too, if you dare. I'm curious what the execution plan would be in that case.
DECLARE #normalized TABLE (
Smol char(1) NOT NULL,
Big char(1) NOT NULL,
INDEX IDX1 ( Smol ),
INDEX IDX2 ( Big ),
CHECK ( Smol < Big )
);
INSERT INTO #normalized ( Smol, Big )
SELECT
DISTINCT
CASE WHEN Id1 < Id2 THEN Id1 ELSE Id2 END AS Smol,
CASE WHEN Id1 < Id2 THEN Id2 ELSE Id1 END AS Big
FROM
#horribleInputData
WHERE
Id1 <> Id2 -- Exclude zero-information rows.
ORDER BY
Smol, -- Make it easier to read, interactively.
Big;
/*
Smol Big
----------
a b
a d
b c
f g
f k
l m
*/
-- Also, just gonna bury this in here and see what happens:
DECLARE #superImportantPart nvarchar(300) = CONVERT( nvarchar(300), 0x4800450059002000450056004500520059004F004E00450020004900200043004F0050005900200041004E004400200050004100530054004500200043004F00440045002000460052004F004D00200053005400410043004B004F0056004500520046004C004F005700200057004900540048004F0055005400200055004E004400450052005300540041004E00440049004E00470020005700480041005400200049005400200044004F004500530020004F005200200048004F005700200049005400200057004F0052004B00530020004F005400480045005200570049005300450020004900200057004F0055004C004400200048004100560045002000520045004D004F005600450044002000540048004900530020005000520049004E0054002000530054004100540045004D0045004E005400 );
RAISERROR( #superImportantPart, /*severity:*/ 0, /*state:*/ 1 ) WITH NOWAIT;
-- Then trace a route from every Big to its smallest connected Smol.
-- Each Big sharing the same Smol is in the same connected graph, the set of distinct Smol nodes identifies each output set.
-- 1. Get all roots first: these will be `Smol` nodes that never appear in `Big`.
WITH smolRoots AS (
SELECT
DISTINCT
l.Smol AS Smollest
FROM
#normalized AS l
LEFT OUTER JOIN #normalized AS r ON l.Smol = r.Big
WHERE
r.Big IS NULL
/*
Smollest
-----
a
f
l
*/
),
everyNode AS (
SELECT DISTINCT Smol AS Id FROM #normalized
UNION
SELECT DISTINCT Big AS Id FROM #normalized
),
-- The tree-walk:
recursiveCte AS (
-- Each root (Smol) value:
SELECT
CONVERT( char(1), NULL ) AS Smol,
s.Smollest AS Big,
s.Smollest AS SmolRoot,
0 AS Depth
FROM
smolRoots AS s
UNION ALL
-- Then the magic happens...
SELECT
n.Smol,
n.Big,
r.SmolRoot,
r.Depth + 1 AS Depth
FROM
#normalized AS n
INNER JOIN recursiveCte AS r ON n.Smol = r.Big
/*
Smol Big SmolRoot Depth
-----------------------------
NULL a a 0
NULL f f 0
NULL l l 0
l m l 1
f g f 1
f k f 1
a b a 1
a d a 1
b c a 2
*/
),
numberForEachRoot AS ( -- Generate distinct numbers (from 1, 2, ...) for each SmolRoot number, i.e. a number for each disconnected-component or disjoint graph:
SELECT
SmolRoot,
ROW_NUMBER() OVER ( ORDER BY SmolRoot ) AS DisconnectedComponentNumber
FROM
recursiveCte AS r
GROUP BY
r.SmolRoot
/*
SmolRoot DisconnectedComponentNumber
-------------------
a 1
f 2
l 3
*/
)
-- Then ignore all of the above and start with `everyNode` and JOIN it to `recursiveCte` and `numberForEachRoot`:
SELECT
e.Id,
n.DisconnectedComponentNumber
FROM
everyNode AS e
INNER JOIN recursiveCte AS r ON e.Id = r.Big
INNER JOIN numberForEachRoot AS n ON r.SmolRoot = n.SmolRoot
ORDER BY
e.Id;
/*
Id DisconnectedComponentNumber
a 1
f 2
l 3
m 3
g 2
k 2
b 1
d 1
c 1
*/

Is there a method to simply transpose a table in SQL. This table contains Numeric and Varchar values

I would like to know how to transpose very simply a table in SQL. There is no sum or calculations to do.
This table contains Numeric and Varchar values.
Meaning, I have a table of 2 rows x 195 columns. I would like to have the same table with 195 rows x 2 columns (maybe 3 columns)
time_index
legal_entity_code
cohort
...
...
0
AAA
50
...
...
1
BBB
55
...
...
TO
Element
time_index_0
time_index_1
legal_entity_code
AAA
BBB
cohort
50
55
...
...
...
...
...
...
I have created this piece of code for testing
SELECT time_index, ValueT, FieldName
FROM (select legal_entity_code, cohort, time_index from ifrs17.output_bba where id in (1349392,1349034)) as T
UNPIVOT
(
ValueT
FOR FieldName in ([legal_entity_code],[cohort])
) as P
but I receive this error message :
The type of column "cohort" conflicts with the type of other columns specified in the UNPIVOT list.
I would recommend using apply for this. I don't fully follow the specified results because the query and the sample data are inconsistent in their naming.
I'm pretty sure you want:
select o.time_index, v.*
from ifrs17.output_bba o cross apply
(values ('Name1', o.name1),
('Value1', convert(varchar(max), o.value1)),
('Name2', o.name2)
) v(name, value)
where o.id in (1349392,1349034);
Gordon's approach is correct and certainly more performant. +1
However, if you want to dynamically unpivot 195 columns without having to list them all, consider the following:
Note: if not 2016+ ... there is a similar XML approach.
Example or dbFiddle
Select Element = [Key]
,Time_Index_0 = max(case when time_index=0 then value end)
,Time_Index_1 = max(case when time_index=1 then value end)
From (
Select [time_index]
,B.*
From YourTable A
Cross Apply (
Select [Key]
,Value
From OpenJson( (Select A.* For JSON Path,Without_Array_Wrapper ) )
Where [Key] not in ('time_index')
) B
) A
Group By [Key]
Returns
Element Time_Index_0 Time_Index_1
cohort 50 55
legal_entity_code AAA BBB

How do I combine multiple parent-child relationships with different lengths using T-SQL?

Summary
In an Azure database (using SQL Server Management Studio 17, so T-SQL) I seek to concatenate multiple parent-child relationships of different lengths.
Base Table
My table is of this form:
ID parent
1 2
2 NULL
3 2
4 3
5 NULL
Feel free to use this code to generate and fill it:
DECLARE #t TABLE (
ID int,
parent int
)
INSERT #t VALUES
( 1, 2 ),
( 2, NULL ),
( 3, 2 ),
( 4, 3 ),
( 5, NULL )
Issue
How do I receive a table with the path concatenation as shown in the following table?
ID path parentcount
1 2->1 1
2 2 0
3 2->3 1
4 2->3->4 2
5 5 0
Detail
The real table has many more rows and the longest path should contain ~15 IDs. So it would be ideal to find a solution that is dynamic in the aspect of parent count definition.
Also: I do not necessarily need the column 'parentcount', so feel free to skip that in answers.
select ##version:
Microsoft SQL Azure (RTM) - 12.0.2000.8
You can use a recursive CTE for this:
with cte as (
select id, parent, convert(varchar(max), concat(id, '')) as path, 0 as parentcount
from #t t
union all
select cte.id, t.parent, convert(varchar(max), concat(t.id, '->', path)), parentcount + 1
from cte join
#t t
on cte.parent = t.id
)
select top (1) with ties *
from cte
order by row_number() over (partition by id order by parentcount desc);
Clearly Gordon nailed it with a recursive CTE, but here is another option using the HierarchyID data type.
Example
Declare #YourTable Table ([ID] int,[parent] int)
Insert Into #YourTable Values
(1,2)
,(2,NULL)
,(3,2)
,(4,3)
,(5,NULL)
;with cteP as (
Select ID
,Parent
,HierID = convert(hierarchyid,concat('/',ID,'/'))
From #YourTable
Where Parent is Null
Union All
Select ID = r.ID
,Parent = r.Parent
,HierID = convert(hierarchyid,concat(p.HierID.ToString(),r.ID,'/'))
From #YourTable r
Join cteP p on r.Parent = p.ID
)
Select ID
,Parent
,[Path] = HierID.GetDescendant ( null , null ).ToString()
,ParentCount = HierID.GetLevel() - 1
From cteP A
Order By A.HierID
Returns

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.
I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values
From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;
You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Postgres self-join recursive CTE ancestry chain

I have a pilates_bill table representing direct ancestry (not a tree structure)
bill_id (pk) | previous_bill_id (self-join fk)
=============+================================
1 2
2 3
3 4
5 NULL
Need to produce a list (parent / grandparent / etc) of all ancestors for any given row (below examples start with 1).
Obtaining a list of bill_ids with ancestor chain using recursive CTE
WITH RECURSIVE chain(from_id, to_id) AS (
SELECT NULL::integer, 1 -- starting id
UNION
SELECT c.to_id, pilates_bill.previous_bill_id
FROM chain c
LEFT OUTER JOIN pilates_bill ON (pilates_bill.bill_id = to_id)
WHERE c.to_id IS NOT NULL
)
SELECT from_id FROM chain WHERE from_id IS NOT NULL;
Result 1,2,3,4,5 as expected
But now when I try to produce table rows in order of ancestry the result is broken
SELECT * FROM pilates_bill WHERE bill_id IN
(
WITH RECURSIVE chain(from_id, to_id) AS (
SELECT NULL::integer, 1
UNION
SELECT c.to_id, pilates_bill.previous_bill_id
FROM chain c
LEFT OUTER JOIN pilates_bill ON (pilates_bill.bill_id = to_id)
WHERE c.to_id IS NOT NULL
)
SELECT from_id FROM chain WHERE from_id IS NOT NULL
)
Row order is 5,1,2,3,4
What a I doing wrong here ?
The rows returned by a SQL query are in random order unless you specify an order by.
You can calculate depth by keeping track of it in the recursive CTE:
WITH RECURSIVE chain(from_id, to_id, depth) AS
(
SELECT NULL::integer
, 1
, 1
UNION
SELECT c.to_id
, pb.previous_bill_id
, depth + 1
FROM chain c
LEFT JOIN
pilates_bill pb
ON pb.bill_id = c.to_id
WHERE c.to_id IS NOT NULL
)
SELECT *
FROM chain
ORDER BY
depth