Find overlapping sets of data in a table

Find overlapping sets of data in a table - sql

I need to identify duplicate sets of data and give those sets who's data is similar a group id.
id threshold cost
-- ---------- ----------
1 0 9
1 100 7
1 500 6
2 0 9
2 100 7
2 500 6
I have thousands of these sets, most are the same with different id's. I need find all the like sets that have the same thresholds and cost amounts and give them a group id. I'm just not sure where to begin. Is the best way to iterate and insert each set into a table and then each iterate through each set in the table to find what already exists?

This is one of those cases where you can try to do something with relational operators. Or, you can just say: "let's put all the information in a string and use that as the group id". SQL Server seems to discourage this approach, but it is possible. So, let's characterize the groups using:
select d.id,
(select cast(threshold as varchar(8000)) + '-' + cast(cost as varchar(8000)) + ';'
from data d2
where d2.id = d.id
for xml path ('')
order by threshold
) as groupname
from data d
group by d.id;
Oh, I think that solves your problem. The groupname can serve as the group id. If you want a numeric id (which is probably a good idea, use dense_rank():
select d.id, dense_rank() over (order by groupname) as groupid
from (select d.id,
(select cast(threshold as varchar(8000)) + '-' + cast(cost as varchar(8000)) + ';'
from data d2
where d2.id = d.id
for xml path ('')
order by threshold
) as groupname
from data d
group by d.id
) d;

Here's the solution to my interpretation of the question:
IF OBJECT_ID('tempdb..#tempGrouping') IS NOT NULL DROP Table #tempGrouping;
;
WITH BaseTable AS
(
SELECT 1 id, 0 as threshold, 9 as cost
UNION SELECT 1, 100, 7
UNION SELECT 1, 500, 6
UNION SELECT 2, 0, 9
UNION SELECT 2, 100, 7
UNION SELECT 2, 500, 6
UNION SELECT 3, 1, 9
UNION SELECT 3, 100, 7
UNION SELECT 3, 500, 6
)
, BaseCTE AS
(
SELECT
id
--,dense_rank() over (order by threshold, cost ) as GroupId
,
(
SELECT CAST(TblGrouping.threshold AS varchar(8000)) + '/' + CAST(TblGrouping.cost AS varchar(8000)) + ';'
FROM BaseTable AS TblGrouping
WHERE TblGrouping.id = BaseTable.id
ORDER BY TblGrouping.threshold, TblGrouping.cost
FOR XML PATH ('')
) AS MultiGroup
FROM BaseTable
GROUP BY id
)
,
CTE AS
(
SELECT
*
,DENSE_RANK() OVER (ORDER BY MultiGroup) AS GroupId
FROM BaseCTE
)
SELECT *
INTO #tempGrouping
FROM CTE
-- SELECT * FROM #tempGrouping;
UPDATE BaseTable
SET BaseTable.GroupId = #tempGrouping.GroupId
FROM BaseTable
INNER JOIN #tempGrouping
ON BaseTable.Id = #tempGrouping.Id
IF OBJECT_ID('tempdb..#tempGrouping') IS NOT NULL DROP Table #tempGrouping;
Where BaseTable is your table, and and you don't need the CTE "BaseTable", because you have a data table.
You may need to take extra-precautions if your threshold and cost fields can be NULL.

Related

SQL Server Order first by ParentID, then Child

I'm currently dealing with a database my company is phasing out, and we're trying to build a quick and dirty interface so that people can easily extract some data. A major problem with this database however, is that the primary assets are all recorded in one large table in order of when they were created, not how they relate to one another.
The gist of the database is shown below:
ParentAssetID ChildAssetID AssetName
------------------------------------
84 2 abc
35 1 cdf
956 35 PARENT35
84 1 ghi
956 3 PARENT3
35 3 jkl
956 84 PARENT84
3 5 mno
I would like to, using a select statement, output this ordered in such a way so that it appears as below:
ParentAssetID ChildAssetID AssetName
------------------------------------
956 3 PARENT3
3 5 mno
956 35 PARENT35
35 1 cdf
35 3 jkl
956 84 PARENT84
84 1 ghi
84 2 abc
As you can see, the data is first sorted by the ChildAssetID, and then each child of that asset is sorted below it. It's a pain to deal with, and that's one of the reasons why we're trying to get rid of it.
Currently, all I've got is the following:
select ParentAssetID, ChildAssetID, AssetName from dbo.Assets order by ParentAssetID
however this only groups the child assets all together without their parent headings at the start - they're all the way down the bottom at 956, grouped with their parent's children. Is there any way to sort the table like this so it's easily human readable, or will this job have to be done by hand?

For your example this could work:
SELECT t1.*
FROM elbat t1
ORDER BY CASE
WHEN NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.childassetid = t1.parentassetid) THEN
t1.childassetid
ELSE
t1.parentassetid
END,
CASE
WHEN NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.childassetid = t1.parentassetid) THEN
0
ELSE
1
END,
t1.childassetid;
db<>fiddle
The first CASE gets all children and their parent together, the second makes sure the parent is atop and then the children are sorted. If the levels in your real table are any deeper than in the example though, this might no longer work. But maybe you can make something out of it anyways.

you can achieve this using CTE
;with cte as
(
select
ParentAssetID,
ChildAssetID,
AssetName,
cast(row_number()over(partition by ParentAssetID order by AssetName) as varchar(max)) as [path],
0 as level,
row_number()over(partition by ParentAssetID order by AssetName) / power(10.0,0) as x
from Assets
where ParentAssetID =956
union all
select
t.ParentAssetID,
t.ChildAssetID,
t.AssetName,
[path] +'-'+ cast(row_number()over(partition by t.ParentAssetID order by t.AssetName) as varchar(max)),
level+1,
x + row_number()over(partition by t.ParentAssetID order by t.AssetName) / power(10.0,level+1)
from
cte
join Assets t on cte.ChildAssetID = t.ParentAssetID
)
select
ParentAssetID,
ChildAssetID,
AssetName,
[path],
x
from cte
order by x

Your data is a bit awkward, because "mno" has a parent of "3" and "3" is associated with two parent ids.
Other than this, you appear to want to order by the path to the top. You can do this with a recursive CTE:
with cte as (
select a.parentassetid, a.childassetid, a.assetname,
convert(varchar(max), concat(format(a.parentassetid, '0000'), format(a.childassetid, '0000'))) as path, 1 as lev
from assets a
where not exists (select 1 from assets ap where a.parentassetid = ap.childassetid)
union all
select a.parentassetid, a.childassetid, a.assetname,
convert(varchar(max), concat(cte.path, '/', format(a.childassetid, '0000'))), lev + 1
from cte join
assets a
on cte.childassetid = a.parentassetid
where lev < 10
)
select *
from cte
order by path;
This doesn't produce exactly what you want, because "mno" is duplicated. I would assume that is a transcription error.
If this is not a transcription error and you want the first time that a row occurs, you can use:
select cte.*
from (select cte.*,
row_number() over (partition by parentassetid, childassetid order by lev asc) as seqnum
from cte
) cte
where seqnum = 1
order by path
Here is a db<>fiddle.

Testing answer from #Krishna Muppalla (https://stackoverflow.com/a/59174634/956364)
Is there a way to not use the power(10) functions? I've never seen them used to sort like this!
drop table if exists #Assets;
create table #Assets( [ParentAssetID] int, [ChildAssetID] int, [AssetName] varchar(30) );
insert into #Assets( [ParentAssetID], [ChildAssetID], [AssetName] )
select 84, 2, 'abc' union all
select 35, 1, 'cdf' union all
select 956, 35, 'PARENT35' union all
select 84, 1, 'ghi' union all
select 956, 3, 'PARENT3' union all
select 35, 3, 'jkl' union all
select 956, 84, 'PARENT84' union all
select 3, 5, 'mno';
declare #one float = 1; --I don't know if power(10.0,0) was being recomputed each call.
with [cte] as (
select
[ParentAssetID],
[ChildAssetID],
[AssetName],
0 [level],
row_number() over( partition by [ParentAssetID] order by [AssetName] ) / #one [x]
from #Assets
where [ParentAssetID] = 956 --this is bad. How do we get around this?
union all
select
t.[ParentAssetID],
t.[ChildAssetID],
t.[AssetName],
[level] + 1,
[x] + row_number() over( partition by t.[ParentAssetID] order by t.[AssetName] ) / power(10.0, [level] + 1) [x]
from [cte]
join #Assets t on cte.[ChildAssetID] = t.[ParentAssetID]
)
select
[ParentAssetID],
[ChildAssetID],
[AssetName]
,[x]
from [cte]
order by [x];

Is this what you are looking for?
select
ParentAssetID,
ChildAssetID,
AssetName
from dbo.Assets
order by ParentAssetID desc, ChildAssetID asc;

Get every combination of sort order and value of a csv

If I have a string with numbers separated by commas, like this:
Declare #string varchar(20) = '123,456,789'
And would like to return every possible combination + sort order of the values by doing this:
Select Combination FROM dbo.GetAllCombinations(#string)
Which would in result return this:
123
456
789
123,456
456,123
123,789
789,123
456,789
789,456
123,456,789
123,789,456
456,789,123
456,123,789
789,456,123
789,123,456
As you can see not only is every combination returned, but also each combination+sort order as well. The example shows only 3 values separated by commas, but should parse any amount--Recursive.
The logic needed would be somewhere in the realm of using a WITH CUBE statement, but the problem with using WITH CUBE (in a table structure instead of CSV of course), is that it won't shuffle the order of the values 123,456 456,123 etc., and will only provide each combination, which is only half of the battle.
Currently I have no idea what to try. If someone can provide some assistance it would be appreciated.

I use a User Defined Table-valued Function called split_delimiter that takes 2 values: the #delimited_string and the #delimiter_type.
CREATE FUNCTION [dbo].[split_delimiter](#delimited_string VARCHAR(8000), #delimiter_type CHAR(1))
RETURNS TABLE AS
RETURN
WITH cte10(num) AS
(
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
,cte100(num) AS
(
SELECT 1
FROM cte10 t1, cte10 t2
)
,cte10000(num) AS
(
SELECT 1
FROM cte100 t1, cte100 t2
)
,cte1(num) AS
(
SELECT TOP (ISNULL(DATALENGTH(#delimited_string),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM cte10000
)
,cte2(num) AS
(
SELECT 1
UNION ALL
SELECT t.num+1
FROM cte1 t
WHERE SUBSTRING(#delimited_string,t.num,1) = #delimiter_type
)
,cte3(num,[len]) AS
(
SELECT t.num
,ISNULL(NULLIF(CHARINDEX(#delimiter_type,#delimited_string,t.num),0)-t.num,8000)
FROM cte2 t
)
SELECT delimited_item_num = ROW_NUMBER() OVER(ORDER BY t.num)
,delimited_value = SUBSTRING(#delimited_string, t.num, t.[len])
FROM cte3 t;
Using that I was able to parse the CSV to a table and join it back to itself multiple times and use WITH ROLLUP to get the permutations you are looking for.
WITH Numbers as
(
SELECT delimited_value
FROM dbo.split_delimiter('123,456,789',',')
)
SELECT CAST(Nums1.delimited_value AS VARCHAR)
,ISNULL(CAST(Nums2.delimited_value AS VARCHAR),'')
,ISNULL(CAST(Nums3.delimited_value AS VARCHAR),'')
,CAST(Nums4.delimited_value AS VARCHAR)
FROM Numbers as Nums1
LEFT JOIN Numbers as Nums2
ON Nums2.delimited_value not in (Nums1.delimited_value)
LEFT JOIN Numbers as Nums3
ON Nums3.delimited_value not in (Nums1.delimited_value, Nums2.delimited_value)
LEFT JOIN Numbers as Nums4
ON Nums4.delimited_value not in (Nums1.delimited_value, Nums2.delimited_value, Nums3.delimited_value)
GROUP BY CAST(Nums1.delimited_value AS VARCHAR)
,ISNULL(CAST(Nums2.delimited_value AS VARCHAR),'')
,ISNULL(CAST(Nums3.delimited_value AS VARCHAR),'')
,CAST(Nums4.delimited_value AS VARCHAR) WITH ROLLUP
If you will potentially have more than 3 or 4, you'll want to expand your code accordingly.

SQL order by included character and string

I have a table and i want to colum joint_no column. The column's values are like these
FW-1
FW-2
.
.
.
FW-13
FW-R1
FW-1A
When i ordered them i get this results
FW-1
FW-10
FW-11
FW-12
FW-13
FW-1A
.
.
FW-R1
I want to get this result after sql query
FW-1
FW-1A
FW-2
FW-3
..
FW-13
FW-R1
can anybody help me?

If you can do it, I'd advise you to renumber the values so that the 'logical' order sticks to the alphabetical order. F-1 will then be updated to F-01, or F-001.
If you cannot do it, add a field that will be populated with the 'ordered' form of your code. You 'll then be able to order by the F-001 column and still display the F-1 value
Otherwise ordering your records will rapidly become your nightmare.

Using Patindex to find the first numeric expression as first sort field, then extracting the numeric part as integer as second sortfield and using the whole string as third sort field you might get the desired result.
Declare #a Table (c varchar(50))
Insert Into #a
Select 'FW-1'
Union Select 'FW-10'
Union Select 'FW-11'
Union Select 'FW-12'
Union Select 'FW-13'
Union Select 'FW-1A'
Union Select 'FW-2'
Union Select 'FW-3'
Union Select 'FW-R1'
Union Select 'FW-A1'
;With CTE as
(Select 1 as ID
Union All
Select ID + 1 from CTE where ID < 100
)
Select * from
(
Select c
,PATINDEX('%[0-9]%',c) as s1
,(Select Cast(
(Select Case
When SUBSTRING(c, ID, 1) LIKE '[0-9]'
Then SUBSTRING(c, ID, 1)
Else ''
End
From (Select * from CTE) AS X(ID)
Where ID <= LEN(c)
For XML PATH(''))
as int)
)
as s2
from
#a
) x
order by
s1,s2,c
With the output:
FW-1 4 1 -1
FW-1A 4 1 -1A
FW-2 4 2 -2
FW-3 4 3 -3
FW-10 4 10 -10
FW-11 4 11 -11
FW-12 4 12 -12
FW-13 4 13 -13
FW-A1 5 1 A1
FW-R1 5 1 R1
If the leading part is not fixed (FW-) you might need to add one additional sort field
Declare #a Table (c varchar(50))
Insert Into #a
Select 'FW-1'
Union Select 'FW-10'
Union Select 'FW-11'
Union Select 'FW-12'
Union Select 'FW-13'
Union Select 'FW-1A'
Union Select 'FW-2'
Union Select 'FW-3'
Union Select 'FW-R1'
Union Select 'FW-A1'
Union Select 'AB-A1'
Union Select 'AB-11'
;With CTE as
(Select 1 as ID
Union All
Select ID + 1 from CTE where ID < 100
)
Select * from
(
Select c
,SubString(c,1,PATINDEX('%[0-9]%',c)-1) as S0
,PATINDEX('%[0-9]%',c) as s1
,(Select Cast(
(Select Case
When SUBSTRING(c, ID, 1) LIKE '[0-9]'
Then SUBSTRING(c, ID, 1)
Else ''
End
From (Select * from CTE) AS X(ID)
Where ID <= LEN(c)
For XML PATH(''))
as int)
)
as s2
from
#a
) x
order by
s0,s1,s2,c

How to get the deepest levels of a hierarchical sql query

I'm using SQLServer 2008.
Say I have a recursive hierarchy table, SalesRegion, whit SalesRegionId and ParentSalesRegionId. What I need is, given a specific SalesRegion (anywhere in the hierarchy), retrieve ALL the records at the BOTTOM level.
I.E.:
SalesRegion, ParentSalesRegionId
1, null
1-1, 1
1-2, 1
1-1-1, 1-1
1-1-2, 1-1
1-2-1, 1-2
1-2-2, 1-2
1-1-1-1, 1-1-1
1-1-1-2, 1-1-1
1-1-2-1, 1-1-2
1-2-1-1, 1-2-1
(in my table I have sequencial numbers, this dashed numbers are only to be clear)
So, if the user enters 1-1, I need to retrieve al records with SalesRegion 1-1-1-1 or 1-1-1-2 or 1-1-2-1 (and NOT 1-2-2). Similarly, if the user enters 1-1-2-1, I need to retrieve just 1-1-2-1
I have a CTE query that retrieves everything below 1-1, but that includes rows that I don't want:
WITH SaleLocale_CTE AS (
SELECT SL.SaleLocaleId, SL.SaleLocaleName, SL.AccountingLocationID, SL.LocaleTypeId, SL.ParentSaleLocaleId, 1 AS Level /*Added as a workaround*/
FROM SaleLocale SL
WHERE SL.Deleted = 0
AND (#SaleLocaleId IS NULL OR SaleLocaleId = #SaleLocaleId)
UNION ALL
SELECT SL.SaleLocaleId, SL.SaleLocaleName, SL.AccountingLocationID, SL.LocaleTypeId, SL.ParentSaleLocaleId, Level + 1 AS Level
FROM SaleLocale SL
INNER JOIN SaleLocale_CTE SLCTE ON SLCTE.SaleLocaleId = SL.ParentSaleLocaleId
WHERE SL.Deleted = 0
)
SELECT *
FROM SaleLocale_CTE
Thanks in advance!
Alejandro.

I found a quick way to do this, but I'd rather the answer to be in a single query. So if you can think of one, please share! If I like it better, I'll vote for it as the best answer.
I added a "Level" column in my previous query (I'll edit the question so this answer is clear), and used it to get the last level and then delete the ones I don't need.
INSERT INTO #SaleLocales
SELECT *
FROM SaleLocale_GetChilds(#SaleLocaleId)
SELECT #LowestLevel = MAX(Level)
FROM #SaleLocales
DELETE #SaleLocales
WHERE Level <> #LowestLevel

Building off your post:
; WITH CTE AS
(
SELECT *
FROM SaleLocale_GetChilds(#SaleLocaleId)
)
SELECT
FROM CTE a
JOIN
(
SELECT MAX(level) AS level
FROM CTE
) b
ON a.level = b.level
There were a few edits in there. Kept hitting post...

Are you looking for something like this:
declare #SalesRegion as table ( SalesRegion int, ParentSalesRegionId int )
insert into #SalesRegion ( SalesRegion, ParentSalesRegionId ) values
( 1, NULL ), ( 2, 1 ), ( 3, 1 ),
( 4, 3 ), ( 5, 3 ),
( 6, 5 )
; with CTE as (
-- Get the root(s).
select SalesRegion, CAST( SalesRegion as varchar(1024) ) as Path
from #SalesRegion
where ParentSalesRegionId is NULL
union all
-- Add the children one level at a time.
select SR.SalesRegion, CAST( CTE.Path + '-' + cast( SR.SalesRegion as varchar(10) ) as varchar(1024) )
from CTE inner join
#SalesRegion as SR on SR.ParentSalesRegionId = CTE.SalesRegion
)
select *
from CTE
where Path like '1-3%'

I haven't tried this on a serious dataset, so I'm not sure how it'll perform, but I believe it solves your problem:
WITH SaleLocale_CTE AS (
SELECT SL.SaleLocaleId, SL.SaleLocaleName, SL.AccountingLocationID, SL.LocaleTypeId, SL.ParentSaleLocaleId, CASE WHEN EXISTS (SELECT 1 FROM SaleLocal SL2 WHERE SL2.ParentSaleLocaleId = SL.SaleLocaleID) THEN 1 ELSE 0 END as HasChildren
FROM SaleLocale SL
WHERE SL.Deleted = 0
AND (#SaleLocaleId IS NULL OR SaleLocaleId = #SaleLocaleId)
UNION ALL
SELECT SL.SaleLocaleId, SL.SaleLocaleName, SL.AccountingLocationID, SL.LocaleTypeId, SL.ParentSaleLocaleId, CASE WHEN EXISTS (SELECT 1 FROM SaleLocal SL2 WHERE SL2.ParentSaleLocaleId = SL.SaleLocaleID) THEN 1 ELSE 0 END as HasChildren
FROM SaleLocale SL
INNER JOIN SaleLocale_CTE SLCTE ON SLCTE.SaleLocaleId = SL.ParentSaleLocaleId
WHERE SL.Deleted = 0
)
SELECT *
FROM SaleLocale_CTE
WHERE HasChildren = 0

SELECT DISTINCT for data groups

I have following table:
ID Data
1 A
2 A
2 B
3 A
3 B
4 C
5 D
6 A
6 B
etc. In other words, I have groups of data per ID. You will notice that the data group (A, B) occurs multiple times. I want a query that can identify the distinct data groups and number them, such as:
DataID Data
101 A
102 A
102 B
103 C
104 D
So DataID 102 would resemble data (A,B), DataID 103 would resemble data (C), etc. In order to be able to rewrite my original table in this form:
ID DataID
1 101
2 102
3 102
4 103
5 104
6 102
How can I do that?
PS. Code to generate the first table:
CREATE TABLE #t1 (id INT, data VARCHAR(10))
INSERT INTO #t1
SELECT 1, 'A'
UNION ALL SELECT 2, 'A'
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'A'
UNION ALL SELECT 3, 'B'
UNION ALL SELECT 4, 'C'
UNION ALL SELECT 5, 'D'
UNION ALL SELECT 6, 'A'
UNION ALL SELECT 6, 'B'

In my opinion You have to create a custom aggregate that concatenates data (in case of strings CLR approach is recommended for perf reasons).
Then I would group by ID and select distinct from the grouping, adding a row_number()function or add a dense_rank() your choice. Anyway it should look like this
with groupings as (
select concat(data) groups
from Table1
group by ID
)
select groups, rownumber() over () from groupings

The following query using CASE will give you the result shown below.
From there on, getting the distinct datagroups and proceeding further should not really be a problem.
SELECT
id,
MAX(CASE data WHEN 'A' THEN data ELSE '' END) +
MAX(CASE data WHEN 'B' THEN data ELSE '' END) +
MAX(CASE data WHEN 'C' THEN data ELSE '' END) +
MAX(CASE data WHEN 'D' THEN data ELSE '' END) AS DataGroups
FROM t1
GROUP BY id
ID DataGroups
1 A
2 AB
3 AB
4 C
5 D
6 AB
However, this kind of logic will only work in case you the "Data" values are both fixed and known before hand.
In your case, you do say that is the case. However, considering that you also say that they are 1000 of them, this will be frankly, a ridiculous looking query for sure :-)
LuckyLuke's suggestion above would, frankly, be the more generic way and probably saner way to go about implementing the solution though in your case.

From your sample data (having added the missing 2,'A' tuple, the following gives the renumbered (and uniqueified) data:
with NonDups as (
select t1.id
from #t1 t1 left join #t1 t2
on t1.id > t2.id and t1.data = t2.data
group by t1.id
having COUNT(t1.data) > COUNT(t2.data)
), DataAddedBack as (
select ID,data
from #t1 where id in (select id from NonDups)
), Renumbered as (
select DENSE_RANK() OVER (ORDER BY id) as ID,Data from DataAddedBack
)
select * from Renumbered
Giving:
1 A
2 A
2 B
3 C
4 D
I think then, it's a matter of relational division to match up rows from this output with the rows in the original table.

Just to share my own dirty solution that I'm using for the moment:
SELECT DISTINCT t1.id, D.data
FROM #t1 t1
CROSS APPLY (
SELECT CAST(Data AS VARCHAR) + ','
FROM #t1 t2
WHERE t2.id = t1.id
ORDER BY Data ASC
FOR XML PATH('') )
D ( Data )
And then going analog to LuckyLuke's solution.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find overlapping sets of data in a table - sql

Related

SQL Server Order first by ParentID, then Child

Get every combination of sort order and value of a csv

SQL order by included character and string

How to get the deepest levels of a hierarchical sql query

SELECT DISTINCT for data groups

Categories

Resources