SQL Merge table rows based on IN criteria - sql

I have a table result like following
Code Counts1 Counts2 TotalCounts
1 10 20 30
4 15 18 33
5 5 14 19
... ... ... ...
What I am trying to achieve is merging counts for all rows where Code (the column counts are grouped on) belongs IN (1,4). However, within all my research, all I found was methods to merge rows based on a common value for each row (same id, etc.)
Is there a way to merge rows based on IN criteria, so I know if I should research it further?

How about a union?
select
1 as Code,
sum(Counts1) as Counts1,
sum(Counts2) as Counts2,
sum(TotalCount) as TotalCounts
from
YourTable
where
code in (1,4)
union
select *
from
YourTable
where
code not in(1,4)

Just assuming you will have numerous groupings (See the #Groupings mapping table)
You can have dynamic groupings via a LEFT JOIN
Example
Declare #YourTable Table ([Code] varchar(50),[Counts1] int,[Counts2] int,[TotalCounts] int)
Insert Into #YourTable Values
(1,10,20,30)
,(4,15,18,33)
,(5,5,14,19)
Declare #Groupings Table (Code varchar(50),Grp int)
Insert Into #Groupings values
(1,1)
,(4,1)
select code = Isnull(B.NewCode,A.Code)
,Counts1 = sum(Counts1)
,Counts2 = sum(Counts2)
,TotalCounts = sum(TotalCounts)
From #YourTable A
Left Join (
Select *
,NewCode = (Select Stuff((Select ',' + Code From #Groupings Where Grp=B1.Grp For XML Path ('')),1,1,'') )
From #Groupings B1
) B on (A.Code=B.Code)
Group By Isnull(B.NewCode,A.Code)
Returns
code Counts1 Counts2 TotalCounts
1,4 25 38 63
5 5 14 19
If it helps with the Visualization, the subquery generates
Code Grp NewCode
1 1 1,4
4 1 1,4

sum the count, remove code from the select statement. Add a new column to group 1 and 4 using case statement lets name this groupN. then in SQL group it by groupN.
You are correct, grouping has to be based on common value. so by creating a new column, you are making that happen.

Related

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.
I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values
From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;
You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Rotate rows into columns with column names not coming from the row

I've looked at some answers but none of them seem to be applicable to me.
Basically I have this result set:
RowNo | Id | OrderNo |
1 101 1
2 101 10
I just want to convert this to
| Id | OrderNo_0 | OrderNo_1 |
101 1 10
I know I should probably use PIVOT. But the syntax is just not clear to me.
The order numbers are always two. To make things clearer
And if you want to use PIVOT then the following works with the data provided:
declare #Orders table (RowNo int, Id int, OrderNo int)
insert into #Orders (RowNo, Id, OrderNo)
select 1, 101, 1 union all select 2, 101, 10
select Id, [1] OrderNo_0, [2] OrderNo_1
from (
select RowNo, Id, OrderNo
from #Orders
) SourceTable
pivot (
sum(OrderNo)
for RowNo in ([1],[2])
) as PivotTable
Reference: https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-2017
Note: To build each row in the result set the pivot function is grouping by the columns not begin pivoted. Therefore you need an aggregate function on the column that is being pivoted. You won't notice it in this instance because you have unique rows to start with - but if you had multiple rows with the RowNo and Id you would then find the aggregation comes into play.
As you say there are only ever two order numbers per ID, you could join the results set to itself on the ID column. For the purposes of the example below, I'm assuming your results set is merely selecting from a single Orders table, but it should be easy enough to replace this with your existing query.
SELECT o1.ID, o1.OrderNo AS [OrderNo_0], o2.OrderNo AS [OrderNo_1]
FROM Orders AS o1
INNER JOIN Orders AS o2
ON (o1.ID = o2.ID AND o1.OrderNo <> o2.OrderNo)
From your sample data, simplest you can try to use min and MAX function.
SELECT Id,min(OrderNo) OrderNo_0,MAX(OrderNo) OrderNo_1
FROM T
GROUP BY Id

T-SQL Combine Ranges Based On Value

I am using SQL Server 2012 and have been struggling with this query for hours. I am trying to aggregate mile post ranges based off the value in the Value column. The results should have unique segments with the highest value from the Value field for each segment. Here's an example:
Mile_Marker_Start | Mile_Marker_End | Value
0 100 5
50 150 6
100 200 10
75 300 9
150 200 7
And here's the result I'm looking for:
Mile_Marker_Start | Mile_Marker_End | Value
0 50 5
50 75 6
75 100 9
100 200 10
200 300 9
As you can see, the row with a value of 9 got split into 2 rows because Value 10 was bigger. Also, the row with Value 7 does not display because Value 10 was bigger. Can this be done without using a cursor? Any help would be much appreciated.
Thanks
I believe the following now does what you need. I'd recommend running all the parts separately so you can see what they do and how they work.
DECLARE #input AS TABLE
(Mile_Marker_Start int, Mile_Marker_End int, Value int)
INSERT INTO #input VALUES
(0,100,5), (50,150,6), (100,200,10), (75,300,9), (150,200,7)
DECLARE #staging as table
(Mile_Marker int)
INSERT INTO #staging
SELECT Mile_Marker_Start from #input
UNION -- this will remove duplicates
SELECT Mile_Marker_End from #input
; -- we need semi-colon for the following CTE
-- this CTE gets the right values, but the rows aren't "collapsed"
WITH all_markers AS
(
SELECT
groups.Mile_Marker_Start,
groups.Mile_Marker_End,
max(i3.Value) Value
FROM
(
SELECT
s1.Mile_Marker Mile_Marker_Start,
min(s2.Mile_Marker) Mile_Marker_End
FROM
#staging s1
JOIN #staging s2 ON
s1.Mile_Marker < s2.Mile_Marker
GROUP BY
s1.Mile_Marker
) as groups
JOIN #input i3 ON
i3.Mile_Marker_Start < groups.Mile_Marker_End AND
i3.Mile_Marker_End > groups.Mile_Marker_Start
GROUP BY
groups.Mile_Marker_Start,
groups.Mile_Marker_End
)
SELECT
MIN(collapse.Mile_Marker_Start) as Mile_Marker_Start,
MAX(collapse.Mile_Marker_End) as Mile_Marker_End,
collapse.Value
FROM
(-- Subquery get's IDs for the groups we're collapsing together
SELECT
am.*,
ROW_NUMBER() OVER (ORDER BY am.Mile_Marker_Start) - ROW_NUMBER() OVER (PARTITION BY am.Value ORDER BY am.Mile_Marker_Start) GroupID
FROM
all_markers am
) AS COLLAPSE
GROUP BY
collapse.GroupID,
collapse.Value
ORDER BY
MIN(collapse.Mile_Marker_Start)
Since you are on 2012 you could maybe use LEAD. Here is my code but as noted on your question by #stevelovell , we need clarification on how you are getting your result table.
--test date
declare #tablename TABLE
(
Mile_Marker_Start int,
Mile_Marker_End int,
Value int
);
insert into #tablename
values(0,100, 5),
(50,150, 6),
(100,200,10),
(75,300, 9),
(150,200, 7);
--query
select *
from #tablename
order by Mile_Marker_Start
select Mile_Marker_Start,
case when lead(mile_marker_start) over(order by mile_marker_start) < Mile_Marker_End THEN
lead(mile_marker_start) over(order by mile_marker_start)
ELSE
Mile_marker_end
END
AS MILE_MARKER_END,
Value
from #tablename
order by Mile_Marker_Start
Once you update your notes I will come back and update my answer.
Update: wasn't able to get LEAD and the other windowing functions to work with your requirements. With the way you need to move up and down the table current, and calculated values...

SQL Server match and count on substring

I am using SQL Server 2008 R2 and have a table like this:
ID Record
1 IA12345
2 IA33333
3 IA33333
4 IA44444
5 MO12345
I am trying to put together some SQL to return the two rows that contain IA12345 and MO12345. So, I need to match on the partial string of the column "Record". What is complicating my SQL is that I don't want to return matches like IA33333 and IA33333. Clear as mud?
I am getting twisted up in substrings, group by, count and the like!
SELECT ID, Record FROM Table WHERE Record LIKE '%12345'
Select *
from MyTable
where Record like '%12345%'
This will find repeating and/or runs. For example 333 or 123 or 321
Think of it as Rummy 500
Declare #YourTable table (ID int,Record varchar(25))
Insert Into #YourTable values
( 1,'IA12345'),
( 2,'IA33333'),
( 3,'IA33333'),
( 4,'IA44444'),
( 5,'MO12345'),
( 6,'M785256') -- Will be excluded because there is no pattern
Declare #Num table (Num int);Insert Into #Num values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
Select Distinct A.*
From #YourTable A
Join (
Select Patt=replicate(Num,3) from #Num
Union All
Select Patt=right('000'+cast((Num*100+Num*10+Num)+12 as varchar(5)),3) from #Num where Num<8
Union All
Select Patt=reverse(right('000'+cast((Num*100+Num*10+Num)+12 as varchar(5)),3)) from #Num where Num<8
) B on CharIndex(Patt,Record)>0
Returns
ID Record
1 IA12345
2 IA33333
3 IA33333
4 IA44444
5 MO12345
EDIT
I should add that runs of 3 is too small, it is a small matter tweak the sub-queries so 333 becomes 3333 and 123 becomes 1234

How do you find a missing number in a table field starting from a parameter and incrementing sequentially?

Let's say I have an sql server table:
NumberTaken CompanyName
2 Fred 3 Fred 4 Fred 6 Fred 7 Fred 8 Fred 11 Fred
I need an efficient way to pass in a parameter [StartingNumber] and to count from [StartingNumber] sequentially until I find a number that is missing.
For example notice that 1, 5, 9 and 10 are missing from the table.
If I supplied the parameter [StartingNumber] = 1, it would check to see if 1 exists, if it does it would check to see if 2 exists and so on and so forth so 1 would be returned here.
If [StartNumber] = 6 the function would return 9.
In c# pseudo code it would basically be:
int ctr = [StartingNumber]
while([SELECT NumberTaken FROM tblNumbers Where NumberTaken = ctr] != null)
ctr++;
return ctr;
The problem with that code is that is seems really inefficient if there are thousands of numbers in the table. Also, I can write it in c# code or in a stored procedure whichever is more efficient.
Thanks for the help
Fine, if this question isn't going to be closed, I may as well Copy and paste my answer from the other one:
I called my table Blank, and used the following:
declare #StartOffset int = 2
; With Missing as (
select #StartOffset as N where not exists(select * from Blank where ID = #StartOffset)
), Sequence as (
select #StartOffset as N from Blank where ID = #StartOffset
union all
select b.ID from Blank b inner join Sequence s on b.ID = s.N + 1
)
select COALESCE((select N from Missing),(select MAX(N)+1 from Sequence))
You basically have two cases - either your starting value is missing (so the Missing CTE will contain one row), or it's present, so you count forwards using a recursive CTE (Sequence), and take the max from that and add 1
Tables:
create table Blank (
ID int not null,
Name varchar(20) not null
)
insert into Blank(ID,Name)
select 2 ,'Fred' union all
select 3 ,'Fred' union all
select 4 ,'Fred' union all
select 6 ,'Fred' union all
select 7 ,'Fred' union all
select 8 ,'Fred' union all
select 11 ,'Fred'
go
I would create a temp table containing all numbers from StartingNumber to EndNumber and LEFT JOIN to it to receive the list of rows not contained in the temp table.
If NumberTaken is indexed you could do it with a join on the same table:
select T.NumberTaken -1 as MISSING_NUMBER
from myTable T
left outer join myTable T1
on T.NumberTaken= T1.NumberTaken+1
where T1.NumberTaken is null and t.NumberTaken >= STARTING_NUMBER
order by T.NumberTaken
EDIT
Edited to get 1 too
1> select 1+ID as ID from #b as b
where not exists (select 1 from #b where ID = 1+b.ID)
2> go
ID
-----------
5
9
12
Take max(1+ID) and/or add your starting value to the where clause, depending on what you actually want.