Count length of consecutive duplicate values for each id - sql

I have a table as shown in the screenshot (first two columns) and I need to create a column like the last one. I'm trying to calculate the length of each sequence of consecutive values for each id.
For this, the last column is required. I played around with
row_number() over (partition by id, value)
but did not have much success, since the circled number was (quite predictably) computed as 2 instead of 1.
Please help!

First of all, we need to have a way to defined how the rows are ordered. For example, in your sample data there is not way to be sure that 'first' row (1, 1) will be always displayed before the 'second' row (1,0).
That's why in my sample data I have added an identity column. In your real case, the details can be order by row ID, date column or something else, but you need to ensure the rows can be sorted via unique criteria.
So, the task is pretty simple:
calculate trigger switch - when value is changed
calculate groups
calculate rows
That's it. I have used common table expression and leave all columns in order to be easy for you to understand the logic. You are free to break this in separate statements and remove some of the columns.
DECLARE #DataSource TABLE
(
[RowID] INT IDENTITY(1, 1)
,[ID]INT
,[value] INT
);
INSERT INTO #DataSource ([ID], [value])
VALUES (1, 1)
,(1, 0)
,(1, 0)
,(1, 1)
,(1, 1)
,(1, 1)
--
,(2, 0)
,(2, 1)
,(2, 0)
,(2, 0);
WITH DataSourceWithSwitch AS
(
SELECT *
,IIF(LAG([value]) OVER (PARTITION BY [ID] ORDER BY [RowID]) = [value], 0, 1) AS [Switch]
FROM #DataSource
), DataSourceWithGroup AS
(
SELECT *
,SUM([Switch]) OVER (PARTITION BY [ID] ORDER BY [RowID]) AS [Group]
FROM DataSourceWithSwitch
)
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID], [Group] ORDER BY [RowID]) AS [GroupRowID]
FROM DataSourceWithGroup
ORDER BY [RowID];

You want results that are dependent on actual data ordering in the data source. In SQL you operate on relations, sometimes on ordered set of relations rows. Your desired end result is not well-defined in terms of SQL, unless you introduce an additional column in your source table, over which your data is ordered (e.g. auto-increment or some timestamp column).
Note: this answers the original question and doesn't take into account additional timestamp column mentioned in the comment. I'm not updating my answer since there is already an accepted answer.

One way to solve it could be through a recursive CTE:
create table #tmp (i int identity,id int, value int, rn int);
insert into #tmp (id,value) VALUES
(1,1),(1,0),(1,0),(1,1),(1,1),(1,1),
(2,0),(2,1),(2,0),(2,0);
WITH numbered AS (
SELECT i,id,value, 1 seq FROM #tmp WHERE i=1 UNION ALL
SELECT a.i,a.id,a.value, CASE WHEN a.id=b.id AND a.value=b.value THEN b.seq+1 ELSE 1 END
FROM #tmp a INNER JOIN numbered b ON a.i=b.i+1
)
SELECT * FROM numbered -- OPTION (MAXRECURSION 1000)
This will return the following:
i id value seq
1 1 1 1
2 1 0 1
3 1 0 2
4 1 1 1
5 1 1 2
6 1 1 3
7 2 0 1
8 2 1 1
9 2 0 1
10 2 0 2
See my little demo here: https://rextester.com/ZZEIU93657
A prerequisite for the CTE to work is a sequenced table (e. g. a table with an identitycolumn in it) as a source. In my example I introduced the column i for this. As a starting point I need to find the first entry of the source table. In my case this was the entry with i=1.
For a longer source table you might run into a recursion-limit error as the default for MAXRECURSION is 100. In this case you should uncomment the OPTION setting behind my SELECT clause above. You can either set it to a higher value (like shown) or switch it off completely by setting it to 0.

IMHO, this is easier to do with cursor and loop.
may be there is a way to do the job with selfjoin
declare #t table (id int, val int)
insert into #t (id, val)
select 1 as id, 1 as val
union all select 1, 0
union all select 1, 0
union all select 1, 1
union all select 1, 1
union all select 1, 1
;with cte1 (id , val , num ) as
(
select id, val, row_number() over (ORDER BY (SELECT 1)) as num from #t
)
, cte2 (id, val, num, N) as
(
select id, val, num, 1 from cte1 where num = 1
union all
select t1.id, t1.val, t1.num,
case when t1.id=t2.id and t1.val=t2.val then t2.N + 1 else 1 end
from cte1 t1 inner join cte2 t2 on t1.num = t2.num + 1 where t1.num > 1
)
select * from cte2

Related

sql make geometric sequence from series of bit values

I have this table:
declare #Table table (value int)
insert #Table select 0
insert #Table select 1
insert #Table select 1
insert #Table select 1
insert #Table select 0
insert #Table select 1
insert #Table select 1
Now, I need to make a Select query, which would add a column. This column will make a geometric sequence once there is a serie of value 1 in column value.
This would be the result:
I would phrase this as an arithmetic problem. First, you problem suggests that the ordering of rows is important. Hence, you need a column to specify the ordering. I assume there is an id column with this information.
Then to create the groups where the sequences start, do a cumulative sum of the 0s -- all the 1 are in the same group. Given the data you can express this as sum(1 - value) over (order by id).
Then just use arithmetic:
select t.*,
value * power(2, row_number() over (partition by grp order by id) - 1) as generatedsequence
from (select t.*, sum(1 - value) over (order by id) as grp
from #table t
) t;
Here is a db<>fiddle.
The arithmetic is that you want to enumerate the values in the group and then raise 2 to that power (except when value is 0). So the subquery returns:
id. value grp
1 1 1
2 1 1
3 1 1
4 1 1
5 0 2
6 1 2
7 1 2
The row_number() then enumerates the values within each grp.
OK.. first things first, in a database there is no inherent ordering of the data within a table. Therefore, to do what you want, you will need to make a field to sort/order on. In this case, I'm using an IDENTITY field called 'SortID'.
CREATE TABLE #Table (SortID int IDENTITY(1,1), BitValue bit);
INSERT INTO #Table (BitValue)
VALUES (0), (1), (1), (1), (0), (1), (1);
This gives a table with the following starting data
SortID BitValue
1 0
2 1
3 1
4 1
5 0
6 1
7 1
Now, to solve the problem
One way to do it is via a recursive CTE - where the value of the current row is based on the values of the previous rows.
However, recursive CTEs can have performance issues (they're loops, basically) so it's better to do a set-based approach if possible.
In this case, as you want a geometric sequence which is 2 to the power of the relevant row number, we don't need the previous rows to calculate this row - we only need to know the row number
The following approach
Uses a CTE to make a new field called 'GroupNum' which is used to group the rows together. Every time a row has a BitValue of 0, it increments the GroupNum by 1.
In your example, the first four rows would have GroupNum = 1, the remaining three would have GroupNum = 2
Follows the above with a window function - partitioning by those group numbers, and getting the row_number (minus one) within each group.
The final result is set as the power of a variable #a to the relevant row_number.
To match your example, I have used #a = 2 as the base for the POWER function.
DECLARE #a int;
SET #a = 2;
WITH Grouped_BitValues AS
(SELECT SortID, BitValue,
CASE WHEN BitValue = 0 THEN 1 ELSE 0 END AS NewGrpFlag,
SUM(CASE WHEN BitValue = 0 THEN 1 ELSE 0 END) OVER (ORDER BY SortID) AS GroupNum
FROM #Table
)
SELECT BitValue, POWER(#a, ROW_NUMBER() OVER (PARTITION BY GroupNum ORDER BY SortID) -1) AS Geometric_Sequence
FROM Grouped_BitValues
ORDER BY SortID;
And here are the results
BitValue Geometric_Sequence
0 1
1 2
1 4
1 8
0 1
1 2
1 4
Note that in your question, 2^0 should be 1, not 0, for a proper geometric sequence. If instead you wanted 0, you'd need to code in Geometric_Sequence to have a CASE expression (e.g., CASE WHEN BitValue = 0 THEN 0 ELSE POWER(...) AS Geometric_Sequence).
Here is a db<>fiddle with
the setup
the answer
the components of the answer (e.g., the CTE, and calculations) to demonstrate how it's calculated

SQL group three columns into one

I have a table with three columns:
[ID] [name] [link]
1 sample_name_1 sample_link_1
2 sample_name_2 sample_link_2
3 sample_name_3 sample_link_3
I need to somehow group them into one column, so the ideal result is this:
[one_column]
1
sample_name_1
sample_name_1
2
sample_name_2
sample_link_2
3
sample_name_3
sample_link_3
Does anyone have any suggestions on where to look and how to get it done in SQL Server?
You may try to use VALUES table value constructor with CROSS APPLY:
Table:
CREATE TABLE MyTable (
ID int,
name varchar(50),
link varchar(50)
)
INSERT INTO MyTable (ID, name, link)
VALUES
(1, 'sample_name_1', 'sample_link_1'),
(2, 'sample_name_2', 'sample_link_2'),
(3, 'sample_name_3', 'sample_link_3')
Statement:
SELECT v.one_column
FROM MyTable t
CROSS APPLY (VALUES
(1, CONVERT(varchar(50), ID)),
(2, CONVERT(varchar(50), name)),
(3, CONVERT(varchar(50), link))
) v (rn, one_column)
ORDER BY t.ID, v.rn
Result:
one_column
1
sample_name_1
sample_link_1
2
sample_name_2
sample_link_2
3
sample_name_3
sample_link_3
While this is something you should do in your presentation layer (i.e. your app or Website) you can do this in SQL:
select one column
from
(
select cast(id as varchar(10)) as one column, id as sortkey1, 1 as sortkey2 from mytable
union all
select name as one column, id as sortkey1, 2 as sortkey2 from mytable
union all
select link as one column, id as sortkey1, 3 as sortkey2 from mytable
) unioned
order by sortkey1, sortkey2;

SQL Server query to find where all preceding numbers are not included per each ID for a specific column

I am having a hard time trying to explain this succinctly but basically I need to query Table A for each ID number and find where in the positions column there are missing sequential numbers for each specific ID. If there is a position 7 for a certain ID, then there should be a 6, 5, 4, 3, 2, 1 position for that ID as well. Each ID can have anywhere from 1-15 position records.
Does anyone have any suggestions on the best way to go about this?
Edited to Add:
There is only one ID column, it is called GlobalID. There is only one Positions column. The end result is that I will update an Issues column with a code specific to the problem, it will populate with PositionsIncorrect for each GlobalID record where there is an incorrect sequence of numbers in the Positions column.
If you just want to identify the gaps, you can use lead() in a subuqery to get the value of the next position for the same id, and then do comparison in the outer query:
select *
from (
select
id,
position,
lead(position) over(partition by id order by position) lead_position
from tableA
) x
where lead_position is not null and lead_position != position + 1
This will return one row for each record of the same id where the next record is not in sequence, along with the position of the next record.
Something like this will show which positions are missing:
DECLARE #t table
(
ID int
, Position int
)
INSERT INTO #t (ID, Position)
VALUES
(1, 4)
, (1, 15)
, (2, 3)
, (2, 10)
;
WITH cte
AS
(
SELECT
ID
, MIN(Position) Position
, MAX(Position) MaxPosition
FROM #t
GROUP BY ID
UNION ALL
SELECT
ID
, Position + 1
, MaxPosition
FROM cte
WHERE Position + 1 <= MaxPosition
)
SELECT
C.ID
, C.Position
, CAST(CASE WHEN T.ID IS NULL THEN 1 ELSE 0 END AS bit) Missing
FROM
cte C
LEFT JOIN #t T ON
C.ID = T.ID
AND C.Position = T.Position
ORDER BY
ID
, Position
OPTION (MAXRECURSION 0)

Remove duplicated subsets from very large table

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.
I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values
From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;
You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Sql Server Flatten Rows to Columns

I have a table that looks similar to this
RowNumber Value colIdx
1 A 1
1 Shimano Dura-Ace 2
2 B 1
2 SRAM eTap 2
3 C 1
3 Campagnolo Super Record 2
I want to flatten rows, and so far I have come up with the following
SELECT Rownumber,
stuff(
(SELECT DISTINCT ': ' + cast(value AS varchar(MAX))
FROM groupsets t2
WHERE t2.Rownumber = t1.Rownumber
FOR XML PATH('')),1,1,'')
FROM groupsets t1
GROUP BY Rownumber
ORDER BY Rownumber
However, the following is produced - I want for the single character to always prefix the value.
RowNumber Value
1 A: Shimano Dura-Ace
2 B: SRAM eTap
3 Campagnolo Super Record: D
I have created a SQL Fiddle here. I'm not sure how to order by colIdx without needing to expose it?
The expected output is:
RowNumber Value
1 A: Shimano Dura-Ace
2 B: SRAM eTap
3 D: Campagnolo Super Record
Datasets in SQL Server are never guaranteed to be returned in any specific order without using an ORDER BY clause.
If you need to guarantee that the single character will be returned first, you'll need to use an ORDER BY. For example:
SELECT Rownumber,
STUFF(CONVERT(varchar(MAX),(SELECT DISTINCT ': ' + [value] --Is the DISTINCT required here?
--Also, the CAST is not required, that goes on the outside of the SELECt, as you can see
FROM groupsets t2
WHERE t2.Rownumber = t1.Rownumber
ORDER BY LEN([value]) ASC
FOR XML PATH(''))),1,1,'')
FROM groupsets t1
GROUP BY Rownumber
ORDER BY Rownumber;
While digging a bit I saw some new feature in SQL Server 2017 (and azure). Here's a query that will work using a CTE + STRING_AGG (New feature).
WITH groupsetsOrdered AS
(
SELECT top 100000 rownumber, [value], [colIdx]
FROM groupsets
ORDER BY rownumber, colidx
)
select rownumber as [RowNumber], string_agg([value], ': ') as [Value]
from groupsetsOrdered
group by rownumber
order by rownumber
Dataset like:
CREATE TABLE groupsets
([Rownumber] varchar(1), [Value] varchar(max), [colidx] int)
;
INSERT INTO groupsets
([Rownumber], [Value], [colidx])
VALUES
('1', 'A',1),
('1', 'Shimano Dura-Ace',2),
('2', 'SRAM eTap',2),
('2', 'B',1),
('3', 'D',1),
('3', 'Campagnolo Super Record',2)
;
Result:
rownumber Value
1 A: Shimano Dura-Ace
2 B: SRAM eTap
3 D: Campagnolo Super Record
(Fiddle: http://sqlfiddle.com/#!18/707ec/9/0)