Remove duplicated subsets from very large table

Remove duplicated subsets from very large table - sql

The data I'm working with is fairly complicated, so I'm just going to provide a simpler example so I can hopefully expand that out to what I'm working on.
Note: I've already found a way to do it, but it's extremely slow and not scalable. It works great on small datasets, but if I applied it to the actual tables it needs to run on, it would take forever.
I need to remove entire duplicate subsets of data within a table. Removing duplicate rows is easy, but I'm stuck finding an efficient way to remove duplicate subsets.
Example:
GroupID Subset Value
------- ---- ----
1 a 1
1 a 2
1 a 3
1 b 1
1 b 3
1 b 5
1 c 1
1 c 3
1 c 5
2 a 1
2 a 2
2 a 3
2 b 4
2 b 5
2 b 6
2 c 1
2 c 3
2 c 6
So in this example, from GroupID 1, I would need to remove either subset 'b' or subset 'c', doesn't matter which since both contain Values 1,2,3. For GroupID 2, none of the sets are duplicated, so none are removed.
Here's the code I used to solve this on a small scale. It works great, but when applied to 10+ Million records...you can imagine it would be very slow (I was later informed of the number of records, the sample data I was given was much smaller)...:
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6)
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value]
SELECT x.GroupID, x.NameValues, MIN(x.SubSet)
FROM (
SELECT t1.GroupID, t1.SubSet
, NameValues = (SELECT ',' + CONVERT(VARCHAR(10), t2.[Value]) FROM #values t2 WHERE t1.GroupID = t2.GroupID AND t1.SubSet = t2.SubSet ORDER BY t2.[Value] FOR XML PATH(''))
FROM #values t1
GROUP BY t1.GroupID, t1.SubSet
) x
GROUP BY x.GroupID, x.NameValues
All I'm doing here is grouping by GroupID and Subset and concatenating all of the values into a comma delimited string...and then taking that and grouping on GroupID and Value list, and taking the MIN subset.

I'd go with something like this:
;with cte as
(
select v.GroupID, v.SubSet, checksum_agg(v.Value) h, avg(v.Value) a
from #values v
group by v.GroupID, v.SubSet
)
delete v
from #values v
join
(
select c1.GroupID, case when c1.SubSet > c2.SubSet then c1.SubSet else c2.SubSet end SubSet
from cte c1
join cte c2 on c1.GroupID = c2.GroupID and c1.SubSet <> c2.SubSet and c1.h = c2.h and c1.a = c2.a
)x on v.GroupID = x.GroupID and v.SubSet = x.SubSet
select *
from #values

From Checksum_Agg:
The CHECKSUM_AGG result does not depend on the order of the rows in
the table.
This is because it is a sum of the values: 1 + 2 + 3 = 3 + 2 + 1 = 3 + 3 = 6.
HashBytes is designed to produce a different value for two inputs that differ only in the order of the bytes, as well as other differences. (There is a small possibility that two inputs, perhaps of wildly different lengths, could hash to the same value. You can't take an arbitrary input and squeeze it down to an absolutely unique 16-byte value.)
The following code demonstrates how to use HashBytes to return for each GroupId/Subset.
-- Thanks for the sample data!
DECLARE #values TABLE (GroupID INT NOT NULL, SubSet VARCHAR(1) NOT NULL, [Value] INT NOT NULL)
INSERT INTO #values (GroupID, SubSet, [Value])
VALUES (1,'a',1),(1,'a',2),(1,'a',3) ,(1,'b',1),(1,'b',3),(1,'b',5) ,(1,'c',1),(1,'c',3),(1,'c',5),
(2,'a',1),(2,'a',2),(2,'a',3) ,(2,'b',2),(2,'b',4),(2,'b',6) ,(2,'c',1),(2,'c',3),(2,'c',6);
SELECT *
FROM #values v
ORDER BY v.GroupID, v.SubSet, v.[Value];
with
DistinctGroups as (
select distinct GroupId, Subset
from #Values ),
GroupConcatenatedValues as (
select GroupId, Subset, Convert( VarBinary(256), (
select Convert( VarChar(8000), Cast( Value as Binary(4) ), 2 ) AS [text()]
from #Values as V
where V.GroupId = DG.GroupId and V.SubSet = DG.SubSet
order by Value
for XML Path('') ), 2 ) as GroupedBinary
from DistinctGroups as DG )
-- To see the intermediate results from the CTE you can use one of the
-- following two queries instead of the last select :
-- select * from DistinctGroups;
-- select * from GroupConcatenatedValues;
select GroupId, Subset, GroupedBinary, HashBytes( 'MD4', GroupedBinary ) as Hash
from GroupConcatenatedValues
order by GroupId, Subset;

You can use checksum_agg() over a set of rows. If the checksums are the same, this is strong evidence that the 'values' columns are equal within the grouped fields.
In the 'getChecksums' cte below, I group by the group and subset, with a checksum based on your 'value' column.
In the 'maybeBadSubsets' cte, I put a row_number over each aggregation just to identify the 2nd+ row in the event the checksums match.
Finally, I delete any subgroups so identified.
with
getChecksums as (
select groupId,
subset,
cs = checksum_agg(value)
from #values v
group by groupId,
subset
),
maybeBadSubsets as (
select groupId,
subset,
cs,
deleteSubset =
case
when row_number() over (
partition by groupId, cs
order by subset
) > 1
then 1
end
from getChecksums
)
delete v
from #values v
where exists (
select 0
from maybeBadSubsets mbs
where v.groupId = mbs.groupId
and v.SubSet = mbs.subset
and mbs.deleteSubset = 1
);
I don't know what the exact likelihood is for checksums to match. If you're not comfortable with the false positive rate, you can still use it to eliminate some branches in a more algorithmic approach in order to vastly improve performance.
Note: CTE's can have a quirk performance-wise. If you find that the query engine is running 'maybeBadSubsets' for each row of #values, you may need to put its results into a temp table or table variable before using it. But I believe with 'exists' you're okay as far at that goes.
EDIT:
I didn't catch it, but as the OP noticed, checksum_agg seems to perform very poorly in terms of false hits/misses. I suspect it might be due to the simplicity of the input. I changed
cs = checksum_agg(value)
above to
cs = checksum_agg(convert(int,hashbytes('md5', convert(char(1),value))))
and got better results. But I don't know how it would perform on larger datasets.

Related

Finding the id's which include multiple criteria in long format

Suppose I have a table like this,
id
tagId
1
1
1
2
1
5
2
1
2
5
3
2
3
4
3
5
3
8
I want to select id's where tagId includes both 2 and 5. For this fake data set, It should return 1 and 3.
I tried,
select id from [dbo].[mytable] where tagId IN(2,5)
But it takes 2 and 5 into account respectively. I also did not want to keep my table in wide format since tagId is dynamic. It can reach any number of columns. I also considered filtering with two different queries to find (somehow) the intersection. However since I may search more than two values inside the tagId in real life, it sounds inefficient to me.
I am sure that this is something faced before when tag searching. What do you suggest? Changing table format?

One option is to count the number of distinct tagIds (from the ones you're looking for) each id has:
SELECT id
FROM [dbo].[mytable]
WHERE tagId IN (2,5)
GROUP BY id
HAVING COUNT(DISTINCT tagId) = 2

This is actually a Relational Division With Remainder question.
First, you have to place your input into proper table format. I suggest you use a Table Valued Parameter if executing from client code. You can also use a temp table or table variable.
DECLARE #ids TABLE (tagId int PRIMARY KEY);
INSERT #ids VALUES (2), (5);
There are a number of different solutions to this type of question.
Classic double-negative EXISTS
SELECT DISTINCT
mt.Id
FROM mytable mt
WHERE NOT EXISTS (SELECT 1
FROM #ids i
WHERE NOT EXISTS (SELECT 1
FROM mytable mt2
WHERE mt2.id = mt.id
AND mt2.tagId = i.tagId)
);
This is not usually efficient though
Comparing to the total number of IDs to match
SELECT mt.id
FROM mytable mt
JOIN #ids i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = (SELECT COUNT(*) FROM #ids);
This is much more efficient. You can also do this using a window function, it may be more or less efficient, YMMV.
SELECT mt.Id
FROM mytable mt
JOIN (
SELECT *,
total = COUNT(*) OVER ()
FROM #ids i
) i ON i.tagId = mt.tagId
GROUP BY mt.id
HAVING COUNT(*) = MIN(i.total);
Another solution involves cross-joining everything and checking how many matches there are using conditional aggregation
SELECT mt.id
FROM (
SELECT
mt.id,
mt.tagId,
matches = SUM(CASE WHEN i.tagId = mt.tagId THEN 1 END),
total = COUNT(*)
FROM mytable mt
CROSS JOIN #ids i
GROUP BY
mt.id,
mt.tagId
) mt
GROUP BY mt.id
HAVING SUM(matches) = MIN(total)
AND MIN(matches) >= 0;
db<>fiddle
There are other solutions also, see High Performance Relational Division in SQL Server

Count length of consecutive duplicate values for each id

I have a table as shown in the screenshot (first two columns) and I need to create a column like the last one. I'm trying to calculate the length of each sequence of consecutive values for each id.
For this, the last column is required. I played around with
row_number() over (partition by id, value)
but did not have much success, since the circled number was (quite predictably) computed as 2 instead of 1.
Please help!

First of all, we need to have a way to defined how the rows are ordered. For example, in your sample data there is not way to be sure that 'first' row (1, 1) will be always displayed before the 'second' row (1,0).
That's why in my sample data I have added an identity column. In your real case, the details can be order by row ID, date column or something else, but you need to ensure the rows can be sorted via unique criteria.
So, the task is pretty simple:
calculate trigger switch - when value is changed
calculate groups
calculate rows
That's it. I have used common table expression and leave all columns in order to be easy for you to understand the logic. You are free to break this in separate statements and remove some of the columns.
DECLARE #DataSource TABLE
(
[RowID] INT IDENTITY(1, 1)
,[ID]INT
,[value] INT
);
INSERT INTO #DataSource ([ID], [value])
VALUES (1, 1)
,(1, 0)
,(1, 0)
,(1, 1)
,(1, 1)
,(1, 1)
--
,(2, 0)
,(2, 1)
,(2, 0)
,(2, 0);
WITH DataSourceWithSwitch AS
(
SELECT *
,IIF(LAG([value]) OVER (PARTITION BY [ID] ORDER BY [RowID]) = [value], 0, 1) AS [Switch]
FROM #DataSource
), DataSourceWithGroup AS
(
SELECT *
,SUM([Switch]) OVER (PARTITION BY [ID] ORDER BY [RowID]) AS [Group]
FROM DataSourceWithSwitch
)
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID], [Group] ORDER BY [RowID]) AS [GroupRowID]
FROM DataSourceWithGroup
ORDER BY [RowID];

You want results that are dependent on actual data ordering in the data source. In SQL you operate on relations, sometimes on ordered set of relations rows. Your desired end result is not well-defined in terms of SQL, unless you introduce an additional column in your source table, over which your data is ordered (e.g. auto-increment or some timestamp column).
Note: this answers the original question and doesn't take into account additional timestamp column mentioned in the comment. I'm not updating my answer since there is already an accepted answer.

One way to solve it could be through a recursive CTE:
create table #tmp (i int identity,id int, value int, rn int);
insert into #tmp (id,value) VALUES
(1,1),(1,0),(1,0),(1,1),(1,1),(1,1),
(2,0),(2,1),(2,0),(2,0);
WITH numbered AS (
SELECT i,id,value, 1 seq FROM #tmp WHERE i=1 UNION ALL
SELECT a.i,a.id,a.value, CASE WHEN a.id=b.id AND a.value=b.value THEN b.seq+1 ELSE 1 END
FROM #tmp a INNER JOIN numbered b ON a.i=b.i+1
)
SELECT * FROM numbered -- OPTION (MAXRECURSION 1000)
This will return the following:
i id value seq
1 1 1 1
2 1 0 1
3 1 0 2
4 1 1 1
5 1 1 2
6 1 1 3
7 2 0 1
8 2 1 1
9 2 0 1
10 2 0 2
See my little demo here: https://rextester.com/ZZEIU93657
A prerequisite for the CTE to work is a sequenced table (e. g. a table with an identitycolumn in it) as a source. In my example I introduced the column i for this. As a starting point I need to find the first entry of the source table. In my case this was the entry with i=1.
For a longer source table you might run into a recursion-limit error as the default for MAXRECURSION is 100. In this case you should uncomment the OPTION setting behind my SELECT clause above. You can either set it to a higher value (like shown) or switch it off completely by setting it to 0.

IMHO, this is easier to do with cursor and loop.
may be there is a way to do the job with selfjoin
declare #t table (id int, val int)
insert into #t (id, val)
select 1 as id, 1 as val
union all select 1, 0
union all select 1, 0
union all select 1, 1
union all select 1, 1
union all select 1, 1
;with cte1 (id , val , num ) as
(
select id, val, row_number() over (ORDER BY (SELECT 1)) as num from #t
)
, cte2 (id, val, num, N) as
(
select id, val, num, 1 from cte1 where num = 1
union all
select t1.id, t1.val, t1.num,
case when t1.id=t2.id and t1.val=t2.val then t2.N + 1 else 1 end
from cte1 t1 inner join cte2 t2 on t1.num = t2.num + 1 where t1.num > 1
)
select * from cte2

I need to be able to generate non-repetitive 8 character random alphanumeric for 2.5 million records

I need to be able to apply unique 8 character strings per row on a table that has almost 2.5 million records.
I have tried this:
UPDATE MyTable
SET [UniqueID]=SUBSTRING(CONVERT(varchar(255), NEWID()), 1, 8)
Which works, but when I check the uniqueness of the ID's, I receive duplicates
SELECT [UniqueID], COUNT([UniqueID])
FROM NicoleW_CQ_2019_Audi_CR_Always_On_2019_T1_EM
GROUP BY [UniqueID]
HAVING COUNT([UniqueID]) > 1
I really would just like to update the table, as above, with just a simple line of code, if possible.

Here's a way that uses a temporary table to assure the uniqueness
Create and fill a #temporary table with unique random 8 character codes.
The SQL below uses a FOR XML trick to generate the codes in BASE62 : [A-Za-z0-9]
Examples : 8Phs7ZYl, ugCKtPqT, U9soG39q
A GUID only uses the characters [0-9A-F].
For 8 characters that can generate 16^8 = 4294967296 combinations.
While with BASE62 there are 62^8 = 2.183401056e014 combinations.
So the odds that a duplicate is generated are significantly lower with BASE62.
The temp table should have an equal of larger amount of records than the destination table.
This example only generates 100000 codes. But you get the idea.
IF OBJECT_ID('tempdb..#tmpRandoms') IS NOT NULL DROP TABLE #tmpRandoms;
CREATE TABLE #tmpRandoms (
ID INT PRIMARY KEY IDENTITY(1,1),
[UniqueID] varchar(8),
CONSTRAINT UC_tmpRandoms_UniqueID UNIQUE ([UniqueID])
);
WITH DIGITS AS
(
select n
from (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)
),
NUMS AS
(
select (d5.n*10000 + d4.n*1000 + d3.n*100 + d2.n * 10 + d1.n) as n
from DIGITS d1
cross join DIGITS d2
cross join DIGITS d3
cross join DIGITS d4
cross join DIGITS d5
)
INSERT INTO #tmpRandoms ([UniqueID])
SELECT DISTINCT LEFT(REPLACE(REPLACE((select CAST(NEWID() as varbinary(16)), n FOR XML PATH(''), BINARY BASE64),'+',''),'/',''), 8) AS [UniqueID]
FROM NUMS;
Then update your table with it
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN, [UniqueID]
FROM YourTable
)
UPDATE t
SET t.[UniqueID] = tmp.[UniqueID]
FROM CTE t
JOIN #tmpRandoms tmp ON tmp.ID = t.RN;
A test on rextester here

Can you just use numbers and assign a randomish value?
with toupdate as (
select t.*,
row_number() over (order by newid()) as random_enough
from mytable t
)
update toupdate
set UniqueID = right(concat('00000000', random_enough), 8);

See: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/a289ed64-2038-415e-9f5d-ae84e50fe702/generate-random-string-of-length-5-az09?forum=transactsql
Alter: DECLARE #s char(5) and SELECT TOP (5) c1 to fix length you want.

How can I get a random number generated in a CTE not to change in JOIN?

The problem
I'm generating a random number for each row in a table #Table_1 in a CTE, using this technique. I'm then joining the results of the CTE on another table, #Table_2. Instead of getting a random number for each row in #Table_1, I'm getting a new random number for every resulting row in the join!
CREATE TABLE #Table_1 (Id INT)
CREATE TABLE #Table_2 (MyId INT, ParentId INT)
INSERT INTO #Table_1
VALUES (1), (2), (3)
INSERT INTO #Table_2
VALUES (1, 1), (2, 1), (3, 1), (4, 1), (1, 2), (2, 2), (3, 2), (1, 3)
;WITH RandomCTE AS
(
SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE r
INNER JOIN #Table_2 t
ON r.Id = t.ParentId
The results
Id MyId RandomNumber
----------- ----------- ------------
1 1 1
1 2 2
1 3 0
1 4 3
2 1 4
2 2 0
2 3 0
3 1 3
The desired results
Id MyId RandomNumber
----------- ----------- ------------
1 1 1
1 2 1
1 3 1
1 4 1
2 1 4
2 2 4
2 3 4
3 1 3
What I tried
I tried to obscure the logic of the random number generation from the optimizer by casting the random number to VARCHAR, but that did not work.
What I don't want to do
I'd like to avoid using a temporary table to store the results of the CTE.
How can I generate a random number for a table and preserve that random number in a join without using temporary storage?

This seems to do the trick:
WITH CTE AS(
SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
FROM #Table_1),
RandomCTE AS(
SELECT Id,
RandomNumber
FROM CTE
GROUP BY ID, RandomNumber)
SELECT *
FROM RandomCTE r
INNER JOIN #Table_2 t
ON r.Id = t.ParentId;
It looks like SQL Server is aware that, at the point of being outside the CTE, that RandomNumber is effectively just NEWID() with some additional functions wrapped around it (DB<>Fiddle), and hence it still generates a unique ID for each row. The GROUP BY clause in the second CTE therefore forces the data engine to define RandomNumber a value so it can perform the GROUP BY.

Per the quote in this answer
The optimizer does not guarantee timing or number of executions of
scalar functions. This is a long-estabilished tenet. It's the
fundamental 'leeway' tha allows the optimizer enough freedom to gain
significant improvements in query-plan execution.
If it is important for your application that the random number be evaluated once and only once you should calculate it up front and store it into a temp table.
Anything else is not guaranteed and so is irresponsible to add into your application's code base - as even if it works now it may break as a result of a schema change/execution plan change/version upgrade/CU install.
For example Lamu's answer breaks if a unique index is added to #Table_1 (Id)

How about not using a real random number at all? Use rand() with a seed:
WITH RandomCTE AS (
SELECT Id,
CONVERT(INT, RAND(ROW_NUMBER() OVER (ORDER BY NEWID()) * 999999) * 5) as RandomNumber
FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE rINNER JOIN
#Table_2 t
ON r.Id = t.ParentId;
The seed argument to rand() is pretty awful. Values of the seed near each other produce similar initial values, which is the reason for the multiplication.
Here is the db<>fiddle.

How to do this data transformation

This is my input data
GroupId Serial Action
1 1 Start
1 2 Run
1 3 Jump
1 8 End
2 9 Shop
2 10 Start
2 11 Run
For each activitysequence in a group I want to Find pairs of Actions where Action1.SerialNo = Action2.SerialNo + k and how may times it happens
Suppose k = 1, then output will be
FirstAction NextAction Frequency
Start Run 2
Run Jump 1
Shop Start 1
How can I do this in SQL, fast enough given the input table contains millions of entries.

tful, This should produce the result you want, but I don't know if it will be as fast as you 'd like. It's worth a try.
create table Actions(
GroupId int,
Serial int,
"Action" varchar(20) not null,
primary key (GroupId, Serial)
);
insert into Actions values
(1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
(1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
(2,11,'Run');
go
declare #k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
select
Serial, 'a', "Action"
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B
), Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table Actions;
If you will be doing the same computation for various #k values on stable data, this may work better in the long run:
declare #k int = 1;
select
Serial, 'a' as Tag, "Action"
into ActionsDoubled
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B;
go
create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go
with Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table ActionsDoubled;

SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + #k)
GROUP BY a1.Action, a2.Action;

The problem is this: Your query has to go through EVERY row regardless.
You can make it more manageable for your database by tackling each group separately as separate queries. Especially if the size of each group is SMALL.
There's a lot going on under the hood and when the query has to do a scan of the entire table, this actually ends up being many times slower than if you did small chunks which effectively cover all million rows.
So for instance:
--Stickler for clean formatting...
SELECT
a1.Action AS FirstAction,
a2.Action AS NextAction,
COUNT(*) AS Frequency
FROM
Activities a1 JOIN Activities a2
ON (a1.groupid = a2.groupid
AND a1.Serial = a2.Serial + #k)
WHERE
a1.groupid = 1
GROUP BY
a1.Action,
a2.Action;
By the way, you have an index (GroupId, Serial) on the table, right?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicated subsets from very large table - sql

Related

Finding the id's which include multiple criteria in long format

Count length of consecutive duplicate values for each id

I need to be able to generate non-repetitive 8 character random alphanumeric for 2.5 million records

How can I get a random number generated in a CTE not to change in JOIN?

How to do this data transformation

Categories

Resources