Delete repeated Ids from Table - Performance Improvement

Delete repeated Ids from Table - Performance Improvement - sql

I have a table with repeated codes, I need to clean the table removing the repeated, but having at least one left of then in the table.
My table is this:
FriendlyFunctionCode MemberFirmId FunctionLevel3Desc
1 Value1 Value2
1 Value2 Value3
2 Value4 Value5
I need something like this: (It doesn't matter which row is left, just to have at least one)
FriendlyFunctionCode MemberFirmId FunctionLevel3Desc
1 Value1 Value2
2 Value4 Value5
I have this query, but the performance is awful
SELECT MemberFirmId, FriendlyFunctionCode
INTO #ToDeleteRepeated
FROM [dbo].[FirmFunction]
GROUP BY MemberFirmId, FriendlyFunctionCode
HAVING COUNT(1) > 1
DECLARE #Code VARCHAR(100), #Desc VARCHAR(250)
WHILE ((SELECT COUNT(1) FROM #ToDeleteRepeated) > 0)
BEGIN
SELECT TOP 1 #Code = FriendlyFunctionCode FROM #ToDeleteRepeated
WHILE ((SELECT COUNT(1) FROM [FirmFunction] WHERE FriendlyFunctionCode = #Code) > 0)
BEGIN
SELECT TOP 1 #Desc = FunctionLevel3Desc FROM [FirmFunction] WHERE FriendlyFunctionCode = #Code
DELETE FROM [FirmFunction] WHERE FriendlyFunctionCode = #Code AND FunctionLevel3Desc = #Desc
END
END
Any suggestions?

WITH CTE AS (SELECT MemberFirmId, FriendlyFunctionCode,
ROW_NUMBER() over (PARTITION by FriendlyFunctionCode ORDER BY FriendlyFunctionCode ) AS RN
FROM [dbo].[FirmFunction]
)
DELETE CTE WHERE CTE.RN >1

Delete using CTE with row_number()
;with cte as (
select *, row_number() over(partition by friendlyfunctioncode order by memberfirmid) rn
from deletingtable)
delete from cte where rn > 1
This executes with below execution plan:
Table/Clustered index scan --> sort(if no index) --> segment --> Sequence Project --> Filter and then delete,
If it has proper index on FriendlyFunctionCode it executes faster in single scan

You could use a windowing function like this. Saves having to use a cursor (which don't perform well in SQL Server). You can run the inner select on it's own to see what it's doing with the row number.
Test Data
CREATE TABLE #TestData (FriendlyFunctionCode int, MemberFirmId nvarchar(10), FunctionLevel3Desc nvarchar(10))
INSERT INTO #TestData
VALUES
(1,'Value1','Value2')
,(1,'Value2','Value3')
,(2,'Value4','Value5')
Query
SELECT
a.FriendlyFunctionCode
,a.MemberFirmId
,a.FunctionLevel3Desc
INTO #SavedData
FROM
(
SELECT
FriendlyFunctionCode
,MemberFirmId
,FunctionLevel3Desc
,ROW_NUMBER() OVER(PARTITION BY FriendlyFunctionCode ORDER BY FriendlyFunctionCode) RowNum
FROM #TestData
) a
WHERE a.RowNum = 1
TRUNCATE TABLE #TestData
INSERT INTO #TestData (FriendlyFunctionCode, MemberFirmId, FunctionLevel3Desc)
SELECT
FriendlyFunctionCode
,MemberFirmId
,FunctionLevel3Desc
FROM #SavedData
DROP TABLE #SavedData
Result
FriendlyFunctionCode MemberFirmId FunctionLevel3Desc
1 Value1 Value2
2 Value4 Value5

You can just use MAX and group on the FunctionCode.
SELECT
FriendlyFunctionCode,
MAX(MemberFirmId) as MemberFirmId,
MAX(FunctionLevel3Desc) as FuncationLevel3Desc
INTO #StagingTable
FROM
FirmFunction
GROUP BY
FriendlyFunctionCode
Then Truncate Your Table, and select back into it... or just create a table all together and insert the distinct (max) records into it.
TRUNCATE TABLE FirmFunction
INSERT INTO FirmFunction (FriendlyFunctionCode,MemberFirmId,FunctionLevel3Desc)
SELECT * FROM #StagingTable
This is less safe than creating a table FirmFunction2 for example with the same schema as your original and then just inserting into it, then renaming it....
SELECT TOP 1 INTO FirmFunction2 FROM FirmFunction WHERE 1=0
INSERT INTO FirmFunction2 (FriendlyFunctionCode, MemberFirmId, FunctionLevel3Desc)
SELECT
FriendlyFunctionCode,
MAX(MemberFirmId) as MemberFirmId,
MAX(FunctionLevel3Desc) as FuncationLevel3Desc
INTO #StagingTable
FROM
FirmFunction
GROUP BY
FriendlyFunctionCode
Then you can check the date in FirmFunction2 and if you are satisfied... rename it after dropping the other table.

Related

Update a column with it's concatenated previous value?

Hello dear Stackoverflow SQL gurus.
Using this simple data model:
create table test(Id INT, Field1 char(1), Field2 varchar(max));
insert into test (id, Field1) values (1, 'a');
insert into test (id, Field1) values (2, 'b');
insert into test (id, Field1) values (3, 'c');
insert into test (id, Field1) values (4, 'd');
I'm able to update Field2 with Field1 and Field2 concatenated previous value in a simple TSQL anonymous block like this :
BEGIN
DECLARE #CurrentId INT;
DECLARE #CurrentField1 char(1);
DECLARE #Field2 varchar(max) = NULL;
DECLARE cur CURSOR FOR
SELECT id, Field1
FROM test
ORDER BY id;
OPEN cur
FETCH NEXT FROM cur INTO #CurrentId, #CurrentField1;
WHILE ##FETCH_STATUS = 0
BEGIN
SET #Field2 = CONCAT(#Field2, #CurrentId, #CurrentField1);
UPDATE test
SET Field2 = #Field2
WHERE Id = #CurrentId;
FETCH NEXT FROM cur INTO #CurrentId, #CurrentField1;
END
CLOSE cur;
DEALLOCATE cur;
END
GO
Giving me the desired result:
select * from test;
Id Field1 Field2
1 a 1a
2 b 1a2b
3 c 1a2b3c
4 d 1a2b3c4d
I want to achieved the same result with a single UPDATE command to avoid CURSOR.
I thought it was possible with the LAG() function:
UPDATE test set Field2 = NULL; --reset data
UPDATE test
SET Field2 = NewValue.NewField2
FROM (
SELECT CONCAT(Field2, Id, ISNULL(LAG(Field2,1) OVER (ORDER BY Id), '')) AS NewField2,
Id
FROM test
) NewValue
WHERE test.Id = NewValue.Id;
But this give me this:
select * from test;
Id Field1 Field2
1 a 1
2 b 2
3 c 3
4 d 4
Field2 is not correctly updated with Id+Field1+(previous Field2).
The update result is logic to me because when the LAG() function re-select the value in the table this value is not yet updated.
Do you think their is a way to do this with a single SQL statement?

One method is with a recursive Common Table Expression (rCTE) to iterate through the data. This assumes that all values of Id are sequential:
WITH rCTE AS(
SELECT Id,
Field1,
CONVERT(varchar(MAX),CONCAT(ID,Field1)) AS Field2
FROM dbo.test
WHERE ID = 1
UNION ALL
SELECT t.Id,
t.Field1,
CONVERT(varchar(MAX),CONCAT(r.Field2,t.Id,t.Field1)) AS Field2
FROM dbo.test t
JOIN rCTe r ON t.id = r.Id + 1)
SELECT *
FROM rCTe;
If they aren't sequential, you can use a CTE to row number the rows first:
WITH RNs AS(
SELECT Id,
Field1,
ROW_NUMBER() OVER (ORDER BY ID) AS RN
FROM dbo.Test),
rCTE AS(
SELECT Id,
Field1,
CONVERT(varchar(MAX),CONCAT(ID,Field1)) AS Field2,
RN
FROM RNs
WHERE ID = 1
UNION ALL
SELECT RN.Id,
RN.Field1,
CONVERT(varchar(MAX),CONCAT(r.Field2,RN.Id,RN.Field1)) AS Field2,
RN.RN
FROM RNs RN
JOIN rCTe r ON RN.RN = r.RN + 1)
SELECT Id,
Field1,
Field2
FROM rCTe;

Unfortunately, SQL Server does not (yet) support string_agg() as a window function.
Instead, you can use cross apply to calculate the values:
select t.*, t2.new_field2
from test t cross apply
(select string_agg(concat(id, field1), '') within group (order by id) as new_field2
from test t2
where t2.id <= t.id
) t2;
For an update:
with toupdate as (
select t.*, t2.new_field2
from test t cross apply
(select string_agg(concat(id, field1), '') within group (order by id) as new_field2
from test t2
where t2.id <= t.id
) t2
)
update toupdate
set field2 = new_field2;
Here is a db<>fiddle.
Note: This works for small tables, but it would not be optimal on large tables. But then again, on large tables, the string would quickly become unwieldy.

Loop through sql result set and remove [n] duplicates

I've got a SQL Server db with quite a few dupes in it. Removing the dupes manually is just not going to be fun, so I was wondering if there is any sort of sql programming or scripting I can do to automate it.
Below is my query that returns the ID and the Code of the duplicates.
select a.ID, a.Code
from Table1 a
inner join (
SELECT Code
FROM Table1 GROUP BY Code HAVING COUNT(Code)>1)
x on x.Code= a.Code
I'll get a return like this, for example:
5163 51727
5164 51727
5165 51727
5166 51728
5167 51728
5168 51728
This snippet shows three returns for each ID/Code (so a primary "good" record and two dupes). However this isnt always the case. There can be up to [n] dupes, although 2-3 seems to be the norm.
I just want to somehow loop through this result set and delete everything but one record. THE RECORDS TO DELETE ARE ARBITRARY, as any of them can be "kept".

You can use row_number to drive your delete.
ie
CREATE TABLE #table1
(id INT,
code int
);
WITH cte AS
(select a.ID, a.Code, ROW_NUMBER() OVER(PARTITION by COdE ORDER BY ID) AS rn
from #Table1 a
)
DELETE x
FROM #table1 x
JOIN cte ON x.id = cte.id
WHERE cte.rn > 1
But...
If you are going to be doing a lot of deletes from a very large table you might be better off to select out the rows you need into a temp table & then truncate your table and re-insert the rows you need.
Keeps the Transaction log from getting hammered, your CI getting Fragged and should be quicker too!

It is actually very simple:
DELETE FROM Table1
WHERE ID NOT IN
(SELECT MAX(ID)
FROM Table1
GROUP BY CODE)

Self join solution with a performance test VS cte.
create table codes(
id int IDENTITY(1,1) NOT NULL,
code int null,
CONSTRAINT [PK_codes_id] PRIMARY KEY CLUSTERED
(
id ASC
))
declare #counter int, #code int
set #counter = 1
set #code = 1
while (#counter <= 1000000)
begin
print ABS(Checksum(NewID()) % 1000)
insert into codes(code) select ABS(Checksum(NewID()) % 1000)
set #counter = #counter + 1
end
GO
set statistics time on;
delete a
from codes a left join(
select MIN(id) as id from codes
group by code) b
on a.id = b.id
where b.id is null
set statistics time off;
--set statistics time on;
-- WITH cte AS
-- (select a.id, a.code, ROW_NUMBER() OVER(PARTITION by code ORDER BY id) AS rn
-- from codes a
-- )
-- delete x
-- FROM codes x
-- JOIN cte ON x.id = cte.id
-- WHERE cte.rn > 1
--set statistics time off;
Performance test results:
With Join:
SQL Server Execution Times:
CPU time = 3198 ms, elapsed time = 3200 ms.
(999000 row(s) affected)
With CTE:
SQL Server Execution Times:
CPU time = 4197 ms, elapsed time = 4229 ms.
(999000 row(s) affected)

It's basically done like this:
WITH CTE_Dup AS
(
SELECT*,
ROW_NUMBER()OVER (PARTITIONBY SalesOrderno, ItemNo ORDER BY SalesOrderno, ItemNo)
AS ROW_NO
from dbo.SalesOrderDetails
)
DELETEFROM CTE_Dup WHERE ROW_NO > 1;
NOTICE: MUST INCLUDE ALL FIELDS!!
Here is another example:
CREATE TABLE #Table (C1 INT,C2 VARCHAR(10))
INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (2,'Oracle')
SELECT * FROM #Table
;WITH Delete_Duplicate_Row_cte
AS (SELECT ROW_NUMBER()OVER(PARTITION BY C1, C2 ORDER BY C1,C2) ROW_NUM,*
FROM #Table )
DELETE FROM Delete_Duplicate_Row_cte WHERE ROW_NUM > 1
SELECT * FROM #Table

DELETE EXCEPT TOP 1

Is there any way to delete all the rows in a table except one (random) row, without specifying any column names in the DELETE statement?
I'm trying to do something like this:
CREATE TABLE [dbo].[DeleteExceptTop1]([Id] INT)
INSERT [dbo].[DeleteExceptTop1] SELECT 1
INSERT [dbo].[DeleteExceptTop1] SELECT 2
INSERT [dbo].[DeleteExceptTop1] SELECT 3
SELECT * FROM [dbo].[DeleteExceptTop1]
DELETE
FROM [dbo].[DeleteExceptTop1]
EXCEPT
SELECT TOP 1 * FROM [dbo].[DeleteExceptTop1]
SELECT * FROM [dbo].[DeleteExceptTop1]
The final SELECT should yield one row (could be any of the three).

;WITH CTE AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT newid())) AS RN
FROM [dbo].[DeleteExceptTop1]
)
DELETE FROM CTE
WHERE RN > 1
Or similar to #abatishchev's answer but with more variety in the ordering and avoiding deprecated constructs.
DECLARE #C INT
SELECT #C = COUNT(*) - 1
FROM [dbo].[DeleteExceptTop1]
IF #c > 0
BEGIN
WITH CTE AS
(
SELECT TOP(#C) *
FROM [dbo].[DeleteExceptTop1]
ORDER BY NEWID()
)
DELETE FROM CTE;
END
Or a final way that uses EXCEPT and assumes no duplicate rows and that all columns are of datatypes compatible with the EXCEPT operator
/*Materialise TOP 1 to ensure only evaluated once*/
SELECT TOP(1) *
INTO #T
FROM [dbo].[DeleteExceptTop1]
ORDER BY NEWID()
;WITH CTE AS
(
SELECT *
FROM [dbo].[DeleteExceptTop1] T1
WHERE EXISTS(
SELECT *
FROM #T
EXCEPT
SELECT T1.*)
)
DELETE FROM CTE;
DROP TABLE #T

Try:
declare #c int
select #c = count(*) - 1 from [dbo].[DeleteExceptTop1]
IF #c > 0
BEGIN
set RowCount #c
delete from [dbo].[DeleteExceptTop1]
END

No.
You need to use a column name (such as that of the primary key) to identify which rows you want to remove.
"random row" has no meaning in SQL except its data. If you want to delete everything except some row, you must differentiate that row from the others you with to DELETE
EXCEPT works by comparing the DISTINCT values in the row.
EDIT: If you can specify the primary key then this is a trivial matter. You can simply DELETE where the PK <> your "random" selection or NOT IN your "random" selection(s).
EDIT: Apparently I'm wrong about the need to specify any column name, you can do it using the assigned ROW_NUMBER.. But I'm not going to delete my answer because it references your use of EXCEPT which was discussed in the comments. You cannot do it without deriving some column name like that from ROW_NUMBER

You could do something like this (SQL 2008)
DECLARE #Original TABLE ([Id] INT)
INSERT INTO #Original(ID) VALUES(1)
INSERT INTO #Original(ID) VALUES(2)
INSERT INTO #Original(ID) VALUES(3)
SELECT * FROM #Original;
WITH CTE AS
(SELECT ROW_NUMBER() OVER(ORDER BY ID) AS ROW, ID FROM #Original)
DELETE #Original
FROM #Original O
INNER JOIN CTE ON O.ID = CTE.ROW
WHERE ROW > 1
SELECT * FROM #Original

It seems like the simplest answer may be the best. The following should work:
Declare #count int
Set #count=(Select count(*) from DeleteExceptTop1)-1
Delete top (#count) from DeleteExceptTop1

I know it has been answered but what about?
DELETE
FROM [dbo].[DeleteExceptTop1]
Where Id not in (
SELECT TOP 1 * FROM [dbo].[DeleteExceptTop1])

T-Sql count string sequences over multiple rows

How can I find subsets of data over multiple rows in sql?
I want to count the number of occurrences of a string (or number) before another string is found and then count the number of times this string occurs before another one is found.
All these strings can be in random order.
This is what I want to achieve:
I have one table with one column (columnx) with data like this:
A
A
B
C
A
B
B
The result I want from the query should be like this:
2 A
1 B
1 C
1 A
2 B
Is this even possible in sql or would it be easier just to write a little C# app to do this?

Since, as per your comment, you can add a column that will unambiguously define the order in which the columnx values go, you can try the following query (provided the SQL product you are using supports CTEs and ranking functions):
WITH marked AS (
SELECT
columnx,
sortcolumn,
grp = ROW_NUMBER() OVER ( ORDER BY sortcolumn)
- ROW_NUMBER() OVER (PARTITION BY columnx ORDER BY sortcolumn)
FROM data
)
SELECT
columnx,
COUNT(*)
FROM marked
GROUP BY
columnx,
grp
ORDER BY
MIN(sortcolumn)
;
You can see the method in work on SQL Fiddle.
If sortcolumn is an auto-increment integer column that is guaranteed to have no gaps, you can replace the first ROW_NUMBER() expression with just sortcolumn. But, I guess, that cannot be guaranteed in general. Besides, you might indeed want to sort on a timestamp instead of an integer.

I dont think you can do it with a single select.
You can use AdventureWorks cursor:
create table my_Strings
(
my_string varchar(50)
)
insert into my_strings values('A'),('A'),('B'),('C'),('A'),('B'),('B') -- this method will only work on SQL Server 2008
--select my_String from my_strings
declare #temp_result table(
string varchar(50),
nr int)
declare #myString varchar(50)
declare #myLastString varchar(50)
declare #nr int
set #myLastString='A' --set this with the value of your FIRST string on the table
set #nr=0
DECLARE string_cursor CURSOR
FOR
SELECT my_string as aux_column FROM my_strings
OPEN string_cursor
FETCH NEXT FROM string_cursor into #myString
WHILE ##FETCH_STATUS = 0 BEGIN
if (#myString = #myLastString) begin
set #nr=#nr+1
set #myLastString=#myString
end else begin
insert into #temp_result values (#myLastString, #nr)
set #myLastString=#myString
set #nr=1
end
FETCH NEXT FROM string_cursor into #myString
END
insert into #temp_result values (#myLastString, #nr)
CLOSE string_cursor;
DEALLOCATE string_cursor;
select * from #temp_result
Result:
A 2
B 1
C 1
A 1
B 2

Try this :
;with sample as (
select 'A' as columnx
union all
select 'A'
union all
select 'B'
union all
select 'C'
union all
select 'A'
union all
select 'B'
union all
select 'B'
), data
as (
select columnx,
Row_Number() over(order by (select 0)) id
from sample
) , CTE as (
select * ,
Row_Number() over(order by (select 0)) rno from data
) , result as (
SELECT d.*
, ( SELECT MAX(ID)
FROM CTE c
WHERE NOT EXISTS (SELECT * FROM CTE
WHERE rno = c.rno-1 and columnx = c.columnx)
AND c.ID <= d.ID) AS g
FROM data d
)
SELECT columnx,
COUNT(1) cnt
FROM result
GROUP BY columnx,
g
Result :
columnx cnt
A 2
B 1
C 1
A 1
B 2

What is the easiest way to find a sql query returns a result or not?

Consider the following sql server query ,
DECLARE #Table TABLE(
Wages FLOAT
)
INSERT INTO #Table SELECT 20000
INSERT INTO #Table SELECT 15000
INSERT INTO #Table SELECT 10000
INSERT INTO #Table SELECT 45000
INSERT INTO #Table SELECT 50000
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(ORDER BY Wages DESC) RowID
FROM #Table
) sub
WHERE RowID = 3
The result of the query would be 20000 ..... Thats fine as of now i have to find the result of this query,
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(ORDER BY Wages DESC) RowID
FROM #Table
) sub
WHERE RowID = 6
It will not give any result because there are only 5 rows in the table.....
so now my question is
What is the easiest way to find a sql query returns a result or not?

Use ##ROWCOUNT > 0
So as a simple example,
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(ORDER BY Wages DESC) RowID
FROM #Table
) sub
WHERE RowID = 6
IF ##ROWCOUNT > 0 BEGIN
RETURN 1
END
ELSE BEGIN
RETURN 0
END
For more information, here's a link to the documentation.

Like this:
SELECT ##rowcount

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Delete repeated Ids from Table - Performance Improvement - sql

WITH CTE AS (SELECT MemberFirmId, FriendlyFunctionCode, ROW_NUMBER() over (PARTITION by FriendlyFunctionCode ORDER BY FriendlyFunctionCode ) AS RN FROM [dbo].[FirmFunction] ) DELETE CTE WHERE CTE.RN >1

Related

Update a column with it's concatenated previous value?

Loop through sql result set and remove [n] duplicates

DELETE EXCEPT TOP 1

T-Sql count string sequences over multiple rows

What is the easiest way to find a sql query returns a result or not?

Categories

Resources