A stored procedure has been written that includes duplicates. ROW_NUMBER was tried but did not work. DISTINCT has worked but was unable to retrieve the large number of records required (about 700,000). Is there another way of using RANK or GROUP BY to remove duplicates?
I have used DISTINCT and this does not retrieve enough records. I have not successfully used GROUP BY.
I have attempted to use ROW NUMBER but this did not work either (you can see where its commented out).
CREATE PROCEDURE [report].[get_foodDetails]
#foodgroup_id INT,
#shop_id INT = 0,
#product_id INT = 0,
#maxrows INT = 600,
#expiry INT = 1,
#productactive INT = 1,
#expiryPeriod DATETIME = '9999-12-31 23:59:59'
AS
IF (#expiryPeriod >= '9999-12-31')
BEGIN
SET #expiryPeriod = GETDATE()
END
SELECT
-- dp.RowNumber
ISNULL([FoodType], '') AS [Foodtype],
ISNULL([FoodColour], '') AS [FoodColour],
ISNULL([FoodBarcode], '') AS [FoodBarcode],
ISNULL([FoodArticleNum], 0) AS [FoodArticleNum],
ISNULL([FoodShelfLife, '9999-21-31') AS [FoodShelfLIFe]
INTO
#devfood
FROM
report.[GetOrderList] (#foodgroup_id, #product_id, #productactive, #expiry, #expiryPeriod, #shop_id, #maxrows ) dp
INNER JOIN
food_group fg ON fg.food_group_id = it.item_FK_item_group_id
SELECT TOP(#maxrows) *
FROM #devfood
ORDER BY [device_packet_created_date]
END
Around 700,000 records retrieved. This is currently achieved although there are duplicates. There are only 20,000 retrieved when using DISTINCT (but no duplicates).
The sample code below is from a presentation I've used to demonstrate CTE's. This is a common mechanism for removing duplicates and is very fast. In this case the duplicates are removed directly from the table. If that is not your objective you could use a temp table or a prior chained CTE. Note that the important thing is what columns you partition by. If, in the example, you partitioned by only [name] you would not see both the red rose and the white rose.
-------------------------------------------------
if object_id(N'[flower].[order]', N'U') is not null
drop table [flower].[order];
go
create table [flower].[order]
(
[id] int identity(1, 1) not null constraint [flower.order.id.clustered_primary_key] primary key clustered
, [flower] nvarchar(128)
, [color] nvarchar(128)
, [count] int
);
go
insert into [flower].[order]
([flower]
, [color]
, [count])
values (N'rose',N'red',5),
(N'rose',N'red',3),
(N'rose',N'white',2),
(N'rose',N'red',1),
(N'rose',N'red',9),
(N'marigold',N'yellow',2),
(N'marigold',N'yellow',9),
(N'marigold',N'yellow',4),
(N'chamomile',N'amber',9),
(N'chamomile',N'amber',4),
(N'lily',N'white',12);
go
select [flower]
, [color]
from [flower].[order];
go
--
-------------------------------------------------
with [duplicate_finder]([name], [color], [sequence])
as (select [flower]
, [color]
, row_number()
over (
partition by [flower], [color]
order by [flower] desc) as [sequence]
from [flower].[order])
delete from [duplicate_finder]
where [sequence] > 1;
--
-- no duplicates
-------------------------------------------------
select [flower]
, [color]
from [flower].[order];
I know you said you tried ROW_NUMBER, but did you try it either of these ways?
First, a CTE. The CTE here is just your existing query, but with a ROW_NUMBER windowing function attached. For each duplicate iteration of a record, it will add one to RowNumber. With the next unique group of records, RowNumber resets to 1.
After the pull, only take the records with a RowNumber = 1. I use this all the time for deleting dupes out of the underlying record set, but it works well to just identify them as well.
WITH NoDupes AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY
ISNULL(FoodType, '')
,ISNULL(FoodColour, '')
,ISNULL(FoodBarcode, '')
,ISNULL(FoodArticleNum, '')
,ISNULL(FoodShelfLife, '9999-21-31')
ORDER BY
(
SELECT
0
)
) AS RowNumber
,ISNULL(FoodType, '') AS Foodtype
,ISNULL(FoodColour, '') AS FoodColour
,ISNULL(FoodBarcode, '') AS FoodBarcode
,ISNULL(FoodArticleNum, 0) AS FoodArticleNum
,ISNULL(FoodShelfLife, '9999-21-31') AS FoodShelfLIFe
FROM
report.GetOrderList(#foodgroup_id, #product_id, #productactive, #expiry, #expiryPeriod, #shop_id, #maxrows) AS dp
INNER JOIN
food_group AS fg
ON
fg.food_group_id = it.item_FK_item_group_id
)
SELECT
nd.Foodtype
,nd.FoodColour
,nd.FoodBarcode
,nd.FoodArticleNum
,nd.FoodShelfLIFe
INTO
#devfood
FROM
NoDupes AS nd
WHERE
NoDupes.RowNumber = 1;
Alternatively (and shorter) you could try SELECT TOP (1) WITH TIES, using that same ROW_NUMBER function to order the record set. The TOP (1) WITH TIES part functionally does the same thing as the CTE, returning only the first record of each set of duplicates.
SELECT
TOP (1) WITH TIES
ISNULL(FoodType, '') AS Foodtype
,ISNULL(FoodColour, '') AS FoodColour
,ISNULL(FoodBarcode, '') AS FoodBarcode
,ISNULL(FoodArticleNum, 0) AS FoodArticleNum
,ISNULL(FoodShelfLife, '9999-21-31') AS FoodShelfLIFe
INTO
#devfood
FROM
report.GetOrderList(#foodgroup_id, #product_id, #productactive, #expiry, #expiryPeriod, #shop_id, #maxrows) AS dp
INNER JOIN
food_group AS fg
ON
fg.food_group_id = it.item_FK_item_group_id
ORDER BY
ROW_NUMBER() OVER (PARTITION BY
ISNULL(FoodType, '')
,ISNULL(FoodColour, '')
,ISNULL(FoodBarcode, '')
,ISNULL(FoodArticleNum, '')
,ISNULL(FoodShelfLife, '9999-21-31')
ORDER BY
(
SELECT
0
)
);
The CTE is maybe a little clearer in it's intention for the next person who looks at the code, but the TOP might perform a little better.
Related
I've got a SQL Server db with quite a few dupes in it. Removing the dupes manually is just not going to be fun, so I was wondering if there is any sort of sql programming or scripting I can do to automate it.
Below is my query that returns the ID and the Code of the duplicates.
select a.ID, a.Code
from Table1 a
inner join (
SELECT Code
FROM Table1 GROUP BY Code HAVING COUNT(Code)>1)
x on x.Code= a.Code
I'll get a return like this, for example:
5163 51727
5164 51727
5165 51727
5166 51728
5167 51728
5168 51728
This snippet shows three returns for each ID/Code (so a primary "good" record and two dupes). However this isnt always the case. There can be up to [n] dupes, although 2-3 seems to be the norm.
I just want to somehow loop through this result set and delete everything but one record. THE RECORDS TO DELETE ARE ARBITRARY, as any of them can be "kept".
You can use row_number to drive your delete.
ie
CREATE TABLE #table1
(id INT,
code int
);
WITH cte AS
(select a.ID, a.Code, ROW_NUMBER() OVER(PARTITION by COdE ORDER BY ID) AS rn
from #Table1 a
)
DELETE x
FROM #table1 x
JOIN cte ON x.id = cte.id
WHERE cte.rn > 1
But...
If you are going to be doing a lot of deletes from a very large table you might be better off to select out the rows you need into a temp table & then truncate your table and re-insert the rows you need.
Keeps the Transaction log from getting hammered, your CI getting Fragged and should be quicker too!
It is actually very simple:
DELETE FROM Table1
WHERE ID NOT IN
(SELECT MAX(ID)
FROM Table1
GROUP BY CODE)
Self join solution with a performance test VS cte.
create table codes(
id int IDENTITY(1,1) NOT NULL,
code int null,
CONSTRAINT [PK_codes_id] PRIMARY KEY CLUSTERED
(
id ASC
))
declare #counter int, #code int
set #counter = 1
set #code = 1
while (#counter <= 1000000)
begin
print ABS(Checksum(NewID()) % 1000)
insert into codes(code) select ABS(Checksum(NewID()) % 1000)
set #counter = #counter + 1
end
GO
set statistics time on;
delete a
from codes a left join(
select MIN(id) as id from codes
group by code) b
on a.id = b.id
where b.id is null
set statistics time off;
--set statistics time on;
-- WITH cte AS
-- (select a.id, a.code, ROW_NUMBER() OVER(PARTITION by code ORDER BY id) AS rn
-- from codes a
-- )
-- delete x
-- FROM codes x
-- JOIN cte ON x.id = cte.id
-- WHERE cte.rn > 1
--set statistics time off;
Performance test results:
With Join:
SQL Server Execution Times:
CPU time = 3198 ms, elapsed time = 3200 ms.
(999000 row(s) affected)
With CTE:
SQL Server Execution Times:
CPU time = 4197 ms, elapsed time = 4229 ms.
(999000 row(s) affected)
It's basically done like this:
WITH CTE_Dup AS
(
SELECT*,
ROW_NUMBER()OVER (PARTITIONBY SalesOrderno, ItemNo ORDER BY SalesOrderno, ItemNo)
AS ROW_NO
from dbo.SalesOrderDetails
)
DELETEFROM CTE_Dup WHERE ROW_NO > 1;
NOTICE: MUST INCLUDE ALL FIELDS!!
Here is another example:
CREATE TABLE #Table (C1 INT,C2 VARCHAR(10))
INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (1,'SQL Server')
INSERT INTO #Table VALUES (2,'Oracle')
SELECT * FROM #Table
;WITH Delete_Duplicate_Row_cte
AS (SELECT ROW_NUMBER()OVER(PARTITION BY C1, C2 ORDER BY C1,C2) ROW_NUM,*
FROM #Table )
DELETE FROM Delete_Duplicate_Row_cte WHERE ROW_NUM > 1
SELECT * FROM #Table
Imagine the following two tables:
create table MainTable (
MainId integer not null, -- This is the index
Data varchar(100) not null
)
create table OtherTable (
MainId integer not null, -- MainId, Name combined are the index.
Name varchar(100) not null,
Status tinyint not null
)
Now I want to select all the rows from MainTable, while combining all the rows that match each MainId from OtherTable into a single field in the result set.
Imagine the data:
MainTable:
1, 'Hi'
2, 'What'
OtherTable:
1, 'Fish', 1
1, 'Horse', 0
2, 'Fish', 0
I want a result set like this:
MainId, Data, Others
1, 'Hi', 'Fish=1,Horse=0'
2, 'What', 'Fish=0'
What is the most elegant way to do this?
(Don't worry about the comma being in front or at the end of the resulting string.)
There is no really elegant way to do this in Sybase. Here is one method, though:
select
mt.MainId,
mt.Data,
Others = stuff((
max(case when seqnum = 1 then ','+Name+'='+cast(status as varchar(255)) else '' end) +
max(case when seqnum = 2 then ','+Name+'='+cast(status as varchar(255)) else '' end) +
max(case when seqnum = 3 then ','+Name+'='+cast(status as varchar(255)) else '' end)
), 1, 1, '')
from MainTable mt
left outer join
(select
ot.*,
row_number() over (partition by MainId order by status desc) as seqnum
from OtherTable ot
) ot
on mt.MainId = ot.MainId
group by
mt.MainId, md.Data
That is, it enumerates the values in the second table. It then does conditional aggregation to get each value, using the stuff() function to handle the extra comma. The above works for the first three values. If you want more, then you need to add more clauses.
Well, here is how I implemented it in Sybase 13.x. This code has the advantage of not being limited to a number of Names.
create proc
as
declare
#MainId int,
#Name varchar(100),
#Status tinyint
create table #OtherTable (
MainId int not null,
CombStatus varchar(250) not null
)
declare OtherCursor cursor for
select
MainId, Name, Status
from
Others
open OtherCursor
fetch OtherCursor into #MainId, #Name, #Status
while (##sqlstatus = 0) begin -- run until there are no more
if exists (select 1 from #OtherTable where MainId = #MainId) begin
update #OtherTable
set CombStatus = CombStatus + ','+#Name+'='+convert(varchar, Status)
where
MainId = #MainId
end else begin
insert into #OtherTable (MainId, CombStatus)
select
MainId = #MainId,
CombStatus = #Name+'='+convert(varchar, Status)
end
fetch OtherCursor into #MainId, #Name, #Status
end
close OtherCursor
select
mt.MainId,
mt.Data,
ot.CombStatus
from
MainTable mt
left join #OtherTable ot
on mt.MainId = ot.MainId
But it does have the disadvantage of using a cursor and a working table, which can - at least with a lot of data - make the whole process slow.
I sure hope someone can help me out with this issue. I have been searching for hours to find it but I am coming up empty.
In this example I have two columns in my table
GRP_ID Desc
My group ID is the way I will identify that these products are of the same type, and desc is what I want to find all the common words.
So here is my table
GRP_ID Desc
-------------------------------
2 Red Hat
2 Green Hat
2 Yellow Hat
3 Boots Large Brown
3 Boots Medium Red
3 Boots Medium Brown
What I want as a result of the query would be the following
GRP_ID Desc
-----------------------
2 Hat
3 Boots
So what I want is all the words that appear in every string in the group or the common words in the group.
I think you'd need to create a mapping table for GRP_ID and products - e.g. Hat and Boots.
CREATE TABLE GroupProductMapping (
GRP_ID INT NOT NULL, -- I'm assuming its an Int
ProductDesc VARCHAR(50) NOT NULL
)
SELECT a.GRP_ID,
b.ProductDesc Desc
FROM {Table_Name} a
INNER JOIN GroupProductMapping b ON a.GRP_ID = b.GRP_ID
Alternatively, if you don't have too many products. You could use CASE in your SELECT clause.
e.g.
SELECT
GRP_ID,
CASE GRP_ID
WHEN 1 THEN 'Hat'
WHEN 2 THEN 'Boots'
END AS Desc
FROM {Table_Name}
{Table_Name} is the name of your original table.
Ideally you would normalise your data and store the words in a separate table.
However for your immediate requirements, you first need to provide a UDF to split 'desc' into words. I poached this function:
-- this function splits the provided strings on a delimiter
-- similar to .Net string.Split.
-- I'm sure there are alternatives (such as calling string.Split through
-- a CLR function).
CREATE FUNCTION [dbo].[Split]
(
#RowData NVARCHAR(MAX),
#Delimeter NVARCHAR(MAX)
)
RETURNS #RtnValue TABLE
(
ID INT IDENTITY(1,1),
Data NVARCHAR(MAX)
)
AS
BEGIN
DECLARE #Iterator INT
SET #Iterator = 1
DECLARE #FoundIndex INT
SET #FoundIndex = CHARINDEX(#Delimeter,#RowData)
WHILE (#FoundIndex>0)
BEGIN
INSERT INTO #RtnValue (data)
SELECT
Data = LTRIM(RTRIM(SUBSTRING(#RowData, 1, #FoundIndex - 1)))
SET #RowData = SUBSTRING(#RowData,
#FoundIndex + DATALENGTH(#Delimeter) / 2,
LEN(#RowData))
SET #Iterator = #Iterator + 1
SET #FoundIndex = CHARINDEX(#Delimeter, #RowData)
END
INSERT INTO #RtnValue (Data)
SELECT Data = LTRIM(RTRIM(#RowData))
RETURN
END
Then you need to split the descriptions and do some grouping (which you would also do if the data was normalised)
-- get the count of each grp_id
with group_count as
(
select grp_id, count(*) cnt from [Group]
group by grp_id
),
-- get the count of each word in each grp_id
group_word_count as
(
select count(*) cnt, grp_id, data from
(
select * from [group] g
cross apply dbo.Split(g.[Desc], ' ')
)
t
group by grp_id, data
)
-- return rows where number of grp_id = number of words in grp_id
select gwc.GRP_ID, gwc.Data [Desc] from group_word_count gwc
inner join group_count gc on gwc.GRP_ID = gc.GRP_ID and gwc.cnt = gc.cnt
Where [Group] is your table.
I am still so new to all this and I think I may have not done this the best way. I have a Table Valued function that I wrote, but I think that it could be written as a view.
The big catch as to why I used a table val function is that if the select query returns no results then I wanted to return a "default" row that showed empty values along with the timestamp and I didn't know how to do that in a view.
I betting the experts here know how to. Here's the function:
alter FUNCTION [dbo].[GetCurrentRTBindingConstraints]()
RETURNS
#CurrentBindingConstraints table (
CONSTRAINTNAME [nvarchar] (120),
MKTHOUR_EST [dateTime],
MARGINALVALUE [nvarchar] (20)
)
AS
BEGIN
INSERT INTO #CurrentBindingConstraints
select * from
OPENQUERY(UDS9, 'select
CONSTRAINTNAME, MKTHOUR -(5/24) as MKTHOUR_EST,MARGINALVALUE
from UDS9.MKTPLANCONSTRAINT mpc
where MARGINALVALUE != 0.00 and mpc.caseid=(SELECT caseid FROM uds9.MktCase
WHERE casestartinterval=(SELECT MAX(casestartinterval) FROM uds9.MktCase WHERE casestate=5 AND studymodeid=5)
AND casestate=5 AND studymodeid=5)')
DECLARE #cnt INT
SELECT #cnt = COUNT(*) FROM #CurrentBindingConstraints
IF #cnt = 0
INSERT INTO #CurrentBindingConstraints (
[CONSTRAINTNAME],
[MKTHOUR_EST],
[MARGINALVALUE])
VALUES ('None',dbo.RoundTime(dbo.GetGMTtoEST(getutcdate())),'None')
RETURN
END
You can use a common table expression (CTE) and a ranking function as follows:
;with Defaulted as (
select 'none' as Col1,CURRENT_TIMESTAMP as Col2,'none' as Col3,1 as init -- This is your default row
union all
select name,DATEADD(day,-1,CURRENT_TIMESTAMP),name,0 from sys.objects -- This is where you query for real rows
), Ranked as (
select Col1,Col2,Col3,RANK() OVER (ORDER BY init) as rnk from Defaulted
)
select * from Ranked where rnk = 1
The above is just an example - you'd need to replace the two selects inside the first CTE with your real queries, and should use column names rather than select *. It works because the ranking function (RANK()) is able to assess the result set as a whole.
Edit - trying with your actual queries:
create view CurrentRTBindingConstraints
as
;with Defaulted as (
select CONSTRAINTNAME,MKTHOUR_EST,MARGINALVALUE,0 as init from
OPENQUERY(UDS9, 'select
CONSTRAINTNAME, MKTHOUR -(5/24) as MKTHOUR_EST,MARGINALVALUE
from UDS9.MKTPLANCONSTRAINT mpc
where MARGINALVALUE != 0.00 and mpc.caseid=(SELECT caseid FROM uds9.MktCase
WHERE casestartinterval=(SELECT MAX(casestartinterval) FROM uds9.MktCase WHERE casestate=5 AND studymodeid=5)
AND casestate=5 AND studymodeid=5)')
union all
select 'None',dbo.RoundTime(dbo.GetGMTtoEST(getutcdate())),'None',1
), Ranked as (
select CONSTRAINTNAME,MKTHOUR_EST,MARGINALVALUE,RANK() OVER (ORDER BY init) as rnk from Defaulted
)
select CONSTRAINTNAME,MKTHOUR_EST,MARGINALVALUE from Ranked where rnk = 1
I have a table with some duplicate entries. I have to discard all but one, and then update this latest one. I've tried with a temporary table and a while statement, in this way:
CREATE TABLE #tmp_ImportedData_GenericData
(
Id int identity(1,1),
tmpCode varchar(255) NULL,
tmpAlpha3Code varchar(50) NULL,
tmpRelatedYear int NOT NULL,
tmpPreviousValue varchar(255) NULL,
tmpGrowthRate varchar(255) NULL
)
INSERT INTO #tmp_ImportedData_GenericData
SELECT
MCS_ImportedData_GenericData.Code,
MCS_ImportedData_GenericData.Alpha3Code,
MCS_ImportedData_GenericData.RelatedYear,
MCS_ImportedData_GenericData.PreviousValue,
MCS_ImportedData_GenericData.GrowthRate
FROM MCS_ImportedData_GenericData
INNER JOIN
(
SELECT CODE, ALPHA3CODE, RELATEDYEAR, COUNT(*) AS NUMROWS
FROM MCS_ImportedData_GenericData AS M
GROUP BY M.CODE, M.ALPHA3CODE, M.RELATEDYEAR
HAVING count(*) > 1
) AS M2 ON MCS_ImportedData_GenericData.CODE = M2.CODE
AND MCS_ImportedData_GenericData.ALPHA3CODE = M2.ALPHA3CODE
AND MCS_ImportedData_GenericData.RELATEDYEAR = M2.RELATEDYEAR
WHERE
(MCS_ImportedData_GenericData.PreviousValue <> 'INDEFINITO')
-- SELECT * from #tmp_ImportedData_GenericData
-- DROP TABLE #tmp_ImportedData_GenericData
DECLARE #counter int
DECLARE #rowsCount int
SET #counter = 1
SELECT #rowsCount = count(*) from #tmp_ImportedData_GenericData
-- PRINT #rowsCount
WHILE #counter < #rowsCount
BEGIN
SELECT
#Code = tmpCode,
#Alpha3Code = tmpAlpha3Code,
#RelatedYear = tmpRelatedYear,
#OldValue = tmpPreviousValue,
#GrowthRate = tmpGrowthRate
FROM
#tmp_ImportedData_GenericData
WHERE
Id = #counter
DELETE FROM MCS_ImportedData_GenericData
WHERE
Code = #Code
AND Alpha3Code = #Alpha3Code
AND RelatedYear = #RelatedYear
AND PreviousValue <> 'INDEFINITO' OR PreviousValue IS NULL
UPDATE
MCS_ImportedData_GenericData
SET
PreviousValue = #OldValue, GrowthRate = #GrowthRate
WHERE
Code = #Code
AND Alpha3Code = #Alpha3Code
AND RelatedYear = #RelatedYear
AND MCS_ImportedData_GenericData.PreviousValue ='INDEFINITO'
SET #counter = #counter + 1
END
but it takes too long time, even if there are just 20000 - 30000 rows to process.
Does anyone has some suggestions in order to improve performance?
Thanks in advance!
WITH q AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY CODE, ALPHA3CODE, RELATEDYEAR ORDER BY CASE WHEN PreviousValue = 'INDEFINITO' THEN 1 ELSE 0 END)
FROM MCS_ImportedData_GenericData m
WHERE PreviousValue <> 'INDEFINITO'
)
DELETE
FROM q
WHERE rn > 1
Quassnoi's answer uses SQL Server 2005+ syntax, so I thought I'd put in my tuppence worth using something more generic...
First, to delete all the duplicates, but not the "original", you need a way of differentiating the duplicate records from each other. (The ROW_NUMBER() part of Quassnoi's answer)
It would appear that in your case the source data has no identity column (you create one in the temp table). If that is the case, there are two choices that come to my mind:
1. Add the identity column to the data, then remove the duplicates
2. Create a "de-duped" set of data, delete everything from the original, and insert the de-deduped data back into the original
Option 1 could be something like...
(With the newly created ID field)
DELETE
[data]
FROM
MCS_ImportedData_GenericData AS [data]
WHERE
id > (
SELECT
MIN(id)
FROM
MCS_ImportedData_GenericData
WHERE
CODE = [data].CODE
AND ALPHA3CODE = [data].ALPHA3CODE
AND RELATEDYEAR = [data].RELATEDYEAR
)
OR...
DELETE
[data]
FROM
MCS_ImportedData_GenericData AS [data]
INNER JOIN
(
SELECT
MIN(id) AS [id],
CODE,
ALPHA3CODE,
RELATEDYEAR
FROM
MCS_ImportedData_GenericData
GROUP BY
CODE,
ALPHA3CODE,
RELATEDYEAR
)
AS [original]
ON [original].CODE = [data].CODE
AND [original].ALPHA3CODE = [data].ALPHA3CODE
AND [original].RELATEDYEAR = [data].RELATEDYEAR
AND [original].id <> [data].id
I don't understand used syntax perfectly enough to post an exact answer, but here's an approach.
Identify rows you want to preserve (eg. select value, ... from .. where ...)
Do the update logic while identifying (eg. select value + 1 ... from ... where ...)
Do insert select to a new table.
Drop the original, rename new to original, recreate all grants/synonyms/triggers/indexes/FKs/... (or truncate the original and insert select from the new)
Obviously this has a prety big overhead, but if you want to update/clear millions of rows, it will be the fastest way.