SQL Deduplication, populating the duplicates with its unique identifier - sql

I want to be able to populate any duplicate items in a table with its unique identifier. So for example, in the table below;
PROJ00002492 should get a GlobalFamilDupID of PROJ00002492 (itself, aka the ControlNumber column), and all other duplicates should get the same value PROJ00002492.
PROJ00005876 should get the value PROJ00005876 (aka itself, the ControlNumber column).
Code:
update mstr
SET
IsGlobalFamilyUnique = case when (rn > 1) then 0 else 1 end
from(
select
ControlNumber,
MD5hash,
IsglobalFamilyUnique,
GlobalFamilyDupID,
row_number() over (partition by [MD5Hash] order by ID asc) [RN]
from dbo.tblMaster
where NuixGuid = TopLvlGuid and IsGlobalFamilyUnique is null
)mstr
The above code works, but I can't think how to populate the GlobalFamilDupID column? Will I have to do it in a separate query?

You can add the value you need into your SELECT using
FIRST_VALUE(ControlNumber) over (partition by [MD5Hash] order by ID
Then use the value in your UPDATE

Related

How do I remove duplicates from this data in snowflake?

I have this data where I have to remove the duplicate data highlighted in yellow. I tried to use distinct and group by but it does not seem to work.
Even though the comment and value is different, anything having the value as CRIMSSTANDARD is considered a child of the parent in this case and hence considered as a duplicate as the first 6 fields are the same. How do I remove the duplicate records in this case?
CLIENT;CLIENT_ID;QTY_TYPE;QUANTITY;AMOUNT;TRANTYPE;COMMENT;VALUE;ORDER_ID;DEV_COMM;AMT_COMM;ID
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-AGNIndPlusFd;Q;2898;;Buy;INS;CRIMSSTANDARD;10902701;;;
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-DepVz;A;;82000;Sell;GEN,TransferOrder;10902568;Y;0;
TEST_IMPL8;P-DepVz;A;;82000;Sell;INS;CRIMSSTANDARD;10902568;;;
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;GEN;OHPEN_14613296BD001571;10902668;;;14613296BD001571
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;INS;CRIMSSTANDARD;10902668;;;
TEST_IMPL8;P-Dyn4Vz;A;;13000;Buy;INS;CRIMSSTANDARD;10877286;;;
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;GEN;OHPEN_14613296BD001565;10677375;;;14613296BD001565
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;INS;CRIMSSTANDARD;10677375;;;
TEST_IMPL8;P-VNAMijBel;Q;10236;;Sell;INS;CRIMSSTANDARD;11402183;;;
First pass, you'll need to get rid of rows that are entire duplicates of each other (this will fix the row 1 and row 3 issue in your data where the rows are exactly the same). Use the approach outlined here: How to delete Duplicate records in snowflake database table
Next, try this to eliminate duplicates of first 6 columns, and deleting the child record designated by CRIMSSTANDARD in the COMMENT column:
begin work;
delete from
my_table using (
select
CLIENT,
CLIENT_ID,
QTY_TYPE,
QUANTITY,
AMOUNT,
TRANTYPE
from
my_table
GROUP BY
CLIENT,
CLIENT_ID,
QTY_TYPE,
QUANTITY,
AMOUNT,
TRANTYPE
HAVING
COUNT(*) > 1
) as my_table_dupes
where
my_table.CLIENT = X.CLIENT
AND my_table.CLIENT_ID = my_table_dupes.CLIENT_ID
AND my_table.QTY_TYPE = my_table_dupes.QTY_TYP
AND my_table.QUANTITY = my_table_dupes.QUANTITY
AND my_table.AMOUNT = my_table_dupes.AMOUNT
AND my_table.TRANTYPE = my_table_dupes.TRANTYPE
AND my_table.COMMENT = 'CRIMSSTANDARD'
commit work;
There is no distinct on in Snowflake, but you can have a similar result using qualify:
SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value
ORDER BY client, client_id, qty_type, quantity, amount, trantype, value
) = 1;
See here to see more details
EDIT:
To keep the rows that have the dev_comm, amt_comm and ID non null in priority you can modify your order condition:
SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value
ORDER BY
IFF(dev_comm IS NOT NULL, 0, 1),
IFF(amt_comm IS NOT NULL, 0, 1),
IFF(ID IS NOT NULL, 0, 1)
) = 1;
You can also have different priorities by ordering the fields in the order by. For example in case you have multiple lines that have the dev_comm columns but you want in priority the line that have the ID, you need to put the iff(ID ... first in your order by. It order by the first element, then if there is several lines with the same first element it order by the second, etc...

Big query De-duplication query is not working properly

anyone please tell me the below query is not working properly, It suppose to delete the duplicate records only and keep the one of them (latest record) but it is deleting all the record instead of keeping one of the duplicate records, why is it so?
delete
from
dev_rahul.page_content_insights
where
(sha_id,
etl_start_utc_dttm) in (
select
(a.sha_id,
a.etl_start_utc_dttm)
from
(
select
sha_id,
etl_start_utc_dttm,
ROW_NUMBER() over (Partition by sha_id
order by
etl_start_utc_dttm desc) as rn
from
dev_rahul.page_content_insights
where
(snapshot_dt) >= '2021-03-25' ) a
where
a.rn <> 1)
Query looks ok, though I don't use that syntax for cleaning up duplicates.
Can I confirm the following:
sha_id, etl_start_utc_dttm is your primary key?
You wish to keep sha_id and the latest row based on etl_start_utc_dttm field descending?
If so, try this two query pattern:
create or replace table dev_rahul.rows_not_to_delete as
SELECT col.* FROM (SELECT ARRAY_AGG(pci ORDER BY etl_start_utc_dttm desc LIMIT 1
) OFFSET(0)] col
FROM dev_rahul.page_content_insights pci
where snapshot_dt >= '2021-03-25' )
GROUP BY sha_id
);
delete dev_rahul.page_content_insights p
where not exists (select 1 from DW_pmo.rows_not_to_delete d
where p.sha_id = d.sha_id and p.etl_start_utc_dttm = d.etl_start_utc_dttm
) and snapshot_dt >= '2021-03-25';
You could do this in a singe query by putting the first statement into a CTE.

Remove duplicate row based on select statement

I have two select statements which is returning duplicated data. What I'm trying to accomplish is to remove a duplicated leg. But I'm having hard times to get to the second row programmatically.
select i.InvID, i.UID, i.StartDate, i.EndDate, i.Minutes,i.ABID from inv_v i, InvoiceLines_v i2 where
i.Period = '2014/08'
and i.EndDate = i2.EndDate
and i.Minutes = i2.Minutes
and i.Uid <> i2.Uid
and i.abid = i2.abid
order by i.EndDate
This select statement returns the following data.
As you can see it returns duplicate rows where minutes are the same ABID is the same but InvID are different. What I need to do is to remove one of the InvID where the criteria matches. Doesn't matter which one.
The second select statement is returning different data.
select i.InvID, i.UID, i.StartDate, i.EndDate, i.Minutes from InvoiceLines_v i, InvoiceLines_v i2 where
i.Period = '2014/08'
and i.EndDate = i2.EndDate
and i.Uid = i2.Uid
and i.Abid <> i2.Abid
and i.Language <> i2.Language
order by i.startdate desc
In this select statement I want to remove an InvID where UID is the same then select the lowest Mintues. In This case, I would remove the following InvIDs: 2537676 , 2537210
My goal is to remove those rows...
I could accomplish this using cursor grab the InvID and remove it by simple delete statement, but I'm trying to stay away from cursors.
Any suggestions on how I can accomplish this?
You can use exists to delete all duplicates except the one with the highest InvID by deleting those rows where another row exists with the same values but with a higher InvID
delete from inv_v
where exists (
select 1 from inv_v i2
where i2.InvID > inv_v.InvID
and i2.minutes = inv_v.minutes
and i2.EndDate = inv_v.EndDate
and i2.abid = inv_v.abid
and i2.uid <> inv_v.uid -- not sure why <> is used here, copied from question
)
I have faced similar problems regarding duplicate data and some one told me to use partition by and other methods but those were causing performance issues
However , I had a primary key in my table through which I was able to select one row from the duplicate data and then delete it.
For example in the first select statement "minutes" and "ABID" are the criteria to consider duplicacy in data.But "Invid" can be used to distinguish between the duplicate rows.
So you can use below query to remove duplicacy.
delete from inv_i where inv_id in (select max(inv_id) from inv_i group by minutes,abid having count(*) > 1 );
This simple concept was helpful to me. It can be helpful in your case if "Inv_id" is unique.
;WITH CTE AS
(
SELECT InvID
,[UID]
,StartDate
,EndDate
,[Minutes]
,ROW_NUMBER() OVER (PARTITION BY InvID, [UID] ORDER BY [Minutes] ASC) rn
FROM InvoiceLines_v
)
SELECT *
FROM CTE
WHERE rn = 1
Replace the ORIGINAL_TABLE with your table name.
QUERY 1:
WITH DUP_TABLE AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY minutes, ABID ORDER BY minutes, ABID) As ROW_NO
FROM <ORIGINAL_TABLE>
)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
QUERY 2:
WITH DUP_TABLE AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY UID ORDER BY minutes) As ROW_NO
FROM <ORIGINAL_TABLE>
)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;

SQL UPDATE row Number

I have a table serviceClusters with a column identity(1590 values). Then I have another table serviceClustersNew with the columns ID, text and comment. In this table, I have some values for text and comment, the ID is always 1. Here an example for the table:
[1, dummy1, hello1;
1, dummy2, hello2;
1, dummy3, hello3;
etc.]
WhaI want now for the values in the column ID is the continuing index of the table serviceClusters plus the current Row number: In our case, this would be 1591, 1592 and 1593.
I tried to solve the problem like this: First I updated the column ID with the maximum value, then I tryed to add the row number, but this doesnt work:
-- Update ID to the maximum value 1590
UPDATE serviceClustersNew
SET ID = (SELECT MAX(ID) FROM serviceClusters);
-- This command returns the correct values 1591, 1592 and 1593
SELECT ID+ROW_NUMBER() OVER (ORDER BY Text_ID) AS RowNumber
FROM serviceClustersNew
-- But I'm not able to update the table with this command
UPDATE serviceClustersNew
SET ID = (SELECT ID+ROW_NUMBER() OVER (ORDER BY Text_ID) AS RowNumber FROM
serviceClustersNew)
By sending the last command, I get the error "Syntax error: Ordered Analytical Functions are not allowed in subqueries.". Do you have any suggestions, how I could solve the problem? I know I could do it with a volatile table or by adding a column, but is there a way without creating a new table / altering the current table?
You have to rewrite it using UPDATE FROM, the syntax is just a bit bulky:
UPDATE serviceClustersNew
FROM
(
SELECT text_id,
(SELECT MAX(ID) FROM serviceClusters) +
ROW_NUMBER() OVER (ORDER BY Text_ID) AS newID
FROM serviceClustersNew
) AS src
SET ID = newID
WHERE serviceClustersNew.Text_ID = src.Text_ID
You are not dealing with a lot of data, so a correlated subquery can serve the same purpose:
UPDATE serviceClustersNew
SET ID = (select max(ID) from serviceClustersNew) +
(select count(*)
from serviceClustersNew scn2
where scn2.Text_Id <= serviceClustersNew.TextId
)
This assumes that the text_id is unique along the rows.
Apparently you can update a base table through a CTE... had no idea. So, just change your last UPDATE statement to this, and you should be good. Just be sure to include any fields in the CTE that you desire to update.
;WITH cte_TEST AS
( SELECT
ID,
ID+ROW_NUMBER() OVER (ORDER BY TEXT_ID) AS RowNumber FROM serviceClustersNew)
UPDATE cte_TEST
SET cte_TEST.ID = cte_TEST.RowNumber
Source:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/ee06f451-c418-4bca-8288-010410e8cf14/update-table-using-rownumber-over

Getting row number for query

I have a query which will return one row. Is there any way I can find the row index of the row I'm querying when the table is sorted?
I've tried rowid but got #582 when I was expecting row #7.
Eg:
CategoryID Name
I9GDS720K4 CatA
LPQTOR25XR CatB
EOQ215FT5_ CatC
K2OCS31WTM CatD
JV5FIYY4XC CatE
--> C_L7761O2U CatF <-- I want this row (#5)
OU3XC6T19K CatG
L9YKCYAYMG CatH
XKWMQ7HREG CatI
I've tried rowid with unexpected results:
SELECT rowid FROM Categories WHERE CategoryID = 'C_L7761O2U ORDER BY Name
EDIT: I've also tried J Cooper's suggestion (below), but the row numbers just aren't right.
using (var cmd = conn.CreateCommand()) {
cmd.CommandText = string.Format(#"SELECT (SELECT COUNT(*) FROM Recipes AS t2 WHERE t2.RecipeID <= t1.RecipeID) AS row_Num
FROM Recipes AS t1
WHERE RecipeID = 'FB3XSAXRWD'
ORDER BY Name";
cmd.Parameters.AddWithValue("#recipeId", id);
idx = Convert.ToInt32(cmd.ExecuteScalar());
Here is a way to get the row number in Sqlite:
SELECT CategoryID,
Name,
(SELECT COUNT(*)
FROM mytable AS t2
WHERE t2.Name <= t1.Name) AS row_Num
FROM mytable AS t1
ORDER BY Name, CategoryID;
Here's a funny trick you can use in Spatialite to get the order of values. If you use the count() function with a WHERE clause limiting to only values >= the current value, then the count will actually give the order. So if I have a point layer called "mypoints" with columns "value" and "val_order" then:
SELECT value, (
SELECT count(*) FROM mypoints AS my
WHERE my.value>=mypoints.value) AS val_order
FROM mypoints
ORDER BY value DESC;
Gives the descending order of the values.
I can update the "val_order" column this way:
UPDATE mypoints SET val_order = (
SELECT count(*) FROM mypoints AS my
WHERE my.value>=mypoints.value
);
What you are asking can be explained in two different ways, but I'm assuming you want to sort the resulting table and then number those rows according to the sort.
declare #resultrow int
select
#resultrow = row_number() OVER (ORDER BY Name Asc) as 'Row Number'
from Categories WHERE CategoryID = 'C_L776102U'
select #resultrow