How do I remove duplicates from this data in snowflake? - sql

I have this data where I have to remove the duplicate data highlighted in yellow. I tried to use distinct and group by but it does not seem to work.
Even though the comment and value is different, anything having the value as CRIMSSTANDARD is considered a child of the parent in this case and hence considered as a duplicate as the first 6 fields are the same. How do I remove the duplicate records in this case?
CLIENT;CLIENT_ID;QTY_TYPE;QUANTITY;AMOUNT;TRANTYPE;COMMENT;VALUE;ORDER_ID;DEV_COMM;AMT_COMM;ID
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-AGNIndPlusFd;Q;2898;;Buy;INS;CRIMSSTANDARD;10902701;;;
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-DepVz;A;;82000;Sell;GEN,TransferOrder;10902568;Y;0;
TEST_IMPL8;P-DepVz;A;;82000;Sell;INS;CRIMSSTANDARD;10902568;;;
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;GEN;OHPEN_14613296BD001571;10902668;;;14613296BD001571
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;INS;CRIMSSTANDARD;10902668;;;
TEST_IMPL8;P-Dyn4Vz;A;;13000;Buy;INS;CRIMSSTANDARD;10877286;;;
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;GEN;OHPEN_14613296BD001565;10677375;;;14613296BD001565
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;INS;CRIMSSTANDARD;10677375;;;
TEST_IMPL8;P-VNAMijBel;Q;10236;;Sell;INS;CRIMSSTANDARD;11402183;;;

First pass, you'll need to get rid of rows that are entire duplicates of each other (this will fix the row 1 and row 3 issue in your data where the rows are exactly the same). Use the approach outlined here: How to delete Duplicate records in snowflake database table
Next, try this to eliminate duplicates of first 6 columns, and deleting the child record designated by CRIMSSTANDARD in the COMMENT column:
begin work;
delete from
my_table using (
select
CLIENT,
CLIENT_ID,
QTY_TYPE,
QUANTITY,
AMOUNT,
TRANTYPE
from
my_table
GROUP BY
CLIENT,
CLIENT_ID,
QTY_TYPE,
QUANTITY,
AMOUNT,
TRANTYPE
HAVING
COUNT(*) > 1
) as my_table_dupes
where
my_table.CLIENT = X.CLIENT
AND my_table.CLIENT_ID = my_table_dupes.CLIENT_ID
AND my_table.QTY_TYPE = my_table_dupes.QTY_TYP
AND my_table.QUANTITY = my_table_dupes.QUANTITY
AND my_table.AMOUNT = my_table_dupes.AMOUNT
AND my_table.TRANTYPE = my_table_dupes.TRANTYPE
AND my_table.COMMENT = 'CRIMSSTANDARD'
commit work;

There is no distinct on in Snowflake, but you can have a similar result using qualify:
SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value
ORDER BY client, client_id, qty_type, quantity, amount, trantype, value
) = 1;
See here to see more details
EDIT:
To keep the rows that have the dev_comm, amt_comm and ID non null in priority you can modify your order condition:
SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value
ORDER BY
IFF(dev_comm IS NOT NULL, 0, 1),
IFF(amt_comm IS NOT NULL, 0, 1),
IFF(ID IS NOT NULL, 0, 1)
) = 1;
You can also have different priorities by ordering the fields in the order by. For example in case you have multiple lines that have the dev_comm columns but you want in priority the line that have the ID, you need to put the iff(ID ... first in your order by. It order by the first element, then if there is several lines with the same first element it order by the second, etc...

Related

How to deselect duplicate entries in a query?

I've got a query like this:
SELECT *
FROM RecipeTable, RecipeIngredientTable, SyncRecipeIngredientTable
WHERE RecipeTable.recipe_id = SyncRecipeIngredientTable.recipe_id
AND RecipeIngredientTable.recipe_ingredient_id =
SyncRecipeIngredientTable.recipe_ingredient_id
AND RecipeIngredientTable.recipe_item_name in ("ayva", "pirinç", "su")
GROUP by RecipeTable.recipe_id
HAVING COUNT(*) >= 3;
and this query returns the result like this:
As you can see in the image there is 3 duplicate, unnecessary entries (no, i can't delete them because of the multiple foreign keys). How can I deselect these duplicate entries from the result query? In the end I want to return 6 entries not 9.
What you want to eliminate in the result set is not duplication of recipe_id values but recipe_name values.
You just need to group(partition) by recipe_name through use of ROW_NUMBER() analytic function :
SELECT recipe_id, author_name ...
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY recipe_name) AS rn,
sr.recipe_id, author_name ...
FROM SyncRecipeIngredientTable sr
JOIN RecipeIngredientTable ri
ON ri.recipe_ingredient_id = sr.recipe_ingredient_id
JOIN RecipeTable rt
ON rt.recipe_id = sr.recipe_id
WHERE ri.recipe_item_name in ("ayva", "pirinç", "su")
)
WHERE rn = 1
This way, you can pick only one of the records with rn=1 (ORDER BY Clause might be added to that analytic function after PARTITION BY clause if spesific record is needed to be picked)

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

SQL Deduplication, populating the duplicates with its unique identifier

I want to be able to populate any duplicate items in a table with its unique identifier. So for example, in the table below;
PROJ00002492 should get a GlobalFamilDupID of PROJ00002492 (itself, aka the ControlNumber column), and all other duplicates should get the same value PROJ00002492.
PROJ00005876 should get the value PROJ00005876 (aka itself, the ControlNumber column).
Code:
update mstr
SET
IsGlobalFamilyUnique = case when (rn > 1) then 0 else 1 end
from(
select
ControlNumber,
MD5hash,
IsglobalFamilyUnique,
GlobalFamilyDupID,
row_number() over (partition by [MD5Hash] order by ID asc) [RN]
from dbo.tblMaster
where NuixGuid = TopLvlGuid and IsGlobalFamilyUnique is null
)mstr
The above code works, but I can't think how to populate the GlobalFamilDupID column? Will I have to do it in a separate query?
You can add the value you need into your SELECT using
FIRST_VALUE(ControlNumber) over (partition by [MD5Hash] order by ID
Then use the value in your UPDATE

Remove duplicate row based on select statement

I have two select statements which is returning duplicated data. What I'm trying to accomplish is to remove a duplicated leg. But I'm having hard times to get to the second row programmatically.
select i.InvID, i.UID, i.StartDate, i.EndDate, i.Minutes,i.ABID from inv_v i, InvoiceLines_v i2 where
i.Period = '2014/08'
and i.EndDate = i2.EndDate
and i.Minutes = i2.Minutes
and i.Uid <> i2.Uid
and i.abid = i2.abid
order by i.EndDate
This select statement returns the following data.
As you can see it returns duplicate rows where minutes are the same ABID is the same but InvID are different. What I need to do is to remove one of the InvID where the criteria matches. Doesn't matter which one.
The second select statement is returning different data.
select i.InvID, i.UID, i.StartDate, i.EndDate, i.Minutes from InvoiceLines_v i, InvoiceLines_v i2 where
i.Period = '2014/08'
and i.EndDate = i2.EndDate
and i.Uid = i2.Uid
and i.Abid <> i2.Abid
and i.Language <> i2.Language
order by i.startdate desc
In this select statement I want to remove an InvID where UID is the same then select the lowest Mintues. In This case, I would remove the following InvIDs: 2537676 , 2537210
My goal is to remove those rows...
I could accomplish this using cursor grab the InvID and remove it by simple delete statement, but I'm trying to stay away from cursors.
Any suggestions on how I can accomplish this?
You can use exists to delete all duplicates except the one with the highest InvID by deleting those rows where another row exists with the same values but with a higher InvID
delete from inv_v
where exists (
select 1 from inv_v i2
where i2.InvID > inv_v.InvID
and i2.minutes = inv_v.minutes
and i2.EndDate = inv_v.EndDate
and i2.abid = inv_v.abid
and i2.uid <> inv_v.uid -- not sure why <> is used here, copied from question
)
I have faced similar problems regarding duplicate data and some one told me to use partition by and other methods but those were causing performance issues
However , I had a primary key in my table through which I was able to select one row from the duplicate data and then delete it.
For example in the first select statement "minutes" and "ABID" are the criteria to consider duplicacy in data.But "Invid" can be used to distinguish between the duplicate rows.
So you can use below query to remove duplicacy.
delete from inv_i where inv_id in (select max(inv_id) from inv_i group by minutes,abid having count(*) > 1 );
This simple concept was helpful to me. It can be helpful in your case if "Inv_id" is unique.
;WITH CTE AS
(
SELECT InvID
,[UID]
,StartDate
,EndDate
,[Minutes]
,ROW_NUMBER() OVER (PARTITION BY InvID, [UID] ORDER BY [Minutes] ASC) rn
FROM InvoiceLines_v
)
SELECT *
FROM CTE
WHERE rn = 1
Replace the ORIGINAL_TABLE with your table name.
QUERY 1:
WITH DUP_TABLE AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY minutes, ABID ORDER BY minutes, ABID) As ROW_NO
FROM <ORIGINAL_TABLE>
)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
QUERY 2:
WITH DUP_TABLE AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY UID ORDER BY minutes) As ROW_NO
FROM <ORIGINAL_TABLE>
)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;

SQL UPDATE row Number

I have a table serviceClusters with a column identity(1590 values). Then I have another table serviceClustersNew with the columns ID, text and comment. In this table, I have some values for text and comment, the ID is always 1. Here an example for the table:
[1, dummy1, hello1;
1, dummy2, hello2;
1, dummy3, hello3;
etc.]
WhaI want now for the values in the column ID is the continuing index of the table serviceClusters plus the current Row number: In our case, this would be 1591, 1592 and 1593.
I tried to solve the problem like this: First I updated the column ID with the maximum value, then I tryed to add the row number, but this doesnt work:
-- Update ID to the maximum value 1590
UPDATE serviceClustersNew
SET ID = (SELECT MAX(ID) FROM serviceClusters);
-- This command returns the correct values 1591, 1592 and 1593
SELECT ID+ROW_NUMBER() OVER (ORDER BY Text_ID) AS RowNumber
FROM serviceClustersNew
-- But I'm not able to update the table with this command
UPDATE serviceClustersNew
SET ID = (SELECT ID+ROW_NUMBER() OVER (ORDER BY Text_ID) AS RowNumber FROM
serviceClustersNew)
By sending the last command, I get the error "Syntax error: Ordered Analytical Functions are not allowed in subqueries.". Do you have any suggestions, how I could solve the problem? I know I could do it with a volatile table or by adding a column, but is there a way without creating a new table / altering the current table?
You have to rewrite it using UPDATE FROM, the syntax is just a bit bulky:
UPDATE serviceClustersNew
FROM
(
SELECT text_id,
(SELECT MAX(ID) FROM serviceClusters) +
ROW_NUMBER() OVER (ORDER BY Text_ID) AS newID
FROM serviceClustersNew
) AS src
SET ID = newID
WHERE serviceClustersNew.Text_ID = src.Text_ID
You are not dealing with a lot of data, so a correlated subquery can serve the same purpose:
UPDATE serviceClustersNew
SET ID = (select max(ID) from serviceClustersNew) +
(select count(*)
from serviceClustersNew scn2
where scn2.Text_Id <= serviceClustersNew.TextId
)
This assumes that the text_id is unique along the rows.
Apparently you can update a base table through a CTE... had no idea. So, just change your last UPDATE statement to this, and you should be good. Just be sure to include any fields in the CTE that you desire to update.
;WITH cte_TEST AS
( SELECT
ID,
ID+ROW_NUMBER() OVER (ORDER BY TEXT_ID) AS RowNumber FROM serviceClustersNew)
UPDATE cte_TEST
SET cte_TEST.ID = cte_TEST.RowNumber
Source:
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/ee06f451-c418-4bca-8288-010410e8cf14/update-table-using-rownumber-over