Big query De-duplication query is not working properly

Big query De-duplication query is not working properly - google-bigquery

anyone please tell me the below query is not working properly, It suppose to delete the duplicate records only and keep the one of them (latest record) but it is deleting all the record instead of keeping one of the duplicate records, why is it so?
delete
from
dev_rahul.page_content_insights
where
(sha_id,
etl_start_utc_dttm) in (
select
(a.sha_id,
a.etl_start_utc_dttm)
from
(
select
sha_id,
etl_start_utc_dttm,
ROW_NUMBER() over (Partition by sha_id
order by
etl_start_utc_dttm desc) as rn
from
dev_rahul.page_content_insights
where
(snapshot_dt) >= '2021-03-25' ) a
where
a.rn <> 1)

Query looks ok, though I don't use that syntax for cleaning up duplicates.
Can I confirm the following:
sha_id, etl_start_utc_dttm is your primary key?
You wish to keep sha_id and the latest row based on etl_start_utc_dttm field descending?
If so, try this two query pattern:
create or replace table dev_rahul.rows_not_to_delete as
SELECT col.* FROM (SELECT ARRAY_AGG(pci ORDER BY etl_start_utc_dttm desc LIMIT 1
) OFFSET(0)] col
FROM dev_rahul.page_content_insights pci
where snapshot_dt >= '2021-03-25' )
GROUP BY sha_id
);
delete dev_rahul.page_content_insights p
where not exists (select 1 from DW_pmo.rows_not_to_delete d
where p.sha_id = d.sha_id and p.etl_start_utc_dttm = d.etl_start_utc_dttm
) and snapshot_dt >= '2021-03-25';
You could do this in a singe query by putting the first statement into a CTE.

Related

SQL Server: Each GROUP BY expression must contain at least one column that is not an outer reference

scenario 1:
I have two tables INFUSION_APP_APPOINTMENT,INFUSION_APP_NURSE_NOTES where
INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID=INFUSION_APP_APPOINTMENT.ID and i want to find out the INFUSION_APP_NURSE_NOTES.ID's where INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID is same.
for eg. if the INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID = 1 and INFUSION_APP_NURSE_NOTES.ID is 12,15,78, then i want to display all the
INFUSION_APP_NURSE_NOTES.ID's where INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID =1.
i use below script
SELECT INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID,INFUSION_APP_NURSE_NOTES.ID
FROM INFUSION_APP_NURSE_NOTES
GROUP BY INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID,INFUSION_APP_NURSE_NOTES.ID
HAVING COUNT(INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID)>1
but it does not gives me any records.
scenario 2:
I am running below script with the intention to get the duplicate records with different INFUSION_APP_NURSE_NOTES.ID's but same INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID.
SELECT INFUSION_APP_NURSE_NOTES.ID,INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID,INFUSION_APP_NURSE_NOTES.TYPE
FROM INFUSION_APP_NURSE_NOTES
WHERE
EXISTS (
SELECT 1 FROM INFUSION_APP_APPOINTMENT
WHERE
INFUSION_APP_NURSE_NOTES.ENABLE=1
AND INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID=INFUSION_APP_APPOINTMENT.ID
GROUP BY INFUSION_APP_NURSE_NOTES.ID
HAVING COUNT(INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID)>1
)
ORDER BY INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID;
but getting below error
SQL Error(164): Each GROUP BY expression must contain at least one
column that is not an outer reference
how to solve it?
i want the only row which has common APPOINTMENT_ID but different n

The question is unclear. Finding duplicates is typically performed using ranking functions like ROW_NUMBER(). This query :
SELECT *,ROW_NUMBER(PARTITION BY APPOINTMENT_ID ORDER BYID) as RN
FROM INFUSION_APP_NURSE_NOTES
WHERE
ENABLE=1
Will rank notes for the same appointment by ID and return 1, 2, 3 etc starting from the earliest note. ORDER BY ID DESC would return 1 for the latest note.
This can be used in a subquery or CTE to find the first, last or or duplicate records, eg :
with notes as (
SELECT *,ROW_NUMBER(PARTITION BY APPOINTMENT_ID ORDER BYID) as RN
FROM INFUSION_APP_NURSE_NOTES
WHERE
ENABLE=1
)
select *
from notes
where RN=1
Will return the first note per appointment while :
where RN>1
Will return only duplicates.
The question doesn't say what should be done with the duplicates though.
If the question is how to return all notes from appointments with multiple notes, a subquery can be used to return the APPOINTMENT_IDs that have more than one note. There's no need to include the INFUSION_APP_APPOINTMENT table though :
SELECT *
FROM INFUSION_APP_NURSE_NOTES
where
ENABLE=1 AND
APPOINTMENT_ID IN ( SELECT APPOINTMENT_ID
FROM INFUSION_APP_NURSE_NOTES
WHERE
ENABLE=1
group by APPOINTMENT_ID
having count(*)>1)

Try this
SELECT INFUSION_APP_NURSE_NOTES.ID,INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID,INFUSION_APP_NURSE_NOTES.TYPE
FROM INFUSION_APP_NURSE_NOTES
WHERE
EXISTS (
SELECT COUNT(B.APPOINTMENT_ID), B.ID
FROM INFUSION_APP_APPOINTMENT A
INNER JOIN INFUSION_APP_NURSE_NOTES B ON B.APPOINTMENT_ID = A.ID
WHERE
B.ENABLE=1
GROUP BY B.ID
HAVING COUNT(B.APPOINTMENT_ID)>1
)
ORDER BY INFUSION_APP_NURSE_NOTES.APPOINTMENT_ID;

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here

You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).

Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).

A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Remove duplicate row based on select statement

I have two select statements which is returning duplicated data. What I'm trying to accomplish is to remove a duplicated leg. But I'm having hard times to get to the second row programmatically.
select i.InvID, i.UID, i.StartDate, i.EndDate, i.Minutes,i.ABID from inv_v i, InvoiceLines_v i2 where
i.Period = '2014/08'
and i.EndDate = i2.EndDate
and i.Minutes = i2.Minutes
and i.Uid <> i2.Uid
and i.abid = i2.abid
order by i.EndDate
This select statement returns the following data.
As you can see it returns duplicate rows where minutes are the same ABID is the same but InvID are different. What I need to do is to remove one of the InvID where the criteria matches. Doesn't matter which one.
The second select statement is returning different data.
select i.InvID, i.UID, i.StartDate, i.EndDate, i.Minutes from InvoiceLines_v i, InvoiceLines_v i2 where
i.Period = '2014/08'
and i.EndDate = i2.EndDate
and i.Uid = i2.Uid
and i.Abid <> i2.Abid
and i.Language <> i2.Language
order by i.startdate desc
In this select statement I want to remove an InvID where UID is the same then select the lowest Mintues. In This case, I would remove the following InvIDs: 2537676 , 2537210
My goal is to remove those rows...
I could accomplish this using cursor grab the InvID and remove it by simple delete statement, but I'm trying to stay away from cursors.
Any suggestions on how I can accomplish this?

You can use exists to delete all duplicates except the one with the highest InvID by deleting those rows where another row exists with the same values but with a higher InvID
delete from inv_v
where exists (
select 1 from inv_v i2
where i2.InvID > inv_v.InvID
and i2.minutes = inv_v.minutes
and i2.EndDate = inv_v.EndDate
and i2.abid = inv_v.abid
and i2.uid <> inv_v.uid -- not sure why <> is used here, copied from question
)

I have faced similar problems regarding duplicate data and some one told me to use partition by and other methods but those were causing performance issues
However , I had a primary key in my table through which I was able to select one row from the duplicate data and then delete it.
For example in the first select statement "minutes" and "ABID" are the criteria to consider duplicacy in data.But "Invid" can be used to distinguish between the duplicate rows.
So you can use below query to remove duplicacy.
delete from inv_i where inv_id in (select max(inv_id) from inv_i group by minutes,abid having count(*) > 1 );
This simple concept was helpful to me. It can be helpful in your case if "Inv_id" is unique.

;WITH CTE AS
(
SELECT InvID
,[UID]
,StartDate
,EndDate
,[Minutes]
,ROW_NUMBER() OVER (PARTITION BY InvID, [UID] ORDER BY [Minutes] ASC) rn
FROM InvoiceLines_v
)
SELECT *
FROM CTE
WHERE rn = 1

Replace the ORIGINAL_TABLE with your table name.
QUERY 1:
WITH DUP_TABLE AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY minutes, ABID ORDER BY minutes, ABID) As ROW_NO
FROM <ORIGINAL_TABLE>
)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
QUERY 2:
WITH DUP_TABLE AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY UID ORDER BY minutes) As ROW_NO
FROM <ORIGINAL_TABLE>
)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;

Remove duplicates (1 to many) or write a subquery that solves my problem

Referring to the diagram below the records table has unique Records. Each record is updated, via comments through an Update Table. When I join the two I get lots of duplicates.
How to remove duplicates? Group By does not work for me as I have more than 10 fields in select query and some of them are functions.
Write a sub query which pulls the last updates in the Update table for each record that is updated in a particular month. Joining with this sub query will solve my problem.
Thanks!
Edit
Table structure that is of interest is
create table Records(
recordID int,
90more_fields various
)
create table Updates(
update_id int,
record_id int,
comment text,
byUser varchar(25),
datecreate datetime
)

Here's one way.
SELECT * /*But list columns explicitly*/
FROM Orange o
CROSS APPLY (SELECT TOP 1 *
FROM Blue b
WHERE b.datecreate >= '20110901'
AND b.datecreate < '20111001'
AND o.RecordID = b.Record_ID2
ORDER BY b.datecreate DESC) b

Based on the limited information available...
WITH cteLastUpdate AS (
SELECT Record_ID2, UpdateDateTime,
ROW_NUMBER() OVER(PARTITION BY Record_ID2 ORDER BY UpdateDateTime DESC) AS RowNUM
FROM BlueTable
/* Add WHERE clause if needed to restrict date range */
)
SELECT *
FROM cteLastUpdate lu
INNER JOIN OrangeTable o
ON lu.Record_ID2 = o.RecordID
WHERE lu.RowNum = 1

Last updates per record and month:
SELECT *
FROM UPDATES outerUpd
WHERE exists
(
-- Magic part
SELECT 1
FROM UPDATES innerUpd
WHERE innerUpd.RecordId = outerUpd.RecordId
GROUP BY RecordId
, date_part('year', innerUpd.datecolumn)
, date_part('month', innerUpd.datecolumn)
HAVING max(innerUpd.datecolumn) = outerUpd.datecolumn
)
(Works on PostgreSQL, date_part is different in other RDBMS)

Update based on subquery fails

I am trying to do the following update in Oracle 10gR2:
update
(select voyage_port_id, voyage_id, arrival_date, port_seq,
row_number() over (partition by voyage_id order by arrival_date) as new_seq
from voyage_port) t
set t.port_seq = t.new_seq
Voyage_port_id is the primary key, voyage_id is a foreign key. I'm trying to assign a sequence number based on the dates within each voyage.
However, the above fails with ORA-01732: data manipulation operation not legal on this view
What is the problem and how can I avoid it ?

Since you can't update subqueries with row_number, you'll have to calculate the row number in the set part of the update. At first I tried this:
update voyage_port a
set a.port_seq = (
select
row_number() over (partition by voyage_id order by arrival_date)
from voyage_port b
where b.voyage_port_id = a.voyage_port_id
)
But that doesn't work, because the subquery only selects one row, and then the row_number() is always 1. Using another subquery allows a meaningful result:
update voyage_port a
set a.port_seq = (
select c.rn
from (
select
voyage_port_id
, row_number() over (partition by voyage_id
order by arrival_date) as rn
from voyage_port b
) c
where c.voyage_port_id = a.voyage_port_id
)
It works, but more complex than I'd expect for this task.

You can update some views, but there are restrictions and one is that the view must not contain analytic functions. See SQL Language Reference on UPDATE and search for first occurence of "analytic".
This will work, provided no voyage visits more than one port on the same day (or the dates include a time component that makes them unique):
update voyage_port vp
set vp.port_seq =
( select count(*)
from voyage_port vp2
where vp2.voyage_id = vp.voyage_id
and vp2.arrival_date <= vp.arrival_date
)
I think this handles the case where a voyage visits more than 1 port per day and there is no time component (though the sequence of ports visited on the same day is then arbitrary):
update voyage_port vp
set vp.port_seq =
( select count(*)
from voyage_port vp2
where vp2.voyage_id = vp.voyage_id
and (vp2.arrival_date <= vp.arrival_date)
or ( vp2.arrival_date = vp.arrival_date
and vp2.voyage_port_id <= vp.voyage_port_id
)
)

Don't think you can update a derived table, I'd rewrite as:
update voyage_port
set port_seq = t.new_seq
from
voyage_port p
inner join
(select voyage_port_id, voyage_id, arrival_date, port_seq,
row_number() over (partition by voyage_id order by arrival_date) as new_seq
from voyage_port) t
on p.voyage_port_id = t.voyage_port_id

The first token after the UPDATE should be the name of the table to update, then your columns-to-update. I'm not sure what you are trying to achieve with the select statement where it is, but you can' update the result set from the select legally.
A version of the sql, guessing what you have in mind, might look like...
update voyage_port t
set t.port_seq = (<select statement that generates new value of port_seq>)
NOTE: to use a select statement to set a value like this you must make sure only 1 row will be returned from the select !
EDIT : modified statement above to reflect what I was trying to explain. The question has been answered very nicely by Andomar above

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Big query De-duplication query is not working properly - google-bigquery

Related

SQL Server: Each GROUP BY expression must contain at least one column that is not an outer reference

Modify my SQL Server query -- returns too many rows sometimes

Remove duplicate row based on select statement

Remove duplicates (1 to many) or write a subquery that solves my problem

Update based on subquery fails

Categories

Resources