Replacing IN clause by EXISTS causes unexpected results - sql

Given a simple table lieu_horaire with :
+ id_horaire (numeric)
+ id_lieu (numeric)
+ horaire (timestamp)
Following query works fine at the moment :
DELETE FROM lieu_horaire where id_horaire IN (
SELECT id_horaire "+
FROM (
SELECT id_horaire,
ROW_NUMBER() OVER (PARTITION BY id_lieu order by horaire desc) AS line_number
FROM lieu_horaire
) as sr
WHERE sr.line_number > 10);
But due to scalability reasons, I would like to replace IN clause by EXISTS '... WHERE EXISTS(... '. But with exists, instead of deleting the correct lines, it removes all the lines as if '... WHERE sr.line_number...' expression were not present.

Make sure you have joined the column while using exists , something like this:
DELETE FROM lieu_horaire where exists (
SELECT id_horaire
FROM (
SELECT id_horaire,
ROW_NUMBER() OVER (PARTITION BY id_lieu order by horaire desc) AS line_number
FROM lieu_horaire
) as sr
WHERE sr.line_number > 10 **and lieu_horaire.id_horaire=sr.id_horaire**);

Here is a simplification using NOT IN:
DELETE FROM lieu_horaire lh
WHERE lh.id_horaire NOT IN (SELECT lh2.id_horaire
FROM lieu_horaire lh2
WHERE lh2.id_lieu = lh.id_lieu
ORDER BY lh2.horaire DESC
LIMIT 10
);
However, I don't know that the performance will be much better. For either version, you want an index on lieu_horaire(id_lieu, horaire).
If a large number of rows were to be deleted, I might suggest truncate/insert instead.

Related

Big query De-duplication query is not working properly

anyone please tell me the below query is not working properly, It suppose to delete the duplicate records only and keep the one of them (latest record) but it is deleting all the record instead of keeping one of the duplicate records, why is it so?
delete
from
dev_rahul.page_content_insights
where
(sha_id,
etl_start_utc_dttm) in (
select
(a.sha_id,
a.etl_start_utc_dttm)
from
(
select
sha_id,
etl_start_utc_dttm,
ROW_NUMBER() over (Partition by sha_id
order by
etl_start_utc_dttm desc) as rn
from
dev_rahul.page_content_insights
where
(snapshot_dt) >= '2021-03-25' ) a
where
a.rn <> 1)
Query looks ok, though I don't use that syntax for cleaning up duplicates.
Can I confirm the following:
sha_id, etl_start_utc_dttm is your primary key?
You wish to keep sha_id and the latest row based on etl_start_utc_dttm field descending?
If so, try this two query pattern:
create or replace table dev_rahul.rows_not_to_delete as
SELECT col.* FROM (SELECT ARRAY_AGG(pci ORDER BY etl_start_utc_dttm desc LIMIT 1
) OFFSET(0)] col
FROM dev_rahul.page_content_insights pci
where snapshot_dt >= '2021-03-25' )
GROUP BY sha_id
);
delete dev_rahul.page_content_insights p
where not exists (select 1 from DW_pmo.rows_not_to_delete d
where p.sha_id = d.sha_id and p.etl_start_utc_dttm = d.etl_start_utc_dttm
) and snapshot_dt >= '2021-03-25';
You could do this in a singe query by putting the first statement into a CTE.

How to deselect duplicate entries in a query?

I've got a query like this:
SELECT *
FROM RecipeTable, RecipeIngredientTable, SyncRecipeIngredientTable
WHERE RecipeTable.recipe_id = SyncRecipeIngredientTable.recipe_id
AND RecipeIngredientTable.recipe_ingredient_id =
SyncRecipeIngredientTable.recipe_ingredient_id
AND RecipeIngredientTable.recipe_item_name in ("ayva", "pirinç", "su")
GROUP by RecipeTable.recipe_id
HAVING COUNT(*) >= 3;
and this query returns the result like this:
As you can see in the image there is 3 duplicate, unnecessary entries (no, i can't delete them because of the multiple foreign keys). How can I deselect these duplicate entries from the result query? In the end I want to return 6 entries not 9.
What you want to eliminate in the result set is not duplication of recipe_id values but recipe_name values.
You just need to group(partition) by recipe_name through use of ROW_NUMBER() analytic function :
SELECT recipe_id, author_name ...
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY recipe_name) AS rn,
sr.recipe_id, author_name ...
FROM SyncRecipeIngredientTable sr
JOIN RecipeIngredientTable ri
ON ri.recipe_ingredient_id = sr.recipe_ingredient_id
JOIN RecipeTable rt
ON rt.recipe_id = sr.recipe_id
WHERE ri.recipe_item_name in ("ayva", "pirinç", "su")
)
WHERE rn = 1
This way, you can pick only one of the records with rn=1 (ORDER BY Clause might be added to that analytic function after PARTITION BY clause if spesific record is needed to be picked)

Remove duplicate row based on select statement

I have two select statements which is returning duplicated data. What I'm trying to accomplish is to remove a duplicated leg. But I'm having hard times to get to the second row programmatically.
select i.InvID, i.UID, i.StartDate, i.EndDate, i.Minutes,i.ABID from inv_v i, InvoiceLines_v i2 where
i.Period = '2014/08'
and i.EndDate = i2.EndDate
and i.Minutes = i2.Minutes
and i.Uid <> i2.Uid
and i.abid = i2.abid
order by i.EndDate
This select statement returns the following data.
As you can see it returns duplicate rows where minutes are the same ABID is the same but InvID are different. What I need to do is to remove one of the InvID where the criteria matches. Doesn't matter which one.
The second select statement is returning different data.
select i.InvID, i.UID, i.StartDate, i.EndDate, i.Minutes from InvoiceLines_v i, InvoiceLines_v i2 where
i.Period = '2014/08'
and i.EndDate = i2.EndDate
and i.Uid = i2.Uid
and i.Abid <> i2.Abid
and i.Language <> i2.Language
order by i.startdate desc
In this select statement I want to remove an InvID where UID is the same then select the lowest Mintues. In This case, I would remove the following InvIDs: 2537676 , 2537210
My goal is to remove those rows...
I could accomplish this using cursor grab the InvID and remove it by simple delete statement, but I'm trying to stay away from cursors.
Any suggestions on how I can accomplish this?
You can use exists to delete all duplicates except the one with the highest InvID by deleting those rows where another row exists with the same values but with a higher InvID
delete from inv_v
where exists (
select 1 from inv_v i2
where i2.InvID > inv_v.InvID
and i2.minutes = inv_v.minutes
and i2.EndDate = inv_v.EndDate
and i2.abid = inv_v.abid
and i2.uid <> inv_v.uid -- not sure why <> is used here, copied from question
)
I have faced similar problems regarding duplicate data and some one told me to use partition by and other methods but those were causing performance issues
However , I had a primary key in my table through which I was able to select one row from the duplicate data and then delete it.
For example in the first select statement "minutes" and "ABID" are the criteria to consider duplicacy in data.But "Invid" can be used to distinguish between the duplicate rows.
So you can use below query to remove duplicacy.
delete from inv_i where inv_id in (select max(inv_id) from inv_i group by minutes,abid having count(*) > 1 );
This simple concept was helpful to me. It can be helpful in your case if "Inv_id" is unique.
;WITH CTE AS
(
SELECT InvID
,[UID]
,StartDate
,EndDate
,[Minutes]
,ROW_NUMBER() OVER (PARTITION BY InvID, [UID] ORDER BY [Minutes] ASC) rn
FROM InvoiceLines_v
)
SELECT *
FROM CTE
WHERE rn = 1
Replace the ORIGINAL_TABLE with your table name.
QUERY 1:
WITH DUP_TABLE AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY minutes, ABID ORDER BY minutes, ABID) As ROW_NO
FROM <ORIGINAL_TABLE>
)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;
QUERY 2:
WITH DUP_TABLE AS
(
SELECT ROW_NUMBER()
OVER (PARTITION BY UID ORDER BY minutes) As ROW_NO
FROM <ORIGINAL_TABLE>
)
DELETE FROM DUP_TABLE WHERE ROW_NO > 1;

I need a sql query to group by name but return other fields based on the most recent entry

I'm querying a single table called PhoneCallNotes. The caller FirstName, LastName and DOB are recorded for each call as well as many other fields including a unique ID for the call (PhoneNoteID) but no unique ID for the caller.
My requirement is to return a list of callers with duplicates removed along with the PhoneNoteID, etc from their most recent entry.
I can get the list of users I want using a Group By on name, DOB and Max(CreatedOn) but how do I include uniqueID (of the most recent entry in the results?)
select O.CallerFName,O.CallerLName,O.CallerDOB,Max(O.CreatedOn)
from [dbo].[PhoneCallNotes] as O
where O.CallerLName like 'Public'
group by O.CallerFName,O.CallerLName,O.CallerDOB order by Max(O.CreatedOn)
Results:
John Public 4/4/2001 4/6/12 16:42
Joe Public 4/12/1988 4/6/12 16:52
John Public 1/2/1950 4/6/12 17:01
Thanks
You can also write what Andrey wrote somewhat more compactly if you select TOP (1) WITH TIES and put the ROW_NUMBER() expression in the ORDER BY clause:
SELECT TOP (1) WITH TIES
CallerFName,
CallerLName,
CallerDOB,
CreatedOn,
PhoneNoteID
FROM [dbo].[PhoneCallNotes]
WHERE CallerLName = 'Public'
ORDER BY ROW_NUMBER() OVER(
PARTITION BY CallerFName, CallerLName, CallerDOB
ORDER BY CreatedOn DESC
)
(By the way, there's no reason to use LIKE for a simple string comparison.)
Try something like that:
;WITH CTE AS (
SELECT
O.CallerFName,
O.CallerLName,
O.CallerDOB,
O.CreatedOn,
PhoneNoteID,
ROW_NUMBER() OVER(PARTITION BY O.CallerFName, O.CallerLName, O.CallerDOB ORDER BY O.CreatedOn DESC) AS rn
FROM [dbo].[PhoneCallNotes] AS O
WHERE
O.CallerLName LIKE 'Public'
)
SELECT
CallerFName,
CallerLName,
CallerDOB,
CreatedOn,
PhoneNoteID
FROM CTE
WHERE rn = 1
ORDER BY
CreatedOn
Assuming that the set of [FirstName, LastName, DateOfBirth] are unique (#shudder#), I believe the following should work, on pretty much every major RDBMS:
SELECT a.callerFName, a.callerLName, a.callerDOB, a.createdOn, a.phoneNoteId
FROM phoneCallNotes as a
LEFT JOIN phoneCallNotes as b
ON b.callerFName = a.callerFName
AND b.callerLName = a.callerLName
AND b.callerDOB = a.callerDOB
AND b.createdOn > a.createdOn
WHERE a.callerLName LIKE 'Public'
AND b.phoneNoteId IS NULL
Basically, the query is looking for every phone-call-note for a particular name/dob combination, where there is not a more-recent row (b is null). If you have two rows with the same create time, you'll get duplicate rows, though.

Update based on subquery fails

I am trying to do the following update in Oracle 10gR2:
update
(select voyage_port_id, voyage_id, arrival_date, port_seq,
row_number() over (partition by voyage_id order by arrival_date) as new_seq
from voyage_port) t
set t.port_seq = t.new_seq
Voyage_port_id is the primary key, voyage_id is a foreign key. I'm trying to assign a sequence number based on the dates within each voyage.
However, the above fails with ORA-01732: data manipulation operation not legal on this view
What is the problem and how can I avoid it ?
Since you can't update subqueries with row_number, you'll have to calculate the row number in the set part of the update. At first I tried this:
update voyage_port a
set a.port_seq = (
select
row_number() over (partition by voyage_id order by arrival_date)
from voyage_port b
where b.voyage_port_id = a.voyage_port_id
)
But that doesn't work, because the subquery only selects one row, and then the row_number() is always 1. Using another subquery allows a meaningful result:
update voyage_port a
set a.port_seq = (
select c.rn
from (
select
voyage_port_id
, row_number() over (partition by voyage_id
order by arrival_date) as rn
from voyage_port b
) c
where c.voyage_port_id = a.voyage_port_id
)
It works, but more complex than I'd expect for this task.
You can update some views, but there are restrictions and one is that the view must not contain analytic functions. See SQL Language Reference on UPDATE and search for first occurence of "analytic".
This will work, provided no voyage visits more than one port on the same day (or the dates include a time component that makes them unique):
update voyage_port vp
set vp.port_seq =
( select count(*)
from voyage_port vp2
where vp2.voyage_id = vp.voyage_id
and vp2.arrival_date <= vp.arrival_date
)
I think this handles the case where a voyage visits more than 1 port per day and there is no time component (though the sequence of ports visited on the same day is then arbitrary):
update voyage_port vp
set vp.port_seq =
( select count(*)
from voyage_port vp2
where vp2.voyage_id = vp.voyage_id
and (vp2.arrival_date <= vp.arrival_date)
or ( vp2.arrival_date = vp.arrival_date
and vp2.voyage_port_id <= vp.voyage_port_id
)
)
Don't think you can update a derived table, I'd rewrite as:
update voyage_port
set port_seq = t.new_seq
from
voyage_port p
inner join
(select voyage_port_id, voyage_id, arrival_date, port_seq,
row_number() over (partition by voyage_id order by arrival_date) as new_seq
from voyage_port) t
on p.voyage_port_id = t.voyage_port_id
The first token after the UPDATE should be the name of the table to update, then your columns-to-update. I'm not sure what you are trying to achieve with the select statement where it is, but you can' update the result set from the select legally.
A version of the sql, guessing what you have in mind, might look like...
update voyage_port t
set t.port_seq = (<select statement that generates new value of port_seq>)
NOTE: to use a select statement to set a value like this you must make sure only 1 row will be returned from the select !
EDIT : modified statement above to reflect what I was trying to explain. The question has been answered very nicely by Andomar above