Understanding this SQL Query - sql

I'm new to oracle database, can some help me understand this query. This query eliminates duplicates from table.
DELETE FROM table_name A
WHERE ROWID > (SELECT min(rowid)
FROM table_name B
WHERE A.key_values = B.key_values);
Any suggestions for improving the query are welcome.
Edit: No this is not homework , what I didn't understand is, what is being done by subquery and what does ROWID > On subquery do ?
This is the Source of the query

Dissecting the actual mechanics:
DELETE FROM table_name A
This is a standard query to delete records from the table named "table_name". Here, it has been aliased as "A" to be referred to in the subquery.
WHERE ROWID >
This places a condition on the deletion, such that for each row encountered, the ROWID must meed a condition of being greater than..
(SELECT min(rowid)
FROM table_name B
WHERE A.key_values = B.key_values)
This is a subquery that is correlated to the main DELETE statement. It uses the value A.key_values from the outside query. So given a record from the DELETE statement, it will run this subquery to find the minimum rowid (internal record id) for all records in the same table (aliased as B now) that bear the same key_values value.
So, to put it together, say you had these rows
rowid | key_values
======= ============
1 A
2 B
3 B
4 C
5 A
6 B
The subquery works out that the min(rowid) for each record based on ALL records with the same key_values is:
rowid | key_values | min(rowid)
======= ============ ===========
1 A 1
2 B 2
3 B 2 **
4 C 4
5 A 1 **
6 B 2 **
For the records marked with **, the condition
WHERE ROWID > { subquery }
becomes true, and they are deleted.
EDIT - additional info
This answer previously stated that ROWID increased by insertion order. That is very untrue. The truth is that rowid is just a file.block.slot-on-block - a physical address.
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:53140678334596
Tom's Followup December 1, 2008 - 6am Central time zone:
it is quite possible that D will be "first" in the table - as it took over A's place.
If rowids always "grew", than space would never be reused (that would be an implication of rowids growing always - we would never be able to reuse old space as the rowid is just a file.block.slot-on-block - a physical address)

Rowid is a pseudo-column that uniquely identifies each row in a table; it is numeric.
This query finds all rows in A where A.key_values = B.key_values and delete all of them but one with the minimal rowid. It's just a way to arbitrarily choose one duplicate to preserve.

Quote AskTom:
A rowid is assigned to a row upon insert and is immutable (never changing)... unless the row
is deleted and re-inserted (meaning it is another row, not the same row!)
The query you provided is relying on that rowid, and deletes all the rows with a rowid value higher than the minimum one on a per key_values basis. Hence, any duplicates are removed.
The subquery you provided is a correlated subquery, because there's a relationship between the table reference in the subquery, and one outside of the subquery.

ROWID is a number that increments for each new row that is inserted. So if you have two ROWID numbers 16 & 24, you know 16 was inserted before 24. Your delete statement is deleting all duplicates, keeping only the first of those duplicates that was inserted. Make sense??

Related

Optimisation of sql query for deleting duplicate items from large table

Could anyone please help me optimise one of the queries which is taking more than 20 minutes to run against 3 Million data.
Table Structure
-----------------------------------------------------------------------------------------
|id [INT Auto Inc]| name_id (uuid) | name (varchar)| city (varchar) | name_type(varchar)|
-----------------------------------------------------------------------------------------
Query
The purpose of the query is to eliminate the duplicate, here duplicate means having same name_id and name.
DELETE
FROM records
WHERE id NOT IN
(SELECT DISTINCT
ON (name_id, name) id
FROM records);
I would write your delete using exists logic:
DELETE
FROM records r1
WHERE EXISTS (SELECT 1 FROM records r2
WHERE r2.name_id = r1.name_id AND r2.name = r2.name AND
r2.id < r1.id);
This delete query will spare the duplicate having the smallest id value. To speed this up, you may try adding the following index:
CREATE INDEX idx ON records (name_id, name, id);
You probably already have a primary key on the identity column, then you can use it to exclude redundant rows by id in the following way:
WITH cte AS (
SELECT MIN(id) AS id FROM records GROUP BY name_id, name)
DELETE FROM records
WHERE NOT EXISTS (SELECT id FROM cte WHERE id=records.id)
Even without the index, this should work relatively fast, probably because of merge join strategy.

Get the "most" optimal row in a JOIN

Problem
I have a situation in which I have two tables in which I would like the entries from table 2 (lets call it table_2) to be matched up with the entries in table 1 (table_1) such that there are no duplicates rows of table_2 used in the match up.
Discussion
Specifically, in this case there are datetime stamps in each table (field is utcdatetime). For each row in table_1, I want to find the row in table_2 in which has the closed utcdatetime to the table 1 utcdatetime such that the table2.utcdatetime is older than the table_1 utcdatetime and within 30 minutes of the table 1 utcdatetime. Here is the catch, I do not want any repeats. If a row in table 2 gets gobbled up in a match on an earlier row in table 1, then I do not want it considered for a match later.
This has currently been implemented in a Python routine, but it is slow to iterate over all of the rows in table 1 as it is large. I thought I was there with a single SQL statement, but I found that my current SQL results in duplicate table 2 rows in the output data.
I would recommend using a nested select to get whatever results you're looking for.
For instance:
select *
from person p
where p.name_first = 'SCCJS'
and not exists (select 'x' from person p2 where p2.person_id != p.person_id
and p.name_first = 'SCCJS' and p.name_last = 'SC')

Order by or where clause: which is effecient when retrieving a record from versioned table

This might be a basic sql questions, however I was curious to know the answer to this.
I need to fetch top one record from the db. Which query would be more efficient, one with where clause or order by?
Example:
Table
Movie
id name isPlaying endDate isDeleted
Above is a versioned table for storing records for movie.
If the endDate is not null and isDeleted = 1 then the record is old and an updated one already exist in this table.
So to fetch the movie "Gladiator" which is currently playing, I can write a query in two ways:
1.
Select m.isPlaying
From Movie m
where m.name=:name (given)
and m.endDate is null and m.isDeleted=0
2. Select TOP 1 m.isPlaying
From Movie m
where m.name=:name (given)
order by m.id desc --- This will always give me the active record (one which is not deleted)
Which query is faster and the correct way to do it?
Update:
id is the only indexed column and id is the unique key. I am expecting the queries to return me only one result.
Update:
Examples:
Movie
id name isPlaying EndDate isDeleted
3 Gladiator 1 03/1/2017 1
4 Gladiator 1 03/1/2017 1
5 Gladiator 0 null 0
I would go with the where clause:
Select m.isPlaying
From Movie m
where m.id = :id and m.endDate is null and m.isDeleted = 0;
This can take advantage of an index on (id, isDeleted, endDate).
Also, the two are not equivalent. The second might return multiple rows when the first returns 1. Or the second might return one row when the first returns none.
The first option might return more than 1 row. Maybe you know it won't because you know what data you have stored but the SQL engine doesn't, and it will affect it's execution plan.
Considering that you only have 1 index and it's on the ID column, the 2nd query should be faster in theory, since it would do an index scan from the highest ID with a predicate for the given name, stopping at the first match.
The first query will do a full table scan while comparing column name, endDate and isDeleted, since it won't stop at the first result that matches.
Posting your execution plans for both queries might enlighten a few loose cables.

Do I have to make sure my record only match with one row when using JOIN with DELETE statement?

I have a query like this:
DELETE A
FROM A
LEFT JOIN B ON
A.ID=B.ID
WHERE B.Status='OK'
The records from these join might resulted in more than one rows, for example:
Table A
ID
1
2
Table B
ID Status
1 OK
1 OK
2 OK
Do I have to make sure my record only match with one row? Because in these example, ID 1 will have 2 rows.
Apologize for bad english.
In SQL Server you don't have to ensure a 1 row result per row to delete. The engine will delete all rows from the deleting table that matches your joining or where conditions, even if they are being selected for more than 1 row.
The important part is which table are your deleting, make sure not to delete the wrong one!

How to delete duplicate rows with SQL?

I have a table with some rows in. Every row has a date-field. Right now, it may be duplicates of a date. I need to delete all the duplicates and only store the row with the highest id. How is this possible using a SQL query?
Now:
date id
'07/07' 1
'07/07' 2
'07/07' 3
'07/05' 4
'07/05' 5
What I want:
date id
'07/07' 3
'07/05' 5
DELETE FROM table WHERE id NOT IN
(SELECT MAX(id) FROM table GROUP BY date);
I don't have comment rights, so here's my comment as an answer in case anyone comes across the same problem:
In SQLite3, there is an implicit numerical primary key called "rowid", so the same query would look like this:
DELETE FROM table WHERE rowid NOT IN
(SELECT MAX(rowid) FROM table GROUP BY date);
this will work with any table even if it does not contain a primary key column called "id".
For mysql,postgresql,oracle better way is SELF JOIN.
Postgresql:
DELETE FROM table t1 USING table t2 WHERE t1.date=t2.date AND t1.id<t2.id;
MySQL
DELETE FROM table
USING table, table as vtable
WHERE (table.id < vtable.id)
AND (table.date=vtable.date)
SQL aggregate (max,group by) functions almost always are very slow.