Removing duplicate records using another table's oid - sql

Table 1 Table 2
-------- --------
oid oid (J)
sequence trip_id
stop
trip_update_id (J)
(J) = join
Table 1 and Table 2 are updated ever 30 seconds from an api simultaneously.
At the end of each day Table 1 has been filled with 98% duplicate data, this is because the data feed includes both new data generated in the last 30 seconds, and all data generated in previous feeds from the same day. As a result Table 1 is filled with mostly duplicate data (the oid is automatically generated upon insertion, therefore all oid are unique).
Table 2 has all unique records, therefore my question is what is the sql to turn Table 1 into all unique records for each trip_id in Table 2.

I'm not quite sure if I understand what the problem is, but here comes a few suggestions.
To remove rows from table1 with trip_update_id values not found in table2:
delete from table1
where trip_update_id not in (select trip_id from table2 where trip_id is not null)
(The is not null part is very important if trip_id is allowed to have NULL values!!!)
To duplicate remove trip_update_id rows from table 1, keep the one with highest oid:
delete from table1
where oid not in (select max(oid) from table1
group by trip_update_id)

Related

I don't understand how make task on SQL

There is a table with two fields: Id and Timestamp.
Id is an increasing sequence. Each insertion of a new record into the table leads to the generation of ID(n)=ID(n-1) + 1. Timestamp is a timestamp that, when inserted retroactively, can take any values less than the maximum time of all previous records.
Retroactive insertion is the operation of inserting a record into a table in which
ID(n) > ID(n-1)
Timestamp(n) < max(timestamp(1):timestamp(n-1))
Example of a table:
ID
Timestamp
1
2016.09.11
2
2016.09.12
3
2016.09.13
4
2016.09.14
5
2016.09.09
6
2016.09.12
7
2016.09.15
IDs 5 and 6 were inserted retroactively (their timestamps are lower than later records).
I need a query that will return a list of all ids that fit the definition of insertion retroactively. How can I do this?
It can be rephrased to :
Find every entries for which, in the same table, there is an entry with a lesser id (a previous entry) having a greater timestamp
It can be achieved using a WHERE EXISTS clause :
SELECT t.id, t.timestamp
FROM tbl t
WHERE EXISTS (
SELECT 1
FROM tbl t2
WHERE t.id > t2.id
AND t.timestamp < t2.timestamp
);
Fiddle for MySQL It should work with any DBMS, since it's a standard SQL syntax.

Optimisation of sql query for deleting duplicate items from large table

Could anyone please help me optimise one of the queries which is taking more than 20 minutes to run against 3 Million data.
Table Structure
-----------------------------------------------------------------------------------------
|id [INT Auto Inc]| name_id (uuid) | name (varchar)| city (varchar) | name_type(varchar)|
-----------------------------------------------------------------------------------------
Query
The purpose of the query is to eliminate the duplicate, here duplicate means having same name_id and name.
DELETE
FROM records
WHERE id NOT IN
(SELECT DISTINCT
ON (name_id, name) id
FROM records);
I would write your delete using exists logic:
DELETE
FROM records r1
WHERE EXISTS (SELECT 1 FROM records r2
WHERE r2.name_id = r1.name_id AND r2.name = r2.name AND
r2.id < r1.id);
This delete query will spare the duplicate having the smallest id value. To speed this up, you may try adding the following index:
CREATE INDEX idx ON records (name_id, name, id);
You probably already have a primary key on the identity column, then you can use it to exclude redundant rows by id in the following way:
WITH cte AS (
SELECT MIN(id) AS id FROM records GROUP BY name_id, name)
DELETE FROM records
WHERE NOT EXISTS (SELECT id FROM cte WHERE id=records.id)
Even without the index, this should work relatively fast, probably because of merge join strategy.

Get the "most" optimal row in a JOIN

Problem
I have a situation in which I have two tables in which I would like the entries from table 2 (lets call it table_2) to be matched up with the entries in table 1 (table_1) such that there are no duplicates rows of table_2 used in the match up.
Discussion
Specifically, in this case there are datetime stamps in each table (field is utcdatetime). For each row in table_1, I want to find the row in table_2 in which has the closed utcdatetime to the table 1 utcdatetime such that the table2.utcdatetime is older than the table_1 utcdatetime and within 30 minutes of the table 1 utcdatetime. Here is the catch, I do not want any repeats. If a row in table 2 gets gobbled up in a match on an earlier row in table 1, then I do not want it considered for a match later.
This has currently been implemented in a Python routine, but it is slow to iterate over all of the rows in table 1 as it is large. I thought I was there with a single SQL statement, but I found that my current SQL results in duplicate table 2 rows in the output data.
I would recommend using a nested select to get whatever results you're looking for.
For instance:
select *
from person p
where p.name_first = 'SCCJS'
and not exists (select 'x' from person p2 where p2.person_id != p.person_id
and p.name_first = 'SCCJS' and p.name_last = 'SC')

Insert data from one table to other using select statement and avoid duplicate data

Database: Oracle
I want to insert data from table 1 to table 2 but the catch is, primary key of table 2 is the combination of first 4 letters and last 4 numbers of the primary key of table 1.
For example:
Table 1 - primary key : abcd12349887/abcd22339887/abcder019987
In this case even if the primary key of table 1 is different, but when I extract the 1st 4 and last 4 chars, the output will be same abcd9887
So, when I use select to insert data, I get error of duplicate PK in table 2.
What I want is if the data of the PK is already present then don't add that record.
Here's my complete stored procedure:
INSERT INTO CPIPRODUCTFAMILIE
(productfamilieid, rapport, mesh, mesh_uitbreiding, productlabelid)
(SELECT DISTINCT (CONCAT(SUBSTR(p.productnummer,1,4),SUBSTR(p.productnummer,8,4)))
productnummer,
ps.rapport, ps.mesh, ps.mesh_uitbreiding, ps.productlabelid
FROM productspecificatie ps, productgroep pg,
product p left join cpiproductfamilie cpf
on (CONCAT(SUBSTR(p.productnummer,1,4),SUBSTR(p.productnummer,8,4))) = cpf.productfamilieid
WHERE p.productnummer = ps.productnummer
AND p.productgroepid = pg.productgroepid
AND cpf.productfamilieid IS NULL
AND pg.productietype = 'P'
**AND p.ROWID IN (SELECT MAX(ROWID) FROM product
GROUP BY (CONCAT(SUBSTR(productnummer,1,4),SUBSTR(productnummer,8,4))))**
AND (CONCAT(SUBSTR(p.productnummer,1,2),SUBSTR(p.productnummer,8,4))) not in
(select productfamilieid from cpiproductfamilie));
The highlighted section seems to be wrong, and because of this the data is not picking up.
Please help
Try using this.
p.productnummer IN (SELECT MAX(productnummer) FROM product
GROUP BY (CONCAT(SUBSTR(productnummer,1,4),SUBSTR(productnummer,8,4))))

Understanding this SQL Query

I'm new to oracle database, can some help me understand this query. This query eliminates duplicates from table.
DELETE FROM table_name A
WHERE ROWID > (SELECT min(rowid)
FROM table_name B
WHERE A.key_values = B.key_values);
Any suggestions for improving the query are welcome.
Edit: No this is not homework , what I didn't understand is, what is being done by subquery and what does ROWID > On subquery do ?
This is the Source of the query
Dissecting the actual mechanics:
DELETE FROM table_name A
This is a standard query to delete records from the table named "table_name". Here, it has been aliased as "A" to be referred to in the subquery.
WHERE ROWID >
This places a condition on the deletion, such that for each row encountered, the ROWID must meed a condition of being greater than..
(SELECT min(rowid)
FROM table_name B
WHERE A.key_values = B.key_values)
This is a subquery that is correlated to the main DELETE statement. It uses the value A.key_values from the outside query. So given a record from the DELETE statement, it will run this subquery to find the minimum rowid (internal record id) for all records in the same table (aliased as B now) that bear the same key_values value.
So, to put it together, say you had these rows
rowid | key_values
======= ============
1 A
2 B
3 B
4 C
5 A
6 B
The subquery works out that the min(rowid) for each record based on ALL records with the same key_values is:
rowid | key_values | min(rowid)
======= ============ ===========
1 A 1
2 B 2
3 B 2 **
4 C 4
5 A 1 **
6 B 2 **
For the records marked with **, the condition
WHERE ROWID > { subquery }
becomes true, and they are deleted.
EDIT - additional info
This answer previously stated that ROWID increased by insertion order. That is very untrue. The truth is that rowid is just a file.block.slot-on-block - a physical address.
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:53140678334596
Tom's Followup December 1, 2008 - 6am Central time zone:
it is quite possible that D will be "first" in the table - as it took over A's place.
If rowids always "grew", than space would never be reused (that would be an implication of rowids growing always - we would never be able to reuse old space as the rowid is just a file.block.slot-on-block - a physical address)
Rowid is a pseudo-column that uniquely identifies each row in a table; it is numeric.
This query finds all rows in A where A.key_values = B.key_values and delete all of them but one with the minimal rowid. It's just a way to arbitrarily choose one duplicate to preserve.
Quote AskTom:
A rowid is assigned to a row upon insert and is immutable (never changing)... unless the row
is deleted and re-inserted (meaning it is another row, not the same row!)
The query you provided is relying on that rowid, and deletes all the rows with a rowid value higher than the minimum one on a per key_values basis. Hence, any duplicates are removed.
The subquery you provided is a correlated subquery, because there's a relationship between the table reference in the subquery, and one outside of the subquery.
ROWID is a number that increments for each new row that is inserted. So if you have two ROWID numbers 16 & 24, you know 16 was inserted before 24. Your delete statement is deleting all duplicates, keeping only the first of those duplicates that was inserted. Make sense??