Delete rows that don't have a reference anymore - sql

I want to delete all rows of the seconde table that don't have a reference to the first one anymore or are 'NULL'. And that in a single query if possible.
I thought of an outer join but that didn't work since I got the missing rows of both tables. I can't do it in code either because the 'IN' query would become to big for Oracle and I can't just loop through because that would take too long. That's why I was hoping that it could be done with a single query?

In most databases, not exists is a supported solution:
delete from table2
where ref is null or not exists (
select 1
from table1
where table1.id = table2.ref
)
Note that this would be much simpler solved by setting up a proper foreign key between child table/column table2(re) and parent table1(id): foreign keys come with useful features such as on delete cascade, that perform such operation for you under the hood.

Delete Statement:
DELETE FROM Table2 LEFT JOIN Table1 ON ref=id WHERE Table1.id IS NULL

Related

Efficiently delete from one table where ID matches another table

I have two tables with few million records in a PostgreSQL database.
I'm trying to delete rows from one table where ID matches ID of another table. I have used the following command:
delete from table1 where id in (select id from table2)
The above command has been taking lot of time (few hours) which got me wondering is there a faster way to do this operation. Will creating indices help?
I have also tried the delete using join as suggested by few people:
delete from table1 join table2 on table1.id = table2.id
But the above command returned a syntax error. Can this be modified to avoid the error?
Syntax
You second attempt is not legal DELETE syntax in PostgreSQL. This is:
DELETE FROM table1 t1
USING table2 t2
WHERE t2.id = t1.id;
Consider the chapter "Notes" for the DELETE command:
PostgreSQL lets you reference columns of other tables in the WHERE condition by specifying the other tables in the USING clause. For example,
[...]
This syntax is not standard.
[...]
In some cases the join style is easier to write or faster to execute than the sub-select style.
Index
Will creating indices help?
The usefulness of indexes always depends on the complete situation. If table1 is big, and much bigger than table2, an index on table1.id should typically help. Typically, id would be your PRIMARY KEY, which is indexed implicitly anyway ...
Also typically, an index on table2 would not help (and not be used even if it exists.)
But like I said: Depends on the complete situation, and you disclosed preciously little.
Other details of your setup might make the deletes expensive. FK constraints, triggers, indexes, locks held by concurrent transactions, table and index bloat ...
Or non-unique rows in table2. (But I would assume ìd to be unique?) Then you would first extract a unique set of IDs from table2. Depending on cardinalities, a simple DISTINCT or more sophisticated query techniques would be in order ...

redshift select distinct returns repeated values

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.
SELECT DISTINCT distinct_value
FROM
(
SELECT
uri,
( SELECT DISTINCT value_string
FROM `test_organization__app__testsegment` AS X
WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value
FROM `test_organization__app__testsegment` AS parent
WHERE
uri IN ( SELECT uri
FROM `test_organization__app__testsegment`
WHERE name = 'types' AND value_uri_multivalue = 'Document'
)
) AS T
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0
This is not a bug and behavior is intentional, though not straightforward.
In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.
Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.
More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them
Perhaps You can solve this by using appropriate joins.
for example i have duplicate values in table 1 and i want values of table 1 by joining it to table 2 and there is some logic behind joining two tables according to your conditions.
so i can do something like this!!
select distinct table1.col1 from table1 left outer join table2 on table1.col1 = table2.col1
this worked for me very well and i got unique values from table1 and could remove dublicates

That was not the right table: Access wiped the wrong data

I... don't quite know if I have the right idea about Access here.
I wrote the following, to grab some data that existed in two places:-
Select TableOne.*
from TableOne inner join TableTwo
on TableOne.[LINK] = TableTwo.[LINK]
Now, my interpretation of this is:
Find the table "TableOne"
Match the LINK field to the corresponding field in the table "TableTwo"
Show only records from TableOne that have a matching record in TableTwo
Just to make sure, I ran the query with some sample tables in SSMS, and it worked as expected.
So why, when I deleted the rows from within that query, did it delete the rows from TableTwo, and NOT from TableOne as expected? I've just lost ~3 days of work.
Edit: For clarity, I manually selected the rows in the query window and deleted them. I did not use a delete query - I've been stung by that a couple of times lately.
Since you have deleted the records manually, your query has to be updateable. This means that your query couldn't have been solely a cartesian join or a join without referential integrity, since these queries are non-updateable in ms access.
When I recreate your query based on two fields without indexes or primary keys, I am not even able to manualy delete records. This leads me to believe there was unknowingly a relationship established which deleted the records in table two. Perhaps you should take a look in the design view of your queries and relationships window, since the query itself should indeed select only records from table one.
Not sure why it got deleted, but I suggest to rewrite your query:
delete TableOne
where LINK in (select LINK from TableTwo)
This should work for you:
DELETE TableOne
FROM TableOne a
INNER JOIN
TableTwo b on b.Bid = a.Bid
and [my filter condition]

Check for complete duplicate rows in a large table

My original question with all the relevant context can be found here:
Adding a multi-column primary key to a table with 40 million records
I have a table with 40 million rows and no primary key. Before I add the primary key, I would like to check if the table has any duplicate entries. When I say duplicate entries, I don't just mean duplicate on particular columns. I mean duplicates on entire rows.
I was told in my last question that I can do an EXISTS query to determine duplicates. How would I do that?
I am running PostgreSQL 8.1.22. (Got this info by running select version()).
To find whether any full duplicate exists (identical on all columns), this is probably the fastest way:
SELECT EXISTS (
SELECT 1
FROM tbl t
NATURAL JOIN tbl t1
WHERE t.ctid <> t1.ctid
)
NATURAL JOIN is a very convenient shorthand for the case because (quoting the manual here):
NATURAL is shorthand for a USING list that mentions all columns in the
two tables that have the same names.
EXISTS is probably fastest, because Postgres stops searching as soon as the first duplicate is found. Since you most probably don't have an index covering the whole row and your table is huge, this will save you a lot of time.
Be aware that NULL is never considered identical to another NULL. If you have NULL values and consider them identical, you'd have to do more.
ctid is a system column that can be (ab-)used as ad-hoc primary key, but cannot replace an actual user-defined primary key in the long run.
The outdated version 8.1 seems to have no <> operator defined for a ctid. Try casting to text:
SELECT EXISTS (
SELECT 1
FROM tbl t
NATURAL JOIN tbl t1
WHERE t.ctid::text <> t1.ctid::text
)
shouldn't something like that do the job?
SELECT ALL_COLUMNS[expect unique ID],
count(0) as Dupl
FROM table
WHERE Dupl>1
GROUP BY ALL_COLUMNS[expect unique ID];
not sure if its the most efficient way, but count>1 means you have two identical rows.

Access: DELETE query with WHERE <> clause: "At most one record can be returned by this subquery"

I want to run this DELETE query:
DELETE * FROM Table1 WHERE Table1.ID <> (SELECT Table1.ID FROM Table1 WHERE ....)
The query in brackets returns all the IDs I want to keep in Table1 (This query works on it's own, I tested it). But as soon as I add the DELETE part I get the following error: "At most one record can be returned by this subquery". I tried the Code
DELETE * FROM Table1 WHERE Table1.ID NOT IN (SELECT Table1.ID FROM Table1 WHERE ....)
But now my database hangs and doesn't do anything anymore...
Thank you for your help!
Actually * is not necessary for delete statements because it deletes the entire row that matches the where condition.
Generally <> (Not equal to) is used to provide a single static value. (like below)
DELETE FROM Table1 WHERE Table1.ID <> 1
But the subquery returns rows/records (with the selected columns). Hence you got the error: At most one record can be returned by this subquery
Coming to the NOT IN clause that you are using, it is the right way to do it, but NOT IN can become pretty complex if more number of rows are involved in the subquery as NOT IN does a join behind the hood.
The best way in your case is to use NOT(condition) as you already know the condition for which you need the required table IDs. (like below)
DELETE FROM Table1 WHERE NOT(condition)
This does your job pretty fast as there are no joins involved like in the earlier case.