Remove rows NOT referenced by a foreign key - sql

This is somewhat related to this question:
I have a table with a primary key, and I have several tables that reference that primary key (using foreign keys). I need to remove rows from that table, where the primary key isn't being referenced in any of those other tables (as well as a few other constraints).
For example:
Group
groupid | groupname
1 | 'group 1'
2 | 'group 3'
3 | 'group 2'
... | '...'
Table1
tableid | groupid | data
1 | 3 | ...
... | ... | ...
Table2
tableid | groupid | data
1 | 2 | ...
... | ... | ...
and so on. Some of the rows in Group aren't referenced in any of the tables, and I need to remove those rows. In addition to this, I need to know how to find all of the tables/rows that reference a given row in Group.
I know that I can just query every table and check the groupid's, but since they are foreign keys, I imagine that there is a better way of doing it.
This is using Postgresql 8.3 by the way.

DELETE
FROM group g
WHERE NOT EXISTS
(
SELECT NULL
FROM table1 t1
WHERE t1.groupid = g.groupid
UNION ALL
SELECT NULL
FROM table1 t2
WHERE t2.groupid = g.groupid
UNION ALL
…
)

At the heart of it, SQL servers don't maintain 2-way info for constraints, so your only option is to do what the server would do internally if you were to delete the row: check every other table.
If (and be damn sure first) your constraints are simple checks and don't carry any "on delete cascade" type statements, you can attempt to delete everything from your group table. Any row that does delete would thus have nothing reference it. Otherwise, you're stuck with Quassnoi's answer.

Related

SQL Table Relations

We have lots of tables in MS SQL that created without table relations many years ago. Now we are trying to create a relationship between these tables. The problem is in many of these developers used fake ids in the tables.
For example:
TABLE A, ID(primary key) -> TABLE B, AID needs to be relational. But developers used some fake ids like -1,-2 to solve some problems in their side. And now when I try to create a relation between TABLE A, ID(primary key) -> TABLE B, AID, I am getting errors.
TABLE A
ID | NAME
1 | name01
2 | name02
TABLE B
ID | NAME | AID
1 | name01 | 1
2 | name02 | -1
3 | name03 | -2
Is there way to solve this problem and is it meaning full what developers did, they didn't use any relations in sql, they are controlling everything in code-behind.
Thanks
You need to add those to your reference table. Something like this:
insert into a (id, name)
select distinct aid, 'Automatically Generated'
from b
where not exists (select 1 from a where b.aid = a.id) and
a.id is not null;
Then you can add the foreign key relationship:
alter table b add constraint fk_b_aid foreign key (aid) references a(id);
The general idea of referential integrity is exactly that you can't have invalid references.
So the best course of action here would be to suck it up and manually clean it up. Create the missing entries in the other table, or delete the records.
You can also ignore checks on existing data. If you are using sql server management studio to create relations there is option to do that just like in this screen shot
Hope it helps

Is chaining rows in the same table a bad pattern?

I want to create a tree structure of categories and need to find a proper way to store it into the database. Think of the following animal tree, which pretty accurately describes how it should look like:
My question now is whether chaining those entries within the same table is a good idea or not. SQLite doesn't allow me to add a FOREIGN KEY constraint to a value in the same table, so I have to make sure manually that I don't create inconsistencies. This is what I currently plan to have:
id | parent | name
---+--------+--------
1 | null | Animal
2 | 1 | Reptile
3 | 2 | Lizard
4 | 1 | Mammal
5 | 4 | Equine
6 | 4 | Bovine
parent references to an id in the same table, going up all the way until null is found, which is the root. Is this a bad pattern? And if so, what are common alternatives to put a tree structure into a relational database?
If your version of SQLite supports recursive CTE, then this is one option:
WITH RECURSIVE cte (n) AS (
SELECT id FROM yourTable WHERE parent IS NULL
UNION ALL
SELECT t1.id
FROM yourTable t1
INNER JOIN cte t2
ON t1.parent = t2.n AND t1.name NOT LIKE '%Lizard%'
)
SELECT *
FROM yourTable
WHERE id IN cte;
This is untested, but the check on t1.name in the recursive portion of the above CTE (hopefully) should stop the recursion as soon we reach a record which matches the name in the LIKE expression. In the case of searching for Lizard, the recursion should stop one level above Lizard, meaning that every record above it in the hierarchy should be returned.

How to get sum of values per id and update existing records in other table

I have two tables like:
ID | TRAFFIC
fd56756 | 4398
645effa | 567899
894fac6 | 611900
894fac6 | 567899
and
USER | ID | TRAFFIC
andrew | fd56756 | 0
peter | 645effa | 0
john | 894fac6 | 0
I need to get SUM ("TRAFFIC") from first table AND set column traffic to the second table where first table ID = second table ID. ID's from first table are not unique, and can be duplicated.
How can I do this?
Table names from your later comment. Chances are, you are reporting table and column names incorrectly.
UPDATE users u
SET "TRAFFIC" = sub.sum_traffic
FROM (
SELECT "ID", sum("TRAFFIC") AS sum_traffic
FROM stats.traffic
GROUP BY 1
) sub
WHERE u."ID" = sub."ID";
Aside: It's unwise to use mixed-case identifiers in Postgres. Use legal, lower-case identifiers, which do not need to be double-quoted, to make your life easier. Start by reading the manual here.
Something like this?
UPDATE users t2 SET t2.traffic = t1.sum_traffic FROM
(SELECT sum(t1.traffic) t1.sum_traffic FROM stats.traffic t1)
WHERE t1.id = t2.id;

Recursively duplicating entries

I am attempting to duplicate an entry. That part isn't hard. The tricky part is: there are n entries connected with a foreign key. And for each of those entries, there are n entries connected to that. I did it manually using a lookup to duplicate and cross reference the foreign keys.
Is there some subroutine or method to duplicate an entry and search for and duplicate foreign entries? Perhaps there is a name for this type of replication I haven't stumbled on yet, is there a specific database related title for this type of operation?
PostgreSQL 8.4.13
main entry (uid is serial)
uid | title
-----+-------
1 | stuff
department (departmentid is serial, uidref is foreign key for uid above)
departmentid | uidref | title
--------------+--------+-------
100 | 1 | Foo
101 | 1 | Bar
sub_category of department (textid is serial, departmentref is foreign for departmentid above)
textid | departmentref | title
-------+---------------+----------------
1000 | 100 | Text for Foo 1
1001 | 100 | Text for Foo 2
1002 | 101 | Text for Bar 1
You can do it all in a single statement using data-modifying CTEs (requires Postgres 9.1 or later).
Your primary keys being serial columns makes it easier:
WITH m AS (
INSERT INTO main (<all columns except pk>)
SELECT <all columns except pk>
FROM main
WHERE uid = 1
RETURNING uid AS uidref -- returns new uid
)
, d AS (
INSERT INTO department (<all columns except pk>)
SELECT <all columns except pk>
FROM m
JOIN department d USING (uidref)
RETURNING departmentid AS departmentref -- returns new departmentids
)
INSERT INTO sub_category (<all columns except pk>)
SELECT <all columns except pk>
FROM d
JOIN sub_category s USING (departmentref);
Replace <all columns except pk> with your actual columns. pk is for primary key, like main.uid.
The query returns nothing. You can return pretty much anything. You just didn't specify anything.
You wouldn't call that "replication". That term usually is applied for keeping multiple database instances or objects in sync. You are just duplicating an entry - and depending objects recursively.
Aside about naming conventions:
It would get even simpler with a naming convention that labels all columns signifying "ID of table foo" with the same (descriptive) name, like foo_id. There are other naming conventions floating around, but this is the best for writing queries, IMO.

Speeding up this big JOIN

EDIT: there was a mistake in the following question that explains the observations. I could delete the question but this might still be useful to someone. The mistake was that the actual query running on the server was SELECT * FROM t (which was silly) when I thought it was running SELECT t.* FROM t (which makes all the difference). See tobyobrian's answer and the comments to it.
I've a too slow query in a situation with a schema as follows. Table t has data rows indexed by t_id. t adjoins tables x and y via junction tables t_x and t_y each of which contains only the foreigns keys required for the JOINs:
CREATE TABLE t (
t_id INT NOT NULL PRIMARY KEY,
data columns...
);
CREATE TABLE t_x (
t_id INT NOT NULL,
x_id INT NOT NULL,
PRIMARY KEY (t_id, x_id),
KEY (x_id)
);
CREATE TABLE t_y (
t_id INT NOT NULL,
y_id INT NOT NULL,
PRIMARY KEY (t_id, y_id),
KEY (y_id)
);
I need to export the stray rows in t, i.e. those not referenced in either junction table.
SELECT t.* FROM t
LEFT JOIN t_x ON t_x.t_id=t.t_id
LEFT JOIN t_y ON t_y.t_id=t.t_id
WHERE t_x.t_id IS NULL OR t_y.t_id IS NULL
INTO OUTFILE ...;
t has 21 M rows while t_x and t_y both have about 25 M rows. So this is naturally going to be a slow query.
I'm using MyISAM so I thought I'd try to speed it up by preloading the t_x and t_y indexes. The combined size of t_x.MYI and t_y.MYI was about 1.2 M bytes so I created a dedicated key buffer for them, assigned their PRIMARY keys to the dedicated buffer and LOAD INDEX INTO CACHE'ed them.
But as I watch the query in operation, mysqld is using about 1% CPU, the average system IO pending queue length is around 5, and mysqld's average seek size is in the 250 k range. Moreover, nearly all the IO is mysqld reading from t_x.MYI and t_x.MYD.
I don't understand:
Why mysqld is reading the .MYD files at all?
Why mysqld isn't using the preloaded the t_x and t_y indexes?
Could it have something to do with the t_x and t_y PRIMARY keys being over two columns?
EDIT: The query explained:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+---------+---------+-----------+----------+-------------+
| 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 20980052 | |
| 1 | SIMPLE | t_x | ref | PRIMARY | PRIMARY | 4 | db.t.t_id | 235849 | Using index |
| 1 | SIMPLE | t_y | ref | PRIMARY | PRIMARY | 4 | db.t.t_id | 207947 | Using where |
+----+-------------+-------+------+---------------+---------+---------+-----------+----------+-------------+
Use not exists - this will be the fastest - much better than 'joins' or using 'not in' in this sitution.
SELECT t.* FROM t a
Where not exists (select 1 from t_x b
where b.t_id = a.t_id)
or not exists (select 1 from t_y c
where c.t_id = a.t_id);
I can answer part 1 of your question, and i may or may not be able to answer part two if you post the output of EXPLAIN:
In order to select t.* it needs to look in the MYD file - only the primary key is in the index, to fetch the data columns you requested it needs the rest of the columns.
That is, your query is quite probably filtering the results very quickly, its just struggling to copy all the data you wanted.
Also note that you will probably have duplicates in your output - if one row has no refs in t_x, but 3 in x_y you will have the same t.* repeated 3 times. Given we think the where clause is sufficiently efficient, and much time is spent on reading the actual data, this is quite possibly the source of your problems. try changing to select distinct and see if that helps your efficiency
This may be a bit more efficient:
SELECT *
FROM t
WHERE t.id NOT IN (
SELECT DISTINCT t_id
FROM t_x
UNION
SELECT DISTINCT t_id
FROM t_y
);