Making combinations of attributes unique in PostgreSQL - sql

There is an option in postgresql where we can have a constraint such that we can have multiple attributes of a table together as unique
UNIQUE (A, B, C)
Is it possible to take attributes from multiple tables and make their entire combination as unique in some way
Edit:
Table 1: List of Book
Attributes: ID, Title, Year, Publisher
Table 2: List of Author
Attributes: Name, ID
Table 3: Written By: Relation between Book and Author
Attributes: Book_ID, Author_ID
Now I have situation where I don't want (Title, Year, Publisher, Authors) get repeated in my entire database

There are 3 solutions to this problem:
You add a column "authorID" to the table "book", as a foreign key. You can then add the UNIQUE constraint to the table "book".
We can have a foreign key on the 2 columns (bookID, author ID) which references the table bookAuthor.
You create a Trigger on insert on the table "book" which checks whether the combination exist and does not insert if it does exist. You will find a working example of this option below.
Whilst working on this option I realised that the JOIN to WrittenBy must be done on Title and not ID. Otherwise we can record the same book as many times as we like just by using a new ID. The problem with using the title is that the slightest change in spelling or punctuation means that it is treated as a new title.
In the example the 3rd insert has failed because it already exists. In the 4th have left 2 spaces in "Tom Sawyer" and it is accepted as a different title.
Also as we use a join to find out the author the real effect of our rule is exactly the same as if we had a UNIQUE constraint on the table books on columns Title, Year and Publisher. This means that all that I have coded is a waste of time.
We thus decide, after coding it, that this option is not effective.
We could create a fourth table with the 4 columns and a UNIQUE constraint on all 4. This seems a heavy solution compared to option 1.
CREATE TABLE Book (
ID int primary key,
Title varchar(25),
Year int,
Publisher varchar(10) );
CREATE TABLE Author (
ID int primary key,
Name varchar(10)
);
CREATE TABLE WrittenBy(
Book_ID int primary key,
Titlew varchar(25),
Author_ID int
);
CREATE FUNCTION book_insert_trigger_function()
RETURNS TRIGGER
LANGUAGE PLPGSQL
AS $$
DECLARE
authID INTEGER;
coun INTEGER;
BEGIN
IF pg_trigger_depth() <> 1 THEN
RETURN NEW;
END IF;
SELECT MAX(Author_ID) into authID
FROM WrittenBy w
WHERE w.Titlew = NEW.Title;
SELECT COUNT(*) INTO coun FROM
Book b LEFT JOIN WrittenBy w ON
b.Title = w.Titlew
WHERE NEW.year = b.year
AND NEW.title=b.title
AND NEW.publisher=b.publisher
AND authID = COALESCE(w.Author_ID,authID);
IF coun > 0 THEN
RETURN null; -- this means that we do not insert
ELSE
RETURN NEW;
END IF;
END;
$$
;
CREATE TRIGGER book_insert_trigger
BEFORE INSERT
ON Book
FOR EACH ROW
EXECUTE PROCEDURE book_insert_trigger_function();
INSERT INTO WrittenBy VALUES
(1,'Tom Sawyer',1),
(2,'Huckleberry Finn',1);
INSERT INTO Book VALUES (1,'Tom Sawyer',1950,'Classics');
INSERT INTO Book VALUES (2,'Huckleberry Finn',1950,'Classics');
INSERT INTO Book VALUES (3,'Tom Sawyer',1950,'Classics');
INSERT INTO Book VALUES (3,'Tom Sawyer',1950,'Classics');
SELECT *
FROM Book b
LEFT JOIN WrittenBy w on w.Titlew = b.Title
LEFT JOIN Author a on w.author_ID = a.ID;
>
> id | title | year | publisher | book_id | titlew | author_id | id | name
> -: | :--------------- | ---: | :-------- | ------: | :--------------- | --------: | ---: | :---
> 1 | Tom Sawyer | 1950 | Classics | 1 | Tom Sawyer | 1 | null | null
> 2 | Huckleberry Finn | 1950 | Classics | 2 | Huckleberry Finn | 1 | null | null
> 3 | Tom Sawyer | 1950 | Classics | null | null | null | null | null
>
db<>fiddle here

Related

Audited table and foreign key

I have a database with multiples tables that must be audited.
As an example, I have a table of objects defined with an unique ID, a name and a description.
The name will always be the same. It is not possible to update it. "ObjectA" will always be "ObjectA".
As you see the name is not unique in the database but only in the logic.
The rows "from", "to" and "creator_id" are used to audit the changes. "from" is the date of the change, "to" is the date when a new row has been added and is null when it is the latest row. "creator_id" is the ID of the user that made the change.
+----+----------+--------------+----------------------+----------------------+------------+
| id | name | description | from | to | creator_id |
+----+----------+--------------+----------------------+----------------------+------------+
| 1 | ObjectA | My object | 2021-05-30T00:05:00Z | 2021-05-31T05:04:36Z | 18 |
| 2 | ObjectB | My desc | 2021-05-30T02:07:25Z | null | 15 |
| 3 | ObjectA | Super object | 2021-05-31T05:04:36Z | null | 20 |
+----+----------+--------------+----------------------+----------------------+------------+
Now I have another table that must have a foreign key to this object table based on the "unique" object name.
+----+---------+-------------+
| id | foo | object_name |
+----+---------+-------------+
| 1 | blabla | ObjectA |
| 2 | wawawa | ObjectB |
+----+---------+-------------+
How can I create this link between those 2 tables ?
I already tried to create another table with a uuid and add a column "unique_identifier" in the object table. The foreign key will be then linked to this uuid table and not the object table. The issue is that I have multiple tables with this problem and I will have to create the double number of table.
It is also possible to use the object ID as the FK instead of the name but it would mean that I must update every table with that FK with the new ID when updating an object.
By the SQL standard, a foreign key must reference either the primary key or a unique key of the parent table. If the primary key has multiple columns, the foreign key must have the same number and order of columns. Therefore the foreign key references a unique row in the parent table; there can be no duplicates.
Another solution is to use trigger, you can check the existence of the object in objects table before you insert into another table.
Update : Adding code
Prepare the tables and create trigger: (I have only included 3 columns in Objects table for simplicity. In trigger, I am just printing the error in else part, you could raise error suing RAISEERROR function to return the error to client)
Create table AuditObjects(id int identity (1,1),ObjectName varchar(20), ObjectDescription varchar(100) )
Insert into AuditObjects values('ObjectA','description ObjectA Test')
Insert into AuditObjects values('ObjectB','description ObjectB Test')
Insert into AuditObjects values('ObjectC','description ObjectC Test')
Insert into AuditObjects values('ObjectB','description ObjectB Test')
Insert into AuditObjects values('ObjectB','description ObjectB Test')
Insert into AuditObjects values('ObjectA','description ObjectA Test')
Create table ObjectTab2 (id int identity (1,1),foo varchar(200), ObjectName varchar(20))
go
CREATE TRIGGER t_CheckObject ON ObjectTab2 INSTEAD OF INSERT
AS BEGIN
Declare #errormsg varchar(200), #ObjectName varchar(20)
select #ObjectName = objectname from INSERTED
if exists(select 1 from AuditObjects where objectname = #ObjectName)
Begin
INSERT INTO ObjectTab2 (foo, Objectname)
Select foo, Objectname
from INSERTED
End
Else
Begin
Select #errormsg = 'Object '+objectname+ ' does not exists in AuditObjects table'
from Inserted
print(#errormsg)
End
END;
Now if you try to insert a row in ObjectTab2 with object name as "ObjectC", insert will be allowed as "objectC" is present in audit table.
Insert into ObjectTab2 values('blabla', 'ObjectC')
Select * from ObjectTab2
id foo ObjectName
----------- ------ --------------------
1 blabla ObjectC
However, if you try to enter "ObjectD", it will not make an insert and give error msg in output.
Insert into ObjectTab2 values('Inserting ObjectD', 'ObjectD')
Object ObjectD does not exists in AuditObjects table
Well its not what you asked for but give you the same functionality and results.
Can you not still go ahead with linking the two tables based on 'object name'. The only difference would be - when you join the two tables, you would get multiple records from table1 (the first table you were referencing). You may then add filter condition based on from and to, as per your requirements.
Post Edit -
What I meant is, you can still achieve the desired results without introducing Foreign Key in this scenario -
Let's call your tables - Table1 and Table2
--Below will give you all records from Table1
SELECT T2.*, T1.description, T1.creator_id, T1.from, T1.to
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T2.OBJECT_NAME = T1.NAME;
--Below will give you ONLY those records from Table1 whose TO is null
SELECT T2.*, T1.description, T1.creator_id, T1.from, T1.to
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T2.OBJECT_NAME = T1.NAME
WHERE T1.TO IS NULL;
I decided to go with an additional table to have this final design:
Table "Object"
+-------+--------------------------------------+---------+--------------+----------------------+----------------------+------------+
| id PK | identifier FK | name | description | from | to | creator_id |
+-------+--------------------------------------+---------+--------------+----------------------+----------------------+------------+
| 1 | 123e4567-e89b-12d3-a456-426614174000 | ObjectA | My object | 2021-05-30T00:05:00Z | 2021-05-31T05:04:36Z | 18 |
| 2 | 123e4567-e89b-12d3-a456-524887451057 | ObjectB | My desc | 2021-05-30T02:07:25Z | null | 15 |
| 3 | 123e4567-e89b-12d3-a456-426614174000 | ObjectA | Super object | 2021-05-31T05:04:36Z | null | 20 |
+-------+--------------------------------------+---------+--------------+----------------------+----------------------+------------+
Table "Object_identifier"
+--------------------------------------+
| identifier PK |
+--------------------------------------+
| 123e4567-e89b-12d3-a456-426614174000 |
| 123e4567-e89b-12d3-a456-524887451057 |
+--------------------------------------+
Table "foo"
+-------+--------+--------------------------------------+
| id PK | foo | object_identifier FK |
+-------+--------+--------------------------------------+
| 1 | blabla | 123e4567-e89b-12d3-a456-426614174000 |
| 2 | wawawa | 123e4567-e89b-12d3-a456-524887451057 |
+-------+--------+--------------------------------------+

Result of query as column value

I've got three tables:
Lessons:
CREATE TABLE lessons (
id SERIAL PRIMARY KEY,
title text NOT NULL,
description text NOT NULL,
vocab_count integer NOT NULL
);
+----+------------+------------------+-------------+
| id | title | description | vocab_count |
+----+------------+------------------+-------------+
| 1 | lesson_one | this is a lesson | 3 |
| 2 | lesson_two | another lesson | 2 |
+----+------------+------------------+-------------+
Lesson_vocabulary:
CREATE TABLE lesson_vocabulary (
lesson_id integer REFERENCES lessons(id),
vocabulary_id integer REFERENCES vocabulary(id)
);
+-----------+---------------+
| lesson_id | vocabulary_id |
+-----------+---------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 2 |
| 2 | 4 |
+-----------+---------------+
Vocabulary:
CREATE TABLE vocabulary (
id integer PRIMARY KEY,
hiragana text NOT NULL,
reading text NOT NULL,
meaning text[] NOT NULL
);
Each lesson contains multiple vocabulary, and each vocabulary can be included in multiple lessons.
How can I get the vocab_count column of the lessons table to be calculated and updated whenevr I add more rows to the lesson_vocabulary table. Is this possible, and how would I go about doing this?
Thanks
You can use SQL triggers to serve your purpose. This would be similar to mysql after insert trigger which updates another table's column.
The trigger would look somewhat like this. I am using Oracle SQL, but there would just be minor tweaks for any other implementation.
CREATE TRIGGER vocab_trigger
AFTER INSERT ON lesson_vocabulary
FOR EACH ROW
begin
for lesson_cur in (select LESSON_ID, COUNT(VOCABULARY_ID) voc_cnt from LESSON_VOCABULARY group by LESSON_ID) LOOP
update LESSONS
set VOCAB_COUNT = LESSON_CUR.VOC_CNT
where id = LESSON_CUR.LESSON_ID;
end loop;
END;
It's better to create a view that calculates that (and get rid of the column in the lessons table):
select l.*, lv.vocab_count
from lessons l
left join (
select lesson_id, count(*)
from lesson_vocabulary
group by lesson_id
) as lv(lesson_id, vocab_count) on l.id = lv.lesson_id
If you really want to update the lessons table each time the lesson_vocabulary changes, you can run an UPDATE statement like this in a trigger:
update lessons l
set vocab_count = t.cnt
from (
select lesson_id, count(*) as cnt
from lesson_vocabulary
group by lesson_id
) t
where t.lesson_id = l.id;
I would recommend using a query for this information:
select l.*,
(select count(*)
from lesson_vocabulary lv
where lv.lesson_id = l.lesson_id
) as vocabulary_cnt
from lessons l;
With an index on lesson_vocabulary(lesson_id), this should be quite fast.
I recommend this over an update, because the data remains correct.
I recommend this over a trigger, because it is simpler.
I recommend this over a subquery with aggregation because it should be faster, particularly if you are filtering on the lessons table.

Prolog to SQL: Any way to improve SQL code for unit tests and fix an edge case elegantly?

Inspired by this StackOverflow question:
Find mutual element in different facts in swi-prolog
We have the following
Problem statement
Given a database of "actors starring in movies"
(starsin is the relation linking actor "bob" to movie "a" for example)
starsin(a,bob).
starsin(c,bob).
starsin(a,maria).
starsin(b,maria).
starsin(c,maria).
starsin(a,george).
starsin(b,george).
starsin(c,george).
starsin(d,george).
And given set of movies M, find those actors that starred in all the movies of M.
The question was initially for Prolog.
Prolog solution
In Prolog, an elegant solution involves the predicate
setof/3,
which collects possible variable instantiations into a set (which is really list without
duplicate values):
actors_appearing_in_movies(MovIn,ActOut) :-
setof(
Ax,
MovAx^(setof(Mx,starsin(Mx,Ax),MovAx), subset(MovIn,MovAx)),
ActOut
).
I won't go into details about this, but let's look at the test code, which is of interest here.
Here are five test cases:
actors_appearing_in_movies([],ActOut),permutation([bob, george, maria],ActOut),!.
actors_appearing_in_movies([a],ActOut),permutation([bob, george, maria],ActOut),!.
actors_appearing_in_movies([a,b],ActOut),permutation([george, maria],ActOut),!.
actors_appearing_in_movies([a,b,c],ActOut),permutation([george, maria],ActOut),!.
actors_appearing_in_movies([a,b,c,d],ActOut),permutation([george],ActOut),!.
A test is a call to the predicate actors_appearing_in_movies/2, which is given
the input list of movies (e.g. [a,b]) and which captures the resulting list of
actors in ActOut.
Subsequently, we just need to test whether ActOut is a permutation of the expected
set of actors, hence for example:
permutation([george, maria],ActOut)`
"Is ActOut a list that is a permutation of the list [george,maria]?.
If that call succeeds (think, doesn't return with false), the test passes.
The terminal ! is the cut operator and is used to tell the Prolog engine to not
reattempt to find more solutions, because we are good at that point.
Note that for the empty set of movies, we get all the actors. This is arguably correct:
every actors stars in all the movies of the empty set (Vacuous Truth).
Now in SQL.
This problem is squarely in the domain of relational algebra, and there is SQL, so let's have
a go at this. Here, i'm using MySQL.
First, set up the facts.
DROP TABLE IF EXISTS starsin;
CREATE TABLE starsin (movie CHAR(20) NOT NULL, actor CHAR(20) NOT NULL);
INSERT INTO starsin VALUES
( "a" , "bob" ),
( "c" , "bob" ),
( "a" , "maria" ),
( "b" , "maria" ),
( "c" , "maria" ),
( "a" , "george" ),
( "b" , "george" ),
( "c" , "george" ),
( "d", "george" );
Regarding the set of movies given as input, giving them in the form of a
(temporary) table sounds natural. In MySQL, "temporary tables" are local to the session. Good.
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b");
Approach:
The results can now be obtained by getting, for each actor, the intersection of the set of
movies denoted by movies_in and the set of movies in which an actor ever appeared
(created for each actor via the inner join), then counting (for each actor) whether the
resulting set has at least as many entries as the set movies_in.
Wrap the query into a procedure for practical reasons.
A delimiter is useful here:
DELIMITER $$
DROP PROCEDURE IF EXISTS actors_appearing_in_movies;
CREATE PROCEDURE actors_appearing_in_movies()
BEGIN
SELECT
d.actor
FROM
starsin d, movies_in q
WHERE
d.movie = q.movie
GROUP BY
actor
HAVING
COUNT(*) >= (SELECT COUNT(*) FROM movies_in);
END$$
DELIMITER ;
Run it!
Problem A appears:
Is there a better way than edit + copy-paste table creation code,
issue a CALL and check the results "by hand"?
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
CALL actors_appearing_in_movies();
Empty set!
Problem B appears:
The above is not desired, I want "all actors", same as for the Prolog solution.
As I do not want to tack a weird edge case exception onto the code, my approach must
be wrong. Is there one which naturally covers this case but doesn't become too complex?
T-SQL and PostgreSQL one-liners are fine too!
The other test cases yield expected data:
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b");
CALL actors_appearing_in_movies();
+--------+
| actor |
+--------+
| george |
| maria |
+--------+
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b"), ("c");
CALL actors_appearing_in_movies();
+--------+
| actor |
+--------+
| george |
| maria |
+--------+
DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b"), ("c"), ("d");
CALL actors_appearing_in_movies();
+--------+
| actor |
+--------+
| george |
+--------+
And given set of movies M, find those actors that starred in all the movies of M.
I would use:
select si.actor
from starsin si
where si.movie in (<M>)
group by si.actor
having count(*) = <n>;
If you have to deal with an empty set, then you need a left join:
select a.actor
from actors a left join
starsin si
on a.actor = si.actor and si.movie in (<M>)
group by a.actor
having count(si.movie) = <n>;
<n> here is the number of movies in <M>.
Update: The second approach in extended form
create or replace temporary table
actor (actor char(20) primary key)
as select distinct actor from starsin;
select
a.actor,
si.actor,si.movie -- left in for docu
from
actor a left join starsin si
on a.actor = si.actor
and si.movie in (select * from movies_in)
group
by a.actor
having
count(si.movie) = (select count(*) from movies_in);
Then for empty movies_in:
+--------+-------+-------+
| actor | actor | movie |
+--------+-------+-------+
| bob | NULL | NULL |
| george | NULL | NULL |
| maria | NULL | NULL |
+--------+-------+-------+
and for this movies_in for example:
+-------+
| movie |
+-------+
| a |
| b |
+-------+
movie here is the top of the group:
+--------+--------+-------+
| actor | actor | movie |
+--------+--------+-------+
| george | george | a |
| maria | maria | a |
+--------+--------+-------+
The following solution involves counting and an UPDATE
Writeup here: A Simple Relational Database Operation
We are using MariaDB/MySQL SQL.
T-SQL or PL/SQL are more complete.
Manual page for CREATE TABLE
Manual page for CREATE PROCEDURE
Manual page for data types in MariaDB
Note that SQL has no vector data types that can be passed to procedures. Gotta work without that.
Enter facts as table:
CREATE OR REPLACE TABLE starsin
(movie CHAR(20) NOT NULL, actor CHAR(20) NOT NULL,
PRIMARY KEY (movie, actor));
INSERT INTO starsin VALUES
( "a" , "bob" ),
( "c" , "bob" ),
( "a" , "maria" ),
( "b" , "maria" ),
( "c" , "maria" ),
( "a" , "george" ),
( "b" , "george" ),
( "c" , "george" ),
( "d", "george" );
Enter a procedure to compute solution and actually ... print it out.
DELIMITER $$
CREATE OR REPLACE PROCEDURE actors_appearing_in_movies()
BEGIN
-- collect all the actors
CREATE OR REPLACE TEMPORARY TABLE tmp_actor (actor CHAR(20) PRIMARY KEY)
AS SELECT DISTINCT actor from starsin;
-- table of "all actors x (input movies + '--' placeholder)"
-- (combinations that are needed for an actor to show up in the result)
-- and a flag indicating whether that combination shows up for real
CREATE OR REPLACE TEMPORARY TABLE tmp_needed
(actor CHAR(20),
movie CHAR(20),
actual TINYINT NOT NULL DEFAULT 0,
PRIMARY KEY (actor, movie))
AS
(SELECT ta.actor, mi.movie FROM tmp_actor ta, movies_in mi)
UNION
(SELECT ta.actor, "--" FROM tmp_actor ta);
-- SELECT * FROM tmp_needed;
-- Mark those (actor, movie) combinations which actually exist
-- with a numeric 1
UPDATE tmp_needed tn SET actual = 1 WHERE EXISTS
(SELECT * FROM starsin si WHERE
si.actor = tn.actor AND si.movie = tn.movie);
-- SELECT * FROM tmp_needed;
-- The result is the set of actors in "tmp_needed" which have as many
-- entries flagged "actual" as there are entries in "movies_in"
SELECT actor FROM tmp_needed GROUP BY actor
HAVING SUM(actual) = (SELECT COUNT(*) FROM movies_in);
END$$
DELIMITER ;
Testing
There is no ready-to-use unit testing framework for MariaDB, so we
"test by hand" and write a procedure, the out of which we check manually.
Variadic arguments don't exist, vector data types don't exist.
Let's accept up to 4 movies as input and check the result manually.
DELIMITER $$
CREATE OR REPLACE PROCEDURE
test_movies(IN m1 CHAR(20),IN m2 CHAR(20),IN m3 CHAR(20),IN m4 CHAR(20))
BEGIN
CREATE OR REPLACE TEMPORARY TABLE movies_in (movie CHAR(20) PRIMARY KEY);
CREATE OR REPLACE TEMPORARY TABLE args (movie CHAR(20));
INSERT INTO args VALUES (m1),(m2),(m3),(m4); -- contains duplicates and NULLs
INSERT INTO movies_in (SELECT DISTINCT movie FROM args WHERE movie IS NOT NULL); -- clean
DROP TABLE args;
CALL actors_appearing_in_movies();
END$$
DELIMITER ;
The above passes all the manual tests, in particular:
CALL test_movies(NULL,NULL,NULL,NULL);
+--------+
| actor |
+--------+
| bob |
| george |
| maria |
+--------+
3 rows in set (0.003 sec)
For example, for CALL test_movies("a","b",NULL,NULL);
First set up the table with all actors against in all the movies in the input set, including the
"doesn't exist" movie represented by a placeholder --.
+--------+--------+-------+
| actual | actor | movie |
+--------+--------+-------+
| 0 | bob | -- |
| 0 | bob | a |
| 0 | bob | b |
| 0 | george | -- |
| 0 | george | a |
| 0 | george | b |
| 0 | maria | -- |
| 0 | maria | a |
| 0 | maria | b |
+--------+--------+-------+
Then mark those rows with a 1 where the actor-movie combination actually exists in starsin.
+--------+--------+-------+
| actual | actor | movie |
+--------+--------+-------+
| 0 | bob | -- |
| 1 | bob | a |
| 0 | bob | b |
| 0 | george | -- |
| 1 | george | a |
| 1 | george | b |
| 0 | maria | -- |
| 1 | maria | a |
| 1 | maria | b |
+--------+--------+-------+
Finally select an actor for inclusion in the solution if the SUM(actual) is equal to the
number of entries in the input movies table (it cannot be larger), as that means that the
actor indeed appears in all movies of the input movies table. In the special case where that
table is empty, the actor-movie combination table will only contain
+--------+--------+-------+
| actual | actor | movie |
+--------+--------+-------+
| 0 | bob | -- |
| 0 | george | -- |
| 0 | maria | -- |
+--------+--------+-------+
and thus all actors will be selected, which is what we want.

Migrating legacy table to normalized data structure with foreign keys in Oracle SQL

I am having some trouble wrapping my head around remaking databases. I have a book database that includes only one table, where all of the authors data is included after each book. I'm trying to remake this database in order to have an author table and a book table.
I made the author table using :
CREATE TABLE AUTHORS
AS SELECT AUTHOR_NAME, AUTHOR_SURNAME, AUTHOR_BIRTHDATE
If I now want to remake the book table, how do I add the foreign key so that the author of each book will be the correct one? That is, if the first entry on the original book table was:
ISBN1 Title1 Author_Name1 Author_Surname1 Author_Birthdate1
How do I import this data into the new table so that the new author field, a foreign key, references the correct entry in the author table? Sorry if it's confusing.
You are looking to split the existing table into two tables, one to store the authors and the other for books. For this to work properly, you need to create a unique id for each author. Here is a step by step approach.
Assuming the following legacy data structure:
create table old_books (
isbn NUMBER(13, 0),
title VARCHAR2(200),
author_name VARCHAR2(200),
author_surname VARCHAR2(200),
author_birthdate DATE
);
And this sample data:
ISBN | TITLE | AUTHOR_NAME | AUTHOR_SURNAME | AUTHOR_BIRTHDATE
------------: | :----- | :---------- | :------------- | :---------------
1000000000001 | book 1 | name 1 | surname 1 | 01-MAR-90
1000000000002 | book 2 | name 2 | surname 2 | 01-MAR-95
1000000000003 | book 3 | name 1 | surname 1 | 01-MAR-90
First, let's create and feed the new data structure for authors (note that you don't want to use CREATE TABLE AS SELECT ... because this does not let you add constraints or other useful options).
To generate a unique author id, we use the IDENTITY feature (available starting Oracle 12c - without this feature, we would need to create a sequence and a trigger).
In legacy data, we assume that each author is uniquely identified by its name, surname and birthdate:
CREATE TABLE authors (
id NUMBER GENERATED ALWAYS AS IDENTITY,
name VARCHAR2(200),
surname VARCHAR2(200),
birthdate DATE,
PRIMARY KEY (id)
);
INSERT INTO AUTHORS (name, surname, birthdate)
SELECT DISTINCT author_name, author_surname, author_birthdate FROM old_books;
2 rows affected
SELECT * FROM authors;
ID | NAME | SURNAME | BIRTHDATE
-: | :----- | :-------- | :--------
1 | name 1 | surname 1 | 01-MAR-90
2 | name 2 | surname 2 | 01-MAR-95
With this first table in place, we can now create the books table. It contains a foreign key that references the primary key of the authors table. To feed the table, we need to join the legacy table with the new authors table to recover the id of each author:
CREATE TABLE books (
isbn NUMBER(13, 0),
title VARCHAR2(200),
author_id NUMBER,
CONSTRAINT book_author FOREIGN KEY(author_id) REFERENCES authors(id),
PRIMARY KEY (isbn)
);
INSERT INTO books(isbn, title, author_id)
SELECT ob.isbn, ob.title, a.id
FROM old_books ob
INNER JOIN authors a
ON a.name = ob.author_name
AND a.surname = ob.author_surname
AND a.birthdate = ob.author_birthdate;
3 rows affected
SELECT * FROM books;
ISBN | TITLE | AUTHOR_ID
------------: | :----- | --------:
1000000000001 | book 1 | 1
1000000000002 | book 2 | 2
1000000000003 | book 3 | 1
All set! Data is properly spread between the two tables, with the proper constraints in place. We can join both tables with a query like:
SELECT b.isbn, b.title, a.name, a.surname, a.birthdate
FROM authors a
INNER JOIN books b ON a.id = b.author_id;
ISBN | TITLE | NAME | SURNAME | BIRTHDATE
------------: | :----- | :----- | :-------- | :--------
1000000000001 | book 1 | name 1 | surname 1 | 01-MAR-90
1000000000002 | book 2 | name 2 | surname 2 | 01-MAR-95
1000000000003 | book 3 | name 1 | surname 1 | 01-MAR-90
You say that an author's first name plus surname are your author table's primary key. This is a valid approach. In case of two authors with the same name you'd have to find a solution like 'John' + 'Smith' and 'John R.' + 'Smith' or 'John' + 'Smith (the fantasy author)'. This is called a natural composite key, albeit not a perfect one as we may have to deal with duplicate names as mentioned. On the other hand there exist authors with the same name, so we may face this problem right away ;-)
Books are identified by their ISBN, which makes for an even better natural key, because there can be no duplicates. (Only if you wanted to add very old books or self-marketed books that have no ISBN, you'd have to create a fake ISBN.)
In order to have your book referring to an author, you must include the whole key, which is first and surname here. This is no redundancy, as this is the key needed to identify an author in your database.
CREATE TABLE books AS SELECT isbn, title, author_name, author_surname FROM old_table;
ALTER TABLE books ADD CONSTRAINT fk_book_author FOREIGN KEY (author_name, author_surname)
REFERENCES authors (author_name, author_surname);
An alternative would be to introduce surrogate (i.e. technical) keys. You would generate an ID (a number) for each book and each author and work with them. (That means the book table would contain an author_id.) But for a good database you should still think about what identifies a row naturally. This makes it easier for people who write the queries later. (E.g. someone asks to select a list of authors and the number of books they've written. How to write that query? Does it suffice to show first and surname or could we end up with two rows "John Smith | 5" and "John Smith | 2" and the enquirer saying they cannot use this ambiguous result?) Even when providing surrogate keys you should still have a unique constraint on the natural key, if there is one. For books with optional ISBNs this may be title + author_id and for authors it could be first name + surname + date of birth.
By the way: There exist books with more than one author ;-)

Creating a query to find matching objects in a "join" table

I am trying to find an efficient query to find all matching objects in a "join" table.
Given an object Adopter that has many Pets, and Pets that have many Adopters through a AdopterPets join table. How could I find all of the Adopters that have the same Pets?
The schema is fairly normalized and looks like this.
TABLE Adopter
INTEGER id
TABLE AdopterPets
INTEGER adopter_id
INTEGER pet_id
TABLE Pets
INTEGER id
Right now the solution I am using loops through all Adopters and asks for their pets anytime it we have a match store it away and can use it later, but I am sure there has to be a better way using SQL.
One SQL solution I looked at was GROUP BY but it did not seem to be the right trick for this problem.
EDIT
To explain a little more of what I am looking for I will try to give an example.
+---------+ +------------------+ +------+
| Adptors | | AdptorsPets | | Pets |
|---------| +----------+-------+ |------|
| 1 | |adptor_id | pet_id| | 1 |
| 2 | +------------------+ | 2 |
| 3 | |1 | 1 | | 3 |
+---------+ |2 | 1 | +------+
|1 | 2 |
|3 | 1 |
|3 | 2 |
|2 | 3 |
+------------------+
When you asked the Adopter with the id of 1 for any other Adopters that have the same Pets you would be retured id 3.
If you asked the same question for the Adopter with the id of 3 you would get id 1.
If you asked again the same question of the Adopter with id 2` you would be returned nothing.
I hope this helps clear things up!
Thank you all for the help, I used a combination of a few things:
SELECT adopter_id
FROM (
SELECT adopter_id, array_agg(pet_id ORDER BY pet_id)
AS pets
FROM adopters_pets
GROUP BY adopter_id
) AS grouped_pets
WHERE pets = array[1,2,3] #array must be ordered
AND adopter_id <> current_adopter_id;
In the subquery I get pet_ids grouped by their adopter. The ordering of the pet_ids is key so that the results in the main query will not be order dependent.
In the main query I compare the results of the subquery to the pet ids of the adopter I am looking to match. For the purpose of this answer the pet_ids of the particular adopter are represented by [1,2,3]. I then make sure that that the adopter I am comparing to is not included in the results.
Let me know if anyone sees any optimizations or if there is a way to compare arrays where order does not matter.
I'm not sure if this is exactly what you're looking for but this might give you some ideas.
First I created some sample data:
create table adopter (id serial not null primary key, name varchar );
insert into adopter (name) values ('Bob'), ('Sally'), ('John');
create table pets (id serial not null primary key, kind varchar);
insert into pets (kind) values ('Dog'), ('Cat'), ('Rabbit'), ('Snake');
create table adopterpets (adopter_id integer, pet_id integer);
insert into adopterpets values (1, 1), (1, 2), (2, 1), (2,3), (2,4), (3, 1), (3,3);
Next I ran this query:
SELECT p.kind, array_agg(a.name) AS adopters
FROM pets p
JOIN adopterpets ap ON ap.pet_id = p.id
JOIN adopter a ON a.id = ap.adopter_id
GROUP BY p.kind
HAVING count(*) > 1
ORDER BY kind;
kind | adopters
--------+------------------
Dog | {Bob,Sally,John}
Rabbit | {Sally,John}
(2 rows)
In this example, for each pet I'm creating an array of all owners. The HAVING count(*) > 1 clause ensures we only show pets with shared owners (more than 1). If we leave this out we'll include pets that don't share owners.
UPDATE
#scommette: Glad you've got it working! I've refactored your working example a little bit below to:
use #> operator. This checks if one array contains the other avoids need to explicitly set order
moved the grouped_pets subquery to a CTE. This isn't only solution but neatly allows you to both filter out the current_adopter_id and get the pets for that id
You might find it helpful to wrap this in a function.
WITH grouped_pets AS (
SELECT adopter_id, array_agg(pet_id ORDER BY pet_id) AS pets
FROM adopters_pets
GROUP BY adopter_id
)
SELECT * FROM grouped_pets
WHERE adopter_id <> 3
AND pets #> (
SELECT pets FROM grouped_pets WHERE adopter_id = 3
);
If you're using Oracle then wm_concat could be useful here
select pet_id, wm_concat(adopter_id) adopters
from AdopterPets
group by pet_id ;
--
-- Relational division 1.0
-- Show all people who own *exactly* the same (non-empty) set
-- of animals as I do.
--
-- Test data
CREATE TABLE adopter (id INTEGER NOT NULL primary key, fname varchar );
INSERT INTO adopter (id,fname) VALUES (1,'Bob'), (2,'Alice'), (3,'Chris');
CREATE TABLE pets (id INTEGER NOT NULL primary key, kind varchar);
INSERT INTO pets (id,kind) VALUES (1,'Dog'), (2,'Cat'), (3,'Pig');
CREATE TABLE adopterpets (adopter_id integer REFERENCES adopter(id)
, pet_id integer REFERENCES pets(id)
);
INSERT INTO adopterpets (adopter_id,pet_id) VALUES (1, 1), (1, 2), (2, 1), (2,3), (3,1), (3,2);
-- Show it to the world
SELECT ap.adopter_id, ap.pet_id
, a.fname, p.kind
FROM adopterpets ap
JOIN adopter a ON a.id = ap.adopter_id
JOIN pets p ON p.id = ap.pet_id
ORDER BY ap.adopter_id,ap.pet_id;
SELECT DISTINCT other.fname AS same_as_me
FROM adopter other
-- moi has *at least* one same kind of animal as toi
WHERE EXISTS (
SELECT * FROM adopterpets moi
JOIN adopterpets toi ON moi.pet_id = toi.pet_id
WHERE toi.adopter_id = other.id
AND moi.adopter_id <> toi.adopter_id
-- C'est moi!
AND moi.adopter_id = 1 -- 'Bob'
-- But moi should not own an animal that toi doesn't have
AND NOT EXISTS (
SELECT * FROM adopterpets lnx
WHERE lnx.adopter_id = moi.adopter_id
AND NOT EXISTS (
SELECT *
FROM adopterpets lnx2
WHERE lnx2.adopter_id = toi.adopter_id
AND lnx2.pet_id = lnx.pet_id
)
)
-- ... And toi should not own an animal that moi doesn't have
AND NOT EXISTS (
SELECT * FROM adopterpets rnx
WHERE rnx.adopter_id = toi.adopter_id
AND NOT EXISTS (
SELECT *
FROM adopterpets rnx2
WHERE rnx2.adopter_id = moi.adopter_id
AND rnx2.pet_id = rnx.pet_id
)
)
)
;
Result:
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "adopter_pkey" for table "adopter"
CREATE TABLE
INSERT 0 3
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "pets_pkey" for table "pets"
CREATE TABLE
INSERT 0 3
CREATE TABLE
INSERT 0 6
adopter_id | pet_id | fname | kind
------------+--------+-------+------
1 | 1 | Bob | Dog
1 | 2 | Bob | Cat
2 | 1 | Alice | Dog
2 | 3 | Alice | Pig
3 | 1 | Chris | Dog
3 | 2 | Chris | Cat
(6 rows)
same_as_me
------------
Chris
(1 row)