Result of query as column value - sql

I've got three tables:
Lessons:
CREATE TABLE lessons (
id SERIAL PRIMARY KEY,
title text NOT NULL,
description text NOT NULL,
vocab_count integer NOT NULL
);
+----+------------+------------------+-------------+
| id | title | description | vocab_count |
+----+------------+------------------+-------------+
| 1 | lesson_one | this is a lesson | 3 |
| 2 | lesson_two | another lesson | 2 |
+----+------------+------------------+-------------+
Lesson_vocabulary:
CREATE TABLE lesson_vocabulary (
lesson_id integer REFERENCES lessons(id),
vocabulary_id integer REFERENCES vocabulary(id)
);
+-----------+---------------+
| lesson_id | vocabulary_id |
+-----------+---------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 2 |
| 2 | 4 |
+-----------+---------------+
Vocabulary:
CREATE TABLE vocabulary (
id integer PRIMARY KEY,
hiragana text NOT NULL,
reading text NOT NULL,
meaning text[] NOT NULL
);
Each lesson contains multiple vocabulary, and each vocabulary can be included in multiple lessons.
How can I get the vocab_count column of the lessons table to be calculated and updated whenevr I add more rows to the lesson_vocabulary table. Is this possible, and how would I go about doing this?
Thanks

You can use SQL triggers to serve your purpose. This would be similar to mysql after insert trigger which updates another table's column.
The trigger would look somewhat like this. I am using Oracle SQL, but there would just be minor tweaks for any other implementation.
CREATE TRIGGER vocab_trigger
AFTER INSERT ON lesson_vocabulary
FOR EACH ROW
begin
for lesson_cur in (select LESSON_ID, COUNT(VOCABULARY_ID) voc_cnt from LESSON_VOCABULARY group by LESSON_ID) LOOP
update LESSONS
set VOCAB_COUNT = LESSON_CUR.VOC_CNT
where id = LESSON_CUR.LESSON_ID;
end loop;
END;

It's better to create a view that calculates that (and get rid of the column in the lessons table):
select l.*, lv.vocab_count
from lessons l
left join (
select lesson_id, count(*)
from lesson_vocabulary
group by lesson_id
) as lv(lesson_id, vocab_count) on l.id = lv.lesson_id
If you really want to update the lessons table each time the lesson_vocabulary changes, you can run an UPDATE statement like this in a trigger:
update lessons l
set vocab_count = t.cnt
from (
select lesson_id, count(*) as cnt
from lesson_vocabulary
group by lesson_id
) t
where t.lesson_id = l.id;

I would recommend using a query for this information:
select l.*,
(select count(*)
from lesson_vocabulary lv
where lv.lesson_id = l.lesson_id
) as vocabulary_cnt
from lessons l;
With an index on lesson_vocabulary(lesson_id), this should be quite fast.
I recommend this over an update, because the data remains correct.
I recommend this over a trigger, because it is simpler.
I recommend this over a subquery with aggregation because it should be faster, particularly if you are filtering on the lessons table.

Related

Making combinations of attributes unique in PostgreSQL

There is an option in postgresql where we can have a constraint such that we can have multiple attributes of a table together as unique
UNIQUE (A, B, C)
Is it possible to take attributes from multiple tables and make their entire combination as unique in some way
Edit:
Table 1: List of Book
Attributes: ID, Title, Year, Publisher
Table 2: List of Author
Attributes: Name, ID
Table 3: Written By: Relation between Book and Author
Attributes: Book_ID, Author_ID
Now I have situation where I don't want (Title, Year, Publisher, Authors) get repeated in my entire database
There are 3 solutions to this problem:
You add a column "authorID" to the table "book", as a foreign key. You can then add the UNIQUE constraint to the table "book".
We can have a foreign key on the 2 columns (bookID, author ID) which references the table bookAuthor.
You create a Trigger on insert on the table "book" which checks whether the combination exist and does not insert if it does exist. You will find a working example of this option below.
Whilst working on this option I realised that the JOIN to WrittenBy must be done on Title and not ID. Otherwise we can record the same book as many times as we like just by using a new ID. The problem with using the title is that the slightest change in spelling or punctuation means that it is treated as a new title.
In the example the 3rd insert has failed because it already exists. In the 4th have left 2 spaces in "Tom Sawyer" and it is accepted as a different title.
Also as we use a join to find out the author the real effect of our rule is exactly the same as if we had a UNIQUE constraint on the table books on columns Title, Year and Publisher. This means that all that I have coded is a waste of time.
We thus decide, after coding it, that this option is not effective.
We could create a fourth table with the 4 columns and a UNIQUE constraint on all 4. This seems a heavy solution compared to option 1.
CREATE TABLE Book (
ID int primary key,
Title varchar(25),
Year int,
Publisher varchar(10) );
CREATE TABLE Author (
ID int primary key,
Name varchar(10)
);
CREATE TABLE WrittenBy(
Book_ID int primary key,
Titlew varchar(25),
Author_ID int
);
CREATE FUNCTION book_insert_trigger_function()
RETURNS TRIGGER
LANGUAGE PLPGSQL
AS $$
DECLARE
authID INTEGER;
coun INTEGER;
BEGIN
IF pg_trigger_depth() <> 1 THEN
RETURN NEW;
END IF;
SELECT MAX(Author_ID) into authID
FROM WrittenBy w
WHERE w.Titlew = NEW.Title;
SELECT COUNT(*) INTO coun FROM
Book b LEFT JOIN WrittenBy w ON
b.Title = w.Titlew
WHERE NEW.year = b.year
AND NEW.title=b.title
AND NEW.publisher=b.publisher
AND authID = COALESCE(w.Author_ID,authID);
IF coun > 0 THEN
RETURN null; -- this means that we do not insert
ELSE
RETURN NEW;
END IF;
END;
$$
;
CREATE TRIGGER book_insert_trigger
BEFORE INSERT
ON Book
FOR EACH ROW
EXECUTE PROCEDURE book_insert_trigger_function();
INSERT INTO WrittenBy VALUES
(1,'Tom Sawyer',1),
(2,'Huckleberry Finn',1);
INSERT INTO Book VALUES (1,'Tom Sawyer',1950,'Classics');
INSERT INTO Book VALUES (2,'Huckleberry Finn',1950,'Classics');
INSERT INTO Book VALUES (3,'Tom Sawyer',1950,'Classics');
INSERT INTO Book VALUES (3,'Tom Sawyer',1950,'Classics');
SELECT *
FROM Book b
LEFT JOIN WrittenBy w on w.Titlew = b.Title
LEFT JOIN Author a on w.author_ID = a.ID;
>
> id | title | year | publisher | book_id | titlew | author_id | id | name
> -: | :--------------- | ---: | :-------- | ------: | :--------------- | --------: | ---: | :---
> 1 | Tom Sawyer | 1950 | Classics | 1 | Tom Sawyer | 1 | null | null
> 2 | Huckleberry Finn | 1950 | Classics | 2 | Huckleberry Finn | 1 | null | null
> 3 | Tom Sawyer | 1950 | Classics | null | null | null | null | null
>
db<>fiddle here

Query to count the frequence of many-to-many associations

I have two tables with a many-to-many association in postgresql. The first table contains activities, which may count zero or more reasons:
CREATE TABLE activity (
id integer NOT NULL,
-- other fields removed for readability
);
CREATE TABLE reason (
id varchar(1) NOT NULL,
-- other fields here
);
For performing the association, a join table exists between those two tables:
CREATE TABLE activity_reason (
activity_id integer NOT NULL, -- refers to activity.id
reason_id varchar(1) NOT NULL, -- refers to reason.id
CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);
I would like to count the possible association between activities and reasons. Supposing I have those records in the table activity_reason:
+--------------+------------+
| activity_id | reason_id |
+--------------+------------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | A |
| 4 | C |
| 4 | D |
| 4 | E |
+--------------+------------+
I should have something like:
+-------+---+------+-------+
| count | | | |
+-------+---+------+-------+
| 2 | A | B | NULL |
| 1 | A | NULL | NULL |
| 1 | C | D | E |
+-------+---+------+-------+
Or, eventually, something like :
+-------+-------+
| count | |
+-------+-------+
| 2 | A,B |
| 1 | A |
| 1 | C,D,E |
+-------+-------+
I can't find the SQL query to do this.
I think you can get what you want using this query:
SELECT count(*) as count, reasons
FROM (
SELECT activity_id, array_agg(reason_id) AS reasons
FROM (
SELECT A.activity_id, AR.reason_id
FROM activity A
LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
ORDER BY activity_id, reason_id
) AS ordered_reasons
GROUP BY activity_id
) reason_arrays
GROUP BY reasons
First you aggregate all the reasons for an activity into an array for each activity. You have to order the associations first, otherwise ['a','b'] and ['b','a'] will be considered different sets and will have individual counts. You also need to include the join or any activity that doesn't have any reasons won't show up in the result set. I'm not sure if that is desirable or not, I can take it back out if you want activities that don't have a reason to not be included. Then you count the number of activities that have the same sets of reasons.
Here is a sqlfiddle to demonstrate
As mentioned by Gordon Linoff you could also use a string instead of an array. I'm not sure which would be better for performance.
We need to compare sorted lists of reasons to identify equal sets.
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id) AS reason_list
FROM (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
ORDER BY reason_id in the innermost subquery would work, too, but adding activity_id is typically faster.
And we don't strictly need the innermost subquery at all. This works as well:
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
FROM activity_reason
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
But it's typically slower for processing all or most of the table. Quoting the manual:
Alternatively, supplying the input values from a sorted subquery will usually work.
We could use string_agg() instead of array_agg(), and that would work for your example with varchar(1) (which might be more efficient with data type "char", btw). It can fail for longer strings, though. The aggregated value can be ambiguous.
If reason_id would be an integer (like it typically is), there is another, faster solution with sort() from the additional module intarray:
SELECT count(*) AS ct, reason_list
FROM (
SELECT sort(array_agg(reason_id)) AS reason_list
FROM activity_reason2
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
Related, with more explanation:
Compare arrays for equality, ignoring order of elements
Storing and comparing unique combinations
You can do this using string_agg():
select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
from activity_reason
group by activity_id
) a
group by reasons
order by count(*) desc;

SQL Query 2 tables null results

I was asked this question in an interview:
From the 2 tables below, write a query to pull customers with no sales orders.
How many ways to write this query and which would have best performance.
Table 1: Customer - CustomerID
Table 2: SalesOrder - OrderID, CustomerID, OrderDate
Query:
SELECT *
FROM Customer C
RIGHT OUTER JOIN SalesOrder SO ON C.CustomerID = SO.CustomerID
WHERE SO.OrderID = NULL
Is my query correct and are there other ways to write the query and get the same results?
Answering for MySQL instead of SQL Server, cause you tagged it later with SQL Server, so I thought (since this was an interview question, that it wouldn't bother you, for which DBMS this is). Note though, that the queries I wrote are standard sql, they should run in every RDBMS out there. How each RDBMS handles those queries is another issue, though.
I wrote this little procedure for you, to have a test case. It creates the tables customers and orders like you specified and I added primary keys and foreign keys, like one would usually do it. No other indexes, as every column worth indexing here is already primary key. 250 customers are created, 100 of them made an order (though out of convenience none of them twice / multiple times). A dump of the data follows, posted the script just in case you want to play around a little by increasing the numbers.
delimiter $$
create procedure fill_table()
begin
create table customers(customerId int primary key) engine=innodb;
set #x = 1;
while (#x <= 250) do
insert into customers values(#x);
set #x := #x + 1;
end while;
create table orders(orderId int auto_increment primary key,
customerId int,
orderDate timestamp,
foreign key fk_customer (customerId) references customers(customerId)
) engine=innodb;
insert into orders(customerId, orderDate)
select
customerId,
now() - interval customerId day
from
customers
order by rand()
limit 100;
end $$
delimiter ;
call fill_table();
For me, this resulted in this:
CREATE TABLE `customers` (
`customerId` int(11) NOT NULL,
PRIMARY KEY (`customerId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `customers` VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250);
CREATE TABLE `orders` (
`orderId` int(11) NOT NULL AUTO_INCREMENT,
`customerId` int(11) DEFAULT NULL,
`orderDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`orderId`),
KEY `fk_customer` (`customerId`),
CONSTRAINT `orders_ibfk_1` FOREIGN KEY (`customerId`) REFERENCES `customers` (`customerId`)
) ENGINE=InnoDB AUTO_INCREMENT=128 DEFAULT CHARSET=utf8;
INSERT INTO `orders` VALUES (1,247,'2013-06-24 19:50:07'),(2,217,'2013-07-24 19:50:07'),(3,8,'2014-02-18 20:50:07'),(4,40,'2014-01-17 20:50:07'),(5,52,'2014-01-05 20:50:07'),(6,80,'2013-12-08 20:50:07'),(7,169,'2013-09-10 19:50:07'),(8,135,'2013-10-14 19:50:07'),(9,115,'2013-11-03 20:50:07'),(10,225,'2013-07-16 19:50:07'),(11,112,'2013-11-06 20:50:07'),(12,243,'2013-06-28 19:50:07'),(13,158,'2013-09-21 19:50:07'),(14,24,'2014-02-02 20:50:07'),(15,214,'2013-07-27 19:50:07'),(16,25,'2014-02-01 20:50:07'),(17,245,'2013-06-26 19:50:07'),(18,182,'2013-08-28 19:50:07'),(19,166,'2013-09-13 19:50:07'),(20,69,'2013-12-19 20:50:07'),(21,85,'2013-12-03 20:50:07'),(22,44,'2014-01-13 20:50:07'),(23,103,'2013-11-15 20:50:07'),(24,19,'2014-02-07 20:50:07'),(25,33,'2014-01-24 20:50:07'),(26,102,'2013-11-16 20:50:07'),(27,41,'2014-01-16 20:50:07'),(28,94,'2013-11-24 20:50:07'),(29,43,'2014-01-14 20:50:07'),(30,150,'2013-09-29 19:50:07'),(31,218,'2013-07-23 19:50:07'),(32,131,'2013-10-18 19:50:07'),(33,77,'2013-12-11 20:50:07'),(34,2,'2014-02-24 20:50:07'),(35,45,'2014-01-12 20:50:07'),(36,230,'2013-07-11 19:50:07'),(37,101,'2013-11-17 20:50:07'),(38,31,'2014-01-26 20:50:07'),(39,56,'2014-01-01 20:50:07'),(40,176,'2013-09-03 19:50:07'),(41,223,'2013-07-18 19:50:07'),(42,145,'2013-10-04 19:50:07'),(43,26,'2014-01-31 20:50:07'),(44,62,'2013-12-26 20:50:07'),(45,195,'2013-08-15 19:50:07'),(46,153,'2013-09-26 19:50:07'),(47,179,'2013-08-31 19:50:07'),(48,104,'2013-11-14 20:50:07'),(49,7,'2014-02-19 20:50:07'),(50,209,'2013-08-01 19:50:07'),(51,86,'2013-12-02 20:50:07'),(52,110,'2013-11-08 20:50:07'),(53,204,'2013-08-06 19:50:07'),(54,187,'2013-08-23 19:50:07'),(55,114,'2013-11-04 20:50:07'),(56,38,'2014-01-19 20:50:07'),(57,236,'2013-07-05 19:50:07'),(58,79,'2013-12-09 20:50:07'),(59,96,'2013-11-22 20:50:07'),(60,37,'2014-01-20 20:50:07'),(61,207,'2013-08-03 19:50:07'),(62,22,'2014-02-04 20:50:07'),(63,120,'2013-10-29 20:50:07'),(64,200,'2013-08-10 19:50:07'),(65,51,'2014-01-06 20:50:07'),(66,181,'2013-08-29 19:50:07'),(67,4,'2014-02-22 20:50:07'),(68,123,'2013-10-26 19:50:07'),(69,108,'2013-11-10 20:50:07'),(70,55,'2014-01-02 20:50:07'),(71,76,'2013-12-12 20:50:07'),(72,6,'2014-02-20 20:50:07'),(73,18,'2014-02-08 20:50:07'),(74,211,'2013-07-30 19:50:07'),(75,53,'2014-01-04 20:50:07'),(76,216,'2013-07-25 19:50:07'),(77,32,'2014-01-25 20:50:07'),(78,74,'2013-12-14 20:50:07'),(79,138,'2013-10-11 19:50:07'),(80,197,'2013-08-13 19:50:07'),(81,221,'2013-07-20 19:50:07'),(82,118,'2013-10-31 20:50:07'),(83,61,'2013-12-27 20:50:07'),(84,28,'2014-01-29 20:50:07'),(85,16,'2014-02-10 20:50:07'),(86,39,'2014-01-18 20:50:07'),(87,3,'2014-02-23 20:50:07'),(88,46,'2014-01-11 20:50:07'),(89,189,'2013-08-21 19:50:07'),(90,59,'2013-12-29 20:50:07'),(91,249,'2013-06-22 19:50:07'),(92,127,'2013-10-22 19:50:07'),(93,47,'2014-01-10 20:50:07'),(94,178,'2013-09-01 19:50:07'),(95,141,'2013-10-08 19:50:07'),(96,188,'2013-08-22 19:50:07'),(97,220,'2013-07-21 19:50:07'),(98,15,'2014-02-11 20:50:07'),(99,175,'2013-09-04 19:50:07'),(100,206,'2013-08-04 19:50:07');
Okay, now to the queries. Three ways came to my mind, I omitted the right join that MDiesel did, because it's actually just another way of writing left join. It was invented for lazy sql developers, that don't want to switch table names, but instead just rewrite one word.
Anyway, first query:
select
c.*
from
customers c
left join orders o on c.customerId = o.customerId
where o.customerId is null;
Results in an execution plan like this:
+----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
| 1 | SIMPLE | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using index |
| 1 | SIMPLE | o | ref | fk_customer | fk_customer | 5 | wtf.c.customerId | 1 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
Second query:
select
c.*
from
customers c
where c.customerId not in (select distinct customerId from orders);
Results in an execution plan like this:
+----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+
| 1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | orders | index_subquery | fk_customer | fk_customer | 5 | func | 2 | Using index |
+----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+
Third query:
select
c.*
from
customers c
where not exists (select 1 from orders o where o.customerId = c.customerId);
Results in an execution plan like this:
+----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
| 1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | o | ref | fk_customer | fk_customer | 5 | wtf.c.customerId | 1 | Using where; Using index |
+----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
We can see in all execution plans, that the customers table is read as a whole, but from the index (the implicit one as the only column is primary key). This may change, when you select other columns from the table, that are not in an index.
The first one seems to be the best. For each row in customers only one row in orders is read. The id column suggests, that MySQL can do this in one step, as only indexes are involved.
The second query seems to be the worst (though all 3 queries shouldn't perform too bad). For each row in customers the subquery is executed (the select_type column tells this).
The third query is not much different in that it uses a dependent subquery, but should perform better than the second query. Explaining the small differences would lead to far now. If you're interested, here's the manual page that explains what each column and their values mean here: EXPLAIN output
Finally: I'd say, that the first query will perform best, but as always, in the end one has to measure, to measure and to measure.
I can thing of two other ways to write this query:
SELECT C.*
FROM Customer C
LEFT OUTER JOIN SalesOrder SO ON C.CustomerID = SO.CustomerID
WHERE SO.CustomerID IS NULL
SELECT C.*
FROM Customer C
WHERE NOT C.CustomerID IN(SELECT CustomerID FROM SalesOrder)
The solutions involving outer joins will perform better than a solution using NOT IN.

Creating a query to find matching objects in a "join" table

I am trying to find an efficient query to find all matching objects in a "join" table.
Given an object Adopter that has many Pets, and Pets that have many Adopters through a AdopterPets join table. How could I find all of the Adopters that have the same Pets?
The schema is fairly normalized and looks like this.
TABLE Adopter
INTEGER id
TABLE AdopterPets
INTEGER adopter_id
INTEGER pet_id
TABLE Pets
INTEGER id
Right now the solution I am using loops through all Adopters and asks for their pets anytime it we have a match store it away and can use it later, but I am sure there has to be a better way using SQL.
One SQL solution I looked at was GROUP BY but it did not seem to be the right trick for this problem.
EDIT
To explain a little more of what I am looking for I will try to give an example.
+---------+ +------------------+ +------+
| Adptors | | AdptorsPets | | Pets |
|---------| +----------+-------+ |------|
| 1 | |adptor_id | pet_id| | 1 |
| 2 | +------------------+ | 2 |
| 3 | |1 | 1 | | 3 |
+---------+ |2 | 1 | +------+
|1 | 2 |
|3 | 1 |
|3 | 2 |
|2 | 3 |
+------------------+
When you asked the Adopter with the id of 1 for any other Adopters that have the same Pets you would be retured id 3.
If you asked the same question for the Adopter with the id of 3 you would get id 1.
If you asked again the same question of the Adopter with id 2` you would be returned nothing.
I hope this helps clear things up!
Thank you all for the help, I used a combination of a few things:
SELECT adopter_id
FROM (
SELECT adopter_id, array_agg(pet_id ORDER BY pet_id)
AS pets
FROM adopters_pets
GROUP BY adopter_id
) AS grouped_pets
WHERE pets = array[1,2,3] #array must be ordered
AND adopter_id <> current_adopter_id;
In the subquery I get pet_ids grouped by their adopter. The ordering of the pet_ids is key so that the results in the main query will not be order dependent.
In the main query I compare the results of the subquery to the pet ids of the adopter I am looking to match. For the purpose of this answer the pet_ids of the particular adopter are represented by [1,2,3]. I then make sure that that the adopter I am comparing to is not included in the results.
Let me know if anyone sees any optimizations or if there is a way to compare arrays where order does not matter.
I'm not sure if this is exactly what you're looking for but this might give you some ideas.
First I created some sample data:
create table adopter (id serial not null primary key, name varchar );
insert into adopter (name) values ('Bob'), ('Sally'), ('John');
create table pets (id serial not null primary key, kind varchar);
insert into pets (kind) values ('Dog'), ('Cat'), ('Rabbit'), ('Snake');
create table adopterpets (adopter_id integer, pet_id integer);
insert into adopterpets values (1, 1), (1, 2), (2, 1), (2,3), (2,4), (3, 1), (3,3);
Next I ran this query:
SELECT p.kind, array_agg(a.name) AS adopters
FROM pets p
JOIN adopterpets ap ON ap.pet_id = p.id
JOIN adopter a ON a.id = ap.adopter_id
GROUP BY p.kind
HAVING count(*) > 1
ORDER BY kind;
kind | adopters
--------+------------------
Dog | {Bob,Sally,John}
Rabbit | {Sally,John}
(2 rows)
In this example, for each pet I'm creating an array of all owners. The HAVING count(*) > 1 clause ensures we only show pets with shared owners (more than 1). If we leave this out we'll include pets that don't share owners.
UPDATE
#scommette: Glad you've got it working! I've refactored your working example a little bit below to:
use #> operator. This checks if one array contains the other avoids need to explicitly set order
moved the grouped_pets subquery to a CTE. This isn't only solution but neatly allows you to both filter out the current_adopter_id and get the pets for that id
You might find it helpful to wrap this in a function.
WITH grouped_pets AS (
SELECT adopter_id, array_agg(pet_id ORDER BY pet_id) AS pets
FROM adopters_pets
GROUP BY adopter_id
)
SELECT * FROM grouped_pets
WHERE adopter_id <> 3
AND pets #> (
SELECT pets FROM grouped_pets WHERE adopter_id = 3
);
If you're using Oracle then wm_concat could be useful here
select pet_id, wm_concat(adopter_id) adopters
from AdopterPets
group by pet_id ;
--
-- Relational division 1.0
-- Show all people who own *exactly* the same (non-empty) set
-- of animals as I do.
--
-- Test data
CREATE TABLE adopter (id INTEGER NOT NULL primary key, fname varchar );
INSERT INTO adopter (id,fname) VALUES (1,'Bob'), (2,'Alice'), (3,'Chris');
CREATE TABLE pets (id INTEGER NOT NULL primary key, kind varchar);
INSERT INTO pets (id,kind) VALUES (1,'Dog'), (2,'Cat'), (3,'Pig');
CREATE TABLE adopterpets (adopter_id integer REFERENCES adopter(id)
, pet_id integer REFERENCES pets(id)
);
INSERT INTO adopterpets (adopter_id,pet_id) VALUES (1, 1), (1, 2), (2, 1), (2,3), (3,1), (3,2);
-- Show it to the world
SELECT ap.adopter_id, ap.pet_id
, a.fname, p.kind
FROM adopterpets ap
JOIN adopter a ON a.id = ap.adopter_id
JOIN pets p ON p.id = ap.pet_id
ORDER BY ap.adopter_id,ap.pet_id;
SELECT DISTINCT other.fname AS same_as_me
FROM adopter other
-- moi has *at least* one same kind of animal as toi
WHERE EXISTS (
SELECT * FROM adopterpets moi
JOIN adopterpets toi ON moi.pet_id = toi.pet_id
WHERE toi.adopter_id = other.id
AND moi.adopter_id <> toi.adopter_id
-- C'est moi!
AND moi.adopter_id = 1 -- 'Bob'
-- But moi should not own an animal that toi doesn't have
AND NOT EXISTS (
SELECT * FROM adopterpets lnx
WHERE lnx.adopter_id = moi.adopter_id
AND NOT EXISTS (
SELECT *
FROM adopterpets lnx2
WHERE lnx2.adopter_id = toi.adopter_id
AND lnx2.pet_id = lnx.pet_id
)
)
-- ... And toi should not own an animal that moi doesn't have
AND NOT EXISTS (
SELECT * FROM adopterpets rnx
WHERE rnx.adopter_id = toi.adopter_id
AND NOT EXISTS (
SELECT *
FROM adopterpets rnx2
WHERE rnx2.adopter_id = moi.adopter_id
AND rnx2.pet_id = rnx.pet_id
)
)
)
;
Result:
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "adopter_pkey" for table "adopter"
CREATE TABLE
INSERT 0 3
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "pets_pkey" for table "pets"
CREATE TABLE
INSERT 0 3
CREATE TABLE
INSERT 0 6
adopter_id | pet_id | fname | kind
------------+--------+-------+------
1 | 1 | Bob | Dog
1 | 2 | Bob | Cat
2 | 1 | Alice | Dog
2 | 3 | Alice | Pig
3 | 1 | Chris | Dog
3 | 2 | Chris | Cat
(6 rows)
same_as_me
------------
Chris
(1 row)

Removing duplicate SQL records to permit a unique key

I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.
Table structure (simplified):
'id (unique, autoinc)'
product_id
The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.
Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:
select id
from sales s
inner join (select product_id,
max(id) as maxId
from sales
group by product_id
having count(product_id) > 1) groupedByProdId on s.product_id
and s.id = groupedByProdId.maxId
I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.
I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.
In reply to your comment, here's a query that works in MySQL:
delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id
This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.
P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:
delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;
Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY.
For example:
describe sales;
+------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| product_id | int(11) | NO | | NULL | |
+------------+---------+------+-----+---------+----------------+
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 3 |
| 6 | 2 |
+----+------------+
ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC;
Query OK, 6 rows affected (0.03 sec)
Records: 6 Duplicates: 3 Warnings: 0
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 6 | 2 |
| 5 | 3 |
| 2 | 1 |
+----+------------+
See this pythian post for more information.
Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.
I might do the following in sql-server to eliminate the duplicates:
DELETE FROM Sales
FROM Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
It looks like the analogous delete statement for mysql might be:
DELETE FROM Sales
USING Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
This type of problem is easier to solve with CTEs and Ranking functions, however, you should be able to do something like the following to solve your problem:
Delete Sales
Where Exists(
Select 1
From Sales As S2
Where S2.product_id = Sales.product_id
And S2.id > Sales.Id
Having Count(*) > 0
)