I am testing against the Sakila database, see http://www.postgresqltutorial.com/postgresql-sample-database/. This database holds three relations:
film: film_id, title
actor: actor_id, first_name
film_actor: film_id, actor_id
I want to list all films and for each film, I want to to list all actors playing in that particular film. I ended with the following query:
select film_id, title, array
(
select first_name
from actor
inner join film_actor
on actor.actor_id = film_actor.actor_id
where film_actor.film_id = film.film_id
) as actors
from film
order by title;
Conceptually, this is a 1 + n query:
one query: get films
n queries: for each film f
f.actors = array(get actors playing in f)
I always understood that 1 + n queries should be avoided at all cost, as this does not scale well.
So this made me wondering: how does PostgreSQL implement this internally? Let's say we have 1000 movies, does it internally execute 1000 select actor.first_name from actor inner join ... queries? Or is PostgreSQL smarter about this and does it something like the following?
1. one query: get films
2. one query: get actors related to these films while keeping reference to film_id
3. internally: for each film f
f.actors = array(subset of (2) according to film_id)
This does 1 + 1 queries.
You are thinking in nested loops. This is something you should overcome when working with relational database (unless you are using MySQL).
What you describe as "1 + n" is a nested loop: you scan one table and for each row found, you scan the other table.
The way your SQL query is written, PostgreSQL has no choice but to execute a nested loop.
This is good as long as the outer table (film in your example) has few rows. Performance deteriorates rapidly once the outer table gets bigger.
In addition to nested loops, PostgreSQL has two other join strategies:
Hash join: The inner relation is scanned and a hash structure is created where the hash key is the join key. Then the outer relation is scanned, and the hash is probed for each row found.
Think of it as a kind of hash join, but on the inner side you have an efficient in-memory data structure.
Merge join: Both tables are sorted on the join key and merged by scanning the results simultaneously.
You are advised to write your query without “correlated subqueries” so that PostgreSQL can choose the optimal join strategy:
SELECT film_id, f.title, array_agg(a.first_name)
FROM film f
LEFT JOIN film_actor fa USING (film_id)
LEFT JOIN actor a USING (actor_id)
GROUP BY f.title
ORDER BY f.title;
The left outer join is used so that you get a result even if a film has no actors.
This is perhaps more appropriate for a comment, but it is too long.
Although I follow the logic of your query, I much prefer expressing it as:
select f.film_id, f.title,
(select array_agg(a.first_name)
from actor a inner join
film_actor fa
on a.actor_id = fa.actor_id
where fa.film_id = f.film_id
) as actors
from film f
order by f.title;
The explicit array_agg() clarifies the logic. You are aggregating the subquery, bringing the results together as an array, and then including that as a column in the outer query.
Related
I was doing a practice question for SQL which asks to create a list of album titles and unit prices for the artist "Audioslave" and find out how many records are returned.
Here is the relational database picture given in the question:
Initially, I used an inner join to retrieve the list and actually got the correct answer (40 records returned). The code is shown below:
select a.Title, t.UnitPrice
from albums a
inner join tracks t on t.AlbumId = a.AlbumId
inner join artists ar on ar.ArtistId = a.ArtistId
where ar.Name = 'Audioslave';
Although I finished the question, I was curious to try to solve this problem using nested subqueries instead and tried to first retrieve the AlbumId and UnitPrice from tracks. I got the correct answer but not the correct list (the question asked for album title and not AlbumId). Here is the code:
select AlbumId, UnitPrice
from tracks
where AlbumId in (
select AlbumId
from albums
where ArtistId in (
select ArtistId
from artists
where Name = 'Audioslave'));
In order to solve the problem with the list, I tried combining the previous codes. However, I get a completely different amount of records being returned (10509).
select a.Title, t.UnitPrice
from albums a
inner join tracks t
where a.AlbumId in (
select AlbumId
from albums
where ArtistId in (
select ArtistId
from artists
where Name = 'Audioslave'));
I don't understand what I'm doing wrong with the last code...Any help would be appreciated! Also, sorry if I wrote too much, I just wanted to convey my thinking process clearly.
Some databases (SQLite, MySQL, Maria, maybe others) allow you to write an INNER JOIN without specifying ON, and they just cross every record on the left with every record on the right in that case. If there were 2 albums and 3 tracks, 6 rows would result. If the albums were A and B, and the tracks were 1, 2 and 3, the rows would be the combination of all: A1, A2, A3, B1, B2, B3
Other databases (Postgres, SQLServer, Oracle, maybe others) refuse to do it unless you specify ON. To get an "every row on the left combined with every row on the right" you have to write CROSS JOIN (or write an inner join with an ON that is always true)
It might help your mental model of what happens during a join to consider that the db takes all the rows on the left and connects them to all the rows on the right, then for each combination of rows, assesses the truth of the ON clause, and the WHERE clause, before deciding to return the row
For example, this will return 10509 rows:
SELECT * FROM albums INNER JOIN tracks ON 1=1
The on clause is always true
This will return 10509 tracks, but only if the query is run on Monday
SELECT * FROM albums INNER JOIN tracks ON strftime('%w', 'now') = 1
What goes in the ON or WHERE doesn't have to have anything to do with the data in the table.. it just has to be something that resolves to a Boolean
I'm trying to simplify this SQL query (I replaced real table names with metaphorical), primarily get rid of the inner query, but I'm brain frozen can't think of any way to do it.
My major concern (aside from aesthetics) is performance under heavy loads
The purpose of the query is to count all books grouping by genre found on any particular shelve where the book is kept (hence the inner query which is effectively telling which shelve to count books on).
SELECT g.name, count(s.book_id) occurances FROM genre g
LEFT JOIN shelve s ON g.shelve_id=s.id
WHERE s.id=(SELECT genre_id FROM book WHERE id=111)
GROUP BY s.genre_id, g.name
It seems like you want to know many books that are on a shelf are in the same genre as book 111: if you liked book "X", we have this many similar books in stock.
One thing I noticed is the WHERE clause in the original required a value for the shelve table, effectively converting it to an INNER JOIN. And speaking of JOINs, you can JOIN instead of the nested select.
SELECT g.name, count(s.book_id) occurances
FROM genre g
INNER JOIN shelve s ON s.id = b.shelve_id
INNER JOIN book b on b.genre_id = s.id
WHERE b.id=111
GROUP BY g.id, g.name
Thinking about it more, I might also start with book rather than genre. In the end, the only reason you need the genre table at all is to find the name, and therefore matching to it by id may be more effective.
SELECT g.name, count(s.book_id) occurances
FROM book b
INNER JOIN shelve s ON s.id = b.genre_id
INNER JOIN genre g on g.shelve_id = s.id
WHERE b.id=111
GROUP BY g.id, g.name
Not sure they meet your idea of "simpler" or not, but they are alternatives.
... unless matching shelve.id with book.genre_id is a typo in the question. It seems very odd the two tables would share the same id values, in which case these will both be wrong.
I had to write an SQL-Query for a given Database (it's huge, I won't be able to post it here, but its about artists with albums and release dates, genres etc.).
The Task was to find all artists involved in albums which contains the word "drop". I had to write an correlated and an uncorrelated query. I got the correlated:
SELECT artist
FROM CDDB.ARTISTS ar
WHERE EXISTS
(SELECT album
FROM CDDB.ALBUMS al
INNER JOIN CDDB.ARTIST2ALBUM aa ON al.albumid = aa.albumid
WHERE ar.artistid = aa.artistid
AND album LIKE '\%drop\%');
Now I have to make that uncorrelated, but I don't know how. Is it possible that one can help me without the given tables etc.?
Uncorrelated subqueries are subqueries that can be run independently from the outer query.
Generally speaking, EXISTS is correlated, IN is uncorrelated.
If you change your query to something like:
SELECT artist
FROM CDDB.ARTISTS ar
INNER JOIN CDDB.ARTIST2ALBUM aa ON ar.artistid = aa.artistid
WHERE album in
(SELECT album
FROM CDDB.ALBUMS
WHERE album LIKE '%drop%');
It is now uncorrelated.
I'm currently working on a database project and one of the problems calls for the following:
The Genre table contains twenty-five entries. The MediaType table contains 5
entries. Write a single SQL query to generate a table with three columns and 125
rows. One column should contain the list of MediaType names; one column
should contain the list of Genre names; the third column should contain a count of
the number of tracks that have each combination of media type and genre. For
example, one row will be: “Rock MPEG Audio File xxx” where xxx is the
number of MPEG Rock tracks, even if the value is 0.
Recognizing this, I believe I'll need to use a FULL OUTER JOIN, which Sqlite3 doesn't support. The part that is confusing me is generating the column with the combination. Below, I've attached the two methods I've tried.
create view T as
select MediaTypeId, M.Name as MName, GenreId, G.Name as GName
from MediaType M, Genre G
SELECT DISTINCT GName, MName, COUNT(*) FROM (
SELECT *
FROM T
OUTER LEFT JOIN MediaType
ON MName = GName
UNION ALL
SELECT *
FROM Genre
OUTER LEFT JOIN T
) GROUP BY GName, MName;
However, that returned nearly 250 rows and the GROUP BY or JOIN(s) is totally wrong.
I've also tried:
SELECT Genre.Name as GenreName, MediaTypeName, COUNT(*)
FROM Genre LEFT OUTER JOIN (
SELECT MediaType.Name as MediaTypeName, Track.Name as TrackName
FROM MediaType LEFT OUTER JOIN Track) GROUP BY GenreName, MediaTypeName;
Which returned 125 rows but they all had the same count of 3503 which leads me to believe the GROUP BY is wrong.
Also, here is a schema of the database:
https://www.dropbox.com/s/onnbwqfrfc82r1t/IMG_2429.png?dl=0
You don't use full outer join to solve this problem.
Because it looks like a homework problem, I'll describe the solution.
First, you want to generate all combinations of genres and media types. Hint: This uses a cross join.
Second, you want to count all the combinations that you have. Hint: this uses an aggregation.
Third, you want to combine these together. Hint: left join.
Let's say we have a database that records all the Movies a User has not rated yet. Each rating is recorded in a MovieRating table.
When we are looking for movies user #1234 hasn't seen yet:
SELECT *
FROM Movies
WHERE id NOT IN
(SELECT DISTINCT movie_id FROM MovieRating WHERE user_id = 1234);
Querying NOT IN can be very expensive as the size of MovieRating grows. Assume MovieRatings can have 100,000+ rows.
My question is what are some more efficient alternatives to the NOT IN query? I've heard of the LEFT OUTER JOIN and NOT EXIST queries, but are there anything else? Is there any way I can design this database differently?
A correlated sub-query using WHERE NOT EXISTS() is potential your most efficient if you have to do this, but you should test performance against your data.
You may also want to consider limiting your results both in terms of the select list (don't use *) and only getting TOP n rows. That is, you may not need 100k+ movies if the user hasn't seen them. You may want to page the results.
SELECT *
FROM Movies m
WHERE NOT EXISTS (SELECT 1
FROM MovieRating r
WHERE user_id = 1234
AND r.movie_id= m.movie_id)
This is a mock query, because I don't have a db to test this, but something along the lines of the following should work.
select m.* from Movies m
left join MovieRating mr on mr.user_id = 1234
where mr.id is null
That should join the movies table to the movie rating table based on a user id. The where clause is then going to find null entries, which would be movies a user hasn't rated.
You can try this :
SELECT M.*
FROM Movies as M
LEFT OUTER JOIN
MovieRating as MR on M.id = MR.movie_id
and MR.user_id = 1234
WHERE M.id IS NULL