(SQL) Creating an uncorrelated query - sql

I had to write an SQL-Query for a given Database (it's huge, I won't be able to post it here, but its about artists with albums and release dates, genres etc.).
The Task was to find all artists involved in albums which contains the word "drop". I had to write an correlated and an uncorrelated query. I got the correlated:
SELECT artist
FROM CDDB.ARTISTS ar
WHERE EXISTS
(SELECT album
FROM CDDB.ALBUMS al
INNER JOIN CDDB.ARTIST2ALBUM aa ON al.albumid = aa.albumid
WHERE ar.artistid = aa.artistid
AND album LIKE '\%drop\%');
Now I have to make that uncorrelated, but I don't know how. Is it possible that one can help me without the given tables etc.?

Uncorrelated subqueries are subqueries that can be run independently from the outer query.
Generally speaking, EXISTS is correlated, IN is uncorrelated.
If you change your query to something like:
SELECT artist
FROM CDDB.ARTISTS ar
INNER JOIN CDDB.ARTIST2ALBUM aa ON ar.artistid = aa.artistid
WHERE album in
(SELECT album
FROM CDDB.ALBUMS
WHERE album LIKE '%drop%');
It is now uncorrelated.

Related

SQLite Subqueries and Inner Joins

I was doing a practice question for SQL which asks to create a list of album titles and unit prices for the artist "Audioslave" and find out how many records are returned.
Here is the relational database picture given in the question:
Initially, I used an inner join to retrieve the list and actually got the correct answer (40 records returned). The code is shown below:
select a.Title, t.UnitPrice
from albums a
inner join tracks t on t.AlbumId = a.AlbumId
inner join artists ar on ar.ArtistId = a.ArtistId
where ar.Name = 'Audioslave';
Although I finished the question, I was curious to try to solve this problem using nested subqueries instead and tried to first retrieve the AlbumId and UnitPrice from tracks. I got the correct answer but not the correct list (the question asked for album title and not AlbumId). Here is the code:
select AlbumId, UnitPrice
from tracks
where AlbumId in (
select AlbumId
from albums
where ArtistId in (
select ArtistId
from artists
where Name = 'Audioslave'));
In order to solve the problem with the list, I tried combining the previous codes. However, I get a completely different amount of records being returned (10509).
select a.Title, t.UnitPrice
from albums a
inner join tracks t
where a.AlbumId in (
select AlbumId
from albums
where ArtistId in (
select ArtistId
from artists
where Name = 'Audioslave'));
I don't understand what I'm doing wrong with the last code...Any help would be appreciated! Also, sorry if I wrote too much, I just wanted to convey my thinking process clearly.
Some databases (SQLite, MySQL, Maria, maybe others) allow you to write an INNER JOIN without specifying ON, and they just cross every record on the left with every record on the right in that case. If there were 2 albums and 3 tracks, 6 rows would result. If the albums were A and B, and the tracks were 1, 2 and 3, the rows would be the combination of all: A1, A2, A3, B1, B2, B3
Other databases (Postgres, SQLServer, Oracle, maybe others) refuse to do it unless you specify ON. To get an "every row on the left combined with every row on the right" you have to write CROSS JOIN (or write an inner join with an ON that is always true)
It might help your mental model of what happens during a join to consider that the db takes all the rows on the left and connects them to all the rows on the right, then for each combination of rows, assesses the truth of the ON clause, and the WHERE clause, before deciding to return the row
For example, this will return 10509 rows:
SELECT * FROM albums INNER JOIN tracks ON 1=1
The on clause is always true
This will return 10509 tracks, but only if the query is run on Monday
SELECT * FROM albums INNER JOIN tracks ON strftime('%w', 'now') = 1
What goes in the ON or WHERE doesn't have to have anything to do with the data in the table.. it just has to be something that resolves to a Boolean

Improve SQL query by replacing inner query

I'm trying to simplify this SQL query (I replaced real table names with metaphorical), primarily get rid of the inner query, but I'm brain frozen can't think of any way to do it.
My major concern (aside from aesthetics) is performance under heavy loads
The purpose of the query is to count all books grouping by genre found on any particular shelve where the book is kept (hence the inner query which is effectively telling which shelve to count books on).
SELECT g.name, count(s.book_id) occurances FROM genre g
LEFT JOIN shelve s ON g.shelve_id=s.id
WHERE s.id=(SELECT genre_id FROM book WHERE id=111)
GROUP BY s.genre_id, g.name
It seems like you want to know many books that are on a shelf are in the same genre as book 111: if you liked book "X", we have this many similar books in stock.
One thing I noticed is the WHERE clause in the original required a value for the shelve table, effectively converting it to an INNER JOIN. And speaking of JOINs, you can JOIN instead of the nested select.
SELECT g.name, count(s.book_id) occurances
FROM genre g
INNER JOIN shelve s ON s.id = b.shelve_id
INNER JOIN book b on b.genre_id = s.id
WHERE b.id=111
GROUP BY g.id, g.name
Thinking about it more, I might also start with book rather than genre. In the end, the only reason you need the genre table at all is to find the name, and therefore matching to it by id may be more effective.
SELECT g.name, count(s.book_id) occurances
FROM book b
INNER JOIN shelve s ON s.id = b.genre_id
INNER JOIN genre g on g.shelve_id = s.id
WHERE b.id=111
GROUP BY g.id, g.name
Not sure they meet your idea of "simpler" or not, but they are alternatives.
... unless matching shelve.id with book.genre_id is a typo in the question. It seems very odd the two tables would share the same id values, in which case these will both be wrong.

How does PostgreSQL approach a 1 + n query?

I am testing against the Sakila database, see http://www.postgresqltutorial.com/postgresql-sample-database/. This database holds three relations:
film: film_id, title
actor: actor_id, first_name
film_actor: film_id, actor_id
I want to list all films and for each film, I want to to list all actors playing in that particular film. I ended with the following query:
select film_id, title, array
(
select first_name
from actor
inner join film_actor
on actor.actor_id = film_actor.actor_id
where film_actor.film_id = film.film_id
) as actors
from film
order by title;
Conceptually, this is a 1 + n query:
one query: get films
n queries: for each film f
f.actors = array(get actors playing in f)
I always understood that 1 + n queries should be avoided at all cost, as this does not scale well.
So this made me wondering: how does PostgreSQL implement this internally? Let's say we have 1000 movies, does it internally execute 1000 select actor.first_name from actor inner join ... queries? Or is PostgreSQL smarter about this and does it something like the following?
1. one query: get films
2. one query: get actors related to these films while keeping reference to film_id
3. internally: for each film f
f.actors = array(subset of (2) according to film_id)
This does 1 + 1 queries.
You are thinking in nested loops. This is something you should overcome when working with relational database (unless you are using MySQL).
What you describe as "1 + n" is a nested loop: you scan one table and for each row found, you scan the other table.
The way your SQL query is written, PostgreSQL has no choice but to execute a nested loop.
This is good as long as the outer table (film in your example) has few rows. Performance deteriorates rapidly once the outer table gets bigger.
In addition to nested loops, PostgreSQL has two other join strategies:
Hash join: The inner relation is scanned and a hash structure is created where the hash key is the join key. Then the outer relation is scanned, and the hash is probed for each row found.
Think of it as a kind of hash join, but on the inner side you have an efficient in-memory data structure.
Merge join: Both tables are sorted on the join key and merged by scanning the results simultaneously.
You are advised to write your query without “correlated subqueries” so that PostgreSQL can choose the optimal join strategy:
SELECT film_id, f.title, array_agg(a.first_name)
FROM film f
LEFT JOIN film_actor fa USING (film_id)
LEFT JOIN actor a USING (actor_id)
GROUP BY f.title
ORDER BY f.title;
The left outer join is used so that you get a result even if a film has no actors.
This is perhaps more appropriate for a comment, but it is too long.
Although I follow the logic of your query, I much prefer expressing it as:
select f.film_id, f.title,
(select array_agg(a.first_name)
from actor a inner join
film_actor fa
on a.actor_id = fa.actor_id
where fa.film_id = f.film_id
) as actors
from film f
order by f.title;
The explicit array_agg() clarifies the logic. You are aggregating the subquery, bringing the results together as an array, and then including that as a column in the outer query.

SQL EXISTS Query

I am trying to write a query to select an album from a table to which at least one artist has been assigned using the EXISTS query.
Albums and Artists are contained in separate tables and it is possible to have albums to which no artists have been assigned, where the value returns as NULL.
Can someone provide an example of how to go about creating this query.
EDIT: Adding the non-working example below
SELECT artist_name FROM artist
JOIN album ON artist.artist_id = album.artist_id
WHERE EXISTS (SELECT album_id FROM album)
The query is returning the correct result, but I don't think the last line is correct because it isn't using the operation for where at least one exists, so I'm thinking there needs to be an operator in the sub-query, or something to do with a NULL value.
If the tables are like :
artist_id, artist_name, album_id
and
album_id, album_name
Then the query will be
select *
from album alb
left join artist art on(alb.album_id = art.album_id)
where art.artist_id is null
Using exists:
select *
from album alb
where not exists
(select * from artist art where art.album_id = alb.album_id)
Your query works, but for an arcane reason:
The join is doing the work.
The exists is always returning true (assuming album has at least one row).
Actually, it sort-of works, because the question is about albums and you are returning artists.
Don't use the join in the outer query. Instead, you want a correlated subquery:
SELECT al.*
FROM album al
WHERE EXISTS (SELECT 1 FROM artist al WHERE a.artist_id = al.artist_id)

What are some alternatives to a NOT IN query?

Let's say we have a database that records all the Movies a User has not rated yet. Each rating is recorded in a MovieRating table.
When we are looking for movies user #1234 hasn't seen yet:
SELECT *
FROM Movies
WHERE id NOT IN
(SELECT DISTINCT movie_id FROM MovieRating WHERE user_id = 1234);
Querying NOT IN can be very expensive as the size of MovieRating grows. Assume MovieRatings can have 100,000+ rows.
My question is what are some more efficient alternatives to the NOT IN query? I've heard of the LEFT OUTER JOIN and NOT EXIST queries, but are there anything else? Is there any way I can design this database differently?
A correlated sub-query using WHERE NOT EXISTS() is potential your most efficient if you have to do this, but you should test performance against your data.
You may also want to consider limiting your results both in terms of the select list (don't use *) and only getting TOP n rows. That is, you may not need 100k+ movies if the user hasn't seen them. You may want to page the results.
SELECT *
FROM Movies m
WHERE NOT EXISTS (SELECT 1
FROM MovieRating r
WHERE user_id = 1234
AND r.movie_id= m.movie_id)
This is a mock query, because I don't have a db to test this, but something along the lines of the following should work.
select m.* from Movies m
left join MovieRating mr on mr.user_id = 1234
where mr.id is null
That should join the movies table to the movie rating table based on a user id. The where clause is then going to find null entries, which would be movies a user hasn't rated.
You can try this :
SELECT M.*
FROM Movies as M
LEFT OUTER JOIN
MovieRating as MR on M.id = MR.movie_id
and MR.user_id = 1234
WHERE M.id IS NULL