Improve SQL query by replacing inner query - sql

I'm trying to simplify this SQL query (I replaced real table names with metaphorical), primarily get rid of the inner query, but I'm brain frozen can't think of any way to do it.
My major concern (aside from aesthetics) is performance under heavy loads
The purpose of the query is to count all books grouping by genre found on any particular shelve where the book is kept (hence the inner query which is effectively telling which shelve to count books on).
SELECT g.name, count(s.book_id) occurances FROM genre g
LEFT JOIN shelve s ON g.shelve_id=s.id
WHERE s.id=(SELECT genre_id FROM book WHERE id=111)
GROUP BY s.genre_id, g.name

It seems like you want to know many books that are on a shelf are in the same genre as book 111: if you liked book "X", we have this many similar books in stock.
One thing I noticed is the WHERE clause in the original required a value for the shelve table, effectively converting it to an INNER JOIN. And speaking of JOINs, you can JOIN instead of the nested select.
SELECT g.name, count(s.book_id) occurances
FROM genre g
INNER JOIN shelve s ON s.id = b.shelve_id
INNER JOIN book b on b.genre_id = s.id
WHERE b.id=111
GROUP BY g.id, g.name
Thinking about it more, I might also start with book rather than genre. In the end, the only reason you need the genre table at all is to find the name, and therefore matching to it by id may be more effective.
SELECT g.name, count(s.book_id) occurances
FROM book b
INNER JOIN shelve s ON s.id = b.genre_id
INNER JOIN genre g on g.shelve_id = s.id
WHERE b.id=111
GROUP BY g.id, g.name
Not sure they meet your idea of "simpler" or not, but they are alternatives.
... unless matching shelve.id with book.genre_id is a typo in the question. It seems very odd the two tables would share the same id values, in which case these will both be wrong.

Related

Need assistance with SQL statement having trouble with the JOINS and WHERE clause

The company is performing an analysis of their inventory. They are considering purging books that are not popular with their customers. To do this they need a list of books that have never been purchased. Write a query using a join that provides this information. Your results should include all the book details and the order number column. Sort your results by the book title.
SELECT o.order_nbr, b.*
FROM orders o JOIN books
WHERE
ORDER BY book_title
This is all I could come up with, I'm still learning Joins and struggling to figure out what the correct statement should be. Wasn't sure what to put in the WHERE clause and don't really know how to properly join these tables.
You need an ON clause to specify what you are joining on. Also, your WHERE clause is empty, and you are not specifying the type of JOIN you are using. Looking at the way the tables are set up, the expectation is you are going to join the BOOKS table on ORDER_ITEMS, which also contains ORDER_NBR.
In the question, it's asking to find books with no orders, so correct join would be a LEFT JOIN between BOOKS and ORDER_ITEMS, as that will include every book, even those without orders, which will have an ORDER_NBR of NULL
The SQL would look like
SELECT o.order_nbr, b.*
FROM books b
LEFT JOIN order_items o on b.book_id = o.book_id
WHERE o.order_nbr is null
ORDER BY book_title
This would return only the books with no orders.

SQLite Subqueries and Inner Joins

I was doing a practice question for SQL which asks to create a list of album titles and unit prices for the artist "Audioslave" and find out how many records are returned.
Here is the relational database picture given in the question:
Initially, I used an inner join to retrieve the list and actually got the correct answer (40 records returned). The code is shown below:
select a.Title, t.UnitPrice
from albums a
inner join tracks t on t.AlbumId = a.AlbumId
inner join artists ar on ar.ArtistId = a.ArtistId
where ar.Name = 'Audioslave';
Although I finished the question, I was curious to try to solve this problem using nested subqueries instead and tried to first retrieve the AlbumId and UnitPrice from tracks. I got the correct answer but not the correct list (the question asked for album title and not AlbumId). Here is the code:
select AlbumId, UnitPrice
from tracks
where AlbumId in (
select AlbumId
from albums
where ArtistId in (
select ArtistId
from artists
where Name = 'Audioslave'));
In order to solve the problem with the list, I tried combining the previous codes. However, I get a completely different amount of records being returned (10509).
select a.Title, t.UnitPrice
from albums a
inner join tracks t
where a.AlbumId in (
select AlbumId
from albums
where ArtistId in (
select ArtistId
from artists
where Name = 'Audioslave'));
I don't understand what I'm doing wrong with the last code...Any help would be appreciated! Also, sorry if I wrote too much, I just wanted to convey my thinking process clearly.
Some databases (SQLite, MySQL, Maria, maybe others) allow you to write an INNER JOIN without specifying ON, and they just cross every record on the left with every record on the right in that case. If there were 2 albums and 3 tracks, 6 rows would result. If the albums were A and B, and the tracks were 1, 2 and 3, the rows would be the combination of all: A1, A2, A3, B1, B2, B3
Other databases (Postgres, SQLServer, Oracle, maybe others) refuse to do it unless you specify ON. To get an "every row on the left combined with every row on the right" you have to write CROSS JOIN (or write an inner join with an ON that is always true)
It might help your mental model of what happens during a join to consider that the db takes all the rows on the left and connects them to all the rows on the right, then for each combination of rows, assesses the truth of the ON clause, and the WHERE clause, before deciding to return the row
For example, this will return 10509 rows:
SELECT * FROM albums INNER JOIN tracks ON 1=1
The on clause is always true
This will return 10509 tracks, but only if the query is run on Monday
SELECT * FROM albums INNER JOIN tracks ON strftime('%w', 'now') = 1
What goes in the ON or WHERE doesn't have to have anything to do with the data in the table.. it just has to be something that resolves to a Boolean

SQLZOO #12 -- confused about multiple select & join statements

I am attempting to answer question #12 on sqlzoo.net
(http://sqlzoo.net/wiki/More_JOIN_operations). I couldn't figure out the answer on my own but I did manage to find the answer online.
12: Which were the busiest years for 'John Travolta', show the year and the number of movies he made each year for any year in which he made more than 2 movies.
Answer:
SELECT yr,COUNT(title) FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr
HAVING COUNT(title)=(SELECT MAX(c) FROM
(SELECT yr,COUNT(title) AS c FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr) AS t)
One of parts that I do not fully understand is the multiple joins:
FROM movie
JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
Is Actor being joined only with Movie, or is actor being joined with Movie JOIN Casting?
I am trying to find a website that explains complex join statements as my attempted answer was far from correct (missing many sections). I think subselect statements with multiple complex join statements is a bit confusing at the moment. But, I could not find a good website that breaks the information up to help me form my own queries.
The other part I don't fully understand is this:
(SELECT yr,COUNT(title) AS c FROM
movie JOIN casting ON movie.id=movieid
JOIN actor ON actorid=actor.id
WHERE name='John Travolta'
GROUP BY yr) AS t)
3. What is the above code trying to find?
Ok, glad you are not afraid to ask, and I'll do my best to help clarify what is going on... Please excuse my re-formatting of the query to my mindset of writing queries. It better shows the relationships of where things are coming from (my perspective), and may help you too.
A few other things about my rewrite. I also like to use alias references to the tables so every column is qualified with the table (or alias) it originates from. It prevents ambiguity, especially for someone who does not know your table structures and relationships between tables. (m = alias to movie, c = alias for casting, a = alias for actor tables). For the sub query, and to keep alias confusion clear, I suffixed them with 2, such as m2, c2, a2.
SELECT
m.yr,
COUNT(m.title)
FROM
movie m
JOIN casting c
ON m.id = c.movieid
JOIN actor a
ON c.actorid = a.id
WHERE
a.name = 'John Travolta'
GROUP BY
m.yr
HAVING
COUNT(m.title) = ( SELECT MAX(t.movieCount)
FROM
( SELECT m2.yr,
COUNT(m2.title) AS movieCount
FROM
movie m2
JOIN casting c2
ON m2.id = c2.movieid
JOIN actor a2
ON c2.actorid = a2.id
WHERE
a2.name='John Travolta'
GROUP BY
m2.yr ) AS t
)
First, look at the outermost query (aliases m, c, a ) and the innermost query (aliases m2, c2, a2) are virtually identical.
The query has to run from the deepest query first... in this case the m2, c2, a2 query. Look at it and see what IT is going to deliver. If you ran that, you would get every year he had a movie and the number of movies... starting result from their sample data goes from 1976 all the way to 2010. So far, nothing complex unto itself (about 20 rows). Now, since each table may have an alias, each sub query (such as this MUST have an alias, so that is why the "as t". So, there is no true table, it is wrapping the entire query's result set and assigning THAT the alias of "t".
So now, go one level up in the query also wrapped in parens...
SELECT MAX(t.movieCount)
FROM (EntireSubquery as t)
Although abbreviated, this is what the engine is doing. Looking at the subquery result given an alias of "t" and finding the maximum "movieCount" value which is the count of movies that were done in a given year. In this case, the actual number is 3 and we are almost done.
Now, to the outermost query... again, this was virtually identical to the innermost query. The only difference is the HAVING clause. This is applied after all the grouping per year is performed. Then it is comparing ITs row result set count per year to the 3 value result of the SELECT MAX( t.movieCount )...
So, all the years that had only 1 or 2 movies are excluded from the result, and only the one year that had 3 movies are included.
Now, to clarify the JOINs. Each table should have a relationship with one or more tables (also known as linking tables, such as the cast table that has both a movie and actors/actresses. So, think of the join as how to I put the tables in order so that each one can touch a piece to the other until I have them all chained together. In this case
Movie -> Casting linked by the movie ID, then Casting -> actor by the actor ID, so that is how I do it visually hierarchically... I am starting FROM the Movie table, JOINing to the cast table based ON Movie ID = Cast Movie ID. Now, from the Casting table joined to the Actor table based on the common Actor ID field
FROM
movie m
JOIN casting c
ON m.id = c.movieid
JOIN actor a
ON c.actorid = a.id
Now, this is a simple relationship, but you COULD have one primary table with multiple child-level tables. You could join multiple tables based on the respective data. Very simple sample to clarify the point. You have a student table going to a school. A student has a degree major, an ethnicity, an address state (assuming an online school and students can be from any state). If you had lookup tables for degrees, ethnicity and states, you might come up with something like...
select
s.firstname,
s.lastname,
d.DegreeDescription,
e.ethnicityDescription,
st.stateName
from
students s
join degrees d
on s.degreemajor = d.degreeID
join ethnicity e
on s.ethnicityID = e.id
join states st
on s.homeState = st.stateID
Notice the hierarchical representation that each table is directly associated under that of the student. Not all tables need to be one deeper than the last.
So, there are many sites out there, such as the w3schools as offered by Mark, but learn to dissect small pieces at a time... what are the bare minimum tables to get from point-A to point-Z and draw the relationships. THEN, tare down based on requirement criteria you are looking for.
The correct answer would be:
SELECT yr, COUNT(title)
FROM movie m
JOIN casting c ON m.id=c.movieid JOIN actor a ON c.actorid=a.id
WHERE name='John Travolta'
GROUP BY yr
HAVING COUNT(title) > 2;
The answer you found (which seems to be a mistake on the sqlzoo site) is looking for any year that has a count equal to the year with the highest count.
I used table aliases in the query above to clear up how the tables are joined. Movie is joined to casting and casting is joined to actor.
The subquery that confuses you is listing each year and a count of movies for that year that star John Travolta. It's not needed if you're answering the question as written.
As for learning resources, make sure you have the basics down. Understand everything at http://w3schools.com/sql. Try searching for "sql joining multiple tables" in your favorite search engine when you're ready for more.

SQL query - Joining a many-to-many relationship, filtering/joining selectively

I find myself in a bit of an unworkable situation with a SQL query and I'm hoping that I'm missing something or might learn something new. The structure of the DB2 database I'm working with isn't exactly built for this sort of query, but I'm tasked with this...
Let's say we have Table People and Table Groups. Groups can contain multiple people, and one person can be part of multiple groups. Yeah, it's already messy. In any case, there are a couple of intermediary tables linking the two. The problem is that I need to start with a list of groups, get all of the people in those groups, and then get all of the groups with which the people are affiliated, which would be a superset of the initial group set. This would mean starting with groups, joining down to the people, and then going BACK and joining to the groups again. I need information from both tables in the result set, too, so that rules out a number of techniques.
I have to join this with a number of other tables for additional information and the query is getting enormous, cumbersome, and slow. I'm wondering if there's some way that I could start with People, join it to Groups, and then specify that if a person has one group that is in the supplied set of groups (which is done via a subquery), then ALL groups for that person should be returned. I don't know of a way to make this happen, but I'm thinking (hoping) that there's a relatively clean way to make this happen in SQL.
A quick and dirty example:
SELECT ...
FROM GROUPS g
JOIN LINKING_A a
ON g.GROUPID = a.GROUPID
AND GROUPID IN (subquery)
JOIN LINKING_B b
ON a.GROUPLIST = b.GROUPLIST
JOIN PEOPLE p
ON b.PERSONID = p.PERSONID
--This gets me all people affiliated with groups,
-- but now I need all groups affiliated with those people...
JOIN LINKING_B b2
ON p.PERSONID = b2.PERSONID
JOIN LINKING_A a2
ON b2.GROUPLIST = a.GROUPLIST
JOIN GROUPS g2
ON a2.GROUPID = g.GROUPID
And then I can return information from p and g2 in the result set. You can see where I'm having trouble. That's a lot of joining on some large tables, not to mention a number of other joins that are performed in this query as well. I need to be able to query by joining PEOPLE to GROUPS, then specify that if any person has an associated group that is in the subquery, it should return ALL groups affiliated with that entry in PEOPLE. I'm thinking that GROUP BY might be just the thing, but I haven't used that one enough to really know. So if Bill is part of group A, B, and C, and our subquery returns a set containing Group A, the result set should include Bill along with groups A, B, and C.
The following is a shorter way to get all the groups that people in the supplied group list are in. Does this help?
Select g.*
From Linking_B b
Join Linking_B b2
On b2.PersonId = b.PersonId
Join Group g
On g.GroupId = b2.GroupId
Where b.Groupid in (SubQuery)
I'm not clear why you have both Linking_A and Linking_B. Generally all you should need to represent a many-to-many relationship between two master tables is a single association table with GroupID and PersonId.
I often recommend using "common table expressions" [CTE's] in order to help you break a problem up into chunks that can be easier to understand. CTE's are specified using a WITH clause, which can contain several CTE's before starting the main SELECT query.
I'm going to assume that the list of groups you want to start with is specified by your subquery, so that will be the 1st CTE. The next one selects people who belong to those groups. The final part of the query then selects groups those people belong to, and returns the columns from both master tables.
WITH g1 as
(subquery)
, p1 as
(SELECT p.*
from g1
join Linking a1 on g1.groupID=a1.groupID
join People p on p.personID=a1.personID )
SELECT p1.*, g2.*
from p1
join Linking a2 on p2.personID=a2.personID
join Groups g2 on g2.groupID=a2.groupID
I think I'd build the list of people you want to pull records for first, then use that to query out all the groups for those people. This will work across any number of link tables with the appropriate joins added:
with persons_wanted as
(
--figure out which people are in a group you want to include
select p.person_key
from person p
join link l1
on p.person_key = l1.person_key
join groups g
on l1.group_key = g.group_key
where g.group name in ('GROUP_I_WANT_PEOPLE_FROM', 'THIS_ONE_TOO')
group by p.person_key --we only want each person_key once
)
--now pull all the groups for the list of people in at least one group we want
select p.name as person_name, g.name as group_name, ...
from person p
join link l1
on p.person_key = l1.person_key
join groups g
on l1.group_key = g.group_key
where p.person_key in (select person_key from persons_wanted);

How does a SQL statement containing mutiple joins work?

I'm learning joins in my class, but I'm not fully grasping some of the concepts. Can somebody explain how a statement with multiple joins works?
SELECT B.TITLE, O.ORDER#, C.STATE FROM BOOKS B
LEFT OUTER JOIN ORDERITEMS OI ON B.ISBN = OI.ISBN
LEFT OUTER JOIN ORDERS O ON O.ORDER# = OI.ORDER#
LEFT OUTER JOIN CUSTOMERS C ON C.CUSTOMER# = O.CUSTOMER#;
I believe I understand that the BOOKS table is the left table in the first outer join connecting BOOKS and ORDERITEMS. All BOOKS will be shown, even if there is not an ORDERITEM for a book. After the first join, I'm not sure what is really happening.
When ORDERS is joined, which is the left table and which is the right table? The same for Customers. This is where I get lost.
First thing what executor will perform — take a first pair of tables that are eligible to be joined and perform the join. On the following steps, the result of the previous join is treated as a virtual relation, therefore you again have a construct similar to ... FROM virt_tab LEFT JOIN real_tab .... This behavior is based on the closure concept used in Relational Algebra, which means that any operation on the relation produces relation, i.e. operations can be nested. And RDBMS stands for Relational DBMS, take a look at the linked wikipedia article.
So far I find PostgreSQL's docs being most definitive in this matter, take a look at them. In the linked article a generic overview on how joins are performed by the databases is given with some PostrgeSQL-specific stuff, which is expected.
One of my favorite online resources is : http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
As to your question.
All books will be displayed and only those orderitems which match a book
all only those orders which have a related record in orderitems which relate to a book will be displayed
Only customers who have orders with items in the books table will be listed.
So customers who don't have orders would not be listed
Customers who have orders but for items that are not books will NOT be listed
Fun stuff. Hope you enjoy it.
As to your second question: Right/left only matter because of the ORDER of the tables in your from statement. You could make every join a left one if you re-arrange a table order. All right/left do is specify the table from which you want ALL records.
Consider: you could just as easily right your select statement as:
SELECT B.TITLE, O.ORDER#, C.STATE
FROM CUSTOMERS C
RIGHT OUTER JOIN ORDERS O ON C.CUSTOMER# = O.CUSTOMER#
RIGHT OUTER JOIN ORDERITEMS OI ON O.ORDER# = OI.ORDER#
RIGHT OUTER JOIN BOOKS B ON B.ISBN = OI.ISBN
In this case right is saying that I want all the records from the table on the right since books is last in the list you'll get all books and only those ordereditems related to a book, only those orders for which the ordered item was a book and only those customers with orders for ordered items which were books. Thus the left / right are the same except for order. I avoid right joins for readability. I find it easier to go top down when thinking about whats included and what will not be.
Those records which are excluded will have NULL values in these types of joins.
Hope this helps.