SQL primer: Why and how to use join statements? - sql

I have a MySQL database that looks like this:
users
( id , name )
groups
( id , name )
group_users
( id , group_id , user_id )
Now to look up all the groups that a user belongs to, I would do something like this:
select * from 'group_users' where 'user_id' = 47;
This will probably return something like:
( 1 , 3 , 47 ),
( 2 , 4 , 47 ),
But when I want to display this to the user, I'm going to want to display the name of the groups that they belong to instead of the group_id. In a loop, I could fetch each group with the group_id that was returned, but that seems like the wrong way to do it. Is this a case where a join statement should be used? Which type of join should I use and how would I use it?

In general, you want to reduce the total number of queries to the database, by making each query do more. There are many reasons why this is a good thing, but the main one is that relational database management systems are specifically designed to be able to join tables quickly and are better at it than the equivalent code in some other language. Another reason is that it's usually more expensive to open many little queries than it is to run one large query that has everything you'll end up needing.
You want to take advantage of your RDBMS's strengths, so you should try to push data access into it in a few big queries rather than lots of little queries.
Now, that's just a general rule of thumb. There are cases when it's better to do some things outside of the database. It's important that you determine which is the right case for your situation by looking into bottlenecks if and only if they occur. Don't spend time worrying about performance until you find a performance problem.
But, in general, it's better to handle joins, lookups and all other query-related tasks in the database itself than it is to try to handle it in a general-purpose language.
That said, the kind of join you want is an inner join. You'd structure your join query like this:
SELECT groups.name, group_users.user_id
FROM group_users
INNER JOIN groups
ON group_users.group_id = groups.group_id
WHERE groups.user_id = 47;

I always find these charts to be very useful when doing joins:
https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

SELECT gu.id, gu.group_id, g.name, gu.user_id
FROM group_users gu
INNER JOIN Group g
ON gu.group_id = g.group_id
WHERE user_id = 47

That's a typical case for an inner join, such as:
select users.name, group.name from groups
inner join group_users on groups.id = group_users.group_id
inner join users on group_users.user_id = users.id
where user_id = 47

Related

Using COUNT (DISTINCT..) when also using INNER JOIN to join 3 tables but Postgres keeps erroring

I need to use INNER JOINs to get a series of information and then I need to COUNT this info. I need to be able to "View all courses and the instructor taking them, the capacity of the course, and the number of members currently booked on the course."
To get all the info I have done the following query:
SELECT
C.coursename, Instructors.fname, Instructors.lname,C.maxNo, membercourse.memno
FROM Courses AS C
INNER JOIN Instructors ON C.instructorNo = Instructors.instructorNo
INNER JOIN Membercourse ON C.courseID = Membercourse.courseID;
but no matter where I put the COUNT it always tells me that whatever is outside the COUNT should be in the GROUP BY
I have worked out how to COUNT/GROUP BY the necessary info e.g.:
SELECT courseID, COUNT (DISTINCT MC.memno)
FROM Membercourse AS MC
GROUP BY MC.courseID;
but I don't know how to combine the two!
I think what you're looking for is a subquery. I'm a SQL-Server guy (not postgresql) but the concept looks to be almost identical after some crash-course postgresql googling.
Anyway, basically, when you write a SELECT statement, you can use a subquery instead of an actual table. So your SQL would look something like:
select count(*)
from
(
select stuff from table
inner join someOtherTable
)
... hopefully that makes sense. Instead of trying to write one big query where you're doing both the inner join and count, you're writing two: an inner one that gets your inner-join'ed data, and then an outer one to actually count the rows.
EDIT: To help explain a bit more on the thought process behind subqueries.
Subqueries are a way of logically breaking down the steps/processes on the data. Instead of trying to do everything in one big step, you do it in steps.
In this case, what's step one? It's to get a combined data source for your combined, inner-join'ed data.
Step 1: Write the Inner Join query
SELECT
C.coursename, Instructors.fname, Instructors.lname,C.maxNo,
membercourse.memno
FROM Courses AS C
INNER JOIN Instructors ON C.instructorNo = Instructors.instructorNo
INNER JOIN Membercourse ON C.courseID = Membercourse.courseID;
Okay, now, what next?
Well, let's say we want to get a count of how many entries there are for each 'memno' in that result above.
Instead of trying to figure out how to modify that query above, we instead use it as a data source, like it was a table itself.
Step 2 - Make it A Subquery
select * from
(
SELECT
C.coursename, Instructors.fname, Instructors.lname,C.maxNo,
membercourse.memno
FROM Courses AS C
INNER JOIN Instructors ON C.instructorNo = Instructors.instructorNo
INNER JOIN Membercourse ON C.courseID = Membercourse.courseID
) mySubQuery
Step 3 - Modify your outer query to get the data you want.
Well, we wanted to group by 'memno', and get the count, right? So...
select memno, count(*)
from
(
-- all that same subquery stuff
) mySubQuery
group by memno
... make sense? Once you've got your subquery written out, you don't need to worry about it any more - you just treat it like a table you're working with.
This is actually incredibly important, and makes it much easier to read more intricate queries - especially since you can name your subqueries in a way that explains what the subquery represents data-wise.
There are many ways to solve this, such using Window Functions and so on. But you can also achieve it using a simple subquery:
SELECT
C.coursename,
Instructors.fname,
Instructors.lname,
C.maxNo,
(SELECT
COUNT(*)
FROM
membercourse
WHERE
C.courseID = Membercourse.courseID) AS members
FROM
Courses AS C
INNER JOIN Instructors ON C.instructorNo = Instructors.instructorNo;

SQL query involving group by and joins

I couldn't be more specific in the title part but I want to do something a little bit complex for me. I thought I did it but it turned out that it is buggy.
I have three tables as following:
ProjectTable
idProject
title
idOwner
OfferTable
idOffer
idProject
idAccount
AccountTable
idAccount
Username
Now in one query I aim to list all the projects with most offers made, and in the query I also want to get details like the username of the owner, username of the offerer* etc. So I don't have to query again for each project.
Here is my broken query, it's my first experiment with GROUP BY and I probably didn't quite get it.
SELECT Project.addDate,Project.idOwner ,Account.Username,Project.idProject,
Project.Price,COUNT(Project.idProject) as offercount
FROM Project
INNER JOIN Offer
ON Project.idProject= Offer.idProject
INNER JOIN Account
ON Account.idAccount = Project.idOwner
GROUP BY Project.addDate,Project.idOwner,
Account.Username,Project.idProject,Project.Price
ORDER BY addDate DESC
*:I wrote that without thinking I was just trying to come up with example extra information, that is meaningless thanks to Hosam Aly.
Try this (modified for projects with no offers):
SELECT
Project.addDate,
Project.idOwner,
Account.Username,
Project.idProject,
Project.Price,
ISNULL(q.offercount, 0) AS offercount
FROM
(
SELECT
o.idProject,
COUNT(o.idProject) as offercount
FROM Offer o
GROUP BY o.idProject
) AS q
RIGHT JOIN Project ON Project.idProject = q.idProject
INNER JOIN Account ON Account.idAccount = Project.idOwner
ORDER BY addDate DESC
I might switch the query slightly to this:
select p.addDate,
p.idOwner,
a.Username,
p.idProject,
p.price,
o.OfferCount
from project p
left join
(
select count(*) OfferCount, idproject
from offer
group by idproject
) o
on p.idproject = o.idproject
left join account a
on p.idowner = a.idaccount
This way, you are getting the count by the projectid and not based on all of the other fields you are grouping by. I am also using a LEFT JOIN in the event the projectid or other id doesn't exist in the other tables, you will still return data.
Your question is a bit vague, but here are some pointers:
To list the projects "with most offers made", ORDER BY offercount.
You're essentially querying for projects, so you should GROUP BY Project.idProject first before the other fields.
You're querying for the number of offers made on each project, yet you ask about offer details. It doesn't really make sense (syntax-wise) to ask for the two pieces of information together. If you want to get the total number of offers, repeated in every record of the result, along with offer information, you'll have to use an inner query for that.
An inner query can be made either in the FROM clause, as suggested by other answers, or directly in the SELECT clause, like so:
SELECT Project.idProject,
(SELECT COUNT(Offer.idOffer)
FROM Offer
WHERE Offer.idProject = Project.idProject
) AS OfferCount
FROM Project

Whether Inner Queries Are Okay?

I often see something like...
SELECT events.id, events.begin_on, events.name
FROM events
WHERE events.user_id IN ( SELECT contacts.user_id
FROM contacts
WHERE contacts.contact_id = '1')
OR events.user_id IN ( SELECT contacts.contact_id
FROM contacts
WHERE contacts.user_id = '1')
Is it okay to have query in query? Is it "inner query"? "Sub-query"? Does it counts as three queries (my example)? If its bad to do so... how can I rewrite my example?
Your example isn't too bad. The biggest problems usually come from cases where there is what's called a "correlated subquery". That's when the subquery is dependent on a column from the outer query. These are particularly bad because the subquery effectively needs to be rerun for every row in the potential results.
You can rewrite your subqueries using joins and GROUP BY, but as you have it performance can vary, especially depending on your RDBMS.
It varies from database to database, especially if the columns compared are
indexed or not
nullable or not
..., but generally if your query is not using columns from the table joined to -- you should be using either IN or EXISTS:
SELECT e.id, e.begin_on, e.name
FROM EVENTS e
WHERE EXISTS (SELECT NULL
FROM CONTACTS c
WHERE ( c.contact_id = '1' AND c.user_id = e.user_id )
OR ( c.user_id = '1' AND c.contact_id = e.user_id )
Using a JOIN (INNER or OUTER) can inflate records if the child table has more than one record related to a parent table record. That's fine if you need that information, but if not then you need to use either GROUP BY or DISTINCT to get a result set of unique values -- and that can cost you when you review the query costs.
EXISTS
Though EXISTS clauses look like correlated subqueries, they do not execute as such (RBAR: Row By Agonizing Row). EXISTS returns a boolean based on the criteria provided, and exits on the first instance that is true -- this can make it faster than IN when dealing with duplicates in a child table.
You could JOIN to the Contacts table instead:
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
-- exercise: without the GROUP BY, how many duplicate rows can you end up with?
This leaves the following question up to the database: "Should we look through all the contacts table and find all the '1's in the various columns, or do something else?" where your original SQL didn't give it much choice.
The most common term for this sort of query is "subquery." There is nothing inherently wrong in using them, and can make your life easier. However, performance can often be improved by rewriting queries w/ subqueries to use JOINs instead, because the server can find optimizations.
In your example, three queries are executed: the main SELECT query, and the two SELECT subqueries.
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
In your case, I believe the JOIN version will be better as you can avoid two SELECT queries on contacts, opting for the JOIN instead.
See the mysql docs on the topic.

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.

Select Users with more Items

Each user HAS MANY photos and HAS MANY comments. I would like to order users by SUM(number_of_photos, number_of_comments)
Can you suggest me the SQL query?
GROUP BY with JOINs works more efficiently than dependent subqueries (in all relational DBs I know):
Select * From Users
Left Join Photos On (Photos.user_id = Users.id)
Left Join Comments On (Comments.user_id = Users.id)
Group By UserId
Order By (Count(Photos.id) + Count(Comments.id))
with some assumptions on the tables (e.g. an id primary key in each of them).
Select * From Users U
Order By (Select Count(*) From Photos
Where userId = U.UserId) +
(Select Count(*) From Comments
Where userId = U.UserId)
EDIT: although every query using subqueries can also be done using Joins, which will be faster ,
is not a simple question,
and is irrelevant unless the system is
experiencing performance problems.
1) Both constructions must be translated by the query optimizer into a query plan which includes some type of correlated join, be it a nested loop join, hash-join, merge join, or whatever. And it's entirely possible, (even likely), that they will both result in the same query plan.
NOTE: This is because the entire SQL Statement is translated into a single query plan. The subqueries do NOT get their own, individual query plans as though they were being executed in isolation.
What query plan and what type of joins are used will depend on the data structure and the data in each specific situation. The only way to tell which is faster is to try both, in controlled environments, and measure the performance... but,
2) Unless the system is experiencing an issue with performance, (unacceptable poor performance). clarity is more important. And for problems like the one described above, (where none of the data attributes in the "other" tables are required in the output of the SQL Statement, a Subquery is much clearer in describing the function and purpose of the SQL that a join with Group Bys would be.
I think that the accepted solutions would be problematic from a performance standpoint, assuming you have many users, photos, and comments. Your query runs two separate select statements for every row in the user table.
What you want to do is synthesize a query using ActiveRecord that looks like this:
SELECT user.*, COUNT(c.id) + COUNT(p.id) AS total_count
FROM users u LEFT JOIN photos p ON u.id = p.user_id
LEFT JOIN comments c ON u.id = c.user_id
GROUP BY user.id
ORDER BY total_count DESC
The join will be much, much more efficient. Using left joins insures that even if a user has no comments or photos they will still be included in the results.
If I were to assume that you had a count of comments and a count of photos (user.number_of_photos, user.number_of_comments; as seen above), it would be simple (not stupid):
Select user_id from user order by number_of_photos DESC, number_of_comments DESC
In Ruby On Rails:
User.find(:all, :order => '((SELECT COUNT(*) FROM photos WHERE user_id=users.id) + (SELECT COUNT(*) FROM classifications WHERE user_id=users.id)) DESC')