Each user HAS MANY photos and HAS MANY comments. I would like to order users by SUM(number_of_photos, number_of_comments)
Can you suggest me the SQL query?
GROUP BY with JOINs works more efficiently than dependent subqueries (in all relational DBs I know):
Select * From Users
Left Join Photos On (Photos.user_id = Users.id)
Left Join Comments On (Comments.user_id = Users.id)
Group By UserId
Order By (Count(Photos.id) + Count(Comments.id))
with some assumptions on the tables (e.g. an id primary key in each of them).
Select * From Users U
Order By (Select Count(*) From Photos
Where userId = U.UserId) +
(Select Count(*) From Comments
Where userId = U.UserId)
EDIT: although every query using subqueries can also be done using Joins, which will be faster ,
is not a simple question,
and is irrelevant unless the system is
experiencing performance problems.
1) Both constructions must be translated by the query optimizer into a query plan which includes some type of correlated join, be it a nested loop join, hash-join, merge join, or whatever. And it's entirely possible, (even likely), that they will both result in the same query plan.
NOTE: This is because the entire SQL Statement is translated into a single query plan. The subqueries do NOT get their own, individual query plans as though they were being executed in isolation.
What query plan and what type of joins are used will depend on the data structure and the data in each specific situation. The only way to tell which is faster is to try both, in controlled environments, and measure the performance... but,
2) Unless the system is experiencing an issue with performance, (unacceptable poor performance). clarity is more important. And for problems like the one described above, (where none of the data attributes in the "other" tables are required in the output of the SQL Statement, a Subquery is much clearer in describing the function and purpose of the SQL that a join with Group Bys would be.
I think that the accepted solutions would be problematic from a performance standpoint, assuming you have many users, photos, and comments. Your query runs two separate select statements for every row in the user table.
What you want to do is synthesize a query using ActiveRecord that looks like this:
SELECT user.*, COUNT(c.id) + COUNT(p.id) AS total_count
FROM users u LEFT JOIN photos p ON u.id = p.user_id
LEFT JOIN comments c ON u.id = c.user_id
GROUP BY user.id
ORDER BY total_count DESC
The join will be much, much more efficient. Using left joins insures that even if a user has no comments or photos they will still be included in the results.
If I were to assume that you had a count of comments and a count of photos (user.number_of_photos, user.number_of_comments; as seen above), it would be simple (not stupid):
Select user_id from user order by number_of_photos DESC, number_of_comments DESC
In Ruby On Rails:
User.find(:all, :order => '((SELECT COUNT(*) FROM photos WHERE user_id=users.id) + (SELECT COUNT(*) FROM classifications WHERE user_id=users.id)) DESC')
Related
I have 3 tables: user, recommendation (post_id, user_id), post
When a user votes on a post, a new recommendation record gets created with the post_id, user_id, and vote value.
I want to have a query that shows a random post that a user hasn't seen/voted on yet.
My thought on this is that it needs to join all recommendations of a user to the post table... and then select the records that don't have a joined recommendation. Not sure how to do this though...
What I have so far that definitely doesn't work:
SELECT "posts".*
FROM "posts"
INNER JOIN "recommendations" ON "recommendations"."post_id" = "posts"."id"
ORDER BY RANDOM()
LIMIT 1
You can do this with a left outer join:
SELECT p.*
FROM posts p LEFT OUTER JOIN
recommendations r
ON r.post_id = p.id and r.userid = YOURUSERID
WHERE r.post_id IS NULL
ORDER BY RANDOM()
LIMIT 1;
Note that I simplified the query by removing the double quotes (not needed for your identifier names) and adding table aliases. These changes make the query easier to write and to read.
There are several good ways to exclude rows that already have a recommendation from a given user:
Select rows which are not present in other table
The important question is: arbitrary or random?
For an arbitrary pick (any qualifying row is good enough), this should be cheapest:
SELECT *
FROM posts p
WHERE NOT EXISTS (
SELECT 1
FROM recommendations
WHERE post_id = p.id
AND user_id = $my_user_id
)
LIMIT 1;
The sort step might be expensive (and unnecessary) for lots of posts. In such a use case most of the posts will typically have no recommendation from the user at hand, yet. You'd have to order all those rows by random() every time.
If any post without recommendation is good enough, dropping ORDER BY will make it considerably faster. Postgres can just return the first qualifying post it finds.
so you need a set of all posts EXCEPT posts already recommended.
SELECT p.id FROM posts p
EXCEPT
SELECT r.post_id FROM recommendations r WHERE r.user_id = X
...
I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
I often see something like...
SELECT events.id, events.begin_on, events.name
FROM events
WHERE events.user_id IN ( SELECT contacts.user_id
FROM contacts
WHERE contacts.contact_id = '1')
OR events.user_id IN ( SELECT contacts.contact_id
FROM contacts
WHERE contacts.user_id = '1')
Is it okay to have query in query? Is it "inner query"? "Sub-query"? Does it counts as three queries (my example)? If its bad to do so... how can I rewrite my example?
Your example isn't too bad. The biggest problems usually come from cases where there is what's called a "correlated subquery". That's when the subquery is dependent on a column from the outer query. These are particularly bad because the subquery effectively needs to be rerun for every row in the potential results.
You can rewrite your subqueries using joins and GROUP BY, but as you have it performance can vary, especially depending on your RDBMS.
It varies from database to database, especially if the columns compared are
indexed or not
nullable or not
..., but generally if your query is not using columns from the table joined to -- you should be using either IN or EXISTS:
SELECT e.id, e.begin_on, e.name
FROM EVENTS e
WHERE EXISTS (SELECT NULL
FROM CONTACTS c
WHERE ( c.contact_id = '1' AND c.user_id = e.user_id )
OR ( c.user_id = '1' AND c.contact_id = e.user_id )
Using a JOIN (INNER or OUTER) can inflate records if the child table has more than one record related to a parent table record. That's fine if you need that information, but if not then you need to use either GROUP BY or DISTINCT to get a result set of unique values -- and that can cost you when you review the query costs.
EXISTS
Though EXISTS clauses look like correlated subqueries, they do not execute as such (RBAR: Row By Agonizing Row). EXISTS returns a boolean based on the criteria provided, and exits on the first instance that is true -- this can make it faster than IN when dealing with duplicates in a child table.
You could JOIN to the Contacts table instead:
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
-- exercise: without the GROUP BY, how many duplicate rows can you end up with?
This leaves the following question up to the database: "Should we look through all the contacts table and find all the '1's in the various columns, or do something else?" where your original SQL didn't give it much choice.
The most common term for this sort of query is "subquery." There is nothing inherently wrong in using them, and can make your life easier. However, performance can often be improved by rewriting queries w/ subqueries to use JOINs instead, because the server can find optimizations.
In your example, three queries are executed: the main SELECT query, and the two SELECT subqueries.
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
In your case, I believe the JOIN version will be better as you can avoid two SELECT queries on contacts, opting for the JOIN instead.
See the mysql docs on the topic.
I am using SQLite and will port to MySQL (5) later.
I wanted to know if I am doing something I shouldn't be doing. I tried purposely to design so I'll compare to 0 instead of 1 (I changed hasApproved to NotApproved to do this, not a big deal and I haven't written any code). I was told I never need to write a subquery but I do here. My Votes table is just id, ip, postid (I don't think I can write that subquery as a join instead?) and that's pretty much all that is on my mind.
Naming conventions I don't really care about since the tables are created via reflection and is all over the place.
select
id,
name,
body,
upvotes,
downvotes,
(select 1 from UpVotes where IPAddr=? AND post=Post.id) as myup,
(select 1 from DownVotes where IPAddr=#0 AND post=Post.id) as mydown
from Post
where
flag = '0'
limit ?, ?"
Since you're asking about good practices... the "upvotes" and "downvotes" appearing in your Posts table looks like you're duplicating data in your database. That's a problem, because now you always have to worry whether or not the data is in sync and correct. If you want to know the number of upvotes then count them, don't also store them in the Post table. I'm not positive that is what you're doing, but it's a guess.
Onto your query... You will probably get better performance using a JOINed subquery instead of how you have it. With the scalar subqueries as columns they have to be run once for every row that is returned. That could be a pretty big performance hit if you're returning a bunch of rows. Instead, try:
SELECT
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes,
COALESCE(UV.cnt, 0) AS upvotes2,
COALESCE(DV.cnt, 0) AS downvotes2
FROM
dbo.Posts P
LEFT OUTER JOIN (SELECT post_id, COUNT(*) cnt FROM dbo.UpVotes GROUP BY post_id) AS UV ON UV.post_id = P.id
LEFT OUTER JOIN (SELECT post_id, COUNT(*) cnt FROM dbo.DownVotes GROUP BY post_id) AS DV ON DV.post_id = P.id
Compare it to your own query and see if it gives you better performance.
EDIT: A couple of other posters have advocated a single table for up/down votes. They are absolutely correct. That makes the query even easier and also probably much faster:
SELECT
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes,
SUM(CASE WHEN V.vote_type = 'UP' THEN 1 ELSE 0 END) AS upvotes2,
SUM(CASE WHEN V.vote_type = 'DOWN' THEN 1 ELSE 0 END) AS downvotes2,
FROM
dbo.Posts P
LEFT OUTER JOIN Votes V ON
V.post_id = P.id
GROUP BY
P.id,
P.name,
P.body,
P.upvotes,
P.downvotes
I'm guessing that you're trying to ensure that a user only votes once on each post here.
I wouldn't - I don't - use separate tables for up votes and down votes. Add vote type to your votes table and you won't need correlated subqueries.
Here is my opinions:
It seems that table "UpVotes" and "DownVotes" have same structure and can be merged into one table.
The relation between table "Post" and "Up/DownVotes" can be constrained by foreign key.
Although I am not sure about the performance difference, but I think it would be better to use "join" mechanism rather than nesting two select statement in a select statement.
You can use joins to achieve the same thing, and I would expect joins to work a lot more efficiently than embeded selects.
I have a MySQL database that looks like this:
users
( id , name )
groups
( id , name )
group_users
( id , group_id , user_id )
Now to look up all the groups that a user belongs to, I would do something like this:
select * from 'group_users' where 'user_id' = 47;
This will probably return something like:
( 1 , 3 , 47 ),
( 2 , 4 , 47 ),
But when I want to display this to the user, I'm going to want to display the name of the groups that they belong to instead of the group_id. In a loop, I could fetch each group with the group_id that was returned, but that seems like the wrong way to do it. Is this a case where a join statement should be used? Which type of join should I use and how would I use it?
In general, you want to reduce the total number of queries to the database, by making each query do more. There are many reasons why this is a good thing, but the main one is that relational database management systems are specifically designed to be able to join tables quickly and are better at it than the equivalent code in some other language. Another reason is that it's usually more expensive to open many little queries than it is to run one large query that has everything you'll end up needing.
You want to take advantage of your RDBMS's strengths, so you should try to push data access into it in a few big queries rather than lots of little queries.
Now, that's just a general rule of thumb. There are cases when it's better to do some things outside of the database. It's important that you determine which is the right case for your situation by looking into bottlenecks if and only if they occur. Don't spend time worrying about performance until you find a performance problem.
But, in general, it's better to handle joins, lookups and all other query-related tasks in the database itself than it is to try to handle it in a general-purpose language.
That said, the kind of join you want is an inner join. You'd structure your join query like this:
SELECT groups.name, group_users.user_id
FROM group_users
INNER JOIN groups
ON group_users.group_id = groups.group_id
WHERE groups.user_id = 47;
I always find these charts to be very useful when doing joins:
https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
SELECT gu.id, gu.group_id, g.name, gu.user_id
FROM group_users gu
INNER JOIN Group g
ON gu.group_id = g.group_id
WHERE user_id = 47
That's a typical case for an inner join, such as:
select users.name, group.name from groups
inner join group_users on groups.id = group_users.group_id
inner join users on group_users.user_id = users.id
where user_id = 47