Optimizing simple query that takes 1 minute to execute - sql

I have this rather simple SQL query, but it takes almost a minute to execute:
SELECT
i.id,
...,
a.id AS albums_id,
...,
u.id AS users_id,
...
FROM
images i
LEFT JOIN albums a ON i.albums_id = a.id
LEFT JOIN users u ON a.users_id = u.id
WHERE
a.access = 'public'
AND i.num_of_views > 0
ORDER BY
i.num_of_views DESC
LIMIT
0, 60
Result of EXPLAIN for the above query:
Tables involved:
images (~4,822,000 rows), albums (~149,000 rows), users (~43,000 rows)
Relevant indexes:
albums: access(access,num_of_images,album_time), access_2(access,num_of_images,num_of_all_comments,album_time), users_id(users_id,album_time)
images: browser_2(num_of_views), albums_id(albums_id,image_order)
All tabels are InnoDB, running on MySql v5.1.47
So how do I bring this down to under a second?
Please leave a comment if you need any additional info.
edit: users table can be joined either with albums or images does not matter to me.
edit2: moving a.access = 'public' from WHERE to JOIN does indeed solve my problem, but the results returned are not correct (I get images from albums that are not public), putting the a.access ... in both WHERE and JOIN slows the query down even more than before.

Add an index on albums.users_id. I also agree with the comments regarding a.access = 'public'. But the index should help either way.
UPDATE
Since the key above exists. Try adjusting the order of your JOIN, i.e. move users above albums or make a different table the primary. In rare cases this can help. Also to better join albums try:
LEFT JOIN albums a ON (i.albums_id = a.id AND a.access = 'public')
UPDATE
Based on the comments, I would remove as many of the LEFT JOIN as possible. As I am not sure what you require in your results, I will only show it for albums. This will not only decrease the result set, but solve the problem for applying the filter.
JOIN albums a ON (i.albums_id = a.id AND a.access = 'public')

I believe there's a little confusion going on here w/r/t the impact that a filter can have on a LEFT JOIN vs an INNER JOIN.
Jan, if what you are trying to ask in your query is "Get all images for all albums that are public, and get the users of those albums as well" then you do not want a left join, you want an inner join. A left join will return all images for all albums, but it will also return all images that have no matching album. You can add "and a.id IS NOT NULL" but that's the same as an INNER JOIN.
I believe what you want is the following:
SELECT
i.id,
...,
a.id AS albums_id,
...,
u.id AS users_id,
...
FROM images i
INNER JOIN albums a ON i.albums_id = a.id AND a.access = 'public'
INNER JOIN users u ON a.users_id = u.id
WHERE i.num_of_views > 0
ORDER BY i.num_of_views DESC
LIMIT 0, 60
If you left join albums to users you could return all albums that don't have users. Not sure which one you want.

Based on your most recent comments, you should use an INNER JOIN to albums instead of a LEFT JOIN.
SELECT
i.id,
...,
a.id AS albums_id,
...,
u.id AS users_id,
...
FROM images i
INNER JOIN albums a ON i.albums_id = a.id
LEFT JOIN users u ON a.users_id = u.id
WHERE a.access = 'public'
AND i.num_of_views > 0
ORDER BY i.num_of_views DESC
LIMIT 0, 60

Related

SQL issue with RIGHT JOINS

The following SQL query does exactly what it should do, but I don't know how to change it so it does what I need it to do.
SELECT
a.id AS comment_id, a.parent_comment_id, a.user_id, a.pid, a.comment, a.blocked_at, a.blocked_by, a.created_at, a.updated_at,
b.name, b.is_moderator, COUNT(IF(d.type = 1, 1, NULL)) AS a_count, COUNT(IF(d.type = 2, 1, NULL)) AS b_count
FROM
comments AS a
RIGHT JOIN
users AS b ON b.id = a.user_id
RIGHT JOIN
node_user AS c ON c.user = b.id
RIGHT JOIN
nodes AS d ON d.id = c.node
WHERE
a.pid = 999
GROUP BY
comment_id
ORDER BY
a.created_at ASC
It gets all comments belonging to a specific pid, it then RIGHT JOINS additional user data like name and is_moderator, then RIGHT JOINS any (so called) nodes including additional data based on the user id and node id. As seen in the SELECT, I count the nodes based on their type.
This works great for users that have any nodes attached to their accounts. But users who don't have any, so whose id doesn't exist in the node_user and nodes tables, won't be shown in the query results.
So my question:
How can I make it so that even users who don't have any nodes, are still shown in the query results but with an a_count and b_count of 0 or NULL.
I'm pretty sure you want left joins not right joins. You also want to fix your table aliases so they are meaningful:
SELECT . . .
FROM comments c LEFT JOIN
users u
ON u.id = c.user_id LEFT JOIN
node_user nu
ON nu.user = u.id LEFT JOIN
nodes n
ON n.id = nu.node
WHERE c.pid = 999
GROUP BY c.id
ORDER BY c.created_at ASC;
This keeps everything in the first table, regardless of whether or not rows match in the subsequent tables. That appears to be what you want.

Left join pre-filtering on first table in postgres

I want to run a query to get user photo album ids, names, and picture count in the album. This query works:
SELECT album.id, album.name, count(pictures.*)
FROM album
LEFT JOIN pictures
ON (pictures.album_id=album.id)
WHERE album.owner = ?
GROUP BY album.id;
I have tons of pictures, and lots of albums, but the join is running before filtering for the user I'm interested in.
I have seen other answers that filter inside the join based on the 2nd table's values, but I want to filter on album.owner which is not included in the 2nd table. How can I filter before the join? (and is it efficient? will I break indexes?)
For this query:
SELECT a.id, a.name, count(p.album_id)
FROM album a LEFT JOIN
pictures p
ON p.album_id = a.id
WHERE a.owner = ?
GROUP BY a.id;
You want an index on album(owner, id, name). This should speed your query.
However, it will probably be faster if phrased like this:
SELECT a.id, a.name,
(SELECT count(*)
FROM pictures p
WHERE p.album_id = a.id
)
FROM album a
WHERE a.owner = ?
Here you want the above index and an index on pictures(album_id).

Counting Related Records : Query Taking Over 2 Minutes To Run

Considering the diagram above I am trying to select bulletins along with related info.
A bulletin can have only one associated user (the creator)
A bulletin can have only one state (the creator's home state)
A bulletin can have only one bulletin type (E.G. Announcement, for sale, etc)
A bulletin can have 0 or 1 event tied to it
A bulletin can have many likes
A bulletin can have many comments
As far as the states go a region can have many states
Using the query below causes it to run for 2 minutes before I hit the cancel button. I have not tried to run it for more than that.
SELECT TOP 10 Bulletins.Id, LEFT(Bulletins.Body, 350) AS BodySnippet, Bulletins.CreationDateTime
, Bulletins.UserId AS PosterId, Bulletins.StateId, Bulletins.EventId,
Bulletins.BulletinTypeId, Bulletins.[Views], Users.UserName,
Users.Zipcode as ZipCode, Users.StateId as StateId, Users.City,
States.Name, States.UnitedStatesRegionId, RegionsOfTheUnitedStates.Name,
COUNT(BulletinLikes.Id) AS Likes, COUNT(BulletinComments.Id) AS Comments
FROM Bulletins
INNER JOIN Users ON Bulletins.UserId = Users.Id
INNER JOIN States ON Bulletins.StateId = States.Id
INNER JOIN RegionsOfTheUnitedStates ON States.UnitedStatesRegionId = RegionsOfTheUnitedStates.Id
INNER JOIN BulletinTypes ON Bulletins.BulletinTypeId = BulletinTypes.Id
LEFT JOIN [Events] ON Bulletins.EventId = [Events].Id
LEFT JOIN BulletinLikes ON Bulletins.Id = BulletinLikes.BulletinId
LEFT JOIN BulletinComments ON Bulletins.Id = BulletinComments.BulletinId
GROUP BY Bulletins.Id, Bulletins.Body, Bulletins.CreationDateTime
, Bulletins.UserId, Bulletins.StateId, Bulletins.EventId,
Bulletins.BulletinTypeId, Bulletins.[Views], Users.UserName,
Users.Zipcode, Users.StateId, Users.City,
States.Name, States.UnitedStatesRegionId, RegionsOfTheUnitedStates.Name
Deleting the line that does the counting of Likes and Comments makes the query return back instantaneously. In my tables I have lots of dummy data. Some of these bulletins have hundreds or a couple thousand likes or comments. That still does not seem like enough to make the query run for 2 minutes plus+ I am no expert when it comes to TSQL so I know it is boiling down to how I'm counting or how I am grouping.
What would be the proper way to return the counted related records in my specific scenario?
**EDIT 1*
My ER is c*ompletely off on one part. I closed out of the website I was using to create it and lost it. Here are some corrections
Bulletins is tied to BulletinTypes with a BulletinTypeFK inside of the Bulletins table (reason being is we use Bulletintypes for a drop down)
EDIT 2
I just found out you can do some profiling on SQL Azure and came up with these two sreenshots of information; however, I'm not 100% sure what to gain from these.
It looks as if the first sort operation is taking up 54.2% of resources. The first index seek looks pretty high too # 32.2%
The first thing I'd try to check performance of much simpler query that touches tables that have the most effect (you mentioned BulletinLikes and BulletinComments are the biggest offenders of performance) :
SELECT TOP 10 b.id, COUNT(bl.Id) AS likes, COUNT(bc.Id) AS Comments
FROM Bulletins b
LEFT JOIN BulletinLikes bl ON b.Id = bl.BulletinId
LEFT JOIN BulletinComments ON b.Id = bc.BulletinId
GROUP BY b.id
If that gives decent performance, I'd make it subquery or CTE, whatever syntax you prefer, and join the rest to the result of subquery.
The general idea is to get rid of huge GROUP BY ...
Side note : TOP without ORDER BY is not guaranteed to give consistent results.
Without the counts then those left joins don't even need to be performed and the query optimizer probably figures it out.
And you don't even user Events with the count - drop it
Make sure you have indexes on all those join conditions (BullitinID) and they are not fragmented.
When these two queries run fast your query will run fast
select count(distinct(BulletinId)) from BulletinLikes
select count(distinct(BulletinId)) from BulletinComments
(and you may need an index on regionId)
SELECT TOP 10 Bulletins.Id, LEFT(Bulletins.Body, 350) AS BodySnippet
, Bulletins.CreationDateTime
, Bulletins.UserId AS PosterId, Bulletins.StateId, Bulletins.EventId
, Bulletins.BulletinTypeId, Bulletins.[Views]
, Users.UserName, Users.Zipcode as ZipCode, Users.StateId as StateId, Users.City
, States.Name, States.UnitedStatesRegionId
, RegionsOfTheUnitedStates.Name
, COUNT(BulletinLikes.Id) AS Likes
, COUNT(BulletinComments.Id) AS Comments
FROM Bulletins
INNER JOIN Users
ON Bulletins.UserId = Users.Id
INNER JOIN States
ON Bulletins.StateId = States.Id
INNER JOIN RegionsOfTheUnitedStates
ON States.UnitedStatesRegionId = RegionsOfTheUnitedStates.Id
INNER JOIN BulletinTypes
ON Bulletins.BulletinTypeId = BulletinTypes.Id
LEFT JOIN [Events]
ON Bulletins.EventId = [Events].Id
LEFT JOIN BulletinLikes
ON Bulletins.Id = BulletinLikes.BulletinId
LEFT JOIN BulletinComments
ON Bulletins.Id = BulletinComments.BulletinId
GROUP BY Bulletins.Id, Bulletins.Body, Bulletins.CreationDateTime
, Bulletins.UserId, Bulletins.StateId, Bulletins.EventId
, Bulletins.BulletinTypeId, Bulletins.[Views]
, Users.UserName, Users.Zipcode, Users.StateId, Users.City
, States.Name, States.UnitedStatesRegionId
, RegionsOfTheUnitedStates.Name
There is nothing wrong with the form of your query (although you may want to consider if you need to select so many columns, but that is beside the point).
You may want to focus on the indexes that exist on all of the columns in your join conditions. Most of the time, we join on columns that are in a foreign key relationship to a primary key, and thus there is likely a (default) clustered index on that column, but you'll want to check to be sure: each of these columns should be the first column in some index on each of the tables in question (at least for the tables with more than a trivial number of rows).
I would try pulling the COUNT fields out into sub-queries, and avoid the whole GROUP BY statement:
SELECT TOP 10 Bulletins.Id, LEFT(Bulletins.Body, 350) AS BodySnippet, Bulletins.CreationDateTime, Bulletins.UserId AS PosterId, Bulletins.StateId, Bulletins.EventId, Bulletins.BulletinTypeId, Bulletins.[Views], Users.UserName, Users.Zipcode as ZipCode, Users.StateId as StateId, Users.City, States.Name, States.UnitedStatesRegionId, RegionsOfTheUnitedStates.Name,
(SELECT COUNT(*) FROM BulletinLikes bl WHERE bl.BulletinId = b.Id) AS Likes,
(SELECT COUNT(*) FROM BulletinComments bc WHERE bc.BulletinId = b.Id) AS Comments
FROM Bulletins
INNER JOIN Users ON Bulletins.UserId = Users.Id
INNER JOIN States ON Bulletins.StateId = States.Id
INNER JOIN RegionsOfTheUnitedStates ON States.UnitedStatesRegionId = RegionsOfTheUnitedStates.Id
INNER JOIN BulletinTypes ON Bulletins.BulletinTypeId = BulletinTypes.Id
LEFT JOIN [Events] ON Bulletins.EventId = [Events].Id

Order by join column but use distinct on another

I'm building a system in which there are the following tables:
Song
Broadcast
Station
Follow
User
A user follows stations, which have songs on them through broadcasts.
I'm building a "feed" of songs for a user based on the stations they follow.
Here's the query:
SELECT DISTINCT ON ("broadcasts"."created_at", "songs"."id") songs.*
FROM "songs"
INNER JOIN "broadcasts" ON "songs"."shared_id" = "broadcasts"."song_id"
INNER JOIN "stations" ON "broadcasts"."station_id" = "stations"."id"
INNER JOIN "follows" ON "stations"."id" = "follows"."station_id"
WHERE "follows"."user_id" = 2
ORDER BY broadcasts.created_at desc
LIMIT 18
Note: shared_id is the same as id.
As you can see I'm getting duplicate results, which I don't want. I found out from a previous question that this was due to selecting distinct on broadcasts.created_at.
My question is: How do I modify this query so it will return only unique songs based on their id but still order by broadcasts.created_at?
Try this solution:
SELECT a.maxcreated, b.*
FROM
(
SELECT bb.song_id, MAX(bb.created_at) AS maxcreated
FROM follows aa
INNER JOIN broadcasts bb ON aa.station_id = bb.station_id
WHERE aa.user_id = 2
GROUP BY bb.song_id
) a
INNER JOIN songs b ON a.song_id = b.id
ORDER BY a.maxcreated DESC
LIMIT 18
The FROM subselect retrieves distinct song_ids that are broadcasted by all stations the user follows; it also gets the latest broadcast date associated with each song. We have to encase this in a subquery because we have to GROUP BY on the columns we're selecting from, and we only want the unique song_id and the maxdate regardless of the station.
We then join that result in the outer query to the songs table to get the song information associated with each unique song_id
You can use Common Table Expressions (CTE) if you want a cleaner query (nested queries make things harder to read)
I would look like this:
WITH a as (
SELECT bb.song_id, MAX(bb.created_at) AS maxcreated
FROM follows aa
INNER JOIN broadcasts bb ON aa.station_id = bb.station_id
INNER JOIN songs cc ON bb.song_id = cc.shared_id
WHERE aa.user_id = 2
GROUP BY bb.song_id
)
SELECT
a.maxcreated,
b.*
FROM a INNER JOIN
songs b ON a.song_id = b.id
ORDER BY
a.maxcreated DESC
LIMIT 18
Using a CTE offers the advantages of improved readability and ease in maintenance of complex queries. The query can be divided into separate, simple, logical building blocks. These simple blocks can then be used to build more complex, interim CTEs until the final result set is generated.
Try by adding GROUP BY Songs.id
I had a very similar query I was doing between listens, tracks and albums and it took me a long while to figure it out (hours).
If you use a GROUP_BY songs.id, you can get it to work by ordering by MAX(broadcasts.created_at) DESC.
Here's what the full SQL looks like:
SELECT songs.* FROM "songs"
INNER JOIN "broadcasts" ON "songs"."shared_id" = "broadcasts"."song_id"
INNER JOIN "stations" ON "broadcasts"."station_id" = "stations"."id"
INNER JOIN "follows" ON "stations"."id" = "follows"."station_id"
WHERE "follows"."user_id" = 2
GROUP BY songs.id
ORDER BY MAX(broadcasts.created_at) desc
LIMIT 18;

Help with SQL Join on two tables

I have two tables, one is a table of forum threads. It has a last post date column.
Another table has PostID, UserId, and DateViewed.
I want to join these tables so I can compare DateViewed and LastPostDate for the current user. However, if they have never viewed the thread, there will not be a row in the 2nd table.
This seems easy but I cant wrap my head around it. Advice please.
Thanks in advance.
What is it that you're trying to do specifically - determine if there are unread posts?
You just need to use an outer join:
SELECT p.PostID, p.LastPostDate, ...,
CASE
WHEN v.DateViewed IS NULL OR v.DateViewed < p.LastPostDate THEN 1
ELSE 0
END AS Unread
FROM Posts p
LEFT JOIN PostViews v
ON v.PostID = p.PostID
AND v.UserID = #UserID
Note that I've placed the UserID test in the JOIN condition; if you put it in the WHERE predicate then you'll get no results because there will be no matching rows in the PostViews table.
So you're thinking something like:
SELECT t.UserID, t.PostID, t.LastPostDate, v.DateViewed
FROM dbo.Threads t
LEFT JOIN dbo.Views v ON v.PostID = t.PostID
AND v.UserID = t.UserID
WHERE t.UserID = #user;
v.DateViewed will be NULL if there's no corresponding row in Views.
If you have lots of rows in Views, you may prefer to do something like:
SELECT t.UserID, t.PostID, t.LastPostDate, v.DateViewed
FROM dbo.Threads t
CROSS APPLY (SELECT MAX(vw.DateViewed) as DateViewed
FROM dbo.Views vw
WHERE vw.PostID = t.PostID
AND vw.UserID = t.UserID
) v
WHERE t.UserID = #user;
The key is to use a LEFT JOIN, which will cause non-existent rows on the right side to come up as all NULL:
SELECT threads.lastpostdate, posts.dateviewed
FROM threads
LEFT JOIN posts
ON threads.id=posts.postid