Having the classic "posts table, and comments table with foreign key to posts table" scenario, what's the most efficient way to get the IDs of the last 20 posts ordered by the time of their last comment, and the actual comment itself?
Here is a query that works but can probably be done much more efficiently:
SELECT * FROM (
SELECT * FROM comments ORDER BY time DESC
) AS foo GROUP BY post_id ORDER BY time DESC LIMIT 20
A nested query with an ORDER BY is necessary to make sure that the latest comment gets selected into the post_id group.
As mentioned in the comments: practically the same question as Retrieveing the most recent records within a query.
See the greatest-n-per-group tag for more similar questions.
Related
Okay, so i have a list of posts and some posts are replies to other posts. I'd like to get a list of post parents in reverse order of replies.
I've tried group by but it always lists the wrong order and distinct is the only way i've managed to get it to work but obviously then it only lists the post id and not the rest of the data.
example of database here
The order i want to pull the posts out in is 1,3,5,4,2 These are the non-reply posts in the order of the latest reply.
SELECT DISTINCT `thread`
FROM
(
SELECT COALESCE(NULLIF(`parent_post`, 0), `postID`) AS `thread`
FROM `posts`
ORDER BY `postID` DESC
LIMIT 100
) `sub`
This pulls them out in the correct order but obviously only pulls out the postID and not the rest of the fields, i've tried group by but it loses the correct order.
A straightforward translation of your requirements to SQL would be:
select *
from posts p1
where parent_post = 0
order by (
select max("datetime")
from posts p2
where p2.parent_post = p1.postID
) desc
I.e. select all rows from posts that are thread starters (not replies) and order them by the latest timestamp from any of their replies in descending order.
I have two tables, one containing posts and other containing comments, There are millions of posts and 100's of millions of comments, the comments table also contains the post's ids. The comments are deactivated after some time, I want to know which posts had the most comments in last 30 days from before before being deactivated.
What I have to do is find max(comment_date) for each posts from comment table and count all the comments back 1 month from that date for each post.
so essentially I want to group by post_id, find max(comment_date) and get count of all comments from max(comment_date) - 1 month for each post. I am struggling to create the query to get this data?
The database is postgres 9.4.1.
On that amount of data, the query will take time. One method is to use window functions:
select post_id, count(*)
from (select c.*, max(comment_date) over (partition by post_id) as maxcd
from comments c
) c
where comment_date >= maxcd - interval '1 month'
group by post_id;
I would try to use LATERAL join here. If you create an index on comments table on (post_id, comment_date DESC), it may be more efficient than variant suggested by Gordon Linoff. It really depends on your data distribution. It may not be necessary to specify DESC in the index, optimizer may be smart enough to use it if it was (post_id, comment_date ASC). But, the order of columns is important.
Here is SQL Fiddle.
The query literally follows the steps you outlined in the question.
We scan posts table once. For each post we find the comment with the latest comment_date and subtract 30 days from it. This should be done by one seek of the index. Then we count all comments of this post after the found date minus 30 days. This should be done by the ranged index scan.
SELECT
posts.id
,c_count.post_count
FROM
posts
INNER JOIN LATERAL
(
SELECT comments.comment_date - interval '30 days' AS max_date
FROM comments
WHERE comments.post_id = posts.id
ORDER BY comments.comment_date DESC
LIMIT 1
) AS c_max_date ON true
INNER JOIN LATERAL
(
SELECT COUNT(*) AS post_count
FROM comments
WHERE
comments.post_id = posts.id
AND comments.comment_date >= c_max_date.max_date
) AS c_count ON true
;
It may be possible to do the two steps in one go (find max date, then count rows within 30 days interval) using window functions.
I have an SQLite table blog_posts. Every blog post has an id and blog_id.
If I want to know how many blog posts every blog has:
SELECT blog_id, count(1) posts FROM blog_posts group by blog_id
What do I do if I want to know how many posts the blog with the most posts has? (I don't need the blog_id.) Apparently this is illegal:
SELECT max(count(1)) posts FROM blog_posts group by blog_id
I'm pretty sure I'm missing something, but I don't see it...
Other solution:
select count(*) as Result from blog_posts
group by blog_id
order by Result desc
limit 1
I'm not sure which solution would run faster, if this one or the one with the subquery.
You can use a subquery. Here's how you do it:
get the number of posts for each blog
select the maximum number of posts
Example:
select max(num_posts) as max_posts
from (
select blog_id, count(*) as num_posts
from blog_posts
group by blog_id
) a
(The subquery is in the (...)).
NB: I'm not a SQLite power user and so I don't know if this works, but the SQLite docs indicate that subqueries are supported.
If I GROUP BY on a unique key, and apply a LIMIT clause to the query, will all the groups be calculated before the limit is applied?
If I have hundred records in the table (each has a unique key), Will I have 100 records in the temporary table created (for the GROUP BY) before a LIMIT is applied?
A case study why I need this:
Take Stack Overflow for example.
Each query you run to show a list of questions, also shows the user who asked this question, and the number of badges he has.
So, while a user<->question is one to one, user<->badges is one has many.
The only way to do it in one query (and not one on questions and another one on users and then combine results), is to group the query by the primary key (question_id) and join+group_concat to the user_badges table.
The same goes for the questions TAGS.
Code example:
Table Questions:
question_id (int)(pk)| question_body(varchar)
Table tag-question:
question-id (int) | tag_id (int)
SELECT:
SELECT quesuestions.question_id,
questions.question_body,
GROUP-CONCAT(tag_id,' ') AS 'tags-ids'
FROM
questions
JOIN
tag_question
ON
questions.question_id=tag-question.question-id
GROUP BY
questions.question-id
LIMIT 15
Yes, the order the query executes is:
FROM
WHERE
GROUP
HAVING
SORT
SELECT
LIMIT
LIMIT is the last thing calculated, so your grouping will be just fine.
Now, looking at your rephrased question, then you're not having just one row per group, but many: in the case of stackoverflow, you'll have just one user per row, but many badges - i.e.
(uid, badge_id, etc.)
(1, 2, ...)
(1, 3, ...)
(1, 12, ...)
all those would be grouped together.
To avoid full table scan all you need are indexes. Besides that, if you need to SUM, for example, you cannot avoid a full scan.
EDIT:
You'll need something like this (look at the WHERE clause):
SELECT
quesuestions.question_id,
questions.question_body,
GROUP_CONCAT(tag_id,' ') AS 'tags_ids'
FROM
questions q1
JOIN tag_question tq
ON q1.question_id = tq.question-id
WHERE
q1.question_id IN (
SELECT
tq2.question_id
FROM
tag_question tq2
ON q2.question_id = tq2.question_id
JOIN tag t
tq2.tag_id = t.tag_id
WHERE
t.name = 'the-misterious-tag'
)
GROUP BY
q1.question_id
LIMIT 15
LIMIT does get applied after GROUP BY.
Will the temporary table be created or not, depends on how your indexes are built.
If you have an index on the grouping field and don't order by the aggregate results, then an INDEX SCAN FOR GROUP BY is applied, and each aggregate is counted on the fly.
That means that if you don't select an aggregate due to the LIMIT, it won't ever be calculated.
But if you order by an aggregate, then, of course, all of them need to be calculated before they can be sorted.
That's why they are calculated first and then the filesort is applied.
Update:
As for your query, see what EXPLAIN EXTENDED says for it.
Most probably, question_id is a PRIMARY KEY for your table, and most probably, it will be used in a scan.
That means no filesort will be applies and the join itself will not ever happen after the 15'th row.
To make sure, rewrite your query as following:
SELECT question_id,
question_body,
(
SELECT GROUP_CONCAT(tag_id, ' ')
FROM tag_question t
WHERE t.question_id = q.question_id
)
FROM questions q
ORDER BY
question_id
LIMIT 15
First, it is more readable,
Second, it is more efficient, and
Third, it will return even untagged questions (which your current query doesn't).
If the field you're grouping on is indexed, it shouldn't do a full table scan.
I have asked a similar question before and while the answers I got were spectacular I might need to clearify.
Just like This question I want to return N number of rows depending on a value in a column.
My example will be I have a blog where I want to show my posts along with a preview of the comments. The last three comments to be exact.
I have have I need for my posts but I am racking my brain to get the comments right. The comments table has a foreign key of post_id which obviously multiple comments can be attached to one post so if a post has 20 comments then I just want to return the last three. What makes this somewhat tricky is I want to do it in one query and not a "limit 3" query per blog post which makes rendering a page with a lot of posts very query heavy.
SELECT *
FROM replies
GROUP BY post_id
HAVING COUNT( post_id ) <=3
This query does what I want but only returns one of each comment and not three.
SELECT l.*
FROM (
SELECT post_id,
COALESCE(
(
SELECT id
FROM replies li
WHERE li.post_id = dlo.post_id
ORDER BY
li.post_id, li.id
LIMIT 2, 1
), CAST(0xFFFFFFFF AS DECIMAL)) AS mid
FROM (
SELECT DISTINCT post_id
FROM replies dl
) dlo
) lo, replies l
WHERE l.replies >= lo.replies
AND l.replies <= lo.replies
AND l.id <= lo.mid
Having an index on replies (post_id, id) (in this order) will greatly improve this query.
Note the usage of l.replies >= lo.replies AND l.replies <= lo.replies: this is to make the index to be usable.
See the article in my blog for details:
Advanced row sampling (how to select N rows from a table for each GROUP)
Do you track comment date? You can sort those results to grab only the 3 most recent ones.
following ian Jacobs idea
declare #PostID int
select top 3 post_id, comment
from replies
where post_id=#PostID
order by createdate desc