SQL query to get most comments since last month - sql

I have two tables, one containing posts and other containing comments, There are millions of posts and 100's of millions of comments, the comments table also contains the post's ids. The comments are deactivated after some time, I want to know which posts had the most comments in last 30 days from before before being deactivated.
What I have to do is find max(comment_date) for each posts from comment table and count all the comments back 1 month from that date for each post.
so essentially I want to group by post_id, find max(comment_date) and get count of all comments from max(comment_date) - 1 month for each post. I am struggling to create the query to get this data?
The database is postgres 9.4.1.

On that amount of data, the query will take time. One method is to use window functions:
select post_id, count(*)
from (select c.*, max(comment_date) over (partition by post_id) as maxcd
from comments c
) c
where comment_date >= maxcd - interval '1 month'
group by post_id;

I would try to use LATERAL join here. If you create an index on comments table on (post_id, comment_date DESC), it may be more efficient than variant suggested by Gordon Linoff. It really depends on your data distribution. It may not be necessary to specify DESC in the index, optimizer may be smart enough to use it if it was (post_id, comment_date ASC). But, the order of columns is important.
Here is SQL Fiddle.
The query literally follows the steps you outlined in the question.
We scan posts table once. For each post we find the comment with the latest comment_date and subtract 30 days from it. This should be done by one seek of the index. Then we count all comments of this post after the found date minus 30 days. This should be done by the ranged index scan.
SELECT
posts.id
,c_count.post_count
FROM
posts
INNER JOIN LATERAL
(
SELECT comments.comment_date - interval '30 days' AS max_date
FROM comments
WHERE comments.post_id = posts.id
ORDER BY comments.comment_date DESC
LIMIT 1
) AS c_max_date ON true
INNER JOIN LATERAL
(
SELECT COUNT(*) AS post_count
FROM comments
WHERE
comments.post_id = posts.id
AND comments.comment_date >= c_max_date.max_date
) AS c_count ON true
;
It may be possible to do the two steps in one go (find max date, then count rows within 30 days interval) using window functions.

Related

Performance in SQL sentence containing ORDER BY, LIMIT and COUNT

I've searched the way of improving this dangerous combination of functions in one SQL sentence...
To put you in a context, i have a table with several information about articles (article_id, author, ...) and another one containing the article_id with one tag_id. As an article is able to have several tags, that second table could have 2 rows with the same article_id and different tag_id.
In order to get a list of the 8 articles that have more tags in common with the one that i want (in this case the 1354) I have written the following query:
SELECT articles.article_id, articles.author, count(articles_tags.article_id) as times
FROM articles
INNER JOIN articles_tags ON (articles.article_id=articles_tags.article_id)
WHERE id_tag IN
(SELECT article_id FROM articles_tags WHERE article_id=1354)
AND article_id <> 1354
GROUP BY article_id
ORDER BY times DESC
LIMIT 8
It is EXTREMELY slow... like 90 seconds for half million articles.
By deleting the "order by times" sentence, it works almost instantly, but if i do so, i won't get the most similar articles.
What can i do?
Thanks!!
a query on a sub-select is ALWAYS a time-killer... Also, as the query didn't really appear to be accurate, or missing, I am making an assumption that your articles_tags table has two columns... one for the actual article ID, and another for the tag_ID associated with it.
That said, I would pre-query just the TAG IDs for article 1354 (the on you are interested in). Use that as a Cartesian join to the article tags again on the tag IDs being the same. From that, you are grabbing the SECOND version of article tags alias and getting ITs article ID, and then the count that MATCH (via Join and not a left-join). Apply the group by on the article ID as you had, And for grins, join to the articles table to get the author.
Now, note. Some SQL engines require you to group by all non-aggregate fields, so you MAY have to either add the author to the group by (which will always be the same per article ID anyway), or change it to MAX( A.author ) as Author which would give the same results.
I would have an index on the (tag_id, article_id) so the tags are found from the "common" tags you are looking to find in common. You could have one article with 10 tags, and another article with 10 completely different tags resulting in 0 in common. This will prevent the other article from even appearing in the result set.
You STILL have the time associated with blowing through half-million articles as you described, which could be millions of actual tag entries.
select
AT2.article_id,
A.Author,
count(*) as Times
from
( select ATG.id_tag
from articles_tags ATG
where ATG.Article_ID = 1354
order by ATG.id_tag ) CommonTags
JOIN articles_tags AT2
on CommonTags.ID_Tag = AT2.ID_Tag
AND AT2.Article_ID <> 1354
JOIN articles A
on AT2.Article_ID = A.Article_ID
group by
AT2.article_id
order by
Times DESC
limit 8
It seems that it should be possible to do this without any subqueries, and then a quicker query may result.
Here the article of interest is joined to its tags, and then further to other articles having these tags. Then the number of tags for each article is counted and ordered:
SELECT a2.article_id, a2.author, COUNT(t2.tag_id) AS times
FROM articles a1
INNER JOIN articles_tags t1
ON t1.article_id = a1.article_id -- find tags for staring article
INNER JOIN tags t2
ON t2.tag_id = t1.tag_id -- find other instances of those tags
AND t2.articles_id <> t1.articles_id
INNER JOIN articles a2
ON a2.articles_id = t2.articles_id -- and the articles where they are used
WHERE a1.article_id = 1354
GROUP BY a2.article_id, a2.author -- count common tags by articles
ORDER BY times DESC
LIMIT 8
If you know a lower bound on the number of tags in common (e.g. 3), inserting HAVING times > 2 before ORDER BY times DESC could give a further speed improvement.

SQL count distinct values for records but filter some dups

I have a MS SQL 2008 table of survey responses and I need to produce some reports. The table is fairly basic, it has a autonumber key, a user ID for the person responding, a date, and then a bunch of fields for each individual question. Most of the questions are multiple choice and the data value in the response field is a short varchar text representation of that choice.
What I need to do is count the number of distinct responses for each choice option (ie. for question 1, 10 people answered A, 20 answered B, and so forth). That is not overly complex. However, the twist is that some people have taken the survey multiple times (so they would have the same User ID field). For these responses, I am only supposed to include the latest data in my report (based on the survey date field). What would be the best way to exclude the older survey records for those users that have multiple records?
Since you didn't give us your DB schema I've had to make some assumptions but you should be able to use row_number to identify the latest survey taken by a user.
with cte as
(
SELECT
Row_number() over (partition by userID, surveyID order by id desc) rn,
surveyID
FROM
User_survey
)
SELECT
a.answer_type,
Count(a.anwer) answercount
FROM
cte
INNER JOIN Answers a
ON cte.surveyID = a.surveyID
WHERE
cte.rn = 1
GROUP BY
a.answer_type
Maybe not the most efficient query, but what about:
select userid, max(survey_date) from my_table group by userid
then you can inner join on the same table to get additional data.

MySQL: latest comments on each post

Having the classic "posts table, and comments table with foreign key to posts table" scenario, what's the most efficient way to get the IDs of the last 20 posts ordered by the time of their last comment, and the actual comment itself?
Here is a query that works but can probably be done much more efficiently:
SELECT * FROM (
SELECT * FROM comments ORDER BY time DESC
) AS foo GROUP BY post_id ORDER BY time DESC LIMIT 20
A nested query with an ORDER BY is necessary to make sure that the latest comment gets selected into the post_id group.
As mentioned in the comments: practically the same question as Retrieveing the most recent records within a query.
See the greatest-n-per-group tag for more similar questions.

Need help with Join

So I'm trying to build a simple forum. It'll be a list of topics in descending order by the date of either the topic (if no replies) or latest reply. Here's the DB structure:
Topics
id, subject, date, poster
Posts
id, topic_id, message, date, poster
The forum itself will consist of an HTML table with the following headers:
Topic | Last Post | Replies
What would the query or queries look like to produce such a structure? I was thinking it would involve a cross join, but not sure... Thanks in advance.
Of course you can make a query for this, but I advise you to create in Topics table fields 'replies' and 'last post', then update them on every new post. That could really improve your database speed, not now, but the time when you will have thousands of topics.
SELECT *
FROM
`Topics`,
(
SELECT *, COUNT(*) AS `replies`
FROM `Posts`
GROUP BY `Posts`.`topic_id`
ORDER BY `Posts`.`date` DESC
) AS `TopicPosts`
WHERE `Topics`.`id` = `TopicPosts`.`topic_id`
ORDER BY `Posts`.`date` DESC
This 'should' work, or almost work in the case it doesn't, but I agree with the other poster, it's probably better to store this data in the topics table for all sorts of reasons, even if it is duplication of data.
The forum itself will consist of an
HTML table with the following headers:
Topic | Last Post | Replies
If "Last Post" is meant to be a date, it's simple.
SELECT
t.id,
t.subject,
MAX(p.date) AS last_post,
COUNT(p.id) AS count_replies
FROM
Topics t
INNER JOIN Posts p ON p.topic_id = t.id
GROUP BY
t.id,
t.subject
If you want other things to display along with the last post date, like its id or the poster, it gets a little more complex.
SELECT
t.id,
t.subject,
aggregated.reply_count,
aggregated.distinct_posters,
last_post.id,
last_post.date,
last_post.poster
FROM
Topics t
INNER JOIN (
SELECT topic_id,
MAX(p.date) AS last_date,
COUNT(p.id) AS reply_count,
COUNT(DISTINCT poster) AS distinct_posters
FROM Posts
GROUP BY topic_id
) AS aggregated ON aggregated.topic_id = t.id
INNER JOIN Posts AS last_post ON p.date = aggregated.last_date
As an example, I've added the count of distinct posters for a topic to show you where this approach can be extended.
The query relies on the assumption that no two posts within one topic can ever have the same date. If you expect this to happen, the query must be changed to account for it.

Select N rows from a table with a non-unique foreign key

I have asked a similar question before and while the answers I got were spectacular I might need to clearify.
Just like This question I want to return N number of rows depending on a value in a column.
My example will be I have a blog where I want to show my posts along with a preview of the comments. The last three comments to be exact.
I have have I need for my posts but I am racking my brain to get the comments right. The comments table has a foreign key of post_id which obviously multiple comments can be attached to one post so if a post has 20 comments then I just want to return the last three. What makes this somewhat tricky is I want to do it in one query and not a "limit 3" query per blog post which makes rendering a page with a lot of posts very query heavy.
SELECT *
FROM replies
GROUP BY post_id
HAVING COUNT( post_id ) <=3
This query does what I want but only returns one of each comment and not three.
SELECT l.*
FROM (
SELECT post_id,
COALESCE(
(
SELECT id
FROM replies li
WHERE li.post_id = dlo.post_id
ORDER BY
li.post_id, li.id
LIMIT 2, 1
), CAST(0xFFFFFFFF AS DECIMAL)) AS mid
FROM (
SELECT DISTINCT post_id
FROM replies dl
) dlo
) lo, replies l
WHERE l.replies >= lo.replies
AND l.replies <= lo.replies
AND l.id <= lo.mid
Having an index on replies (post_id, id) (in this order) will greatly improve this query.
Note the usage of l.replies >= lo.replies AND l.replies <= lo.replies: this is to make the index to be usable.
See the article in my blog for details:
Advanced row sampling (how to select N rows from a table for each GROUP)
Do you track comment date? You can sort those results to grab only the 3 most recent ones.
following ian Jacobs idea
declare #PostID int
select top 3 post_id, comment
from replies
where post_id=#PostID
order by createdate desc