Improve performance postgresql query - sql

I have 3 tables: users, posts and likes. A post is called hot post if it has more than 5 likes within the first hour after post creation. The following is using to query for a list of hot posts. Can anyone help me to improve this query (how to index or rewrite it).
SELECT post.id,
post.content,
user.username,
COUNT(like.id)
FROM posts AS post
LEFT OUTER JOIN users AS user
ON post.user_id = user.id
INNER JOIN likes AS likes
ON post.id = likes.post_id
AND likes.created_at - INTERVAL '1 hour' < post.created_at
GROUP BY post.id, user.username
HAVING COUNT(like.id) >= 5
ORDER BY post.created_at DESC;

First, unless there really can be a post that does not belong to a user, use an inner join there.
Assuming that there is a good number of posts and likes, the best join strategy would be a merge join or a hash join, which PostgreSQL should choose automatically.
For a merge join, the following indexes might be helpful:
CREATE INDEX ON posts (id);
CREATE INDEX ON likes (post_id);
No index could help with a hash join in this case.
If the planner chooses a nested loop join after all, it might be useful to rewrite the query to:
... AND likes.created_at < post.created_at + INTERVAL '1 hour'
and create an index like
CREATE INDEX ON likes (post_id, created_at);

Related

SQL query to get most comments since last month

I have two tables, one containing posts and other containing comments, There are millions of posts and 100's of millions of comments, the comments table also contains the post's ids. The comments are deactivated after some time, I want to know which posts had the most comments in last 30 days from before before being deactivated.
What I have to do is find max(comment_date) for each posts from comment table and count all the comments back 1 month from that date for each post.
so essentially I want to group by post_id, find max(comment_date) and get count of all comments from max(comment_date) - 1 month for each post. I am struggling to create the query to get this data?
The database is postgres 9.4.1.
On that amount of data, the query will take time. One method is to use window functions:
select post_id, count(*)
from (select c.*, max(comment_date) over (partition by post_id) as maxcd
from comments c
) c
where comment_date >= maxcd - interval '1 month'
group by post_id;
I would try to use LATERAL join here. If you create an index on comments table on (post_id, comment_date DESC), it may be more efficient than variant suggested by Gordon Linoff. It really depends on your data distribution. It may not be necessary to specify DESC in the index, optimizer may be smart enough to use it if it was (post_id, comment_date ASC). But, the order of columns is important.
Here is SQL Fiddle.
The query literally follows the steps you outlined in the question.
We scan posts table once. For each post we find the comment with the latest comment_date and subtract 30 days from it. This should be done by one seek of the index. Then we count all comments of this post after the found date minus 30 days. This should be done by the ranged index scan.
SELECT
posts.id
,c_count.post_count
FROM
posts
INNER JOIN LATERAL
(
SELECT comments.comment_date - interval '30 days' AS max_date
FROM comments
WHERE comments.post_id = posts.id
ORDER BY comments.comment_date DESC
LIMIT 1
) AS c_max_date ON true
INNER JOIN LATERAL
(
SELECT COUNT(*) AS post_count
FROM comments
WHERE
comments.post_id = posts.id
AND comments.comment_date >= c_max_date.max_date
) AS c_count ON true
;
It may be possible to do the two steps in one go (find max date, then count rows within 30 days interval) using window functions.

Perform SQL query and then join

Lets say I have two tables:
ticket with columns [id,date, userid] userid is a foreign key that references user.id
user with columns [id,name]
Owing to really large tables I would like to first filter the tickets table by date
SELECT id FROM ticket WHERE date >= 'some date'
then I would like to do a left join with the user table. Is there a way to do it. I tried the follwoing but it doesnt work.
select ticket.id, user.name from ticket where ticket.date >= '2015-05-18' left join user on ticket.userid=user.id;
Apologies if its a stupid question. I have searched on google but most answers involve subqueries after the join instead of what I want which is to perfrom the query first and then do the join for the items returned
To make things a little more clear, the problem I am facing is that I have large tables and join takes time. I am joining 3 tables and the query takes almost 3 seconds. Whats the best way to reduce time. Instead of joining and then doing the where clause, I figured I should first select a small subset and then join.
Simply put everything in the right order:
select - from - where - group by - having - order by
select ticket.id, user.name
from ticket left join user on ticket.user_id=user.id
where ticket.date >= '2015-05-18'
Or put it in a Derived Table:
select ticket.id, user.name
from
(
select * from ticket
where ticket.date >= '2015-05-18'
) as ticket
left join user on ticket.user_id=user.id

SQL SELECT complex expression in column - additional boolean

I seem to have reached a mental block on this and hope someone can give me a kick in the right direction.
I have a web application similar to a newsreader client. It's written in Python and uses SQLAlchemy but that's not important here as I'm trying to get my head around the SQL, also I'm using SQLite as a backend.
There is a Users table and an Articles table, the Users table is obvious enough and the Articles table stores individual articles (like posts on a news server). I track which user has read which article through a many-many relationship employing another table, Users_Articles, to do this.
The (cut down) schema is something like this:
Users:
user_id
user_name
Articles:
article_id
article_body
Users_Articles:
user_id
article_id
What I'm trying to do is SELECT a list of articles but to also display which article has already been read by the current user thus I'd like to add a boolean column to the set of columns in the SELECT statement which indicates if there is a row in Users_Articles which refers to the article for the current user.
you can go with left outer join
select
a.article_id, a.article_body,
ua.article_id as as been_read --will be not null for read articles
from Articles a
left outer join Users_Articles ua
on (ua.article_id = a.article_id and ua.user_id = $current_user_id)
or with subselect
select
a.article_id, a.article_body,
(select 1 from Users_Articles ua
where ua.article_id = a.article_id
and ua.user_id = $current_user_id) as been_read --will be not null for read articles
from Articles a

How to combine data from 2 tables under circumstances?

I have 2 tables. One table contains posts and the other contains votes for the posts. Each member can vote (+ or -) for each post.
(Structure example:)
Posts table: pid, belongs, userp, text.
Votes table: vid, userv, postid, vote.
Also one table which contains the info for the users.
What I want is: Supposing I am a logged-in member. I want to show all the posts, and at those I've already voted, not let me vote again. (and show me what I have voted + or -)
What I have done til now is very bad as it will do a lot of queries:
SELECT `posts`.*, `users`.`username`
FROM `posts`,`users`
WHERE `posts`.belongs=$taken_from_url AND `users`.`usernumber`=`posts`.`userp`
ORDER BY `posts`.`pid` DESC;
and then:
foreach ($query as $result) {if (logged_in) {select vote from votes....etc} }
So, this means that if I am logged in and it shows 30 posts, then it will do 30 queries to check if at each post I have voted and what I've voted. My question is, can I do it shorter with a JOIN (I guess) and how? (I already tried something, but didn't succeed)
Firstly I'll say that if you're going to have significantly different output for users logged in versus those that aren't, just have two queries rather than trying to create something really complicated.
Secondly, this should do something like what you want:
SELECT p.*, u.username,
(SELECT SUM(vote) FROM votes WHERE postid = p.pid) total_votes,
(SELECT vote FROM votes WHERE postid = p.pid AND userv = $logged_in_user_id) my_vote
FROM posts p
JOIN users u ON p.userp = u.usernumber
WHERE p.belongs = $taken_from_url
ORDER BY p.pid DESC
Note: You don't say what the values of the votes table are. I'm assuming it's either +1 (up) or -1 (down) so you can easily find the total votes by adding them up. If you're not doing it this way I suggest you do to make your life easier.
The first correlated subquery can be eliminated by doing a JOIN and GROUP BY but I tend to find the above form much more readable.
So what this does is it joins users to posts, much like you were doing except that it uses JOIN syntax (which again comes down to readability). Then it has two subqueries: the first finds the total votes for that particular post and the second finds out what a particular user's vote was:
+1: up vote;
-1: down vote;
NULL: no vote.

selecting and displaying ranked items and a user's votes, a la reddit, digg, et al

when selecting ranked objects from a database (eg, articles users have voted on), what is the best way to show:
the current page of items
the user's rating, per item (if they've voted)
rough schema:
articles: id, title, content, ...
user: id, username, ...
votes: id, user_id, article_id, vote_value
is it better/ideal to:
select the current page of items
select the user's vote, limiting them to the page of items with an 'IN' clause
or
select the current page of items and just 'JOIN' vote data from the table of user votes
or, something entirely different?
this is theoretically in a high-traffic environment, and using an rdbms like mysql. fwiw, i see this on the side of "thinking it out before doing" and not "premature optimization."
thanks!
The JOIN would be faster; it would save a round trip to the database.
However, I wouldn't worry at all about this until you actually get some traffic. Many people have spoken out against premature optimization, I'll quote a random one:
More computing sins are committed in
the name of efficiency (without
necessarily achieving it) than for any
other single reason - including blind
stupidity.
If you need to order on votes, use this:
SELECT *
FROM (
SELECT a.*, (
SELECT SUM(vote_value)
FROM votes v
WHERE v.article_id = a.id
) AS votes
FROM article a
)
ORDER BY
votes DESC
LIMIT 100, 110
This will count the votes and paginate in a single query.
If you want to show only the user's own votes, use LEFT JOIN:
SELECT a.*, vote_value
FROM articles a
LEFT JOIN
votes v
ON v.user_id = #current_user
AND v.article_id = a.id
ORDER BY
a.timestamp DESC
LIMIT 100, 110
Having an index on (vote_user, vote_item) will greatly improve this query.
Note that you can make (vote_user, vote_item) a PRIMARY KEY for votes, which will improve this query even more.