Get latest child row with parent table row - sql

Two tables posts and comments. posts has many comments (comments has post_id foreign key to posts id primary key)
posts
id | content
------------
comments
id | post_id | text | created_at
-------------------------------
I need all posts, its content, and latest comment (based on max(created_at), and its text.
I can get upto created_at using this
with comment_latest as (select
post_id,
max(created_at) as latest_commented_at
from comments
group by 1)
select
posts.id,
posts.content,
comment_latest.latest_commented_at
from posts
left join comment_latest on comment_latest.post_id = posts.id
order by posts.id desc
limit 10
But I want the text of the comment as well.

You can use the Postgres extension distinct on:
select distinct on (p.id) p.* c.*
from posts p left join
comments c
on p.id = c.post_id
order by p.id desc, c.created_at desc
limit 10;
This sorts the data by the order by clause, returning the first row based on the keys in the distinct on.

Related

Two table join Postgres count child rows

I have two tables: posts and replies.
The post table contains these columns
postid | forumName | title | content
The replies table contains these columns
replyid | content | postid
I would like to have a sql query that joins these tow tables and returns for each Forum
forumName | Total Number of Posts | Total Number of Replies
This is hard as the two tables are linked using postId.
select forum, count(id) as postsNum
from posts
group by forum
order by postsNum desc
Are you looking for this:
select p.forum
,count(distinct id) as posts
,count(r.replyid) as replies
from posts p
inner join replices r
on p.postid = r.postid
group by p.forum

SQL to calculate author with most books

I have a table of books, a table of authors, and a "linker" table (many to many links between authors/books).
How do I find the authors with the highest number of books?
This is my schema:
books : rowid, name
authors : rowid, name
book_authors : rowid, book_id, author_id
This is what I came up with: (but it doesn't work)
SELECT count(*) IN book_authors
WHERE (SELECT count(*) IN book_authors
WHERE author_id = author_id)
And ideally I would like a report of the top 100 authors, something like:
author_name book_count
-----------------------------------
Johnny 25
Kelly 12
Ramboz 10
Do I need some kind of join? What is the fastest approach?
I'd join the three tables (via the book_authors table), group by the author, count occurrences and limit it to the top 100 rows:
SELECT a.name, COUNT(*)
FROM authors a
JOIN books_authors ba ON a.rowid = ba.author_id
JOIN books b ON ba.book_id = b.rowid
GROUP BY a.name
ORDER BY 2 DESC
LIMIT 100
EDIT:
Actually, we aren't using any data from books, just the fact the book actually exists, which can be inferred from books_authors, so this query can be improved by dropping the second join:
SELECT a.name, COUNT(*)
FROM authors a
JOIN books_authors ba ON a.rowid = ba.author_id
GROUP BY a.name
ORDER BY 2 DESC
LIMIT 100
Couldn't you just
select count(1) , Author_ID from Book_Authors group by Author_ID order by count(1) desc limit 100
The authors with the most books would be at the top (or the author_ID would be at least)
As for limiting to top 100... then add limit clause Sqlite LIMIT / OFFSET query
SELECT TOP 3 authors.author_name, authors.book_name, books.sold_copies,
(SELECT SUM(books.sold_copies) FROM books WHERE authors.book_name = books.book_name ) AS Total
FROM authors
INNER JOIN books
ON authors.book_name = books.book_name
ORDER BY sold_copies desc

Is there a better way to find the most popular title in a 'self-linked' table of user posts?

I have this (simplified for space) table schema with user posts and related comments:
create table
tbl_post (
id integer primary key,
title text not null,
content text not null,
post_id integer null
);
where tbl_post.post_id is an (int) comment id associated with given tbl_post.id,
or null if tbl_post.id row is main, authored title (namely not a comment).
I'm using this sqlite query to figure out the most popular title in posts table (criteria is how many comments relates to it...):
select
title
from
tbl_post
where
id = (
select
post_id
from (
select
post_id, count(post_id) as tot
from
tbl_post
where
ifnull(post_id, '') != ''
group by
post_id
order by
tot desc
limit 1
)
);
which looks quite bulky to me having those two nested select statements. I would like to make the query simpler (shorter, potentialy faster) somehow. Thanks.
How about a self-join?
SELECT p.Id p.title, p.content, COUNT(c.Id) AS nbOfComments
FROM tbl_post p
LEFT JOIN tbl_post c ON p.Id = c.post_id
WHERE p.post_id IS NULL
GROUP BY p.Id, p.title, p.content

Query: Find most recent 5 posts from distinct users

Let's say that we have a users table, and a user can have many posts (posts have the user_id column).
I want to retrieve posts for the first 5 users, but only one post per user. So, at the end I want to have 5 posts, where each post belongs to a different user. How can I do that in SQL?
You should have two tables
Table : users ; Columns : users_id , user_name
Table2: posts ; Columns : post_id , post_description , users_id
And now to retrieve all user with one post for each
SELECT * FROM users as u
LEFT JOIN (SELECT * FROM posts LIMIT 1) as p on p.users_id = u.users_id
LIMIT 5 ORDER BY ASC
If you want to get the oldest post for each user
SELECT * FROM users as u
LEFT JOIN (
SELECT
MIN(post_id) as post_id ,
post_description ,
users_id FROM posts
) as p on p.users_id = u.users_id
LIMIT 5 ORDER BY ASC
And for latest post use MAX(post_id) instead of MIN(post_id)
Maybe it will help you:
select *
from users u
join posts p on p.idUser = u.id
and p.id = ( select max(id) from posts where p.id=u.id )
ORDER BY i.id LIMIT 5
This will give you the latest five postings (assuming a higher ID post is "newer"), forcing each result to be from a different user.
SELECT p.user_id, MAX(p.post_id) AS post_id
FROM posts AS p
GROUP BY p.user_id -- get unique users
ORDER BY p.post_id DESC -- sort results by post_id number, descending
LIMIT 5
With this method, if you need to load other user or post data, it would be best to load it separately after you get the desired user ids and post ids.

GROUP BY and COUNT in PostgreSQL

The query:
SELECT COUNT(*) as count_all,
posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id;
Returns n records in Postgresql:
count_all | post_id
-----------+---------
1 | 6
3 | 4
3 | 5
3 | 1
1 | 9
1 | 10
(6 rows)
I just want to retrieve the number of records returned: 6.
I used a subquery to achieve what I want, but this doesn't seem optimum:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
How would I get the number of records in this context right in PostgreSQL?
I think you just need COUNT(DISTINCT post_id) FROM votes.
See "4.2.7. Aggregate Expressions" section in http://www.postgresql.org/docs/current/static/sql-expressions.html.
EDIT: Corrected my careless mistake per Erwin's comment.
There is also EXISTS:
SELECT count(*) AS post_ct
FROM posts p
WHERE EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);
In Postgres and with multiple entries on the n-side like you probably have, it's generally faster than count(DISTINCT post_id):
SELECT count(DISTINCT p.id) AS post_ct
FROM posts p
JOIN votes v ON v.post_id = p.id;
The more rows per post there are in votes, the bigger the difference in performance. Test with EXPLAIN ANALYZE.
count(DISTINCT post_id) has to read all rows, sort or hash them, and then only consider the first per identical set. EXISTS will only scan votes (or, preferably, an index on post_id) until the first match is found.
If every post_id in votes is guaranteed to be present in the table posts (referential integrity enforced with a foreign key constraint), this short form is equivalent to the longer form:
SELECT count(DISTINCT post_id) AS post_ct
FROM votes;
May actually be faster than the EXISTS query with no or few entries per post.
The query you had works in simpler form, too:
SELECT count(*) AS post_ct
FROM (
SELECT FROM posts
JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) sub;
Benchmark
To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:
Test setup
Fake a typical post / vote situation:
CREATE SCHEMA y;
SET search_path = y;
CREATE TABLE posts (
id int PRIMARY KEY
, post text
);
INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int) -- random text
FROM generate_series(1,10000) g;
DELETE FROM posts WHERE random() > 0.9; -- create ~ 10 % dead tuples
CREATE TABLE votes (
vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);
INSERT INTO votes (post_id, up_down)
SELECT g.*
FROM (
SELECT ((random()* 21)^3)::int + 1111 AS post_id -- uneven distribution
, random()::int::bool AS up_down
FROM generate_series(1,70000)
) g
JOIN posts p ON p.id = g.post_id;
All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE ant took the best of five on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.
As is.
After ..
ANALYZE posts;
ANALYZE votes;
After ..
CREATE INDEX foo on votes(post_id);
After ..
VACUUM FULL ANALYZE posts;
CLUSTER votes using foo;
count(*) ... WHERE EXISTS
253 ms
220 ms
85 ms -- winner (seq scan on posts, index scan on votes, nested loop)
85 ms
count(DISTINCT x) - long form with join
354 ms
358 ms
373 ms -- (index scan on posts, index scan on votes, merge join)
330 ms
count(DISTINCT x) - short form without join
164 ms
164 ms
164 ms -- (always seq scan)
142 ms
Best time for original query in question:
353 ms
For simplified version:
348 ms
#wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:
366 ms
Index-only scans in the upcoming PostgreSQL 9.2 can improve the result for each of these queries, most of all for EXISTS.
Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):
Select first row in each GROUP BY group?
Using OVER() and LIMIT 1:
SELECT COUNT(1) OVER()
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
LIMIT 1;
WITH uniq AS (
SELECT DISTINCT posts.id as post_id
FROM posts
JOIN votes ON votes.post_id = posts.id
-- GROUP BY not needed anymore
-- GROUP BY posts.id
)
SELECT COUNT(*)
FROM uniq;
For followers, I like the OP's inner query method:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
Since then you can use HAVING in there as well:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id HAVING count(*) > 1
) as x;
Or the equivalent CTE
with posts_coalesced as (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id )
select count(*) from posts_coalesced;