GROUP BY and COUNT in PostgreSQL - sql

The query:
SELECT COUNT(*) as count_all,
posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id;
Returns n records in Postgresql:
count_all | post_id
-----------+---------
1 | 6
3 | 4
3 | 5
3 | 1
1 | 9
1 | 10
(6 rows)
I just want to retrieve the number of records returned: 6.
I used a subquery to achieve what I want, but this doesn't seem optimum:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
How would I get the number of records in this context right in PostgreSQL?

I think you just need COUNT(DISTINCT post_id) FROM votes.
See "4.2.7. Aggregate Expressions" section in http://www.postgresql.org/docs/current/static/sql-expressions.html.
EDIT: Corrected my careless mistake per Erwin's comment.

There is also EXISTS:
SELECT count(*) AS post_ct
FROM posts p
WHERE EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);
In Postgres and with multiple entries on the n-side like you probably have, it's generally faster than count(DISTINCT post_id):
SELECT count(DISTINCT p.id) AS post_ct
FROM posts p
JOIN votes v ON v.post_id = p.id;
The more rows per post there are in votes, the bigger the difference in performance. Test with EXPLAIN ANALYZE.
count(DISTINCT post_id) has to read all rows, sort or hash them, and then only consider the first per identical set. EXISTS will only scan votes (or, preferably, an index on post_id) until the first match is found.
If every post_id in votes is guaranteed to be present in the table posts (referential integrity enforced with a foreign key constraint), this short form is equivalent to the longer form:
SELECT count(DISTINCT post_id) AS post_ct
FROM votes;
May actually be faster than the EXISTS query with no or few entries per post.
The query you had works in simpler form, too:
SELECT count(*) AS post_ct
FROM (
SELECT FROM posts
JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) sub;
Benchmark
To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:
Test setup
Fake a typical post / vote situation:
CREATE SCHEMA y;
SET search_path = y;
CREATE TABLE posts (
id int PRIMARY KEY
, post text
);
INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int) -- random text
FROM generate_series(1,10000) g;
DELETE FROM posts WHERE random() > 0.9; -- create ~ 10 % dead tuples
CREATE TABLE votes (
vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);
INSERT INTO votes (post_id, up_down)
SELECT g.*
FROM (
SELECT ((random()* 21)^3)::int + 1111 AS post_id -- uneven distribution
, random()::int::bool AS up_down
FROM generate_series(1,70000)
) g
JOIN posts p ON p.id = g.post_id;
All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE ant took the best of five on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.
As is.
After ..
ANALYZE posts;
ANALYZE votes;
After ..
CREATE INDEX foo on votes(post_id);
After ..
VACUUM FULL ANALYZE posts;
CLUSTER votes using foo;
count(*) ... WHERE EXISTS
253 ms
220 ms
85 ms -- winner (seq scan on posts, index scan on votes, nested loop)
85 ms
count(DISTINCT x) - long form with join
354 ms
358 ms
373 ms -- (index scan on posts, index scan on votes, merge join)
330 ms
count(DISTINCT x) - short form without join
164 ms
164 ms
164 ms -- (always seq scan)
142 ms
Best time for original query in question:
353 ms
For simplified version:
348 ms
#wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:
366 ms
Index-only scans in the upcoming PostgreSQL 9.2 can improve the result for each of these queries, most of all for EXISTS.
Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):
Select first row in each GROUP BY group?

Using OVER() and LIMIT 1:
SELECT COUNT(1) OVER()
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
LIMIT 1;

WITH uniq AS (
SELECT DISTINCT posts.id as post_id
FROM posts
JOIN votes ON votes.post_id = posts.id
-- GROUP BY not needed anymore
-- GROUP BY posts.id
)
SELECT COUNT(*)
FROM uniq;

For followers, I like the OP's inner query method:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
Since then you can use HAVING in there as well:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id HAVING count(*) > 1
) as x;
Or the equivalent CTE
with posts_coalesced as (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id )
select count(*) from posts_coalesced;

Related

Filter on Foreign Key with LATERAL JOIN brings yields strange results

thanks for your time!
Basically, I'm trying to filter a NxM table using foreign keys, with 0,1 or N different tags. The problem is that LEFT LATERAL JOIN yields bizarre results.
Please, don't mind the strange casting, I'm doing so because I'm using spring boot.
Here is a fiddle showing a fake relationship:
https://www.db-fiddle.com/f/6bDu33keWACHssLqznk88n/0
Schema (PostgreSQL v13)
CREATE TABLE posts (id int primary key);
CREATE TABLE tags (id int primary key);
CREATE TABLE post_tags (post_id int references posts(id),
tags_id int references tags(id),
primary key (post_id, tags_id));
INSERT INTO posts VALUES (1), (2), (3), (4);
INSERT INTO tags VALUES (8), (9);
INSERT INTO post_tags VALUES (1,8), (1,9), (2,8);
Query #1
select * from posts p
left join lateral (select * from post_tags pt where pt.post_id = p.id) pt on 1=1
where (1 is null or pt.tags_id = any(cast(STRING_TO_ARRAY(CAST('9' AS TEXT), ',') AS INT[])));
id
post_id
tags_id
1
1
9
Query #2
select * from posts p
left join lateral (select * from post_tags pt where pt.post_id = p.id limit 1) pt on 1=1
where (1 is null or pt.tags_id = any(cast(STRING_TO_ARRAY(CAST('9' AS TEXT), ',') AS INT[])));
There are no results to be displayed.
Query #3
select * from posts p
left join lateral (select * from post_tags pt where pt.post_id = p.id) pt on 1=1
where (1 is null or pt.tags_id = any(cast(STRING_TO_ARRAY(CAST('9,8' AS TEXT), ',') AS INT[])));
id
post_id
tags_id
1
1
9
1
1
8
2
2
8
Query #4
select * from posts p
left join lateral (select * from post_tags pt where pt.post_id = p.id limit 1) pt on 1=1
where (1 is null or pt.tags_id = any(cast(STRING_TO_ARRAY(CAST('9,8' AS TEXT), ',') AS INT[])));
id
post_id
tags_id
1
1
8
2
2
8
View on DB Fiddle
If you notice, query #2 yields no results, although it should. I suspect the limit 1 is not allowing it to function properly. But if I remove it, I get duplicate results (as seen in query #3).
My question is, how can I filter on foreign keys and not having duplicate results?
EDIT ---
I expect the query to return at most 1 result per category that matches the where clause;
Query #2 should return:
id
post_id
tags_id
1
1
9
Or in case multi tags are passed, it should return just like query #4 (both matches, post 1 and 2, but not duplicated posts (post id = 1)
Thanks
Obviously there is a problem with the where clause. It is filtering on pt.tag_id, but that column comes from the left join, so it may be null. So when a post has no tags it is always filtered out.
It also occurs to me that you don’t really need a join (which may cause cardinality issues as you are seeing) ; if you just want to filter the posts per tag, exists seems more appropriate:
select p.*
from post p
where exists (
select 1
from post_tag pt
where pt_post_id = p.id
and pt.tags_id = any(cast(STRING_TO_ARRAY(CAST('9,8' AS TEXT), ',') AS INT[]))
)

Get latest child row with parent table row

Two tables posts and comments. posts has many comments (comments has post_id foreign key to posts id primary key)
posts
id | content
------------
comments
id | post_id | text | created_at
-------------------------------
I need all posts, its content, and latest comment (based on max(created_at), and its text.
I can get upto created_at using this
with comment_latest as (select
post_id,
max(created_at) as latest_commented_at
from comments
group by 1)
select
posts.id,
posts.content,
comment_latest.latest_commented_at
from posts
left join comment_latest on comment_latest.post_id = posts.id
order by posts.id desc
limit 10
But I want the text of the comment as well.
You can use the Postgres extension distinct on:
select distinct on (p.id) p.* c.*
from posts p left join
comments c
on p.id = c.post_id
order by p.id desc, c.created_at desc
limit 10;
This sorts the data by the order by clause, returning the first row based on the keys in the distinct on.

SQL to calculate author with most books

I have a table of books, a table of authors, and a "linker" table (many to many links between authors/books).
How do I find the authors with the highest number of books?
This is my schema:
books : rowid, name
authors : rowid, name
book_authors : rowid, book_id, author_id
This is what I came up with: (but it doesn't work)
SELECT count(*) IN book_authors
WHERE (SELECT count(*) IN book_authors
WHERE author_id = author_id)
And ideally I would like a report of the top 100 authors, something like:
author_name book_count
-----------------------------------
Johnny 25
Kelly 12
Ramboz 10
Do I need some kind of join? What is the fastest approach?
I'd join the three tables (via the book_authors table), group by the author, count occurrences and limit it to the top 100 rows:
SELECT a.name, COUNT(*)
FROM authors a
JOIN books_authors ba ON a.rowid = ba.author_id
JOIN books b ON ba.book_id = b.rowid
GROUP BY a.name
ORDER BY 2 DESC
LIMIT 100
EDIT:
Actually, we aren't using any data from books, just the fact the book actually exists, which can be inferred from books_authors, so this query can be improved by dropping the second join:
SELECT a.name, COUNT(*)
FROM authors a
JOIN books_authors ba ON a.rowid = ba.author_id
GROUP BY a.name
ORDER BY 2 DESC
LIMIT 100
Couldn't you just
select count(1) , Author_ID from Book_Authors group by Author_ID order by count(1) desc limit 100
The authors with the most books would be at the top (or the author_ID would be at least)
As for limiting to top 100... then add limit clause Sqlite LIMIT / OFFSET query
SELECT TOP 3 authors.author_name, authors.book_name, books.sold_copies,
(SELECT SUM(books.sold_copies) FROM books WHERE authors.book_name = books.book_name ) AS Total
FROM authors
INNER JOIN books
ON authors.book_name = books.book_name
ORDER BY sold_copies desc

Retrieve rows that matches with all the values listed

Hi I need to get the rows which matches all the groupid listed as an array
SELECT user_id,group_id
FROM group_privilege_details g
WHERE g.group_id in (102,101)
This will return me if any one of the groupid matches. But, I need userid which has all the groupid mention in the list.
Assuming that you cannot have duplicate user_id/group_id combinations:
SELECT user_id,count(group_id)
FROM group_privilege_details g
WHERE g.group_id in (102,101)
GROUP BY user_id
HAVING count(group_id) = 2
Here is a variant of Steven's query for generic arrays:
SELECT user_id
FROM group_privilege_details
WHERE group_id = ANY(my_array)
GROUP BY 1
HAVING count(*) = array_length(my_array, 1)
Works as long as these requirements are met (not mentioned in the question):
(user_id, group_id) is unique in group_privilege_details.
array has only 1 dimension
base array-elements are unique
A generic solution that works regardless of these preconditions:
WITH ids AS (SELECT DISTINCT unnest(my_array) group_id)
SELECT g.user_id
FROM (SELECT user_id, group_id FROM group_privilege_details GROUP BY 1,2) g
JOIN ids USING (group_id)
GROUP BY 1
HAVING count(*) = (SELECT count(*) FROM ids)
unnest() produces one row per base-element. DISTINCT removes possible dupes. The subselect does the same for the table.
Extensive list of options for this kind of queries: How to filter SQL results in a has-many-through relation
Please find my solved query:
select user_id,login_name from user_info where user_id in (
SELECT user_id FROM
group_privilege_details g WHERE g.group_id in
(select group_id from group_privilege_details g,user_info u where u.user_id=g.user_id
and login_name='123')
GROUP BY user_id HAVING count(group_id) = (select count(group_id)
from group_privilege_details g,user_info u where u.user_id=g.user_id
and login_name='123') ) and login_name!='123'

Optimizing Oracle Query

Running explain plan on this query I am getting Full table Access.
Two tables used are:
user_role: 803507 rows
cmp_role: 27 rows
Query:
SELECT
r.user_id, r.role_id, r.participant_code, MAX(status_id)
FROM
user_role r,
cmp_role c
WHERE
r.role_id = c.role_id
AND r.participant_code IS NOT NULL
AND c.group_id = 3
GROUP BY
r.user_id, r.role_id, r.participant_code
HAVING MAX(status_id) IN (SELECT b.status_id FROM USER_ROLE b
WHERE (b.ACTIVE = 1 OR ( b.ACTIVE IN ( 0,3 )
AND SYSDATE BETWEEN b.effective_from_date AND b.effective_to_date
))
)
How can I better write this query so that it returns results in a decent time. Following are the indexes:
idx 1 = role_id
idx 2 = last_updt_user_id
idx 3 = actv_id, participant_code, effective_from_Date, effective_to_date
idx 4 = user_id, role_id, effective_from_Date, effective_to_date
idx 5 = participant_code, user_id, roke_id, actv_cd
Explain plan:
Q_PLAN
--------------------------------------------------------------------------------
SELECT STATEMENT
FILTER
HASH GROUP BY
HASH JOIN
TABLE ACCESS BY INDEX ROWID ROLE
INDEX RANGE SCAN N_ROLE_IDX2
TABLE ACCESS FULL USER_ROLE
TABLE ACCESS BY INDEX ROWID USER_ROLE
INDEX UNIQUE SCAN U_USER_ROLE_IDX1
FILTER
HASH GROUP BY
HASH JOIN
TABLE ACCESS BY INDEX ROWID ROLE
INDEX RANGE SCAN N_ROLE_IDX2
TABLE ACCESS FULL USER_ROLE
TABLE ACCESS BY INDEX ROWID USER_ROLE
INDEX UNIQUE SCAN U_USER_ROLE_IDX1
I do not have enough priveleges to run stats on the table
Tried the following changes but it shaves off 1 or 2 seconds only:
WITH CTE AS (SELECT b.status_id FROM USER_ROLE b
WHERE (b.ACTIVE = 1 OR ( b.ACTIVE IN ( 0,3 )
AND SYSDATE BETWEEN b.effective_from_date AND b.effective_to_date
))
)
SELECT
r.user_id, r.role_id, r.participant_code, MAX(status_id)
FROM
user_role r,
cmp_role c
WHERE
r.role_id = c.role_id
AND r.participant_code IS NOT NULL
AND c.group_id = 3
GROUP BY
r.user_id, r.role_id, r.participant_code
HAVING MAX(status_id) IN (select * from CTE)
Firstly you have the subquery
SELECT b.status_id FROM USER_ROLE b
WHERE (b.ACTIVE = 1
OR ( b.ACTIVE IN ( 0,3 )
AND SYSDATE BETWEEN b.effective_from_date AND b.effective_to_date )
)
There is no way that you can do anything other than a full table scan to get that result.
You may be missing a join, but not knowing what you expect your query to do, there's no way for us to tell.
Secondly, depending on the proportion of cmp_role records with a group_id of 3, and the proportion of user_role than match those roles, it may be better off doing the full scan there. If, say, 3 out of the 27 cmp_role records are in group 3, and 100,000 of the user_role records match those cmp_role records, then it can be more efficient doing a single scan of the table than doing 100,000 index lookups.
Collect statistics for the tables
explain plan for the query and show the results.
I think the following approach will work.I would have thought the subquery will be evaluated only once since it is not correlated - this doesnt seem to be the case.I tried a similar query (simple) against sales table in sh demo schema. I modified it to use a Materialized CTE approach and it ran in 1 second as opposed to 18 sec. See below for the approach.This was 10 times faster
with cte as (
select /*+materialize*/ max(amount_sold) from sales)
select prod_id,sum(amount_sold) from
sales
group by prod_id
having max(amount_sold) in(
select * from cte)
/
So in you case you materialize the subquery as
with CTE as (
SELECT /*+ materialize */ b.status_id FROM USER_ROLE b
WHERE (b.ACTIVE = 1 OR ( b.ACTIVE IN ( 0,3 )
AND SYSDATE BETWEEN b.effective_from_date AND b.effective_to_date
))
)
)
and select FROM CTE in main query
So you have a query that currently takes 16,5 seconds and you want it to run faster. To do that, you need to know where those 16,5 seconds are spent on. The Oracle database is extremely well instrumented, so you can see in great detail what it is doing. You can check it this thread that I wrote on OTN Forums:
http://forums.oracle.com/forums/thread.jspa?messageID=1812597
Without knowing where your time is being spent, all efforts are just guesses ...
Regards,
Rob.