PostgreSQL - GROUP BY clause - sql

I want to search by tags, and then list all articles with that tag, and also how many of given tags they match. So for example I might have:
Page1 - 2 (has css and php tag)
Page2 - 1 (has only css tag)
Query:
SELECT COUNT(t.tag)
FROM a_tags t
JOIN w_articles2tag a2t ON a2t.tag = t.id
JOIN w_article a ON a.id = a2t.article
WHERE t.tag = 'css' OR t.tag = 'php'
GROUP BY t.tag
LIMIT 9
When I only put COUNT(t.tag) the query works, and I get okay results. But if I append e.g. ID of my article I get following error:
ERROR: column "a.title" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT COUNT(t.tag), a.title FROM a_tags t
How to add said columns to this query?

Postgres 9.1 or later, quoting the release notes of 9.1 ...
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause (Peter Eisentraut)
The SQL standard allows this behavior, and because of the primary key,
the result is unambiguous.
Related:
Return a grouped list with occurrences using Rails and PostgreSQL
The queries in the question and in #Michael's answer have the logic backwards. We want to count how many tags match per article, not how many articles have a certain tag. So we need to GROUP BY w_article.id, not by a_tags.id.
list all articles with that tag, and also how many of given tags they match
To fix this:
SELECT count(t.tag) AS ct, a.* -- any column from table a allowed ...
FROM a_tags t
JOIN w_articles2tag a2t ON a2t.tag = t.id
JOIN w_article a ON a.id = a2t.article
WHERE t.tag IN ('css', 'php')
GROUP BY a.id -- ... since PK is in GROUP BY
LIMIT 9;
Assuming id is the primary key of w_article.
However, this form will be faster while doing the same:
SELECT a.*, ct
FROM (
SELECT a2t.article AS id, count(*) AS ct
FROM a_tags t
JOIN w_articles2tag a2t ON a2t.tag = t.id
GROUP BY 1
LIMIT 9 -- LIMIT early - cheaper
) sub
JOIN w_article a USING (id); -- attached alias to article in the sub
Closely related answer from just yesterday:
Why does the following join increase the query time significantly?

When you use a "GROUP BY" clause, you need to enclose all columns that are not grouped in an aggregate function. Try adding title to the GROUP BY list, or selecting "min(a.title)" instead.
SELECT COUNT(t.tag), a.title FROM a_tags t
JOIN w_articles2tag a2t ON a2t.tag = t.id
JOIN w_article a ON a.id = a2t.article
WHERE t.tag = 'css' OR t.tag = 'php' GROUP BY t.tag, a.title LIMIT 9

Related

How to convert [NULL] to NULL in Postgres SQL statement?

In Postgres, I set out to write a SQL statement that would return various fields from one table, along with a column containing an array of tag strings that come from another table. I've made quite good progress with this code:
SELECT p.photo_id, p.name, p.path, array_agg(t.tag) as tags FROM photos p
JOIN users u USING (user_id)
LEFT JOIN photo_tags pt USING (photo_id)
LEFT JOIN tags t USING (tag_id)
WHERE u.user_id = 'some_uuid'
GROUP BY p.photo_id, p.name, p.path
ORDER BY date(p.date_created) DESC, p.date_created ASC
Everything is working exactly like I intended except for one thing: If a given photo has no tags attached to it then this is being returned: [NULL]
I would prefer to return just NULL rather than null in an array. I've tried several things, including using coalesce and ifnull but couldn't fix things precisely the way I want.
Not the end of the world if an array with NULL is returned by the endpoint but if you know a way to return just NULL instead, I would appreciate learning how to do this.
You can filter out nulls during the join process.
If none is returned, you should get a NULL instead of [NULL]
SELECT array_agg(t.tag) filter (where t.tag is not null) as tags
FROM ...
I would go with a subquery in your case:
SELECT p.photo_id, p.name, p.path, agg_tags as tags
FROM photos p
JOIN users u USING (user_id)
LEFT JOIN photo_tags pt USING (photo_id)
LEFT JOIN (
SELECT tag_id, array_agg(tag) AS agg_tags
FROM tags
GROUP BY tag_id
) t USING (tag_id)
WHERE u.user_id = 'some_uuid'
ORDER BY date(p.date_created) DESC, p.date_created ASC
You did not post many information about your schema, table size and so on but a LATERAL join could be an option to add on the above syntax.

SQL for a query with several input IDs, how to get the first 5 results for each ID

I have a query that accepts several IDs as filters in a WHERE clause.
it's formatted something like this:
SELECT a.ID, a.VOLUMETRY, b.ANNOY_DISTANCE
FROM PRODUCT a
JOIN RECOMMENDATIONS b on a.ID = b.ID
WHERE a.ID in ('0001','0002', ...., '0099')
ORDER BY b.ANNOY_DISTANCE
Now this query can return several thousand results for each ID, but I only need the first 5 for each ID after ordering them by the ANNOY_DISTANCE column. The rest aren't needed and would only slow post-processing of the data.
How can I change this so that the query result only gives the first 5 rows for each ID?
Use window functions, which you can filter using a QUALIFY clause:
SELECT p.ID, p.VOLUMETRY, r.ANNOY_DISTANCE
FROM PRODUCT p JOIN
RECOMMENDATIONS r
ON p.ID = r.ID
WHERE a.ID in ('0001','0002', ...., '0099')
QUALIFY ROW_NUMBER() OVER (PARTITION BY p.ID ORDER BY r.ANNOY_DISTANCE) <= 5
ORDER BY r.ANNOY_DISTANCE;
Notice that I changed your table aliases to be meaningful abbreviations for the table names. That is a best practice.

Adding a string for the join condition on SQL

I have two tables, articles and log. I'm trying to join the two tables a look only for the articles that appear in the log. The only relation that the two tables have is articles.slug which shows the title of the article and log.path which shows the same text of articles.slug but with '/article/' at the beginning. Exmaple:
This is the log.path: '/article/bad-things-gone'
This is the articles.slug: 'bad-things-gone'
I'm trying to do this:
SELECT articles.title, count
FROM articles join
(SELECT path, COUNT(*) as count
FROM log
GROUP BY path
ORDER BY count desc
) as a
ON ('/article/' + articles.slug) = a.path
but it is not working as it says I can not add the string '/articles/' to the articles.slug.
Is there a way to do this? Thanks.
I suspect you want:
SELECT a.title, l.count
FROM articles a join
(SELECT l.path, COUNT(*) as count
FROM log l
GROUP BY l.path
) l
ON ('/article/' || a.slug) = l.path
ORDER BY l.count desc;
Notes:
Postgres uses the standard operator || for string concatenation.
Ordering in a subquery has nothing to do with ordering in the outer query.
Use table aliases that relate to the table name or subquery.
Qualify all column names.
use the CONCAT function
SELECT CONCAT('/article/', column);

SQL Query for getting all elements matching another element where there are at least X elements

I want a query that will give me all the items with an ID matching another item, where the group of matching items is larger than X.
Say I have two tables, submissions and submission_items. Each submission has a submission_id and each submission_item has a foreign key that is the parent submission's submission_id. Submissions can have multiple submission_items.
I want to get all the submissions that have more than X submission_items in them.
I tried this:
select submissions.*, submission_item.*
from submission
join submission_item ON submissions.submission_id = submission_item.submission_id
group by submission.submission_id
having count(*) > 1
It errored out saying it wanted other fields in the GROUP BY function. How can I do this better?
When you specify a GROUP BY, fields in the SELECT statement must be either part of the group clause or part of an aggregate.
You need to find the IDs that you are interested in and then rejoin it back to the table.
SELECT * FROM
submission si INNER JOIN
(select submission_id
from submission
join submission_item ON submissions.submission_id = submission_item.submission_id
group by submission.submission_id
having count(*) > 1) c
ON si.submission_id = c.submission_id
or you can group all the fields individually
select submissions.submission_id, submissions.x, submissions.y
from submission
join submission_item ON submissions.submission_id = submission_item.submission_id
group by submission.submission_id, submissions.x, submissions.y
having count(*) > 1

Using UNNEST with a JOIN

I want to be able to use unnest() function in PostgreSQL in a complicated SQL query that has many JOINs. Here's the example query:
SELECT 9 as keyword_id, COUNT(DISTINCT mentions.id) as total, tags.parent_id as tag_id
FROM mentions
INNER JOIN taggings ON taggings.mention_id = mentions.id
INNER JOIN tags ON tags.id = taggings.tag_id
WHERE mentions.taglist && ARRAY[9] AND mentions.search_id = 3
GROUP BY tags.parent_id
I want to eliminate the taggings table here, because my mentions table has an integer array field named taglist that consists of all linked tag ids of mentions.
I tried following:
SELECT 9 as keyword_id, COUNT(DISTINCT mentions.id) as total, tags.parent_id as tag_id
FROM mentions
INNER JOIN tags ON tags.id IN (SELECT unnest(taglist))
WHERE mentions.taglist && ARRAY[9] AND mentions.search_id = 3
GROUP BY tags.parent_id
This works but brings different results than the first query.
So what I want to do is to use the result of the SELECT unnest(taglist) in a JOIN query to compensate for the taggings table.
How can I do that?
UPDATE: taglist is the same set as the respective list of tag ids of mention.
Technically, your query might work like this (not entirely sure about the objective of this query):
SELECT 9 AS keyword_id, count(DISTINCT m.id) AS total, t.parent_id AS tag_id
FROM (
SELECT unnest(m.taglist) AS tag_id
FROM mentions m
WHERE m.search_id = 3
AND 9 = ANY (m.taglist)
) m
JOIN tags t USING (tag_id) -- assumes tag.tag_id!
GROUP BY t.parent_id;
However, it seems to me you are going in the wrong direction here. Normally one would remove the redundant array taglist and keep the normalized database schema. Then your original query should serve well, only shortened the syntax with aliases:
SELECT 9 AS keyword_id, count(DISTINCT m.id) AS total, t.parent_id AS tag_id
FROM mentions m
JOIN taggings mt ON mt.mention_id = m.id
JOIN tags t ON t.id = mt.tag_id
WHERE 9 = ANY (m.taglist)
AND m.search_id = 3
GROUP BY t.parent_id;
Unravel the mystery
<rant>
The root cause for your "different results" is the unfortunate naming convention that some intellectually challenged ORMs impose on people.
I am speaking of id as column name. Never use this anti-pattern in a database with more than one table. Right, that means basically any database. As soon as you join a bunch of tables (that's what you do in a database) you end up with a bunch of columns named id. Utterly pointless.
The ID column of a table named tag should be tag_id (unless there is another descriptive name). Never id.
</rant>
Your query inadvertently counts tags instead of mentions:
SELECT 25 AS keyword_id, count(m.id) AS total, t.parent_id AS tag_id
FROM (
SELECT unnest(m.taglist) AS id
FROM mentions m
WHERE m.search_id = 4
AND 25 = ANY (m.taglist)
) m
JOIN tags t USING (id)
GROUP BY t.parent_id;
It should work this way:
SELECT 25 AS keyword_id, count(DISTINCT m.id) AS total, t.parent_id
FROM (
SELECT m.id, unnest(m.taglist) AS tag_id
FROM mentions m
WHERE m.search_id = 4
AND 25 = ANY (m.taglist)
) m
JOIN tags t ON t.id = m.tag_id
GROUP BY t.parent_id;
I also added back the DISTINCT to your count() that got lost along the way in your query.
Something like this should work:
...
tags t INNER JOIN
(SELECT UNNEST(taglist) as idd) a ON t.id = a.idd
...