sum greater than in subquery - sql

I'm making a database in PostgreSQL that involves around democracy. All data that should be displayed are controlled by the users, and their percentage of power.
I'm struggling to write this SQL query where a collection of tags on a post should only be shown once the sum of all the percentage for each tag reaches a certain criteria.
The relations between the tables (relevant to this question) looks like this:
The post_tags table is used for deciding what tag stays on what post, decided by the users based on their percentage.
It may look something like this
approved_by
post_id
tag_id
percentage
1
1
1
0.33
5
1
3
0.45
7
1
3
0.25
6
1
3
0.15
4
1
1
0.90
1
1
2
0.45
1
1
6
-0.60
6
1
2
-0.15
How do you write an SQL query that selects a post and its tags if the percentage sum is above a certain threshold?
In the case of SUM(post_tags.percentage) > 0.75, only tag with tag_id 1 and 3 should show.
So far, I have written this SQL query, but it contains duplicates in the array_agg (might be a separate issue), and the HAVING only seem to depend on the total sum of all the tags in the array_agg.
SELECT
posts.post_id, array_agg(tags.name) AS tags
FROM
posts, tags, post_tags
WHERE
post_tags.post_id = posts.post_id AND
post_tags.tag_id = tags.tag_id
GROUP BY
posts.post_id
HAVING
SUM(post_tags.percentage) > 0.75
LIMIT 10;
I assume I might need to do a subquery within the query, but you can't do SUM inside the WHERE clause. I'm a bit lost at this issue.
Any help is appreciated
UPDATE
Because I think there needs to be atleast 2 queries into play, I think this should be one of them
SELECT
tags.name
FROM
post_tags, posts, tags
WHERE
post_tags.tag_id = tags.tag_id AND
post_tags.post_id = posts.post_id AND
posts.post_id = 1
GROUP BY
tags.tag_id
HAVING
SUM(post_tags.percentage) > 0.75
In this case, it's only for post 1, and I don't know how to continue this query for all posts

It's easy to get confused, but start small, and then expand the SQL query as you go.
Note the inner parenthesis will execute first. Start with the inner query, and then work on the outer query when building SQL queries.
In this case, for finding the tags relevant for a single post can be written like so
SELECT
t.name
FROM
post_tags
INNER JOIN tags t ON t.tag_id = post_tags.tag_id
INNER JOIN posts p2 ON p2.post_id = post_tags.post_id AND p2.post_id = 1
GROUP BY
t.tag_id
HAVING
SUM(post_tags.percentage) > 0.75
To expand on this, and apply the query for every post, switch out the 1 and set it equal the outer scope. The complete SQL query becomes this:
SELECT p.post_id, ARRAY(
SELECT
t.name
FROM
post_tags
INNER JOIN tags t ON t.tag_id = post_tags.tag_id
INNER JOIN posts p2 ON p2.post_id = post_tags.post_id AND p2.post_id = p.post_id
GROUP BY
t.tag_id
HAVING
SUM(post_tags.percentage) > 0.75
) AS tags
FROM posts p
Big thanks to Tim Biegeleisen who helped change out the FROM statements to INNER JOIN (tho performance-wise, both are tested equally fast in this case).

One idea would be to first aggregate the total percentage for each post/tag pair in a subquery. The subquery gives you a new join table unique_post_tags (one entry per post and tag, including the total_percentage for each post/tag pair). You can then select from post_tags_unique in the outer query, filtering irrelevant tags in the WHERE clause:
SELECT unique_post_tags.post_id, unique_post_tags.tag_id FROM
(
SELECT post_id, tag_id, sum(percentage) as total_percentage
FROM post_tags
GROUP BY post_id, tag_id
) AS unique_post_tags
WHERE unique_post_tag.total_percentage > 0.75
To actually select the tag names per post and group it into an array as you requested, the above query can be extended like this:
SELECT unique_post_tags.post_id, array_agg(t.name) AS tags FROM
(
SELECT post_id, tag_id, sum(percentage) as total_percentage
FROM post_tags
GROUP BY post_id, tag_id
) AS unique_post_tags
LEFT JOIN tags t ON t.id = unique_post_tags.tag_id
WHERE unique_post_tags.total_percentage > 0.75
GROUP BY unique_post_tags.post_id
Update
After looking at your answer more closely, I now realize that my idea of reducing the join table to the relevant entries first, can be implemented entirely in the subquery using the GROUP BY/HAVING approach you initially suggested:
SELECT relevant_post_tags.post_id , array_agg(t.name) AS tags FROM
(
SELECT post_id, tag_id
FROM post_tags
GROUP BY post_id, tag_id
HAVING SUM(percentage) > 0.75
) AS relevant_post_tags
LEFT JOIN tags t ON t.id = relevant_post_tags.tag_id
GROUP BY relevant_post_tags.post_id;
Or written as CTE (for readability):
WITH relevant_post_tags AS (
SELECT post_id, tag_id
FROM post_tags
GROUP BY post_id, tag_id
HAVING SUM(percentage) > 0.75)
)
SELECT relevant_post_tags.post_id, array_agg(t.name) AS tags
FROM relevant_post_tags
LEFT JOIN tags t ON t.id = relevant_post_tags.tag_id
GROUP BY relevant_post_tags.post_id;
If the 0.75 limit is static, you could also create a relevant_post_tags view in the DB and select from there directly. I did not look at the performance of the above (my guess would be that the query optimizer takes care of it, just note that using CTEs had some pitfalls in earlier Postgres versions).
The approach I came up with is a bit different from what you initially asked for, the result set for the queries above will only contain posts that actually have tags.
If you need to select all posts, you can expand like this:
WITH relevant_post_tags AS (
SELECT post_id, tag_id
FROM post_tags
GROUP BY post_id, tag_id
HAVING SUM(percentage) > 0.75
)
SELECT p.id, array_remove(array_agg(t.name), NULL) AS tags
FROM posts p
LEFT JOIN relevant_post_tags pt on pt.post_id = p.id
LEFT JOIN tags t ON t.id = pt.tag_id
GROUP BY p.id;
Or closer to your solution:
WITH relevant_post_tags AS (
SELECT post_id, tag_id
FROM post_tags
GROUP BY post_id, tag_id
HAVING SUM(percentage) > 0.75
)
SELECT p.id, ARRAY(
SELECT t.name
FROM relevant_post_tags pt
JOIN tags t ON t.id = pt.tag_id
WHERE pt.post_id = p.id
)
FROM posts p;

Related

SQL Query: How to make a query for identifying a post without a specific 'Tag'?

posts has tags
table poststags with postid, tagid as fields
How do I create a SQL statement/query so I can get a list of posts without a specific tag.
SELECT posts.postid FROM post
LEFT JOIN tags ON tags.postid = posts.postid
WHERE tags.tagid != 5
The statement above won't work.
post 1 with tags 1, 2, 3
post 2 with tags 1, 5
post 3 with tags 2, 3, 5
I want the result to be post 1 because it doesn't have tagid 5
An alternative...
SELECT
*
FROM
posts
WHERE
NOT EXISTS (
SELECT *
FROM tags
WHERE tags.postid = posts.postid
AND tags.tagid = 5
)
The exact anti-semi-join (rows where something does not exist in another table) with the best performance depends on the database you're using.
(Although never use NOT IN() for this type of problem. NULLs will mess you around and performance will often suck.)
An anti-join will do the job. For example:
select p.*
from post p
left join tags t on p.postid = t.postid and t.tagid = 5
where t.tagid is null

PostgreSQL NOT IN does not work correctly with JOIN

I have trivial tables post, tag and post_tags in a trivial Many-To-Many relationship. I want to select some posts by including and excluding some tags. I tried many variations of SQL queries, but none of them works for excluding tags.
I started from a query like this:
SELECT post.* FROM post
INNER JOIN post_tags ON post.id = post_tags.post_id
INNER JOIN tag ON post_tags.tag_id = tag.id
WHERE tag.name IN ('Science','Culture')
AND tag.name NOT IN ('War', 'Crime')
GROUP BY post.id
HAVING COUNT(post_tags.id) > 1
ORDER BY post.rating DESC
LIMIT 50;
But, unfortunately, this does not work. I see posts with tag "War" in result set. Then I tried to move the NOT IN condition to a separate subquery on post_tags and join to it:
SELECT post.* FROM post
INNER JOIN post_tags ON post.id = post_tags.post_id
INNER JOIN (SELECT * FROM tag WHERE name NOT IN ('War', 'Crime')) AS tags
ON post_tags.tag_id = tags.id
WHERE tags.name IN ('Science','Culture')
GROUP BY post.id
HAVING COUNT(post_tags.id) > 1
ORDER BY post.rating DESC
LIMIT 50;
Even tried to exclude some posts in first JOIN like this:
SELECT post.* FROM post
INNER JOIN post_tags ON post.id = post_tags.post_id
AND post_tags.tag_id NOT IN (SELECT id FROM tag WHERE name IN ('War', 'Crime'))
INNER JOIN tag ON post_tags.tag_id = tag.id
WHERE tag.name IN ('Science','Culture')
GROUP BY post.id
HAVING COUNT(post_tags.id) > 1
ORDER BY post.rating DESC
LIMIT 50;
But none of this works. I am especially confused about second query (joining with filtered result set instead of table).
Using PostgreSQL version 9.3, OS Ubuntu 14.04.
Any thoughts?
It is working fine. It is your logic that is off. You are filtering out the very tags that you want to check for. So, they are not part of the check.
Instead, move the conditions to the having clause:
SELECT p.*
FROM post p INNER JOIN
post_tags pt
ON p.id = pt.post_id INNER JOIN
tag t
ON pt.tag_id = t.id
WHERE t.name IN ('Science', 'Culture', 'War', 'Crime')
GROUP BY p.id
HAVING SUM(CASE WHEN t.name IN ('Science', 'Culture') THEN 1 ELSE 0 END) > 1 AND
SUM(CASE WHEN t.name IN ('War', 'Crime') THEN 1 ELSE 0 END) = 0
ORDER BY p.rating DESC;
There is a difference between ignoring a value (in the where clause) versus checking that it is not there (in the having clause).
This is an application of relational-division. Check out the tag description.
You have to define what you want exactly. Posts with one of the "good" tags and none of the "bad" tags? Or all of the good tags?
The best query technique depends on the table layout. Typically we'd assume referential integrity and that (post_id, tag_id) is defined unique in post_tags, but that's not defined.
Assuming that, and describing your problem as:
Return the 50 posts with the highest rating with at least one of the tags ('Science','Culture') and none of the tags ('War', 'Crime').
We can translate this plain English sentence into SQL directly:
SELECT p.*
FROM post p
WHERE EXISTS ( -- at least one of the tags ('Science','Culture')
SELECT 1
FROM tag t
JOIN post_tags pt ON pt.tag_id = t.id
WHERE pt.post_id = p.id
AND t.name IN ('Science', 'Culture')
AND NOT EXISTS ( -- none of the tags ('War', 'Crime')
SELECT 1
FROM tag t
JOIN post_tags pt ON pt.tag_id = t.id
WHERE pt.post_id = p.id
AND t.name IN ('War', 'Crime')
ORDER BY p.rating DESC -- with the highest rating
LIMIT 50; -- 50 posts
This is typically faster than grouping rows and counting - and also works if (post_id, tag_id) is not unique.
More techniques for relational division:
How to filter SQL results in a has-many-through relation

Using UNNEST with a JOIN

I want to be able to use unnest() function in PostgreSQL in a complicated SQL query that has many JOINs. Here's the example query:
SELECT 9 as keyword_id, COUNT(DISTINCT mentions.id) as total, tags.parent_id as tag_id
FROM mentions
INNER JOIN taggings ON taggings.mention_id = mentions.id
INNER JOIN tags ON tags.id = taggings.tag_id
WHERE mentions.taglist && ARRAY[9] AND mentions.search_id = 3
GROUP BY tags.parent_id
I want to eliminate the taggings table here, because my mentions table has an integer array field named taglist that consists of all linked tag ids of mentions.
I tried following:
SELECT 9 as keyword_id, COUNT(DISTINCT mentions.id) as total, tags.parent_id as tag_id
FROM mentions
INNER JOIN tags ON tags.id IN (SELECT unnest(taglist))
WHERE mentions.taglist && ARRAY[9] AND mentions.search_id = 3
GROUP BY tags.parent_id
This works but brings different results than the first query.
So what I want to do is to use the result of the SELECT unnest(taglist) in a JOIN query to compensate for the taggings table.
How can I do that?
UPDATE: taglist is the same set as the respective list of tag ids of mention.
Technically, your query might work like this (not entirely sure about the objective of this query):
SELECT 9 AS keyword_id, count(DISTINCT m.id) AS total, t.parent_id AS tag_id
FROM (
SELECT unnest(m.taglist) AS tag_id
FROM mentions m
WHERE m.search_id = 3
AND 9 = ANY (m.taglist)
) m
JOIN tags t USING (tag_id) -- assumes tag.tag_id!
GROUP BY t.parent_id;
However, it seems to me you are going in the wrong direction here. Normally one would remove the redundant array taglist and keep the normalized database schema. Then your original query should serve well, only shortened the syntax with aliases:
SELECT 9 AS keyword_id, count(DISTINCT m.id) AS total, t.parent_id AS tag_id
FROM mentions m
JOIN taggings mt ON mt.mention_id = m.id
JOIN tags t ON t.id = mt.tag_id
WHERE 9 = ANY (m.taglist)
AND m.search_id = 3
GROUP BY t.parent_id;
Unravel the mystery
<rant>
The root cause for your "different results" is the unfortunate naming convention that some intellectually challenged ORMs impose on people.
I am speaking of id as column name. Never use this anti-pattern in a database with more than one table. Right, that means basically any database. As soon as you join a bunch of tables (that's what you do in a database) you end up with a bunch of columns named id. Utterly pointless.
The ID column of a table named tag should be tag_id (unless there is another descriptive name). Never id.
</rant>
Your query inadvertently counts tags instead of mentions:
SELECT 25 AS keyword_id, count(m.id) AS total, t.parent_id AS tag_id
FROM (
SELECT unnest(m.taglist) AS id
FROM mentions m
WHERE m.search_id = 4
AND 25 = ANY (m.taglist)
) m
JOIN tags t USING (id)
GROUP BY t.parent_id;
It should work this way:
SELECT 25 AS keyword_id, count(DISTINCT m.id) AS total, t.parent_id
FROM (
SELECT m.id, unnest(m.taglist) AS tag_id
FROM mentions m
WHERE m.search_id = 4
AND 25 = ANY (m.taglist)
) m
JOIN tags t ON t.id = m.tag_id
GROUP BY t.parent_id;
I also added back the DISTINCT to your count() that got lost along the way in your query.
Something like this should work:
...
tags t INNER JOIN
(SELECT UNNEST(taglist) as idd) a ON t.id = a.idd
...

Query on three tables with 1 condition

I have the following tables:
tags
id tag_name
examples
id category heading
examples_tags
id tag_id example_id
How can I retrieve the number of examples under each tag? (a bit like stackoverflow actually :))
I also want an additional condition of the type:
examples.category = "english examples"
This is how I started ...
SELECT tags.id, tags.tag_name, COUNT( examples_tags.tag_id ) AS 'no_tags'
WHERE tags.id = examples_tags.tag_id
&&
examples.category = 'english examples'
GROUP BY tags.id
Thanks .
Without joins correct would be next:
SELECT tags.id, tags.tag_name, COUNT(*) AS num_tags
FROM tags, examples_tags, examples
WHERE tags.id = examples_tags.tag_id
and examples_tags.example_id=examples.id
and examples.category = 'english examples'
GROUP BY tags.id, tags.tag_name
You need to group by all non-aggregated fields.
Otherwise you could use inner join, makes query more readable:
SELECT tags.id, tags.tag_name, COUNT(*) AS num_tags
FROM tags
inner join examples_tags on examples_tags.tag_id=tags.id
inner join examples on examples_tags.example_id=examples.id
WHERE examples.category = 'english examples'
GROUP BY tags.id, tags.tag_name
This is a 3 table join, using the many-to-many table examples_tags in the middle between the tags and examples tables. You also have to group by every column that is not an aggregate in the select list.
SELECT t.id, t.tag_name, COUNT( *) AS 'no_tags'
FROM tags t
JOIN examples_tags et
ON t.id = et.tag_id
JOIN examples e
ON e.example_id = e.id
WHERE
e.category = 'english examples'
GROUP BY t.id, t.tag_name
ORDER BY t.tag_name

Fetch fields from a table that has the same relation to another table

I'll try to explain my case as good as i can.
I'm making a website where you can find topics by browsing their tags. Nothing strange there. I'm having tricky time with some of the queries though. They might be easy for you, my mind is pretty messed up from doing alot of work :P.
I have the tables "topics" and "tags". They are joined using the table tags_topics which contains topic_id and tag_id. When the user wants to find a topic they might first select one tag to filter by, and then add another one to the filter. Then i make a query for fetching all topics that has both of the selected tags. They might also have other tags, but they MUST have those tags chosen to filter by. The amount of tags to filter by differs, but we always have a list of user-selected tags to filter by.
This was mostly answered in Filtering from join-table and i went for the multiple joins-solution.
Now I need to fetch the tags that the user can filter by. So if we already have a defined filter of 2 tags, I need to fetch all tags but those in the filter that is associated to topics that includes all the tags in the filter. This might sound wierd, so i'll give a practical example :P
Let's say we have three topics: tennis, gym and golf.
tennis has tags: sport, ball, court and racket
gym has tags: sport, training and muscles
golf has tags: sport, ball, stick and outside
User selects tag sport, so we show all three tennis, gym and golf, and we show ball, court, racket, training, muscles, stick and outside as other possible filters.
User now adds ball to the filter. Filter is now sport and ball, so we show the topics tennis and golf, with court, racket, stick and outside as additional possible filters.
User now adds court to the filter, so we show tennis and racket as an additional possible filter.
I hope I'm making some sense. By the way, I'm using MySQL.
SELECT DISTINCT `tags`.`tag`
FROM `tags`
LEFT JOIN `tags_topics` ON `tags`.`id` = `tags_topics`.`tag_id`
LEFT JOIN `topics` ON `tags_topics`.`topic_id` = `topics`.`id`
LEFT JOIN `tags_topics` AS `tt1` ON `tt1`.`topic_id` = `topics`.`id`
LEFT JOIN `tags` AS `t1` ON `t1`.`id` = `tt1`.`tag_id`
LEFT JOIN `tags_topics` AS `tt2` ON `tt2`.`topic_id` = `topics`.`id`
LEFT JOIN `tags` AS `t2` ON `t2`.`id` = `tt2`.`tag_id`
LEFT JOIN `tags_topics` AS `tt3` ON `tt3`.`topic_id` = `topics`.`id`
LEFT JOIN `tags` AS `t3` ON `t3`.`id` = `tt3`.`tag_id`
WHERE `t1`.`tag` = 'tag1'
AND `t2`.`tag` = 'tag2'
AND `t3`.`tag` = 'tag3'
AND `tags`.`tag` NOT IN ('tag1', 'tag2', 'tag3')
SELECT topic_id
FROM topic_tag
WHERE tag_id = 1
OR tag_id = 2
OR tag_id = 3
GROUP BY topic_id
HAVING COUNT(topic_id) = 3;
The above query will get all topic_ids that have all three tag_ids of 1, 2 and 3. Then use this as a subquery:
SELECT tag_name
FROM tag
INNER JOIN topic_tag
ON tag.tag_id = topic_tag.tag_id
WHERE topic_id IN
( SELECT topic_id
FROM topic_tag
WHERE tag_id = 1
OR tag_id = 2
OR tag_id = 3
GROUP BY topic_id
HAVING COUNT(topic_id) = 3
)
AND
(
tag.tag_id <> 1
OR tag.tag_id <> 2
OR tag.tag_id <> 3
)
I think this is what you are looking for.
Select a.topic_id
from join_table a
where exists( select *
from join_table b
where a.tag_id = b.tag_id
and b.topic_id = selected_topic )
group by a.topic_id
having count(*) = ( select count(*)
from join_table c
where c.topic_id = selected_topic )
Should give you a list of topics which have all of the tags for selected_topic.
Generic solution from the top of my head but prone to have typos:
CREATE VIEW shared_tags_count AS
SELECT topic_to_tag1.topic_id AS topic_id1, topic_to_tag2.topic_id AS topic_id2, COUNT(*) as number
FROM topic_to_tag as topic_to_tag1
JOIN topic_to_tag as topic_to_tag2
ON topic_to_tag1.topic_id <> topic_to_tag2.topic_id
AND topic_to_tag1.tag_id = topic_to_tag2.tag_id
GROUP BY topic_to_tag1.topic_id, topic_to_tag2.topic_id;
CREATE VIEW tags_count AS
SELECT topic_id, COUNT(*) as number
FROM topic_to_tag
GROUP BY topic_id
CREATE VIEW related_topics AS
SELECT shared_tags_count.topic_id1, shared_tags_count.topic_id2
FROM shared_tags_count
JOIN tags_count
ON topic_id=topic_id1
AND shared_tags_counts.number = tags_count.number
CREATE VIEW related_tags AS
SELECT related_topics.topic_id1 as topic_id, topic_to_tag.tag_id
FROM related_topics
JOIN topic_to_tag
ON raleted_topics.tag_id2 = topic_to_tag.topic_id
You just have to query the related_tags view.
Interesting challenge btw.