Count all existing combinations of groupings of records - sql

I have these db tables
questions: id, text
answers: id, text, question_id
answer_tags: id, answer_id, tag_id
tags: id, text
question has has many answers
answer has many tags through answer_tags, belongs to question
tag has many answers through answer_tags
An answer has an unlimited number of tags
I would like to show all combinations of groupings of tags that exist ordered by count
Examples data
Question 1, Answer 1, tag1, tag2, tag3, tag4
Question 2, Answer 2, tag2, tag3, tag4
Question 3, Answer 3, tag3, tag4
Question 4, Answer 4, tag4
Question 5, Answer 5, tag3, tag4, tag5
Question 1, Answer 6, <no tags>
How can I solve this using SQL?
I'm not sure if this is possible with SQL but if it does I think it would need RECURSIVE method.
Expected results:
tag3, tag4 occur 4 times
tag2, tag3, tag4 occur 2 times
tag2, tag3 occur 2 times
We would only return results with groupings greater than 1. No single tag is ever returned, it must be at least 2 tags together to be counted.

Building on #filiprem's answer and using a slightly modified function from the answer here you get:
--test data
create table questions (id int, text varchar(100));
create table answers (id int, text varchar(100), question_id int);
create table answer_tags (id int, answer_id int, tag_id int);
create table tags (id int, text varchar(100));
insert into questions values (1, 'question1'), (2, 'question2'), (3, 'question3'), (4, 'question4'), (5, 'question5');
insert into answers values (1, 'answer1', 1), (2, 'answer2', 2), (3, 'answer3', 3), (4, 'answer4', 4), (5, 'answer5', 5), (6, 'answer6', 1);
insert into tags values (1, 'tag1'), (2, 'tag2'), (3, 'tag3'), (4, 'tag4'), (5, 'tag5');
insert into answer_tags values
(1,1,1), (2,1,2), (3,1,3), (4,1,4),
(5,2,2), (6,2,3), (7,2,4),
(8,3,3), (9,3,4),
(10,4,4),
(11,5,3), (12,5,4), (13,5,5);
--end test data
--function to get all possible combinations from an array with at least 2 elements
create or replace function get_combinations(source anyarray) returns setof anyarray as $$
with recursive combinations(combination, indices) as (
select source[i:i], array[i] from generate_subscripts(source, 1) i
union all
select c.combination || source[j], c.indices || j
from combinations c, generate_subscripts(source, 1) j
where j > all(c.indices) and
array_length(c.combination, 1) <= 2
)
select combination from combinations
where array_length(combination, 1) >= 2
$$ language sql;
--expected results
SELECT tags, count(*) FROM (
SELECT q.id, get_combinations(array_agg(DISTINCT t.text)) AS tags
FROM questions q
JOIN answers a ON a.question_id = q.id
JOIN answer_tags at ON at.answer_id = a.id
JOIN tags t ON t.id = at.tag_id
GROUP BY q.id
) t1
GROUP BY tags
HAVING count(*)>1;
Note: this gives tag2,tag4 occurs 2 times which was missed in the expected results (from questions 1 and 2)

You can indeed use a recursive CTE to produce the possible combinations. First select all tag IDs as an array of one element. Then UNION ALL a JOIN of the CTE and the tag IDs appending the tag ID to the array if it is larger than the largest ID in the array.
To the CTE join an aggregation getting the tag IDs for every answer as an array. In the ON clause check that the answer's array contains the array from the CTE with the array contains operator #>.
Exclude the combinations from the CTE with only one tag in a WHERE clause as you're not interested in those.
Now GROUP BY the combination of tags an exclude all the combinations which occur less than twice in a HAVING clause -- you're not interested in them too. If you want you also "translate" the IDs to the names of the tags in the SELECT list.
WITH RECURSIVE "cte"
AS
(
SELECT ARRAY["t"."id"] "id"
FROM "tags" "t"
UNION ALL
SELECT "c"."id" || "t"."id" "id"
FROM "cte" "c"
INNER JOIN "tags" "t"
ON "t"."id" > (SELECT max("un"."e")
FROM unnest("c"."id") "un" ("e"))
)
SELECT "c"."id" "id",
(SELECT array_agg("t"."text")
FROM unnest("c"."id") "un" ("e")
INNER JOIN "tags" "t"
ON "t"."id" = "un"."e") "text",
count(*) "count"
FROM "cte" "c"
INNER JOIN (SELECT array_agg("at"."tag_id" ORDER BY "at"."tag_id") "id"
FROM "answer_tags" "at"
GROUP BY at.answer_id) "x"
ON "x"."id" #> "c"."id"
WHERE array_length("c"."id", 1) > 1
GROUP BY "c"."id"
HAVING count(*) > 1;
Result:
id | text | count
---------+------------------+-------
{2,3} | {tag2,tag3} | 2
{3,4} | {tag3,tag4} | 4
{2,4} | {tag2,tag4} | 2
{2,3,4} | {tag2,tag3,tag4} | 2
db<>fiddle

Try this:
SELECT tags, count(*) FROM (
SELECT q.id, array_agg(DISTINCT t.text) AS tags
FROM questions q
JOIN answers a ON a.question_id = q.id
JOIN answer_tags at ON at.answer_id = a.id
JOIN tags t ON t.id = at.tag_id
GROUP BY q.id
) t1
GROUP BY tags
HAVING count(*)>1;

Related

High performance SQL tag search query with logical operations

How can I implement a boolean tag search in SQL?
This question is about as close as I can find, but there's a few.
The only real solution I know is to generate a query like this through backend code and put it into SQL, But I imagine it is slow and I'm also wondering if there is any other way to do it (such as having one main query instead of multiple).
There's also solutions that probably use IN or something like:
How to query data based on multiple 'tags' in SQL?
I cannot use the typical GROUP BY HAVING COUNT Solution as it cannot operate on the context of having a list of tags, as this user points out: Implementing a tag search with operands
I should specify most of the existing solutions do not work as I'm looking for things that are capable of more complex queries such as parenthesis grouping and nested operands.
Schema is "Toxi" http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/
SELECT id AS post_id
FROM posts
WHERE EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS 'random')
AND NOT (
EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS 'query') AND
EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS '1')
)
AND EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS '2')
AND EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS '3')
AND EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS 'racecar')
A GROUP BY HAVING COUNT will work — and it will be fast, and versatile. Some examples:
CREATE TABLE tags(
post_id INT,
name VARCHAR(50),
UNIQUE KEY (post_id, name)
);
INSERT INTO tags(post_id, name) VALUES
(1, 'foo'),
(1, 'bar'),
(2, 'foo'),
(3, 'bar'),
(4, 'baz'),
(5, 'foo'),
(5, 'bar'),
(5, 'meh');
-- posts tagged foo AND bar
-- returns 1, 5
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) = 2;
-- posts tagged foo OR bar
-- returns 1, 2, 3, 5
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) > 0;
-- posts tagged (foo AND bar) OR (baz)
-- returns 1, 4, 5
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) = 2
OR COUNT(CASE WHEN name IN ('baz') THEN 1 END) = 1;
-- posts tagged (foo AND bar) AND (no other tags)
-- returns 1
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) = 2
AND COUNT(*) = 2;
-- posts tagged (foo OR bar) AND NOT (meh)
-- returns 1, 2, 3
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) > 0
AND COUNT(CASE WHEN name IN ('meh') THEN 1 END) = 0;
Demo on DB<>Fiddle
Converting an expression such as tag1 AND tag2 OR tag3 to the corresponding HAVING COUNT is not covered in my answer but the five examples should be sufficient.
Schema prep
CREATE TABLE posts (
ID INT PRIMARY KEY IDENTITY(1,1),
subj nvarchar(50)
)
GO
CREATE TABLE tags (
post INT,
name nvarchar(50)
)
GO
Data Prep
INSERT INTO posts (subj) VALUES ('post1')
INSERT INTO posts (subj) VALUES ('post2')
INSERT INTO posts (subj) VALUES ('post3')
INSERT INTO tags VALUES (1, 'food')
INSERT INTO tags VALUES (1, 'spicy')
INSERT INTO tags VALUES (2, 'spicy')
INSERT INTO tags VALUES (2, 'recipe')
INSERT INTO tags VALUES (3, 'food')
INSERT INTO tags VALUES (3, 'spicy')
INSERT INTO tags VALUES (3, 'sweet')
Query
;WITH Aggregated_Tags AS (
SELECT
post,
STRING_AGG(name, ',') AS name
FROM tags
GROUP BY post
)
SELECT post
FROM Aggregated_Tags
WHERE
(name LIKE '%food%' AND name LIKE '%spicy%' AND name NOT LIKE '%sweet%')
OR (name LIKE '%recipe%')
GROUP BY post
If i understood you correctly you are searching for something like this. The key here is to aggregate the tags per post in order to eliminate generating multiple select queries. This solution is not complete but i believe is a good start.

find rows in database with 2 or more values of a column in common

I have a table called articletag for a blog database that says which article has which tag:
Art_Id Tag_id
1 3
2 3
3 3
4 3
1 1
3 1
4 1
2 2
5 5
another way to see this data is:
1, "blog", "first"
2, "blog", "second"
3, "blog", "first"
4, "blog", "first"
5, "seaside"
Tag_id 3 = 'blog' Tag_id 1 = 'first' Tag_id 5 = 'seaside' Tag_id 2 = 'second'
I am specifically looking for any articles with 2 or more words in common among EVERY article in the database and EVERY word tag (these tags are unique, btw)
Looking at the denormalized example above the answer should be 1,3,4, as articles with 2 or more words in common. Those 3 articles clearly share "blog" and "first."
The output should be
art_id
1
3
4
I have been trying for hours to get this right. The best I came up with was finding which tag_id shows up 2 or more times using:
Select a.*
from articletag a
join (
select t.tag_id
from articletag t
group by t.tag_id
having count(*) >=2
) b on b.tag_id = a.tag_id
But what I really want is which Article_id's have 2 or more words in common
Can anyone please help?
We can try doing a self join here:
SELECT t1.Art_id, t2.Art_id
FROM articletag t1
INNER JOIN articletag t2
ON t2.Art_id > t1.Art_id AND
t1.Tag_id = t2.Tag_id
GROUP BY
t1.Art_id, t2.Art_id
HAVING
COUNT(DISTINCT t1.Tag_id) >= 2;
Demo
Note that I am seeing 1-3, 1-4, and 3-4 as being the articles which have two or more tags in common.
Try this:
declare #x table (art_id int, tag_id int)
insert into #x values
(1, 3),
(2, 3),
(3, 3),
(4, 3),
(1, 1),
(3, 1),
(4, 1),
(2, 2),
(5, 5)
select distinct art_id from (
select [x1].art_id,
COUNT(*) over (partition by [x1].art_id,[x2].art_id) [cnt]
from #x [x1] join #x [x2]
on [x1].tag_id = [x2].tag_id and [x1].art_id <> [x2].art_id
) a where cnt > 1
You could also use cte to find the Art_Ids which have same combination
;with cte as
(
select Tag_id
from table
group by Tag_id
having count(*) >= 2
)
select t.Art_Id
from cte c inner join table t
on t.Tag_id = c.Tag_id
group by t.Art_Id
having count(*) = (select count(1) from cte)

Dealing with a poorly designed "variable column" table in PostgreSQL

I am dealing with a poorly designed table, somewhat like this
create table (
entity_key integer,
tag1 varchar(10),
tag2 varchar(10),
tag3 varchar(10),
...
tag25 varchar(10)
);
An entity can have 0 or more tags indicated by the number of non-null columns. Tags are all the same type, and there should be a seperate "tags" table to which we can join the primary entities.
However, I'm stuck with this (quite large) table.
I want to run a query that gives me the distinct tags and a count of each.
If we had the normed "tags" table we could simply write
select tag, count(tag) from tags group by tag;
However, I haven't yet come up with a good approach for this query given the current table structure.
You can this by using an array and the unnest:
select x.tag, count(*)
from tags
cross join lateral unnest(array[tag1, tag2, tag3, tag4, tag5, tag6, tag7, ...]) as x(tag)
where x.tag is not null --<< git rid of any empty tags
group by x.tag;
This will group by the contents of the tag columns unlike Prdp's answer which groups by the "position" in the column list.
For this sample data:
insert into tags (entity_key, tag1, tag2, tag3, tag4, tag5)
values
(1, 'sql', 'dbms', null, null, null),
(2, 'sql', 'dbms', null, null, 'dml'),
(3, 'sql', null, null, 'ddl', null);
This will return this:
tag | count
-----+------
dml | 1
ddl | 1
sql | 3
dbms | 2
You can unpivot the data and do the count
select tag,count(data)
from
(
select tag1 as data,'tag1' as tag
from yourtable
Union All
select tag2,'tag2' as tag
from yourtable
Union All
..
select tag25,'tag25' as tag
from yourtable
) A
Group by tag
If postgresql supports Unpivot operator then you can use that

How to SELECT by an array in postgresql?

Say I have an table like this:
DROP TABLE tmp;
CREATE TABLE tmp (id SERIAL, name TEXT);
INSERT INTO tmp VALUES (1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'five');
SELECT id, name FROM tmp;
It's like this:
id | name
----+-------
1 | one
2 | two
3 | three
4 | four
5 | five
(5 rows)
Then I have an array of ARRAY[3,1,2]. I want to get query the table by this array, so I can get an array of ARRAY['three', 'one', 'two']. I think this should be very easy but I just can't get it figured out.
Thanks in advance.
To preserve the array order, it needs to be unnested with the index order (using row_number()), then joined to the tmp table:
SELECT array_agg(name ORDER BY f.ord)
FROM (
select row_number() over() as ord, a
FROM unnest(ARRAY[3, 1, 2]) AS a
) AS f
JOIN tmp ON tmp.id = f.a;
array_agg
-----------------
{three,one,two}
(1 row)
Use unnest function:
SELECT id, name FROM tmp
WHERE id IN (SELECT unnest(your_array));
There is a different technique as suggested by Eelke:
You can also use the any operator
SELECT id, name FROM tmp WHERE id = ANY ARRAY[3, 1, 2];
If you want to return the array as output then try this:
SELECT array_agg(name) FROM tmp WHERE id = ANY (ARRAY[3, 1, 2]);
SQL FIDDLE

Am I using GROUP_CONCAT properly?

I'm selecting properties and joining them to mapping tables where they get mapped to filters such as location, destination, and property type.
My goal is to grab all the properties and then LEFT JOIN them to the tables, and then basically get data that shows all the locations, destinations a property is attached to and the property type itself.
Here's my query:
SELECT p.slug AS property_slug,
p.name AS property_name,
p.founder AS founder,
IF (p.display_city != '', display_city, city) AS city,
d.name AS state,
type
GROUP_CONCAT( CONVERT(subcategories_id, CHAR(8)) ) AS foo,
GROUP_CONCAT( CONVERT(categories_id, CHAR(8)) ) AS bah
FROM properties AS p
LEFT JOIN destinations AS d ON d.id = p.state
LEFT JOIN regions AS r ON d.region_id = r.id
LEFT JOIN properties_subcategories AS sc ON p.id = sc.properties_id
LEFT JOIN categories_subcategories AS c ON c.subcategory_id = sc.subcategories_id
WHERE 1 = 1
AND p.is_active = 1
GROUP BY p.id
Before I do the GROUP BY and GROUP_CONCAT my data looks like this:
id name type category_id subcategory_id state
--------------------------------------------------------------------------
1 The Hilton Hotel 1 1 2 7
1 The Hilton Hotel 1 1 3 7
1 The BlaBla Resort 2 2 5 7
After the GROUP BY and GROUP_CONCAT it becomes...
id name type category_id subcategory_id state
--------------------------------------------------------------------------
1 The Hilton Hotel 1 1, 1 2, 3 7
1 The BlaBla Resort 2 1 3 7
Is this the preferred way of grabbing all the possible mappings for the property in one go, with GROUP_CONCAT into a CSV like this?
Using this data, I can render something like...
<div class="property" categories="1" subcategories="2,3">
<h2>{property_name}</h2>
<span>{property_location}</span>
</div>
Then use Javascript to show/hide based on if the user clicks on an anchor which has say, a subcategory="2" attribute it would hide each .property that doesn't have 2 inside of its subcategories attribute value.
I believe you want something like this:
CREATE TABLE property (id INT NOT NULL PRIMARY KEY, name TEXT);
INSERT
INTO property
VALUES
(1, 'Hilton'),
(2, 'Astoria');
CREATE TABLE category (id INT NOT NULL PRIMARY KEY, property INT NOT NULL);
INSERT
INTO category
VALUES
(1, 1),
(2, 1),
(3, 2);
CREATE TABLE subcategory (id INT NOT NULL PRIMARY KEY, category INT NOT NULL);
INSERT
INTO subcategory
VALUES
(1, 1),
(2, 1),
(3, 2),
(5, 3),
(6, 3),
(7, 3);
SELECT id, name,
CONCAT(
'{',
(
SELECT GROUP_CONCAT(
'"', c.id, '": '
'[',
(
SELECT GROUP_CONCAT(sc.id ORDER BY sc.id SEPARATOR ', ' )
FROM subcategory sc
WHERE sc.category = c.id
),
']' ORDER BY c.id SEPARATOR ', ')
FROM category c
WHERE c.property = p.id
), '}')
FROM property p;
which would output this:
1 Hilton {"1": [1, 2], "2": [3]}
2 Astoria {"3": [5, 6, 7]}
The last field is a properly formed JSON which maps category id's to the arrays of subcategory id's.
You should add DISTINCT, and possibly ORDER BY:
GROUP_CONCAT(DISTINCT CONVERT(subcategories_id, CHAR(8))
ORDER BY subcategories_id) AS foo,
GROUP_CONCAT(DISTINCT CONVERT(categories_id, CHAR(8))
ORDER BY categories_id) AS bah
It's "de-normalized" if you want to call it like this. If that's the best representation to be used for rendering is another question, I think it's fine. Some may say it's hack, but I guess it's not too bad.
By the way, a comma seems to be missing after the "type".