Dealing with a poorly designed "variable column" table in PostgreSQL - sql

I am dealing with a poorly designed table, somewhat like this
create table (
entity_key integer,
tag1 varchar(10),
tag2 varchar(10),
tag3 varchar(10),
...
tag25 varchar(10)
);
An entity can have 0 or more tags indicated by the number of non-null columns. Tags are all the same type, and there should be a seperate "tags" table to which we can join the primary entities.
However, I'm stuck with this (quite large) table.
I want to run a query that gives me the distinct tags and a count of each.
If we had the normed "tags" table we could simply write
select tag, count(tag) from tags group by tag;
However, I haven't yet come up with a good approach for this query given the current table structure.

You can this by using an array and the unnest:
select x.tag, count(*)
from tags
cross join lateral unnest(array[tag1, tag2, tag3, tag4, tag5, tag6, tag7, ...]) as x(tag)
where x.tag is not null --<< git rid of any empty tags
group by x.tag;
This will group by the contents of the tag columns unlike Prdp's answer which groups by the "position" in the column list.
For this sample data:
insert into tags (entity_key, tag1, tag2, tag3, tag4, tag5)
values
(1, 'sql', 'dbms', null, null, null),
(2, 'sql', 'dbms', null, null, 'dml'),
(3, 'sql', null, null, 'ddl', null);
This will return this:
tag | count
-----+------
dml | 1
ddl | 1
sql | 3
dbms | 2

You can unpivot the data and do the count
select tag,count(data)
from
(
select tag1 as data,'tag1' as tag
from yourtable
Union All
select tag2,'tag2' as tag
from yourtable
Union All
..
select tag25,'tag25' as tag
from yourtable
) A
Group by tag
If postgresql supports Unpivot operator then you can use that

Related

High performance SQL tag search query with logical operations

How can I implement a boolean tag search in SQL?
This question is about as close as I can find, but there's a few.
The only real solution I know is to generate a query like this through backend code and put it into SQL, But I imagine it is slow and I'm also wondering if there is any other way to do it (such as having one main query instead of multiple).
There's also solutions that probably use IN or something like:
How to query data based on multiple 'tags' in SQL?
I cannot use the typical GROUP BY HAVING COUNT Solution as it cannot operate on the context of having a list of tags, as this user points out: Implementing a tag search with operands
I should specify most of the existing solutions do not work as I'm looking for things that are capable of more complex queries such as parenthesis grouping and nested operands.
Schema is "Toxi" http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/
SELECT id AS post_id
FROM posts
WHERE EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS 'random')
AND NOT (
EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS 'query') AND
EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS '1')
)
AND EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS '2')
AND EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS '3')
AND EXISTS (SELECT name FROM tags WHERE post IS post_id AND name IS 'racecar')
A GROUP BY HAVING COUNT will work — and it will be fast, and versatile. Some examples:
CREATE TABLE tags(
post_id INT,
name VARCHAR(50),
UNIQUE KEY (post_id, name)
);
INSERT INTO tags(post_id, name) VALUES
(1, 'foo'),
(1, 'bar'),
(2, 'foo'),
(3, 'bar'),
(4, 'baz'),
(5, 'foo'),
(5, 'bar'),
(5, 'meh');
-- posts tagged foo AND bar
-- returns 1, 5
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) = 2;
-- posts tagged foo OR bar
-- returns 1, 2, 3, 5
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) > 0;
-- posts tagged (foo AND bar) OR (baz)
-- returns 1, 4, 5
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) = 2
OR COUNT(CASE WHEN name IN ('baz') THEN 1 END) = 1;
-- posts tagged (foo AND bar) AND (no other tags)
-- returns 1
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) = 2
AND COUNT(*) = 2;
-- posts tagged (foo OR bar) AND NOT (meh)
-- returns 1, 2, 3
SELECT post_id
FROM tags
GROUP BY post_id
HAVING COUNT(CASE WHEN name IN ('foo', 'bar') THEN 1 END) > 0
AND COUNT(CASE WHEN name IN ('meh') THEN 1 END) = 0;
Demo on DB<>Fiddle
Converting an expression such as tag1 AND tag2 OR tag3 to the corresponding HAVING COUNT is not covered in my answer but the five examples should be sufficient.
Schema prep
CREATE TABLE posts (
ID INT PRIMARY KEY IDENTITY(1,1),
subj nvarchar(50)
)
GO
CREATE TABLE tags (
post INT,
name nvarchar(50)
)
GO
Data Prep
INSERT INTO posts (subj) VALUES ('post1')
INSERT INTO posts (subj) VALUES ('post2')
INSERT INTO posts (subj) VALUES ('post3')
INSERT INTO tags VALUES (1, 'food')
INSERT INTO tags VALUES (1, 'spicy')
INSERT INTO tags VALUES (2, 'spicy')
INSERT INTO tags VALUES (2, 'recipe')
INSERT INTO tags VALUES (3, 'food')
INSERT INTO tags VALUES (3, 'spicy')
INSERT INTO tags VALUES (3, 'sweet')
Query
;WITH Aggregated_Tags AS (
SELECT
post,
STRING_AGG(name, ',') AS name
FROM tags
GROUP BY post
)
SELECT post
FROM Aggregated_Tags
WHERE
(name LIKE '%food%' AND name LIKE '%spicy%' AND name NOT LIKE '%sweet%')
OR (name LIKE '%recipe%')
GROUP BY post
If i understood you correctly you are searching for something like this. The key here is to aggregate the tags per post in order to eliminate generating multiple select queries. This solution is not complete but i believe is a good start.

Adding a LEFT JOIN on a INSERT INTO....RETURNING

My query Inserts a value and returns the new row inserted
INSERT INTO
event_comments(date_posted, e_id, created_by, parent_id, body, num_likes, thread_id)
VALUES(1575770277, 1, '9e028aaa-d265-4e27-9528-30858ed8c13d', 9, 'December 7th', 0, 'zRfs2I')
RETURNING comment_id, date_posted, e_id, created_by, parent_id, body, num_likes, thread_id
I want to join the created_by with the user_id from my user's table.
SELECT * from users WHERE user_id = created_by
Is it possible to join that new returning row with another table row?
Consider using a WITH structure to pass the data from the insert to a query that can then be joined.
Example:
-- Setup some initial tables
create table colors (
id SERIAL primary key,
color VARCHAR UNIQUE
);
create table animals (
id SERIAL primary key,
a_id INTEGER references colors(id),
animal VARCHAR UNIQUE
);
-- provide some initial data in colors
insert into colors (color) values ('red'), ('green'), ('blue');
-- Store returned data in inserted_animal for use in next query
with inserted_animal as (
-- Insert a new record into animals
insert into animals (a_id, animal) values (3, 'fish') returning *
) select * from inserted_animal
left join colors on inserted_animal.a_id = colors.id;
-- Output
-- id | a_id | animal | id | color
-- 1 | 3 | fish | 3 | blue
Explanation:
A WITH query allows a record returned from an initial query, including data returned from a RETURNING clause, which is stored in a temporary table that can be accessed in the expression that follows it to continue work on it, including using a JOIN expression.
You were right, I misunderstood
This should do it:
DECLARE mycreated_by event_comments.created_by%TYPE;
INSERT INTO
event_comments(date_posted, e_id, created_by, parent_id, body, num_likes, thread_id)
VALUES(1575770277, 1, '9e028aaa-d265-4e27-9528-30858ed8c13d', 9, 'December 7th', 0, 'zRfs2I')
RETURNING created_by into mycreated_by
SELECT * from users WHERE user_id = mycreated_by

Count all existing combinations of groupings of records

I have these db tables
questions: id, text
answers: id, text, question_id
answer_tags: id, answer_id, tag_id
tags: id, text
question has has many answers
answer has many tags through answer_tags, belongs to question
tag has many answers through answer_tags
An answer has an unlimited number of tags
I would like to show all combinations of groupings of tags that exist ordered by count
Examples data
Question 1, Answer 1, tag1, tag2, tag3, tag4
Question 2, Answer 2, tag2, tag3, tag4
Question 3, Answer 3, tag3, tag4
Question 4, Answer 4, tag4
Question 5, Answer 5, tag3, tag4, tag5
Question 1, Answer 6, <no tags>
How can I solve this using SQL?
I'm not sure if this is possible with SQL but if it does I think it would need RECURSIVE method.
Expected results:
tag3, tag4 occur 4 times
tag2, tag3, tag4 occur 2 times
tag2, tag3 occur 2 times
We would only return results with groupings greater than 1. No single tag is ever returned, it must be at least 2 tags together to be counted.
Building on #filiprem's answer and using a slightly modified function from the answer here you get:
--test data
create table questions (id int, text varchar(100));
create table answers (id int, text varchar(100), question_id int);
create table answer_tags (id int, answer_id int, tag_id int);
create table tags (id int, text varchar(100));
insert into questions values (1, 'question1'), (2, 'question2'), (3, 'question3'), (4, 'question4'), (5, 'question5');
insert into answers values (1, 'answer1', 1), (2, 'answer2', 2), (3, 'answer3', 3), (4, 'answer4', 4), (5, 'answer5', 5), (6, 'answer6', 1);
insert into tags values (1, 'tag1'), (2, 'tag2'), (3, 'tag3'), (4, 'tag4'), (5, 'tag5');
insert into answer_tags values
(1,1,1), (2,1,2), (3,1,3), (4,1,4),
(5,2,2), (6,2,3), (7,2,4),
(8,3,3), (9,3,4),
(10,4,4),
(11,5,3), (12,5,4), (13,5,5);
--end test data
--function to get all possible combinations from an array with at least 2 elements
create or replace function get_combinations(source anyarray) returns setof anyarray as $$
with recursive combinations(combination, indices) as (
select source[i:i], array[i] from generate_subscripts(source, 1) i
union all
select c.combination || source[j], c.indices || j
from combinations c, generate_subscripts(source, 1) j
where j > all(c.indices) and
array_length(c.combination, 1) <= 2
)
select combination from combinations
where array_length(combination, 1) >= 2
$$ language sql;
--expected results
SELECT tags, count(*) FROM (
SELECT q.id, get_combinations(array_agg(DISTINCT t.text)) AS tags
FROM questions q
JOIN answers a ON a.question_id = q.id
JOIN answer_tags at ON at.answer_id = a.id
JOIN tags t ON t.id = at.tag_id
GROUP BY q.id
) t1
GROUP BY tags
HAVING count(*)>1;
Note: this gives tag2,tag4 occurs 2 times which was missed in the expected results (from questions 1 and 2)
You can indeed use a recursive CTE to produce the possible combinations. First select all tag IDs as an array of one element. Then UNION ALL a JOIN of the CTE and the tag IDs appending the tag ID to the array if it is larger than the largest ID in the array.
To the CTE join an aggregation getting the tag IDs for every answer as an array. In the ON clause check that the answer's array contains the array from the CTE with the array contains operator #>.
Exclude the combinations from the CTE with only one tag in a WHERE clause as you're not interested in those.
Now GROUP BY the combination of tags an exclude all the combinations which occur less than twice in a HAVING clause -- you're not interested in them too. If you want you also "translate" the IDs to the names of the tags in the SELECT list.
WITH RECURSIVE "cte"
AS
(
SELECT ARRAY["t"."id"] "id"
FROM "tags" "t"
UNION ALL
SELECT "c"."id" || "t"."id" "id"
FROM "cte" "c"
INNER JOIN "tags" "t"
ON "t"."id" > (SELECT max("un"."e")
FROM unnest("c"."id") "un" ("e"))
)
SELECT "c"."id" "id",
(SELECT array_agg("t"."text")
FROM unnest("c"."id") "un" ("e")
INNER JOIN "tags" "t"
ON "t"."id" = "un"."e") "text",
count(*) "count"
FROM "cte" "c"
INNER JOIN (SELECT array_agg("at"."tag_id" ORDER BY "at"."tag_id") "id"
FROM "answer_tags" "at"
GROUP BY at.answer_id) "x"
ON "x"."id" #> "c"."id"
WHERE array_length("c"."id", 1) > 1
GROUP BY "c"."id"
HAVING count(*) > 1;
Result:
id | text | count
---------+------------------+-------
{2,3} | {tag2,tag3} | 2
{3,4} | {tag3,tag4} | 4
{2,4} | {tag2,tag4} | 2
{2,3,4} | {tag2,tag3,tag4} | 2
db<>fiddle
Try this:
SELECT tags, count(*) FROM (
SELECT q.id, array_agg(DISTINCT t.text) AS tags
FROM questions q
JOIN answers a ON a.question_id = q.id
JOIN answer_tags at ON at.answer_id = a.id
JOIN tags t ON t.id = at.tag_id
GROUP BY q.id
) t1
GROUP BY tags
HAVING count(*)>1;

Inserting data to multiple tables Postgres

I currently have a MongoDB database with the following schema:
Image: { name: String, src: String, category: String, tags: [String] }
I'd like to migrate this to Postgres and for that I'd have 4 tables
image (id, src, name, category_id)
tag (id, name)
image_tag (image_id, tag_id)
category (id, name)
There might be new tags on every image inserts so when using CTE I need to select all the tags (and only insert new tags if they don't exist). I was thinking about using a cache (redis) to store the already inserted tags (so I don't need to select them from the db).
So my question is should I go with CTE with insert into tags.. where not exists statements or CTE + redis and only inserting tags when it could not be found in the cache?
So here is the small statement to insert an image with a category and multiple tags into multiple tables of a postgres database. The following expression assumes that the name in the tables category and tag has an unique constraint defined. For completion I also created an statement without that constraint (see the examples section).
Postgres statement
WITH image_values(image_name, src, category) AS (
VALUES
('Goldkraut', 'goldkraut.jpg', 'logo')
),
tag_values(tag_name) AS (
VALUES
('music'), ('band')
),
category_select AS (
SELECT id, name FROM category
WHERE name IN (SELECT category FROM image_values)
),
category_insert AS (
INSERT INTO category(name)
SELECT category FROM image_values
ON CONFLICT (name) DO NOTHING
RETURNING id, name
),
category_created AS (
SELECT id, name FROM category_select
UNION ALL
SELECT id, name FROM category_insert
),
tag_select AS (
SELECT id, name FROM tag
WHERE name IN (SELECT tag_name FROM tag_values)
),
tag_insert AS (
INSERT INTO tag(name)
SELECT tag_name FROM tag_values
ON CONFLICT (name) DO NOTHING
RETURNING id, name
),
tag_created AS (
SELECT id, name FROM tag_select
UNION ALL
SELECT id, name FROM tag_insert
),
image_insert AS (
INSERT INTO image(src, name, category_id)
SELECT src, image_name, category_created.id
FROM image_values
LEFT JOIN category_created ON(image_values.category=category_created.name)
RETURNING id, src, name, category_id
),
image_tag_insert AS (
INSERT INTO image_tag(image_id, tag_id)
SELECT image_insert.id, tag_created.id FROM image_insert
CROSS JOIN tag_created
RETURNING image_id, tag_id
)
SELECT image_insert.*, category_created.name as category_name, image_tag_insert.*, tag_created.name as "tag.name"
FROM image_tag_insert
LEFT JOIN image_insert ON (image_id = image_insert.id)
LEFT JOIN category_created ON (category_created.id = image_insert.category_id)
LEFT JOIN tag_created ON (tag_created.id = tag_id)
Explanation to the statement
In the first common table expression (CTE) image_values you will define all values for an image that has in a 1:1 relation. In the next expression tag_values all tag names for that image are defined.
Now lets start with the categories. To know if a category with the name already exist, you query for an category entry in category_select. In the expression category_insert you will create an new entry for the category if not already exits (instead of querying again from the database we use the cte category_select to find out if we already have an category with this name). To store the category id in the image table we need the category entry whether the existing (from category_select) or the inserted (from category_insert) so we union this two expressions in category_created.
Now we use the same pattern for the tags. Query for existing tags tag_select, insert tags if not exist tag_insert and union this entries in tag_created.
At next we insert the image in image_insert. Therefore we select the values from the expression image_values and join the expression category_created to get the id of the category. To insert the the relation image to tag we will need the id of the inserted image so we will return this value. The other return values are not really necessary but we will use them to get a nicer result set in the final query.
Now we have the primary key of the inserted image and we can store the associations of the image to the tags. In the expression image_tag_insert we select the id of the inserted image and cross join this with every tag id we selected or inserted.
For the final statement it will be enough to just do SELECT * FROM image_tag_insert to execute all the expression. But for an overview what was stored in the database i joined all the relations. So the result will look like this:
Joined result
| id | src | name | category_id | category_name | image_id | tag_id | tag.name |
|----|---------------|-----------|-------------|---------------|----------|--------|----------|
| 1 | goldkraut.jpg | Goldkraut | 2 | logo | 1 | 3 | band |
| 1 | goldkraut.jpg | Goldkraut | 2 | logo | 1 | 1 | music |
Example
On this sqlfiddle you will see the given query in action. In another sqlfiddle i have add some extras to the last statement to format all inserted tags as a list. If you have not add a unique constrain to the name column in the tables tag and category you can use this example

SQL query that gives distinct results that match multiple columns

Sorry, I couldn't provide a better title for my problem as I am quite new to SQL.
I am looking for a SQL query string that solves the below problem.
Let's assume the following table:
DOCUMENT_ID | TAG
----------------------------
1 | tag1
1 | tag2
1 | tag3
2 | tag2
3 | tag1
3 | tag2
4 | tag1
5 | tag3
Now I want to select all distinct document id's that contain one or more tags (but those must provide all specified tags).
For example:
Select all document_id's with tag1 and tag2 would return 1 and 3 (but not 4 for example as it doesn't have tag2).
What would be the best way to do that?
Regards,
Kai
SELECT document_id
FROM table
WHERE tag = 'tag1' OR tag = 'tag2'
GROUP BY document_id
HAVING COUNT(DISTINCT tag) = 2
Edit:
Updated for lack of constraints...
This assumes DocumentID and Tag are the Primary Key.
Edit: Changed HAVING clause to count DISTINCT tags. That way it doesn't matter what the primary key is.
Test Data
-- Populate Test Data
CREATE TABLE #table (
DocumentID varchar(8) NOT NULL,
Tag varchar(8) NOT NULL
)
INSERT INTO #table VALUES ('1','tag1')
INSERT INTO #table VALUES ('1','tag2')
INSERT INTO #table VALUES ('1','tag3')
INSERT INTO #table VALUES ('2','tag2')
INSERT INTO #table VALUES ('3','tag1')
INSERT INTO #table VALUES ('3','tag2')
INSERT INTO #table VALUES ('4','tag1')
INSERT INTO #table VALUES ('5','tag3')
INSERT INTO #table VALUES ('3','tag2') -- Edit: test duplicate tags
Query
-- Return Results
SELECT DocumentID FROM #table
WHERE Tag IN ('tag1','tag2')
GROUP BY DocumentID
HAVING COUNT(DISTINCT Tag) = 2
Results
DocumentID
----------
1
3
select DOCUMENT_ID
TAG in ("tag1", "tag2", ... "tagN")
group by DOCUMENT_ID
having count(*) > N and
Adjust N and the tag list as needed.
Select distinct document_id
from {TABLE}
where tag in ('tag1','tag2')
group by id
having count(tag) >=2
How you generate the list of tags in the where clause depends on your application structure. If you are dynamically generating the query as part of your code then you might simply construct the query as a big dynamically generated string.
We always used stored procedures to query the data. In that case, we pass in the list of tags as an XML document. - a procedure like that might look something like one of these where the input argument would be
<tags>
<tag>tag1</tag>
<tag>tag2</tag>
</tags>
CREATE PROCEDURE [dbo].[GetDocumentIdsByTag]
#tagList xml
AS
BEGIN
declare #tagCount int
select #tagCount = count(distinct *) from #tagList.nodes('tags/tag') R(tags)
SELECT DISTINCT documentid
FROM {TABLE}
JOIN #tagList.nodes('tags/tag') R(tags) ON {TABLE}.tag = tags.value('.','varchar(20)')
group by id
having count(distict tag) >= #tagCount
END
OR
CREATE PROCEDURE [dbo].[GetDocumentIdsByTag]
#tagList xml
AS
BEGIN
declare #tagCount int
select #tagCount = count(*) from #tagList.nodes('tags/tag') R(tags)
SELECT DISTINCT documentid
FROM {TABLE}
WHERE tag in
(
SELECT tags.value('.','varchar(20)')
FROM #tagList.nodes('tags/tag') R(tags)
}
group by id
having count( distinct tag) >= #tagCount
END
END