Multiple STRING_AGG on multiple join columns causes bloated aggregation - sql

I've got a table in my MSSQL server, lets call it blogPost. I've also got two tag tables, lets call them fooTag and barTag. The tag tables are used to tag the blogPost table which are identically structured.
blogPost
| postId | title | body |
+--------+---------------------+-------------+
| 1 | The life on a query | lorem ipsum |
+--------+---------------------+-------------+
fooTag and barTag
| postId | tagName |
+--------+--------------+
| 1 | sql |
| 1 | query |
| 1 | select-query |
+--------+--------------+
I want to get a single blogpost along with all it's tags in a single row so then STRING_AGG() feels suitable to make a query like this:
SELECT blogPost.*, STRING_AGG(fooTag.tagName, ';') as [fooTags], STRING_AGG(barTag.tagName, ';') as [barTags]
FROM blogPost
LEFT JOIN fooTag ON blogPost.postId = fooTag.postId
LEFT JOIN barTag ON blogPost.postId = barTag.postId
WHERE postId = 1
GROUP BY blogPost.postId, title, body
When making this query I'd expect to get the result
| postId | title | body | fooTags | barTags |
+--------+---------------------+-------------+-------------------------+-------------------------+
| 1 | The life on a query | lorem ipsum | sql;query;select-query | sql;query;select-query |
+--------+---------------------+-------------+-------------------------+-------------------------+
But I'm getting this result instead where bar tags (i.e. the last STRING_AGG selected) are duplicated.
| postId | title | body | fooTags | barTags |
+--------+---------------------+-------------+-------------------------+-----------------------------------------------+
| 1 | The life on a query | lorem ipsum | sql;query;select-query; | sql;sql;sql;query;query;query;select-query;select-query;select-query |
+--------+---------------------+-------------+-------------------------+-----------------------------------------------+
Putting barTags last in the SELECT statement makes it so that barTags gets the duplicates instead of fooTags. The amount of duplicates created seem to be bound to the amount of rows columns being aggregated together in the first STRING_AGG result column, so if fooTags has 5 rows to aggregate together there will be 5 duplicates of each barTag in the barTags column in the result.
How would I get the result I want without duplicates?

Your problem is caused by each row in fooTags creating that many rows of barTags in the JOIN, hence the duplication. You can work around this issue by performing the STRING_AGG in the footags and bartags tables before JOINing them:
SELECT blogPost.*, f.tags as [fooTags], b.tags as [barTags]
FROM blogPost
LEFT JOIN (SELECT postId, STRING_AGG(tagName, ';') AS tags
FROM fooTag
GROUP BY postId) f ON blogPost.postId = f.postId
LEFT JOIN (SELECT postId, STRING_AGG(tagName, ';') AS tags
FROM barTag
GROUP BY postId) b ON blogPost.postId = b.postId
WHERE postId = 1

You can simplify the query like so:
SELECT blogPost.*, ca1.*, ca2.*
FROM blogPost
OUTER APPLY (
SELECT STRING_AGG(tagName, ';')
FROM fooTag
WHERE blogPost.postId = fooTag.postId
) AS ca1(fooTags)
OUTER APPLY (
SELECT STRING_AGG(tagName, ';')
FROM barTag
WHERE blogPost.postId = barTag.postId
) AS ca2(barTags)
WHERE postId = 1
No GROUP BY required, in your case it'll be an expensive operation.

Related

Aggregate rows according to JSON array content

I have a PSQL table with json tags, that are always strings stored in a json array :
id | tags (json)
--------------------------
1 | ["tag1", "tag2"]
2 | ["tag12", "tag2"]
122 | []
I would like to query, for instance, the count of entries in the table containing each tag.
For instance, I'd like to get :
tag | count
--------------------------
tag1 | 1
tag2 | 2
tag12 | 1
I tried
SELECT tags::text AS tag, COUNT(id) AS cnt FROM my_table GROUP BY tag;
but if does not work, since it gives
tag | cnt
--------------------------
["tag1", "tag2"] | 1
["tag12", "tag2"] | 1
I guess I need to get the list of all tags in an inner query, and then for each tag count the rows that contain this tag, but I can't find how to do that. Can you help me with that ?
Use json[b]_array_elements_text() and a lateral join to unnest the array:
select x.tag, count(*) cnt
from mytable t
cross join lateral json_array_elements_text(t.tags) as x(tag)
group by x.tag

One-to-Many SQL SELECT concatenated into single row

I'm using Postgres and I have the following schemes.
Orders
| id | status |
|----|-------------|
| 1 | delivered |
| 2 | recollected |
Comments
| id | text | user | order |
|----|---------|------|-------|
| 1 | texto 1 | 10 | 20 |
| 2 | texto 2 | 20 | 20 |
So, in this case, an order can have many comments.
I need to iterate over the orders and get something like this:
| id | status | comments |
|----|-------------|----------------|
| 1 | delivered | text 1, text 2 |
| 2 | recollected | |
I tried to use LEFT JOIN but it didn't work
SELECT
Order.id,
Order.status,
"Comment".text
FROM "Order"
LEFT JOIN "Comment" ON Order.id = "Comment"."order"
it returns this:
| id | status | text |
|----|-------------|--------|
| 1 | delivered | text 1 |
| 1 | delivered | text 2 |
| 2 | recollected| |
You are almost there - you just need aggregation:
SELECT
o.id,
o.status,
STRING_AGG(c.text, ',') comments
FROM "Order" o
LEFT JOIN "Comment" c ON p.id = c."order"
GROUP BY o.id, o.status
I would strongly recommend against having a table (and/or a column) called order: because it conflicts with a language keyword. I would also recommend avoiding quoted identifiers as much as possible - they make the queries longer to write, for no benefit.
Note that you can also use a correlated subquery:
SELECT
o.id,
o.status,
(SELECT STRING_AGG(c.text, ',') FROM "Comment" c WHERE c."order" = p.id) comments
FROM "Order" o
You can make it work with LEFT JOIN and aggregate after the join. But it's typically more efficient to aggregate first and join later.
If most or all rows in "Comment" are involved:
SELECT o.id, o.status, c.comments
FROM "Order" o
LEFT JOIN (
SELECT "order" AS id, string_agg(text, ', ') AS comments
FROM "Comment"
GROUP BY 1
) c USING (id);
Indexes won't matter, while most rows have to be read anyway.
For only a small percentage of rows (like, if you have a selective filter on "Order"):
SELECT o.id, o.status, c.comments
FROM "Order" o
LEFT JOIN LATERAL (
SELECT string_agg(text, ', ') AS comments
FROM "Comment"
WHERE "order" = o.id
) c ON true
WHERE <some_selective_filter>;
In this case, be sure to have an index on ("Comment"."order"), or more specialized, a covering index including text:
CREATE INDEX foo ON "Comment" ("order") INCLUDE (text);
Related:
Concatenate multiple result rows of one column into one, group by another column
Multiple array_agg() calls in a single query
Does a query with a primary key and foreign keys run faster than a query with just primary keys?
Aside: Consider legal, lower-case, unquoted identifiers in Postgres. In particular, don't (ab-)use completely reserved SQL keywords like ORDER as identifier. Much clearer and less potential for sneaky errors. See:
Are PostgreSQL column names case-sensitive?

Select all rows that have at least a list of features with wildcard support

given a table definition:
Objects:
obj_id | obj_name
-------|--------------
1 | object1
2 | object2
3 | object3
Tags:
tag_id | tag_name
-------|--------------
1 | code:python
2 | code:cpp
3 | color:green
4 | colorful
5 | image
objects_tags:
obj_id | tag_id
-------|---------
1 | 1
1 | 2
2 | 1
2 | 3
3 | 1
3 | 2
3 | 3
I'd like to select objects that contain all tags from given list with wildcards. Similar question has been asked several times and answer to simpler variant looks more or less like this:
SELECT obj_id,count(*) c FROM objects_tags
INNER JOIN objects USING(obj_id)
INNER JOIN tags USING(tag_id)
WHERE (name GLOB 'code*' OR name GLOB 'color*')
GROUP BY obj_id
HAVING (c==2)
However this solution doesn't work with wildcards. Is it possible to create similar query that would return objects that for each given wildcard query returned at least 1 tag? Checking if c>=2 doesn't work because one wildcard tag can return multiple results while another may return 0 still passing query even though it shouldn't.
I considered builting dynamic query built by client software that would consist of N INTERSECTs (one per tag) because there's probably not going to be many of them but it sounds like really dirty solution and if there's any more SQL way then I'd prefer to use it.
SQLite supports WITH clause so I would try to use it to determine all tags first, and then use these tags to find objects in the below way.
The example (demo) is made for PostGreSQL because I could not run SQLIte on any online tester, but I belive you will convert it easily to SQLite:
this query retrieves all tags:
WITH tagss AS (
SELECT * FROM Tags
WHERE tag_name LIKE 'code:%' OR tag_name LIKE 'color:%'
)
SELECT * FROM tagss;
| tag_id | tag_name |
|--------|-------------|
| 1 | code:python |
| 2 | code:cpp |
| 3 | color:green |
and the final query uses the above subquery in this way:
WITH tagss AS (
SELECT * FROM Tags
WHERE tag_name LIKE 'code:%' OR tag_name LIKE 'color:%'
)
SELECT obj_id,count(*) c
FROM objects_tags
INNER JOIN tagss USING(tag_id)
WHERE tag_name IN ( SELECT tag_name FROM tagss)
GROUP BY obj_id
HAVING count(*) >= (
SELECT count(*) FROM tagss
)
| obj_id | c |
|--------|---|
| 3 | 3 |

Select and count in the same query on two tables

I've got these two tables:
___Subscriptions
|--------|--------------------|--------------|
| SUB_Id | SUB_HotelId | SUB_PlanName |
|--------|--------------------|--------------|
| 1 | cus_AjGG401e9a840D | Free |
|--------|--------------------|--------------|
___Rooms
|--------|-------------------|
| ROO_Id | ROO_HotelId |
|--------|-------------------|
| 1 |cus_AjGG401e9a840D |
| 2 |cus_AjGG401e9a840D |
| 3 |cus_AjGG401e9a840D |
| 4 |cus_AjGG401e9a840D |
|--------|-------------------|
I'd like to select the SUB_PlanName and count the rooms with the same HotelId.
So I tried:
SELECT COUNT(*) as 'ROO_Count', SUB_PlanName
FROM ___Rooms
JOIN ___Subscriptions
ON ___Subscriptions.SUB_HotelId = ___Rooms.ROO_HotelId
WHERE ROO_HotelId = 'cus_AjGG401e9a840D'
and
SELECT
SUB_PlanName,
(
SELECT Count(ROO_Id)
FROM ___Rooms
Where ___Rooms.ROO_HotelId = ___Subscriptions.SUB_HotelId
) as ROO_Count
FROM ___Subscriptions
WHERE SUB_HotelId = 'cus_AjGG401e9a840D'
But I get empty datas.
Could you please help ?
Thanks.
You need to use GROUP BY whenever you do some aggregation(here COUNT()). Below query will give you the number of ROO_ID only for the SUB_HotelId = 'cus_AjGG401e9a840D' because you have this condition in WHERE. If you want the COUNTs for all Hotel_IDs then you can simply remove the WHERE filter from this query.
SELECT s.SUB_PlanName, COUNT(*) as 'ROO_Count'
FROM ___Rooms r
JOIN ___Subscriptions s
ON s.SUB_HotelId = r.ROO_HotelId
WHERE r.ROO_HotelId = 'cus_AjGG401e9a840D'
GROUP BY s.SUB_PlanName;
To be safe, you can also use COUNT(DISTINCT r.ROO_Id) if you don't want to double count a repeating ROO_Id. But your table structures seem to have unique(non-repeating) ROO_Ids so using a COUNT(*) should work as well.

postgres - pivot query with array values

Suppose I have this table:
Content
+----+---------+
| id | title |
+----+---------+
| 1 | lorem |
+----|---------|
And this one:
Fields
+----+------------+----------+-----------+
| id | id_content | name | value |
+----+------------+----------+-----------+
| 1 | 1 | subtitle | ipsum |
+----+------------+----------+-----------|
| 2 | 1 | tags | tag1 |
+----+------------+----------+-----------|
| 3 | 1 | tags | tag2 |
+----+------------+----------+-----------|
| 4 | 1 | tags | tag3 |
+----+------------+----------+-----------|
The thing is: i want to query the content, transforming all the rows from "Fields" into columns, having something like:
+----+-------+----------+---------------------+
| id | title | subtitle | tags |
+----+-------+----------+---------------------+
| 1 | lorem | ipsum | [tag1,tag2,tag3] |
+----+-------+----------+---------------------|
Also, subtitle and tags are just examples. I can have as many fields as I desired, them being array or not.
But I haven't found a way to convert the repeated "name" values into an array, even more without transforming "subtitle" into array as well. If that's not possible, "subtitle" could also turn into an array and I could change it later on the code, but I needed at least to group everything somehow. Any ideas?
You can use array_agg, e.g.
SELECT id_content, array_agg(value)
FROM fields
WHERE name = 'tags'
GROUP BY id_content
If you need the subtitle, too, use a self-join. I have a subselect to cope with subtitles that don't have any tags without returning arrays filled with NULLs, i.e. {NULL}.
SELECT f1.id_content, f1.value, f2.value
FROM fields f1
LEFT JOIN (
SELECT id_content, array_agg(value) AS value
FROM fields
WHERE name = 'tags'
GROUP BY id_content
) f2 ON (f1.id_content = f2.id_content)
WHERE f1.name = 'subtitle';
See http://www.postgresql.org/docs/9.3/static/functions-aggregate.html for details.
If you have access to the tablefunc module, another option is to use crosstab as pointed out by Houari. You can make it return arrays and non-arrays with something like this:
SELECT id_content, unnest(subtitle), tags
FROM crosstab('
SELECT id_content, name, array_agg(value)
FROM fields
GROUP BY id_content, name
ORDER BY 1, 2
') AS ct(id_content integer, subtitle text[], tags text[]);
However, crosstab requires that the values always appear in the same order. For instance, if the first group (with the same id_content) doesn't have a subtitle and only has tags, the tags will be unnested and will appear in the same column with the subtitles.
See also http://www.postgresql.org/docs/9.3/static/tablefunc.html
If the subtitle value is the only "constant" that you wan to separate, you can do:
SELECT * FROM crosstab
(
'SELECT content.id,name,array_to_string(array_agg(value),'','')::character varying FROM content inner join
(
select * from fields where fields.name = ''subtitle''
union all
select * from fields where fields.name <> ''subtitle''
) fields_ordered
on fields_ordered.id_content = content.id group by content.id,name'
)
AS
(
id integer,
content_name character varying,
tags character varying
);