Aggregate and count after left join - sql

I am aggregating columns of a table to find the count of unique values. For example, aggregating
the status shows that out of 5 alerts there are 2 in open status and 3 that are closed. The simplified table looks like this:
create table alerts (
id,
status,
owner_id
);
The query below uses grouping sets to aggregate multiple columns at once. This approach works well.
with aggs as (
select status
from alerts
where alerts.owner_id = 'x'
)
select status, count(*)
from aggs
group by grouping sets(
(),
(status)
);
the output at its simplest could look like this:
status | count
--------+-------
| 1
new | 1
However, now I need to aggregate additional columns from another table. This table (shown below) can have zero or more rows associated to the first table (alerts:users 1:N).
create table users (
id,
alert_id,
name
);
I have tried updating the query to use a left join but this approach incorrectly inflates the counts of the alert columns.
with aggs as (
select alerts.status, users.name
from alerts
left join users on alerts.id = users.alert_id
where alerts.owner_id = 'x'
-- and additional filtering by columns in the users table
)
select status, name, count(*)
from aggs
group by grouping sets(
(),
(status),
(name)
);
Below is an example of the incorrect results. Since there are 3 rows in the user table the count
for the status column is now 3 but should be 1.
status | name | count
--------+-------------------------+-------
| | 3
| user1 | 1
| user2 | 1
| user3 | 1
new | | 3
How can I perform this aggregation to include the columns from the table with a many-to-one relationship without inflating the counts? In the future I will likely need to aggregate more columns from other tables with a many-to-one relationship and need a solution that will still work with several left joins. All help is much appreciated.
edit: link to db-fiddle https://www.db-fiddle.com/f/buGD2DuJiqf9LGF9rw5EgT/2

Do you just want to count the number of alerts? If so, use count(distinct):
count(distinct alert_id)
Of course, you need this in aggs, so the select would include:
alerts.id as alert_id

Related

How to aggregate json fields when using GROUP BY clause in postgres?

I have the following table structure in my Postgres DB (v12.0)
id | pieces | item_id | material_detail
---|--------|---------|-----------------
1 | 10 | 2 | [{"material_id":1,"pieces":10},{"material_id":2,"pieces":20},{"material_id":3,"pieces":30}]
2 | 20 | 2 | [{"material_id":1,"pieces":40}
3 | 30 | 3 | [{"material_id":1,"pieces":20},{"material_id":3,"pieces":30}
I am using GROUP BY query for this records, like below
SELECT SUM(PIECES) FROM detail_table GROUP BY item_id HAVING item_id =2
Using which I will get the total pieces as 30. But how could I get the count of total pieces from material_detail group by material_id.
I want result something like this
pieces | material_detail
-------| ------------------
30 | [{"material_id":1,"pieces":50},{"material_id":2,"pieces":20},{"material_id":3,"pieces":30}]
As I am from MySQL background, I don't know how to achieve this with JSON fields in Postgres.
Note: material_detail column is of JSONB type.
You are aggregating on two different levels. I can't think of a solution that wouldn't need two separate aggregation steps. Additionally to aggregate the material information all arrays of the item_id have to be unnested first, before the actual pieces value can be aggregated for each material_id. Then this has to be aggregated back into a JSON array.
with pieces as (
-- the basic aggregation for the "detail pieces"
select dt.item_id, sum(dt.pieces) as pieces
from detail_table dt
where dt.item_id = 2
group by dt.item_id
), details as (
-- normalize the material information and aggregate the pieces per material_id
select dt.item_id, (m.detail -> 'material_id')::int as material_id, sum((m.detail -> 'pieces')::int) as pieces
from detail_table dt
cross join lateral jsonb_array_elements(dt.material_detail) as m(detail)
where dt.item_id in (select item_id from pieces) --<< don't aggregate too much
group by dt.item_id, material_id
), material as (
-- now de-normalize the material aggregation back into a single JSON array
-- for each item_id
select item_id, jsonb_agg(to_jsonb(d) - 'item_id') as material_detail
from details d
group by item_id
)
-- join both results together
select p.item_id, p.pieces, m.material_detail
from pieces p
join material m on m.item_id = p.item_id
;
Online example

SQL: Get count of rows for values in another column even when those values do not exist

I have a table named 'products' with two columns: name and status.
I would like to get the count of rows in the table with statuses as Draft, Published and Rejected.
This is the query I tried,
select count(*), status from products where status in ('Draft', 'Published') group by status;
At the moment the table does not have any row with the status as Published or Rejected.
So the above query just returns one row with status and Draft along with its count
count | status
-------+--------
24 | Draft
However, I would like to the query result with the other statuses as zero.
count | status
-------+--------
24 | Draft
0 | Published
0 | Rejected
How should I write the query so that I get the results as above?
You need a list of the statuses and a left join:
select v.status, count(p.status)
from (values ('Draft'), ('Published'), ('Rejected')
) v(status) left join
products p
on p.status = v.status
group by v.status;

SQL simplifying an except query

I have a database with around 50 million entries showing the status of a device for a given day, simplified to the form:
id | status
-------------
1 | Off
1 | Off
1 | On
2 | Off
2 | Off
3 | Off
3 | Off
3 | On
...
such that each id is guaranteed to have at least 2 rows with an 'off' status, but doesn't have to have an 'on' status. I'm trying to get a list of only the ids that do not have an 'On' status. For example, in the above data set I'd want a query returned with only '2'
The current query is:
SELECT DISTINCT id FROM table
EXCEPT
SELECT DISTINCT id FROM table WHERE status <> 'Off'
Which seems to work, but it's having to iterate over the entire table twice which ends up taking ~10-12 minutes to run per query. Is there a simpler way to do this with only a single query?
You can use WHERE NOT EXISTS instead:
Select Distinct Id
From Table A
Where Not Exists
(
Select *
From Table B
Where A.Id = B.Id
And B.Status = 'On'
)
I would also recommend looking at the indexes on the Status column. 10-12 minutes to run is excessively long. Even with 50m records, with proper indexing, a query like this shouldn't take longer than a second.
To add an index to the column, you can run this (I'm assuming SQL Server, your syntax may vary):
Create NonClustered Index Ix_YourTable_Status On YourTable (Status Asc);
You can use conditional aggregation.
select id
from table
group by id
having count(case when status='On' then 1 end)=0
You can use the help of a SELF JOIN ..
SELECT DISTINCT A.Id
FROM Table A
LEFT JOIN Table B ON A.Id=B.Id
WHERE B.Status='On'
AND B.Id IS NULL

Trouble performing Postgres group by non-ID column to get ID containing max value

I'm attempting to perform a GROUP BY on a join table table. The join table essentially looks like:
CREATE TABLE user_foos (
id SERIAL PRIMARY KEY,
user_id INT NOT NULL,
foo_id INT NOT NULL,
effective_at DATETIME NOT NULL
);
ALTER TABLE user_foos
ADD CONSTRAINT user_foos_uniqueness
UNIQUE (user_id, foo_id, effective_at);
I'd like to query this table to find all records where the effective_at is the max value for any pair of user_id, foo_id given. I've tried the following:
SELECT "user_foos"."id",
"user_foos"."user_id",
"user_foos"."foo_id",
max("user_foos"."effective_at")
FROM "user_foos"
GROUP BY "user_foos"."user_id", "user_foos"."foo_id";
Unfortunately, this results in the error:
column "user_foos.id" must appear in the GROUP BY clause or be used in an aggregate function
I understand that the problem relates to "id" not being used in an aggregate function and that the DB doesn't know what to do if it finds multiple records with differing ID's, but I know this could never happen due to my trinary primary key across those columns (user_id, foo_id, and effective_at).
To work around this, I also tried a number of other variants such as using the first_value window function on the id:
SELECT first_value("user_foos"."id"),
"user_foos"."user_id",
"user_foos"."foo_id",
max("user_foos"."effective_at")
FROM "user_foos"
GROUP BY "user_foos"."user_id", "user_foos"."foo_id";
and:
SELECT first_value("user_foos"."id")
FROM "user_foos"
GROUP BY "user_foos"."user_id", "user_foos"."foo_id"
HAVING "user_foos"."effective_at" = max("user_foos"."effective_at")
Unfortunately, these both result in a different error:
window function call requires an OVER clause
Ideally, my goal is to fetch ALL matching id's so that I can use it in a subquery to fetch the legitimate full row data from this table for matching records. Can anyone provide insight on how I can get this working?
Postgres has a very nice feature called distinct on, which can be used in this case:
SELECT DISTINCT ON (uf."user_id", uf."foo_id") uf.*
FROM "user_foos" uf
ORDER BY uf."user_id", uf."foo_id", uf."effective_at" DESC;
It returns the first row in a group, based on the values in parentheses. The order by clause needs to include these values as well as a third column for determining which is the first row in the group.
Try:
SELECT *
FROM (
SELECT t.*,
row_number() OVER( partition by user_id, foo_id ORDER BY effective_at DESC ) x
FROM user_foos t
)
WHERE x = 1
If you don't want to use a sub query based on a composite of all three keys then you need to create a "dense rank" window function field that orders subsets of id, user_id and foo_id by effective date with the rank order field. Then subquery that and take the records where rank_order=1. Since the rank ordering was by effective date you are getting all fields of the record with the highest effective date for each foo and user.
DATSET
1 1 1 01/01/2001
2 1 1 01/01/2002
3 1 1 01/01/2003
4 1 2 01/01/2001
5 2 1 01/01/2001
DATSET WITH RANK ORDER PARTITIONED BY FOO_ID, USER_ID ORDERED BY DATE DESC
1 3 1 1 01/01/2001
2 2 1 1 01/01/2002
3 1 1 1 01/01/2003
4 1 1 2 01/01/2001
5 1 2 1 01/01/2001
SELECT * FROM QUERY ABOVE WHERE RANK_ORDER=1
3 1 1 1 01/01/2003
4 1 1 2 01/01/2001
5 1 2 1 01/01/2001

How do I use a join to query two tables and get all rows from one, and related rows from the other?

Simplified for example, I have two tables, groups and items.
items (
id,
groupId,
title
)
groups (
id,
groupTitle,
externalURL
)
The regular query I'm goes something like this:
SELECT
i.`id`,
i.`title`,
g.`id` as 'groupId',
g.`groupTitle`,
g.`externalURL`
FROM
items i INNER JOIN groups g ON (i.`groupId` = g.`id`)
However I need to modify this now, because all the groups which specify an externalURL will not have any corresponding records in the items table (since they're stored externally). Is it possible to do some sort of join so that the output looks kinda like this:
items:
id title groupId
----------------------
1 Item 1 1
2 Item 2 1
groups
id groupTitle externalURL
-------------------------------
1 Group 1 NULL
2 Group 2 something
3 Group 3 NULL
Query output:
id title groupId groupTitle externalURL
---------------------------------------------------
1 Item 1 1 Group 1 NULL
2 Item 2 1 Group 1 NULL
NULL NULL 2 Group 2 something
-- note that group 3 didn't show up because it had no items OR externalURL
Is that possible in one SQL query?
This is exactly what an outer join is for: return all the rows from one table, whether or not there is a matching row in the other table. In those cases, return NULL for all the columns of the other table.
The other condition you can take care of in the WHERE clause.
SELECT
i.`id`,
i.`title`,
g.`id` as 'groupId',
g.`groupTitle`,
g.`externalURL`
FROM
items i RIGHT OUTER JOIN groups g ON (i.`groupId` = g.`id`)
WHERE i.`id` IS NOT NULL OR g.`externalURL` IS NOT NULL;
Only if both i.id and g.externalURL are NULL, then the whole row of the joined result set should be excluded.