Filtering within Postrgres aggregations - sql

I have a table in Postgres called tasks. It records Mechanical Turk-style tasks. It has the following columns:
entity_name, text (the thing being reviewed)
reviewer_email, text (the email address of the person doing the reviewing)
result, boolean (the entry provided by the reviewer)
Each entity that needs to be reviewed leads to the generation of two task rows, each assigned to a different reviewer. When both reviewers disagree (e.g. their values for result are not equal), the application kicks off a third task, assigned to a moderator. The moderators always have the same email domain.
I'm trying to get the counts for each time reviewer a reviewer has been overruled by a moderator, or affirmed by a moderator. I think I'm fairly close, but the last bit is proving tricky:
SELECT
reviewer_email,
COUNT(*) FILTER(
WHERE entity_name IN (
SELECT entity_name
FROM tasks
GROUP BY entity_name
HAVING
COUNT(*) FILTER (WHERE result IS NOT NULL) = 3 -- find the entities that have exactly three reviews
AND
-- this is the tricky part:
-- need something like:
-- WHERE current_review.result = moderator_review.result
)
) AS overruled_count
FROM
tasks
WHERE
result IS NOT NULL
GROUP BY
reviewer_email
HAVING
reviewer_email NOT LIKE '%#moderators-domain.net'
Sample data:
id | entity_name | reviewer_email | result
1 | apple | bob#email.net | true
2 | apple | alice#email.net | false
3 | apple | mod##moderators-domain.net | true
4 | pair | bob#email.net | true
5 | pair | alice#email.net | false
6 | pair | mod##moderators-domain.net | false
7 | kiwi | bob#email.net | true
8 | kiwi | alice#email.net | true
Desired results:
reviewer_email | overruled_count | affirmed_count
bob#email.net | 1 | 1
alice#email.net | 1 | 1
Bob and Alice each have done three reviews. On one review, they agreed, therefore there was no moderation. They disagreed on the other two reviews and were overruled once, and affirmed once by the moderator.
I believe the code above has me on the right track, but I'm definitely interested in other approaches to this.

I think this is a harder problem than you might realize. The following appends the moderator review to each non-moderator review:
select t.*, tm.result as moderator_result
from tasks t join
tasks tm
on t.entity_name = tm.entity_name
where t.reviewer_email NOT LIKE '%#moderators-domain.net' and
tm.reviewer_email LIKE '%#moderators-domain.net';
From this, we can aggregate the results that you want:
select reviewer_email,
sum( (result = moderator_result)::int ) as moderator_agrees,
sum( (result <> moderator_result)::int ) as moderator_disagrees
from (select t.*, tm.result as moderator_result
from tasks t join
tasks tm
on t.entity_name = tm.entity_name
where t.reviewer_email NOT LIKE '%#moderators-domain.net' and
tm.reviewer_email LIKE '%#moderators-domain.net'
) t
group by reviewer_email;
There may be a way to do this using filter and even window functions. This method seems the most natural to me.
I should note that the subquery is not necessary, of course:
select t.reviewer_email,
sum( (t.result = tm.result)::int ) as moderator_agrees,
sum( (t.result <> tm.result)::int ) as moderator_disagrees
from tasks t join
tasks tm
on t.entity_name = tm.entity_name
where t.reviewer_email NOT LIKE '%#moderators-domain.net' and
tm.reviewer_email LIKE '%#moderators-domain.net'
group by t.reviewer_email;

Just adding some changes to make the query a bit easier to understand in my opinion.
I'm guessing we also need to consider the case where we have users who have never been either affirmed or overruled (so the counts for them would be 0)
SELECT
tasks.reviewer_email,
COUNT(*) FILTER (WHERE tasks.result = modtasks.result) AS affirmed_count,
COUNT(*) FILTER (WHERE tasks.result <> modtasks.result) AS overruled_count
FROM tasks
LEFT JOIN tasks modtasks
ON modtasks.entity_name = tasks.entity_name
AND modtasks.reviewer_email LIKE '%#moderators-domain.net'
WHERE tasks.reviewer_email NOT LIKE '%#moderators-domain.net'
GROUP BY tasks.reviewer_email

Sample data
CREATE TABLE tasks
("id" int, "entity_name" text, "reviewer_email" text, "result" boolean)
;
INSERT INTO tasks
("id", "entity_name", "reviewer_email", "result")
VALUES
(1, 'apple', 'bob#email.net', 'true'),
(2, 'apple', 'alice#email.net', 'false'),
(3, 'apple', 'mod##moderators-domain.net', 'true'),
(4, 'pair', 'bob#email.net', 'true'),
(5, 'pair', 'alice#email.net', 'false'),
(6, 'pair', 'mod##moderators-domain.net', 'false'),
(7, 'kiwi', 'bob#email.net', 'true'),
(8, 'kiwi', 'alice#email.net', 'true')
;
Query 1
WITH
CTE_moderated_tasks
AS
(
SELECT
id AS mod_id
,entity_name AS mod_entity_name
,result AS mod_result
FROM tasks
WHERE reviewer_email LIKE '%#moderators-domain.net'
)
SELECT
tasks.reviewer_email
,SUM(CASE WHEN tasks.result <> mod_result THEN 1 ELSE 0 END) AS overruled_count
,SUM(CASE WHEN tasks.result = mod_result THEN 1 ELSE 0 END) AS affirmed_count
FROM
CTE_moderated_tasks
INNER JOIN tasks ON
tasks.entity_name = CTE_moderated_tasks.mod_entity_name
AND tasks.id <> CTE_moderated_tasks.mod_id
GROUP BY
tasks.reviewer_email
I split the query into two parts.
At first I want to find all tasks where moderator was involved (CTE_moderated_tasks). It assumes that moderator can't be involved more than once in the same task.
This result is inner joined to original tasks table thus naturally filtering out all tasks where moderator was not involved. This also gives us moderator opinion next to the reviewer opinion. This assumes that there are only two reviewers for the same task.
All that is left now is simple grouping by reviewers and counting how many times reviewer's and moderator's opinions matched. I used a classic SUM(CASE ...) for this conditional aggregate.
You don't have to use CTE, I used it primarily for readability.
I'd also like to highlight that this query uses LIKE only during one scan of the table. If there is an index on entity_name the join may be rather efficient.
Result
| reviewer_email | overruled_count | affirmed_count |
|-----------------|-----------------|----------------|
| alice#email.net | 1 | 1 |
| bob#email.net | 1 | 1 |
.
Variant without self-join
Here is another variant without self-join, which may be more efficient. You need to test with your real data, indexes and hardware.
This query uses window function with partitioning by entity_name to bring moderator result for each row without explicit self-join. You can use any aggregate function here (SUM or MIN or MAX), because there will be at most one row from moderator for each entity_name.
Then simple grouping with conditional aggregate give us the count.
Here conditional aggregate uses the fact that NULL compared to any value never returns true. mod_result for entities that don't have a moderator would have nulls and both result <> mod_result and result = mod_result would yield false, so such rows don't contribute to either count.
Final HAVING reviewer_email NOT LIKE '%#moderators-domain.net' removes the count of moderator results themselves.
Again, you don't have to use CTE here and I used it primarily for readability. I'd recommend to run just the CTE first and examine intermediate results to understand how the query works.
Query 2
WITH
CTE
AS
(
SELECT
id
,entity_name
,reviewer_email
,result::int
,SUM(result::int)
FILTER (WHERE reviewer_email LIKE '%#moderators-domain.net')
OVER (PARTITION BY entity_name) AS mod_result
FROM tasks
)
SELECT
reviewer_email
,SUM(CASE WHEN result <> mod_result THEN 1 ELSE 0 END) AS overruled_count
,SUM(CASE WHEN result = mod_result THEN 1 ELSE 0 END) AS affirmed_count
FROM CTE
GROUP BY reviewer_email
HAVING reviewer_email NOT LIKE '%#moderators-domain.net'

Related

postgresql total column sum

SELECT
SELECT pp.id, TO_CHAR(pp.created_dt::date, 'dd.mm.yyyy') AS "Date", CAST(pp.created_dt AS time(0)) AS "Time",
au.username AS "User", ss.name AS "Service", pp.amount, REPLACE(pp.status, 'SUCCESS', ' ') AS "Status",
pp.account AS "Props", pp.external_id AS "External", COALESCE(pp.external_status, null, 'indefined') AS "External status"
FROM payment AS pp
INNER JOIN auth_user AS au ON au.id = pp.creator_id
INNER JOIN services_service AS ss ON ss.id = pp.service_id
WHERE pp.created_dt::date = (CURRENT_DATE - INTERVAL '1' day)::date
AND ss.name = 'Some Name' AND pp.status = 'SUCCESS'
id | Date | Time | Service |amount | Status |
------+-----------+-----------+------------+-------+--------+---
9 | 2021.11.1 | 12:20:01 | some serv | 100 | stat |
10 | 2021.12.1 | 12:20:01 | some serv | 89 | stat |
------+-----------+-----------+------------+-------+--------+-----
Total | | | | 189 | |
I have a SELECT like this. I need to get something like the one shown above. That is, I need to get the total of one column. I've tried a lot of things already, but nothing works out for me.
If I understand correctly you want a result where extra row with aggregated value is appended after result of original query. You can achieve it multiple ways:
1. (recommended) the simplest way is probably to union your original query with helper query:
with t(id,other_column1,other_column2,amount) as (values
(9,'some serv','stat',100),
(10,'some serv','stat',89)
)
select t.id::text, t.other_column1, t.other_column2, t.amount from t
union all
select 'Total', null, null, sum(amount) from t
2. you can also use group by rollup clause whose purpose is exactly this. Your case makes it harder since your query contains many columns uninvolved in aggregation. Hence it is better to compute aggregation aside and join unimportant data later:
with t(id,other_column1,other_column2,amount) as (values
(9,'some serv','stat',100),
(10,'some serv','stat',89)
)
select case when t.id is null then 'Total' else t.id::text end as id
, t.other_column1
, t.other_column2
, case when t.id is null then ext.sum else t.amount end as amount
from (
select t.id, sum(amount) as sum
from t
group by rollup(t.id)
) ext
left join t on ext.id = t.id
order by ext.id
3. For completeness I just show you what should be done to avoid join. In that case group by clause would have to use all columns except amount (to preserve original rows) plus the aggregation (to get the sum row) hence the grouping sets clause with 2 sets is handy. (The rollup clause is special case of grouping sets after all.) The obvious drawback is repeating case grouping... expression for each column uninvolved in aggregation.
with t(id,other_column1,other_column2,amount) as (values
(9,'some serv','stat',100),
(10,'some serv2','stat',89)
)
select case grouping(t.id) when 0 then t.id::text else 'Total' end as id
, case grouping(t.id) when 0 then t.other_column1 end as other_column1
, case grouping(t.id) when 0 then t.other_column2 end as other_column2
, sum(t.amount) as amount
from t
group by grouping sets((t.id, t.other_column1, t.other_column2), ())
order by t.id
See example (db fiddle):
(To be frank, I can hardly imagine any purpose other than plain reporting where a column mixes id of number type with label Total of text type.)

Aggregate and count after left join

I am aggregating columns of a table to find the count of unique values. For example, aggregating
the status shows that out of 5 alerts there are 2 in open status and 3 that are closed. The simplified table looks like this:
create table alerts (
id,
status,
owner_id
);
The query below uses grouping sets to aggregate multiple columns at once. This approach works well.
with aggs as (
select status
from alerts
where alerts.owner_id = 'x'
)
select status, count(*)
from aggs
group by grouping sets(
(),
(status)
);
the output at its simplest could look like this:
status | count
--------+-------
| 1
new | 1
However, now I need to aggregate additional columns from another table. This table (shown below) can have zero or more rows associated to the first table (alerts:users 1:N).
create table users (
id,
alert_id,
name
);
I have tried updating the query to use a left join but this approach incorrectly inflates the counts of the alert columns.
with aggs as (
select alerts.status, users.name
from alerts
left join users on alerts.id = users.alert_id
where alerts.owner_id = 'x'
-- and additional filtering by columns in the users table
)
select status, name, count(*)
from aggs
group by grouping sets(
(),
(status),
(name)
);
Below is an example of the incorrect results. Since there are 3 rows in the user table the count
for the status column is now 3 but should be 1.
status | name | count
--------+-------------------------+-------
| | 3
| user1 | 1
| user2 | 1
| user3 | 1
new | | 3
How can I perform this aggregation to include the columns from the table with a many-to-one relationship without inflating the counts? In the future I will likely need to aggregate more columns from other tables with a many-to-one relationship and need a solution that will still work with several left joins. All help is much appreciated.
edit: link to db-fiddle https://www.db-fiddle.com/f/buGD2DuJiqf9LGF9rw5EgT/2
Do you just want to count the number of alerts? If so, use count(distinct):
count(distinct alert_id)
Of course, you need this in aggs, so the select would include:
alerts.id as alert_id

Return count of total group membership when providers are part of a group

TABLE A: Pre-joined table - Holds a list of providers who belong to a group and the group the provider belongs to. Columns are something like this:
ProviderID (PK, FK) | ProviderName | GroupID | GroupName
1234 | LocalDoctor | 987 | LocalDoctorsUnited
5678 | Physican82 | 987 | LocalDoctorsUnited
9012 | Dentist13 | 153 | DentistryToday
0506 | EyeSpecial | 759 | OphtaSpecialist
TABLE B: Another pre-joined table, holds a list of providers and their demographic information. Columns as such:
ProviderID (PK,FK) | ProviderName | G_or_I | OtherColumnsThatArentInUse
1234 | LocalDoctor | G | Etc.
5678 | Physican82 | G | Etc.
9012 | Dentist13 | I | Etc.
0506 | EyeSpecial | I | Etc.
The expected result is something like this:
ProviderID | ProviderName | ProviderStatus | GroupCount
1234 | LocalDoctor | Group | 2
5678 | Physican82 | Group | 2
9012 | Dentist13 | Individual | N/A
0506 | EyeSpecial | Individual | N/A
The goal is to determine whether or not a provider belongs to a group or operates individually, by the G_or_I column. If the provider belongs to a group, I need to include an additional column that provides the count of total providers in that group.
The Group/Individual portion is relatively easy - I've done something like this:
SELECT DISTINCT
A.ProviderID,
A.ProviderName,
CASE
WHEN B.G_or_I = 'G'
THEN 'Group'
WHEN B.G_or_I = 'I'
THEN 'Individual' END AS ProviderStatus
FROM
TableA A
LEFT OUTER JOIN TableB B
ON A.ProviderID = B.ProviderID;
So far so good, this returns the expected results based on the G_or_I flag.
However, I can't seem to wrap my head around how to complete the COUNT portion. I feel like I may be overthinking it, and stuck in a loop of errors. Some things I've tried:
Add a second CASE STATEMENT:
CASE
WHEN B.G_or_I = 'G'
THEN (
SELECT CountedGroups
FROM (
SELECT ProviderID, count(GroupID) AS CountedGroups
FROM TableA
WHERE A.ProviderID = B.ProviderID
GROUP BY ProviderID --originally had this as ORDER BY, but that was a mis-type on my part
)
)
ELSE 'N/A' END
This returns an error stating that a single row sub-query is returning more than one row. If I limit the number of rows returned to 1, the CountedGroups column returns 1 for every row. This makes me think that its not performing the count function as I expect it to.
I've also tried including a direct count of TableA as a factored sub-query:
WITH CountedGroups AS
( SELECT Provider ID, count(GroupID) As GroupSum
FROM TableA
GROUP BY ProviderID --originally had this as ORDER BY, but that was a mis-type on my part
) --This as a standalone query works just fine
SELECT DISTINCT
A.ProviderID,
A.ProviderName,
CASE
WHEN B.G_or_I = 'G'
THEN 'Group'
WHEN B.G_or_I = 'I'
THEN 'Individual' END AS ProviderStatus,
CASE
WHEN B.G_or_I = 'G'
THEN GroupSum
ELSE 'N/A' END
FROM
CountedGroups CG
JOIN TableA A
ON CG.ProviderID = A.ProviderID
LEFT OUTER JOIN TableB
ON A.ProviderID = B.ProviderID
This returns either null or completely incorrect column values
Other attempts have been a number of variations of this, with a mix of bad results or Oracle errors. As I mentioned above, I'm probably way overthinking it and the solution could be rather simple. Apologies if the information is confusing or I've not provided enough detail. The real tables have a lot of private medical information, and I tried to translate the essence of the issue as best I could.
Thank you.
You can use the CASE..WHEN and analytical function COUNT as follows:
SELECT
A.PROVIDERID,
A.PROVIDERNAME,
CASE
WHEN B.G_OR_I = 'G' THEN 'Group'
ELSE 'Individual'
END AS PROVIDERSTATUS,
CASE
WHEN B.G_OR_I = 'G' THEN TO_CHAR(COUNT(1) OVER(
PARTITION BY A.GROUPID
))
ELSE 'N/A'
END AS GROUPCOUNT
FROM
TABLE_A A
JOIN TABLE_B B ON A.PROVIDERID = B.PROVIDERID;
TO_CHAR is needed on COUNT as output expression must be of the same data type in CASE..WHEN
Your problem seems to be that you are missing a column. You need to add group name, otherwise you won't be able to differentiate rows for the same practitioner who works under multiple business entities (groups). This is probably why you have a DISTINCT on your query. Things looked like duplicates which weren't. Once you've done that, just use an analytic function to figure out the rest:
SELECT ta.providerid,
ta.providername,
DECODE(tb.g_or_i, 'G', 'Group', 'I', 'Individual') AS ProviderStatus,
ta.group_name,
CASE
WHEN tb.g_or_i = 'G' THEN COUNT(DISTINCT ta.provider_id) OVER (PARTITION BY ta.group_id)
ELSE 'N/A'
END AS GROUP_COUNT
FROM table_a ta
INNER JOIN table_b tb ON ta.providerid = tb.providerid
Is it possible that your LEFT JOIN was going the wrong direction? It makes more sense that your base demographic table would have all practitioners in it and then the Group table might be missing some records. For instance if the solo prac was operating under their own SSN and Type I NPI without applying for a separate Type II NPI or TIN.

Get specific row from each group

My question is very similar to this, except I want to be able to filter by some criteria.
I have a table "DOCUMENT" which looks something like this:
|ID|CONFIG_ID|STATE |MAJOR_REV|MODIFIED_ON|ELEMENT_ID|
+--+---------+----------+---------+-----------+----------+
| 1|1234 |Published | 2 |2019-04-03 | 98762 |
| 2|1234 |Draft | 1 |2019-01-02 | 98762 |
| 3|5678 |Draft | 3 |2019-01-02 | 24244 |
| 4|5678 |Published | 2 |2017-10-04 | 24244 |
| 5|5678 |Draft | 1 |2015-05-04 | 24244 |
It's actually a few more columns, but I'm trying to keep this simple.
For each CONFIG_ID, I would like to select the latest (MAX(MAJOR_REV) or MAX(MODIFIED_ON)) - but I might want to filter by additional criteria, such as state (e.g., the latest published revision of a document) and/or date (the latest revision, published or not, as of a specific date; or: all documents where a revision was published/modified within a specific date interval).
To make things more interesting, there are some other tables I want to join in.
Here's what I have so far:
SELECT
allDocs.ID,
d.CONFIG_ID,
d.[STATE],
d.MAJOR_REV,
d.MODIFIED_ON,
d.ELEMENT_ID,
f.ID FILE_ID,
f.[FILENAME],
et.COLUMN1,
e.COLUMN2
FROM DOCUMENT -- Get all document revisions
CROSS APPLY ( -- Then for each config ID, only look at the latest revision
SELECT TOP 1
ID,
MODIFIED_ON,
CONFIG_ID,
MAJOR_REV,
ELEMENT_ID,
[STATE]
FROM DOCUMENT
WHERE CONFIG_ID=allDocs.CONFIG_ID
ORDER BY MAJOR_REV desc
) as d
LEFT OUTER JOIN ELEMENT e ON e.ID = d.ELEMENT_ID
LEFT OUTER JOIN ELEMENT_TYPE et ON e.ELEMENT_TYPE_ID=et.ID
LEFT OUTER JOIN TREE t ON t.NODE_ID = d.ELEMENT_ID
OUTER APPLY ( -- This is another optional 1:1 relation, but it's wrongfully implemented as m:n
SELECT TOP 1
FILE_ID
FROM DOCUMENT_FILE_RELATION
WHERE DOCUMENT_ID=d.ID
ORDER BY MODIFIED_ON DESC
) as df -- There should never be more than 1, but we're using TOP 1 just in case, to avoid duplicates
LEFT OUTER JOIN [FILE] f on f.ID=df.FILE_ID
WHERE
allDocs.CONFIG_ID = '5678' -- Just for testing purposes
and d.state ='Released' -- One possible filter criterion, there may be others
It looks like the results are correct, but multiple identical rows are returned.
My guess is that for documents with 4 revisions, the same values are found 4 times and returned.
A simple SELECT DISTINCT would solve this, but I'd prefer to fix my query.
This would be a classic row_number & partition by question I think.
;with rows as
(
select <your-columns>,
row_number() over (partion by config_id order by <whatever you want>) as rn
from document
join <anything else>
where <whatever>
)
select * from rows where rn=1

Select a row used for GROUP BY

I have this table:
id | owner | asset | rate
-------------------------
1 | 1 | 3 | 1
2 | 1 | 4 | 2
3 | 2 | 3 | 3
4 | 2 | 5 | 4
And i'm using
SELECT asset, max(rate)
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
ORDER BY max(rate) DESC
to get intersection of assets for specified owners with best rate.
I also need id of row used for max(rate), but i can't find a way to include it to SELECT. Any ideas?
Edit:
I need
Find all assets that belongs to both owners (1 and 2)
From the same asset i need only one with the best rate (3)
I also need other columns (owner) that belongs to the specific asset with best rate
I expect the following output:
id | asset | rate
-------------------------
3 | 3 | 3
Oops, all 3s, but basically i need id of 3rd row to query the same table again, so resulting output (after second query) will be:
id | owner | asset | rate
-------------------------
3 | 2 | 3 | 3
Let's say it's Postgres, but i'd prefer reasonably cross-DBMS solution.
Edit 2:
Guys, i know how to do this with JOINs. Sorry for misleading question, but i need to know how to get extra from existing query. I already have needed assets and rates selected, i just need one extra field among with max(rate) and given conditions if it's possible.
Another solution that might or might not be faster than a self join (depending on the DBMS' optimizer)
SELECT id,
asset,
rate,
asset_count
FROM (
SELECT id,
asset,
rate,
rank() over (partition by asset order by rate desc) as rank_rate,
count(asset) over (partition by null) as asset_count
FROM test
WHERE owner IN (1, 2)
) t
WHERE rank_rate = 1
ORDER BY rate DESC
You are dealing with two questions and trying to solve them as if they are one. With a subquery, you can better refine by filtering the list in the proper order first (max(rate)), but as soon as you group, you lose this. As such, i would set up two queries (same procedure, if you are using procedures, but two queries) and ask the questions separately. Unless ... you need some of the information in a single grid when output.
I guess the better direction to head is to have you show how you want the output to look. Once you bake the input and the output, the middle of the oreo is easier to fill.
SELECT b.id, b.asset, b.rate
from
(
SELECT asset, max(rate) maxrate
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
) a, test b
WHERE a.asset = b.asset
AND a.maxrate = b.rate
ORDER BY b.rate DESC
You don't specify what type of database you're running on, but if you have analytical functions available you can do this:
select id, asset, max_rate
from (
select ID, asset, max(rate) over (partition by asset) max_rate,
row_number() over (partition by asset order by rate desc) row_num
from test
where owner in (1,2)
) q
where row_num = 1
I'm not sure how to add in the "having count(asset) > 1" in this way though.
This first searches for rows with the maximum rate per asset. Then it takes the highest id per asset, and selects that:
select *
from test
inner join
(
select max(id) as MaxIdWithMaxRate
from test
inner join
(
select asset
, max(rate) as MaxRate
from test
group by
asset
) filter
on filter.asset = test.asset
and filter.MaxRate = test.rate
group by
asset
) filter2
on filter.MaxIdWithMaxRate = test.id
If multiple assets share the maximum rate, this will display the one with the highest id.