How do I count occurrences with conditions in PostgreSQL? - sql

I'm working in PostgreSQL. Suppose I have this Person table:
| id | time | name | type |
----------------------------------------------------
| 1 | 2022-04-25 07:49:58.0 | Brian | Rejection 1 |
| 2 | 2022-04-25 07:49:58.0 | Brian | Rejection 2 |
| 3 | 2022-04-27 13:05:51.0 | Fredd | Rejection 1 |
| 4 | 2022-05-01 02:13:44.0 | Janet | Rejection 1 |
| 5 | 2022-05-01 03:45:06.0 | Janet | Rejection 2 |
| 6 | 2022-05-01 08:01:34.0 | Peter | Approval |
| 7 | 2022-05-01 12:12:53.0 | Frank | Rejection 2 |
| 8 | 2022-05-02 01:26:38.0 | Frank | Approval |
Note: We have 2 rejections types Rejection 1 and Rejection 2.
I would like to make a query that counts the number of Rejections and the number of approvals for each name. However if there are 2 rejections at the same time, for the same name, like the two first rows in the example, it should only count as one.
Let me just add that it's possible for there to be one of each rejection types at the same time for the same name, but it's impossible for there to be two rejections of the same type at the same time for the same name.
So this is what I'm expecting it to return:
| name | approvals | rejections |
----------------------------------
| Brian | 0 | 1 |
| Fredd | 0 | 1 |
| Janet | 0 | 2 |
| Peter | 1 | 0 |
| Frank | 1 | 1 |
The closest I could get to this is the following:
SELECT
name,
COALESCE(SUM(CASE WHEN log_type = 'Approval' THEN 1 ELSE 0 END), 0) approvals,
COALESCE(SUM(CASE WHEN log_type = 'Rejection 1' OR log_type = 'Rejection 2' THEN 1 ELSE 0 END), 0) rejections
FROM
person
GROUP BY
name
The problem with this is that it counts two rejections with same time and name as 2 instead of 1.

You can use DISTINCT inside COUNT() to count the distinct times if log_type is the 'Rejection X':
SELECT name,
COUNT(CASE WHEN log_type = 'Approval' THEN 1 END) approvals,
COUNT(DISTINCT CASE WHEN log_type IN ('Rejection 1', 'Rejection 2') THEN time END) rejections
FROM person
GROUP BY name;
See the demo.

Use ROW_NUMBER to remove duplicates, then use a simple count query to find the counts:
SELECT
name,
COUNT(*) FILTER (WHERE log_type = 'Approval') approvals,
COUNT(*) FILTER (WHERE log_type LIKE 'Rejection%') rejections
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY time, name, SUBSTRING(log_type FROM '\w+')) rn
FROM person
) t
WHERE rn = 1
GROUP BY name;

We can fetch the date in the CASE and then use DISTINCT COUNT, which ignores nulls.
I have given a first query to show the intermediate results and the counts with and without DISTINCT to show what it is doing. I have used the test LEFT(log_type,6) = 'Reject' to group the 2 rejection types.
I suggest that it would be a good idea to round the time so that 2 rejections close together will be treated as repetitions. We the current queries event 1 second different will be treated as a different rejection.
create table person(
id int,
time date,
name varchar(20),
log_type varchar(20));
insert into person values
( 1,'2022-04-25 07:49:58.0','Brian','Rejection 1'),
( 2,'2022-04-25 07:49:58.0','Brian','Rejection 2'),
( 3,'2022-04-27 13:05:51.0','Fredd','Rejection 1'),
( 4,'2022-05-01 02:13:44.0','Janet','Rejection 1'),
( 5,'2022-05-01 03:45:06.0','Janet','Rejection 2'),
( 6,'2022-05-01 08:01:34.0','Peter','Approval'),
( 7,'2022-05-01 12:12:53.0','Frank','Rejection 2'),
( 8,'2022-05-02 01:26:38.0','Frank','Approval');
✓
8 rows affected
SELECT
name,
CASE WHEN LEFT(log_type,6) = 'Reject' THEN time END R,
CASE WHEN log_type = 'Approval' THEN time END A
FROM person;
name | r | a
:---- | :--------- | :---------
Brian | 2022-04-25 | null
Brian | 2022-04-25 | null
Fredd | 2022-04-27 | null
Janet | 2022-05-01 | null
Janet | 2022-05-01 | null
Peter | null | 2022-05-01
Frank | 2022-05-01 | null
Frank | null | 2022-05-02
SELECT
name,
COUNT(CASE WHEN LEFT(log_type,6) = 'Reject' THEN time END) all_rejections,
COUNT(CASE WHEN log_type = 'Approval' THEN time END) all_approvals,
COUNT(DISTINCT CASE WHEN LEFT(log_type,6) = 'Reject' THEN time END) distinct_rejections,
COUNT(DISTINCT CASE WHEN log_type = 'Approval' THEN time END) distinct_approvals
FROM person
GROUP BY name;
name | all_rejections | all_approvals | distinct_rejections | distinct_approvals
:---- | -------------: | ------------: | ------------------: | -----------------:
Brian | 2 | 0 | 1 | 0
Frank | 1 | 1 | 1 | 1
Fredd | 1 | 0 | 1 | 0
Janet | 2 | 0 | 1 | 0
Peter | 0 | 1 | 0 | 1
db<>fiddle here

Related

how to get balance sheet (debit , credit , balance) from transactions table in SQL?

if I have transactions table like that:
+----+--------+------------+-------------+--------+
| id | userID | debitAccID | creditAccID | amount |
+----+--------+------------+-------------+--------+
| 1 | 1 | 1 | 2 | 500 |
| 2 | 1 | 1 | 3 | 600 |
| 3 | 1 | 3 | 1 | 200 |
+----+--------+------------+-------------+--------+
how what query to use to get a table for account with id 1 like that:
+----+--------+------------+-------------+--------+
| debit | credit |balance |
+----+--------+------------+-------------+--------+
| | 500 | | 500 |
| | 600 | | 1100 |
| | | 200| 900 |
+----+--------+------------+-------------+--------+
900
Assuming the id column shows the correct order of transactions, you can use case and window with the default of rows between unlimited preceding and current row to get your output:
select id, user_id,
case when user_id = debit_acc_id then amount else 0 end as debit,
case when user_id = credit_acc_id then amount else 0 end as credit,
sum(case when user_id = debit_acc_id then amount else 0 end) over w
- sum(case when user_id = credit_acc_id then amount else 0 end) over w as balance
from transactions
where user_id = 1
window w as (partition by user_id order by id)
order by user_id, id;
db<>fiddle here

How to use variable lag window functions?

I have a table with the following schema:
CREATE TABLE example (
userID,
status, --'SUCCESS' or 'FAIL'
date -- self explanatory
);
INSERT INTO example
Values(123, 'SUCCESS', 20211010),
(123, 'SUCCESS', 20211011),
(123, 'SUCCESS', 20211028),
(123, 'FAIL', 20211029),
(123, 'SUCCESS', 20211105),
(123, 'SUCCESS', 20211110)
I am trying to utilize a lag or lead function to assess whether the current line is within a 2-week window of the previous 'SUCCESS'. Given the current data, I would expect a isWithin2WeeksofSuccessFlag to be as following:
123, 'SUCCESS', 20211010,0 --since it is the first instance
123, 'SUCCESS', 20211011,1
123, 'SUCCESS', 20211028,1
123, 'FAIL', 20211029, 1 --failed, but criteria is that it is within 2 weeks of last success, which it is
123, 'SUCCESS', 20211105, 1 --last success is 2 rows back, but it is within 2 weeks
123, 'SUCCESS', 20211128, 0 --outside of 2 weeks
I would initially think to do something like this:
Select userID, status, date,
case when lag(status,1) over (partition by userid order by date asc) = 'SUCCESS'
and date_add('day',-14, date) <= lag(date,1) over (partition by userid order by date asc)
then 1 end as isWithin2WeeksofSuccessFlag
from example
This would work if I didn't have the 'FAIL' line in there. To handle it, I could modify the lag to 2 (instead of 1), but what about if I have 2,3,4,n 'FAIL's in a row? I would need to lag by 3,4,5,n+1. The specific number of FAILs in between is variable. How do I handle this variability?
NOTE I am querying billions of rows. Efficiency isn't really a concern (since it is for analysis), but running into memory allocation errors is.Thus, endlessly adding more window functions would likely cause an automatic termination of the query due memory requirement being above node limit.
How should I handle this?
Here's an approach, also using window functions, with each "common table expression" handling one step at a time.
Note: The expected result in the question does not match the data in the question. '20211128' doesn't exist in the actual data. I used the example INSERT statement.
In the test case, I changed the column name to xdate to avoid any potential SQL reserved word issues.
The SQL:
WITH cte1 AS (
SELECT *
, SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) OVER (PARTITION BY userID ORDER BY xdate) AS grp
FROM example
)
, cte2 AS (
SELECT *
, MAX(CASE WHEN status = 'SUCCESS' THEN xdate END) OVER (PARTITION BY userID, grp) AS lastdate
FROM cte1
)
, cte3 AS (
SELECT *
, CASE WHEN LAG(lastdate) OVER (PARTITION BY userID ORDER BY xdate) > (xdate - INTERVAL '2' WEEK) THEN 1 ELSE 0 END AS isNear
FROM cte2
)
SELECT * FROM cte3
ORDER BY userID, xdate
;
The result:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 123 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 123 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 123 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 123 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 123 | SUCCESS | 2021-11-10 | 5 | 2021-11-10 | 1 |
+--------+---------+------------+------+------------+--------+
and with the data adjusted to match your expected result, plus a new user introduced, the result is this:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 123 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 123 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 123 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 123 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 123 | SUCCESS | 2021-11-28 | 5 | 2021-11-28 | 0 |
| 323 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 323 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 323 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 323 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 323 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 323 | SUCCESS | 2021-11-28 | 5 | 2021-11-28 | 0 |
+--------+---------+------------+------+------------+--------+
Here's an extra test case, which might expose problems in some solutions:
INSERT INTO example VALUES
(123, 'SUCCESS', '2021-10-11')
, (123, 'FAIL' , '2021-10-12')
, (123, 'FAIL' , '2021-10-13')
;
The result:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-11 | 1 | 2021-10-11 | 0 |
| 123 | FAIL | 2021-10-12 | 1 | 2021-10-11 | 1 |
| 123 | FAIL | 2021-10-13 | 1 | 2021-10-11 | 1 |
+--------+---------+------------+------+------------+--------+
If your DBMS doesn't support window function filters you can order by status desc so 'SUCCESS' goes before 'FAIL'.
select userID, status, date,
case when lag(status,1) over (partition by userid order by status desc , date asc) = 'SUCCESS'
and dateadd(d, -14, date) <= lag(date,1) over (partition by userid order by status desc , date asc)
then 1 end as isWithin2WeeksofSuccessFlag
from example
order by date
Sql Server fiddle

SQL Server : get Count() of a related table column where some condition

Given tables CollegeMajors
| Id | Major |
|----|-------------|
| 1 | Accounting |
| 2 | Math |
| 3 | Engineering |
and EnrolledStudents
| Id | CollegeMajorId | Name | HasGraduated |
|----|----------------|-----------------|--------------|
| 1 | 1 | Grace Smith | 1 |
| 2 | 1 | Tony Fabio | 0 |
| 3 | 1 | Michael Ross | 1 |
| 4 | 3 | Fletcher Thomas | 1 |
| 5 | 2 | Dwayne Johnson | 0 |
I want to do a query like
Select
CollegeMajors.Major,
Count(select number of students who have graduated) AS TotalGraduated,
Count(select number of students who have not graduated) AS TotalNotGraduated
From
CollegeMajors
Inner Join
EnrolledStudents On EnrolledStudents.CollegeMajorId = CollegeMajors.Id
and I'm expecting these kind of results
| Major | TotalGraduated | TotalNotGraduated |
|-------------|----------------|-------------------|
| Accounting | 2 | 1 |
| Math | 0 | 1 |
| Engineering | 1 | 0 |
So the question is, what kind of query goes inside the COUNT to achieve the above?
Select CollegeMajors.Major
, COUNT(CASE WHEN EnrolledStudents.HasGraduated= 0 then 1 ELSE NULL END) as "TotalNotGraduated",
COUNT(CASE WHEN EnrolledStudents.HasGraduated = 1 then 1 ELSE NULL END) as "TotalGraduated"
From CollegeMajors
InnerJoin EnrolledStudents On EnrolledStudents.CollegeMajorId = CollegeMajors.Id
GROUP BY CollegeMajors.Major
You can use the CASE statement inside your COUNT to achieve the desired result.Please try the below updated query.
Select CollegeMajors.Major
, COUNT(CASE WHEN EnrolledStudents.HasGraduated= 0 then 1 ELSE NULL END) as "TotalNotGraduated",
COUNT(CASE WHEN EnrolledStudents.HasGraduated = 1 then 1 ELSE NULL END) as "TotalGraduated"
From CollegeMajors
InnerJoin EnrolledStudents On EnrolledStudents.CollegeMajorId = CollegeMajors.Id
GROUP BY CollegeMajors.Major
You can try this for graduated count:
Select Count(*) From EnrolledStudents group by CollegeMajorId having HasGraduated = 1
And change 1 to zero for not graduated ones:
Select Count(*) From EnrolledStudents group by CollegeMajorId having HasGraduated = 0

SQL ratio between rows

I have a SQL table with the following format:
+------------------------------------+
| function_id | event_type | counter |
+-------------+------------+---------+
| 1 | fail | 1000 |
| 1 | started | 5000 |
| 2 | fail | 800 |
| 2 | started | 4500 |
| ... | ... | ... |
+-------------+------------+---------+
I want to run a query over this that will group the results by function_id, by giving a ratio of the number of 'fail' events vs the number of 'started' events, as well as maintaining the number of failures. I.e. I want to run a query that will give something that looks like the following:
+-------------------------------------+
| function_id | fail_ratio | failures |
+-------------+------------+----------+
| 1 | 20% | 1000 |
| 2 | 17.78% | 800 |
| ... | ... | |
+-------------+------------+----------+
I've tried a few approaches but have been unsuccessful so far. I'm using Apache Drill SQL at the moment, as this data is being pulled from flat files.
Any help would be greatly appreciated! :)
This is all conditional aggregation:
select function_id,
sum(case when event_type = 'fail' then counter*1.0 end) / sum(case when event_type = 'started' then counter end) as fail_start_ratio,
sum(case when event_type = 'fail' then counter end) as failures
from t
group by function_id

Horizontal Count SQL

I apologize if this is a duplicate question but I could not find my answer.
I am trying to take data that is horizontal, and get a count of how many times a specific number appears.
Example table
+-------+-------+-------+-------+
| Empid | KPI_A | KPI_B | KPI_C |
+-------+-------+-------+-------+
| 232 | 1 | 3 | 3 |
| 112 | 2 | 3 | 2 |
| 143 | 3 | 1 | 1 |
+-------+-------+-------+-------+
I need to see the following:
+-------+--------------+--------------+--------------+
| EmpID | (1's Scored) | (2's Scored) | (3's Scored) |
+-------+--------------+--------------+--------------+
| 232 | 1 | 0 | 2 |
| 112 | 0 | 2 | 1 |
| 143 | 2 | 0 | 1 |
+-------+--------------+--------------+--------------+
I hope that makes sense. Any help would be appreciated.
Since you are counting data across multiple columns, it might be easier to unpivot your KPI columns first, then count the scores.
You could use either the UNPIVOT function or CROSS APPLY to convert your KPI columns into multiple rows. The syntax would be similar to:
select EmpId, KPI, Val
from yourtable
cross apply
(
select 'A', KPI_A union all
select 'B', KPI_B union all
select 'C', KPI_C
) c (KPI, Val)
See SQL Fiddle with Demo. This gets your multiple columns into multiple rows, which is then easier to work with:
| EMPID | KPI | VAL |
|-------|-----|-----|
| 232 | A | 1 |
| 232 | B | 3 |
| 232 | C | 3 |
| 112 | A | 2 |
Now you can easily count the number of 1's, 2's, and 3's that you have using an aggregate function with a CASE expression:
select EmpId,
sum(case when val = 1 then 1 else 0 end) Score_1,
sum(case when val = 2 then 1 else 0 end) Score_2,
sum(case when val = 3 then 1 else 0 end) Score_3
from
(
select EmpId, KPI, Val
from yourtable
cross apply
(
select 'A', KPI_A union all
select 'B', KPI_B union all
select 'C', KPI_C
) c (KPI, Val)
) d
group by EmpId;
See SQL Fiddle with Demo. This gives a final result of:
| EMPID | SCORE_1 | SCORE_2 | SCORE_3 |
|-------|---------|---------|---------|
| 112 | 0 | 2 | 1 |
| 143 | 2 | 0 | 1 |
| 232 | 1 | 0 | 2 |