SQL - How Cumulative Sum when group by values are missing - sql

As a follow-up to my previous ticket - which I now believe the example was too simple (previous question), I prepared an example of a scenario where I'm looking to aggregate column cus_sum group on the date_col column and the cus column representing the unique customer number.
I wish to generate a series of dates for instance (function generate series) from the 1st of January 2018 till the 10th of January 2018 and then have a cumulative sum of column cus_sum for each customer. As in the case below, you can imagine that there are days without information at all and days where not all customers have any records - regardless i want to show their cumulative sum throughout that period.
CREATE TABLE test2 (date_col date, cus int, cus_sum int);
insert into test2 values ('2018-01-01', 1, 5);
insert into test2 values ('2018-01-02', 1, 12);
insert into test2 values ('2018-01-02', 2, 14);
insert into test2 values ('2018-01-03', 2, 8);
insert into test2 values ('2018-01-03', 2, 10);
insert into test2 values ('2018-01-04', 1, 22);
insert into test2 values ('2018-01-06', 2, 20);
insert into test2 values ('2018-01-06', 1, 5);
insert into test2 values ('2018-01-07', 1, 45);
insert into test2 values ('2018-01-08', 2, 32);
The output should look like:
date_col cus cum_sum
"2018-01-01" 1 5
"2018-01-01" 2 0
"2018-01-02" 1 17
"2018-01-02" 2 14
"2018-01-03" 1 17
"2018-01-03" 2 32
"2018-01-04" 1 39
"2018-01-04" 2 32
"2018-01-05" 1 39
"2018-01-05" 2 32
"2018-01-06" 1 89
"2018-01-06" 2 52
"2018-01-07" 1 134
"2018-01-07" 2 52
"2018-01-08" 1 134
"2018-01-08" 1 84
Perhaps I should add that - one table I assume will be a virtual table that generates a list of dates in a given timeframe. The second table is a list of customers[1,3,4,5..10], products purchases (product volume) - which is what I wish to cumulative sum for every customer and everyday of the series.

Assuming that you have separate table for customers, so you can use CTE to generate the data range and then join croos join customer table to have all combinations of customer and dates, then you get the sum from test2 table. the query will look like below -
WITH DateRange AS (
SELECT
[MyDate] = CONVERT(DATETIME,'01/01/2018')
UNION ALL
SELECT
[MyDate] = DATEADD(DAY, 1, [Date])
FROM
DateRange
WHERE
[MyDate] <= '01/10/2018'
) SELECT
d.[MyDate]
c.cus
(
select isnull(sum(cus_sume),0)
from test2 t
where t.date = d.mydate
and c.cust = t.cust
) as cus_sum
FROM
DateRange d
cross join customer c
order by d.MyDate

The cross join of generate_series() and unnest() creates a virtual table of all possible values:
select distinct
date_col::date,
cus,
coalesce(sum(cus_sum) over (partition by cus order by date_col), 0) as cum_sum
from generate_series('2018-01-01'::date, '2018-01-08', '1d') as date_col
cross join (select distinct cus from test2) c
left join test2 using (date_col, cus)
order by date_col, cus
date_col | cus | cum_sum
------------+-----+---------
2018-01-01 | 1 | 5
2018-01-01 | 2 | 0
2018-01-02 | 1 | 17
2018-01-02 | 2 | 14
2018-01-03 | 1 | 17
2018-01-03 | 2 | 32
2018-01-04 | 1 | 39
2018-01-04 | 2 | 32
2018-01-05 | 1 | 39
2018-01-05 | 2 | 32
2018-01-06 | 1 | 44
2018-01-06 | 2 | 52
2018-01-07 | 1 | 89
2018-01-07 | 2 | 52
2018-01-08 | 1 | 89
2018-01-08 | 2 | 84
(16 rows)
It looks like there are mistakes in the OP's expected results.
DbFiddle.

Related

Calculate cumulative percentages by date in SQL

How might I calculate cumulative percentages in SQL (Postgres/Vertica)?
For instance, the question is "As of each date, of all patients who had been diagnosed by that date, what percent had been treated by that date?"
For instance, this table shows dates of diagnosis and treatment, with binary values that might be summed
ID | diagnosed | date_diag | treated | date_treat
---|------------|-----------|----------|-----------
1 1 Jan 1 0 null
2 1 Jan 15 1 Feb 20
3 1 Jan 29 1 Feb 1
4 1 Feb 08 1 Mar 4
5 1 Feb 12 0 null
6 1 Feb 18 1 Feb 24
7 1 Mar 15 1 May 5
8 1 Apr 14 1 Apr 20
I'd like to get a table of cumulative treated-vs-diagnosed ratio that might look like this.
date | ytd_diag | ytd_treat | ytd_percent
-------|------------|-----------|----------
Jan 01 1 0 0.00
Jan 15 2 0 0.00
Jan 29 3 0 0.00
Feb 08 4 1 0.25
Feb 12 5 1 0.20
Feb 18 6 1 0.17
Mar 15 7 4 0.57
Apr 14 8 4 0.50
I can calculate cumulative counts of diagnosed or treated (e.g. below), using window functions but I can't figure out a SQL query to get the number of people who'd already been treated as of each diagnosis date.
SELECT
date_diag ,
SUM(COUNT(*)) OVER ( ORDER BY date_diag ) as freq
FROM patients
WHERE diagnosed = 1
GROUP BY date_diag
ORDER BY date_diag;
You can use conditional aggregation with SUM() window function:
WITH cte AS (
SELECT kind,
date,
SUM((kind = 1)::int) OVER (ORDER BY date) ytd_diag,
SUM((kind = 2)::int) OVER (ORDER BY date) ytd_treat
FROM (
SELECT 1 kind, date_diag date, diagnosed status FROM patients
UNION ALL
SELECT 2, date_treat, treated FROM patients WHERE date_treat IS NOT NULL
) t
)
SELECT date, ytd_diag, ytd_treat,
ROUND(1.0 * ytd_treat / ytd_diag, 2) ytd_percent
FROM cte
WHERE kind = 1;
See the demo.
You can solve this with window functions. The first thing you want to do is to derive a table from your patients table that has a running tally of both the diagnosed and treated columns. The rows should be tallied in ascending order of the diagnosis date.
Here's how you do that.First I'll create a sample patients table and data (I'll only include the columns necessary):
create temporary table patients (
date_diag date,
diagnosed int default 0,
treated int default 0
);
insert into patients (date_diag, diagnosed, treated) values
('2021-01-01', 1, 0),
('2021-01-11', 1, 1),
('2021-01-16', 1, 0),
('2021-01-30', 1, 1),
('2021-02-04', 1, 1),
('2021-01-14', 1, 1);
Then here's how to create the derived table of all the tallied results.
select
date_diag,
diagnosed,
treated,
sum(treated) over(order by date_diag ASC ) as treated_cmtv,
count(diagnosed) over(order by date_diag ASC) as diagnosed_cmtv
from patients
/*
date_diag | diagnosed | treated | treated_cmtv | diagnosed_cmtv
------------+-----------+---------+--------------+----------------
2021-01-01 | 1 | 0 | 0 | 1
2021-01-11 | 1 | 1 | 1 | 2
2021-01-14 | 1 | 1 | 2 | 3
2021-01-16 | 1 | 0 | 2 | 4
2021-01-30 | 1 | 1 | 3 | 5
2021-02-04 | 1 | 1 | 4 | 6
*/
Now that you have this table you can easily calculate the percentage by using defining this derived table in a subquery and then selecting the necessary columns for the calculation. Like so:
select
p.date_diag,
p.diagnosed,
p.diagnosed_cmtv,
p.treated_cmtv,
p.treated,
TRUNC(p.treated_cmtv::numeric / p.diagnosed_cmtv * 1.0, 2) as percent
from (
-- same table as above
select
date_diag,
diagnosed,
treated,
sum(treated) over(order by date_diag ASC ) as treated_cmtv,
count(diagnosed) over(order by date_diag ASC) as diagnosed_cmtv
from patients
) as p;
/*
date_diag | diagnosed | diagnosed_cmtv | treated_cmtv | treated | percent
------------+-----------+----------------+--------------+---------+---------
2021-01-01 | 1 | 1 | 0 | 0 | 0.00
2021-01-11 | 1 | 2 | 1 | 1 | 0.50
2021-01-14 | 1 | 3 | 2 | 1 | 0.66
2021-01-16 | 1 | 4 | 2 | 0 | 0.50
2021-01-30 | 1 | 5 | 3 | 1 | 0.60
2021-02-04 | 1 | 6 | 4 | 1 | 0.66
*/
I think that gives you what you are asking for.
An alternative approach to the other answers is to use a coordinated sub query in the select
SELECT
p.date_diag,
(SELECT COUNT(*)
FROM patients p2
WHERE p2.date_treat <= p.date_diag) ytd_treated
FROM
patients p
WHERE diagnosed = 1
GROUP BY p.date_diag
ORDER BY p.date_diag
This will give you that column of 0,0,0,1,1,4,4 - you can divide it by the diagnosed column to give your percentage
SELECT
(select ...) / SUM(COUNT(*)) OVER(...)
Note you might need some more clauses in your inner where, such as having a treated date greater than or equal to Jan 1st of the year of the diag date if you're running it against a dataset with more than just one year's data
Also bear in mind that treated as an integer will (should) nearly always be less than diagnosed so if you do an integer divide you'll get zero. Cast one of the operands to float or if you're doing your percentage out of a hundred maybe *100.0

Selecting only top parent table row with all of it's children table rows

So I have two tables:
#ProjectHealthReports
Id | From | SubmittedOn
1 | 2020-01-01 |
2 | 2020-02-01 | 2020-10-23
3 | 2020-03-01 |
4 | 2020-04-01 | 2020-10-23
5 | 2020-05-01 | 2020-10-23
#ProjectHealthReportItems
Id | Note | ProjectHealthReportId
1 | First for 2020-01-01 | 1
2 | Second for 2020-01-01 | 1
3 | First for 2020-02-01 | 2
4 | Second for 2020-02-01 | 2
5 | First for 2020-03-01 | 3
6 | Second for 2020-03-01 | 3
7 | First for 2020-04-01 | 4
8 | Second for 2020-04-01 | 4
9 | (We want this one) First for 2020-05-01 | 5
10 | (We want this one) Second for 2020-05-01 | 5
How can I get all #ProjectHealthReportItems and #ProjectHealthReport details for the last From date which has value for SubmittedOn (so in this case it would be ProjectHealthReport 5 and ProjectHealthReportItems 9, 10).
Basically, I need something like this just, obviously without top 1 as it only returns one row and I need, in this case, to return 2 rows :)
select top 1 phr.Id, phr.[From], phr.SubmittedOn, phri.Note from #ProjectHealthReports phr
inner join #ProjectHealthReportItems phri on phr.Id = phri.ProjectHealthReportId
where phr.SubmittedOn is not null
order by phr.[From] desc
Here is the SQL for creating and seeding the tables
create table #ProjectHealthReports(
Id int primary key,
[From] date not null ,
SubmittedOn date null
)
go
create table #ProjectHealthReportItems(
Id int primary key,
Note nvarchar(max),
ProjectHealthReportId int constraint FK_PHR references #ProjectHealthReports
)
go
insert into #ProjectHealthReports(Id, [From], SubmittedOn)
values (1, '2020-01-01', null),
(2, '2020-02-01', getutcdate()),
(3, '2020-03-01', null),
(4, '2020-04-01', getutcdate()),
(5, '2020-05-01', getutcdate())
go
insert into #ProjectHealthReportItems(Id, Note, ProjectHealthReportId)
values (1, 'First for 2020-01-01', 1),
(2, 'Second for 2020-01-01', 1),
(3, 'First for 2020-02-01', 2),
(4, 'Second for 2020-02-01', 2),
(5, 'First for 2020-03-01', 3),
(6, 'Second for 2020-03-01', 3),
(7, 'First for 2020-04-01', 4),
(8, 'Second for 2020-04-01', 4),
(9, '(We want this one) First for 2020-05-01', 5),
(10, '(We want this one) Second for 2020-05-01', 5)
go
First select top then join
select t.*, phri.Note
from (select top(1) phr.Id phrid, phr.[From], phr.SubmittedOn
from #ProjectHealthReports phr
where phr.SubmittedOn is not null
order by phr.[From] desc) t
inner join #ProjectHealthReportItems phri on t.phrId = phri.ProjectHealthReportId
I would suggest window functions:
select phr.*, phri.*
from #ProjectHealthReports phr left join
(select phri.*,
row_number() over (partition by ProjectHealthReportId order by id desc) as seqnum
from #ProjectHealthReportItems phri
) phri
on phr.Id = phri.ProjectHealthReportId and seqnum = 1
order by phr.[From] desc;
You can also do this using filtering in the where, such as correlated subquery:
select phr.*, phri.*
from #ProjectHealthReports phr join
#ProjectHealthReportItems phri
on phr.Id = phri.ProjectHealthReportId and seqnum = 1
where phri.id = (select max(phri2.id)
from #ProjectHealthReportItems phri2
where phri2.ProjectHealthReportId = phri.ProjectHealthReportId
)
order by phr.[From] desc
An efficient way to do this without a LEFT JOIN would be assign a row number, using the ROW_NUMBER() windowing function, to the #ProjectHealthReports table. Something like this
with lv_cte as (
select *, row_number() over (order by [From] desc) rn
from #ProjectHealthReports)
select l.*, phri.*
from lv_cte l
join #ProjectHealthReportItems phri on l.id=phri.ProjectHealthReportId
where l.rn=1;
Output
Id From SubmittedOn rn Id Note ProjectHealthReportId
5 2020-05-01 2020-10-23 1 9 (We want this one) First for 2020-05-01 5
5 2020-05-01 2020-10-23 1 10 (We want this one) Second for 2020-05-01 5

case when statement in oracle across tables

Hi apologies for formatting but im stumped and frustrated and i just need some help.
I've got two tables. I have made a good faith attempt to follow community standards but just in case it doesnt work, Table A has 3 columns 'ID', to identify a sales rep, 'Start' to indicate what company term they started, and 'Sales' to indicate their sales in that first term. Table B is just an expansion of Table A where it lists all terms (i marked it as quarters) a sales person was there and their sales.
Table A
+----+---------+-------+
| ID | Quarter | Sales |
+----+---------+-------+
| 1 | 141 | 30 |
| 2 | 151 | 50 |
| 3 | 151 | 80 |
+----+---------+-------+
Table B
+----+---------+-------+
| ID | Quarter | Sales |
+----+---------+-------+
| 1 | 141 | 30 |
| 1 | 142 | 25 |
| 1 | 143 | 45 |
| 2 | 151 | 50 |
| 2 | 152 | 60 |
| 2 | 153 | 75 |
| 3 | 151 | 80 |
| 3 | 152 | 50 |
| 3 | 153 | 70 |
+----+---------+-------+
My desired output is a table with ID, start term, sales from that term, second term, sales from that term, etc. for the first 6 terms an employee is there
my code is this
select a.id, start, a.sales,
case when a.start+1 = b.quarter then sales end as secondquartersales,
case when a.start+2 = b.quarter then sales end as thridquartersales,.....
from tablea a
left join tableb b
on a.id = b.id;
it gives nulls for all case when statements. please help
maybe try GROUP BY
create table a ( id number, strt number, sales number);
create table b (id number, quarter number , sales number);
insert into a values (1,141,30);
insert into a values (2,151,50);
insert into a values (3,151,80);
insert into b values ( 1,141,30);
insert into b values ( 1,142,25);
insert into b values ( 1,143,45);
insert into b values ( 2,151,50);
insert into b values ( 2,152,60);
insert into b values ( 2,153,75);
insert into b values ( 3,151,80);
insert into b values ( 3,152,50);
insert into b values ( 3,153,70);
select a.id, a.strt, a.sales,
max(case when a.strt+1 = b.quarter then b.sales end ) as secondquartersales,
max(case when a.strt+2 = b.quarter then b.sales end ) as thridquartersales
from a, b
where a.id = b.id
group by a.id, a.strt, a.sales;
OR PIVOT
select * from (
select a.id,
case when a.strt+1 = b.quarter then 'Q2'
when a.strt+2 = b.quarter then 'Q3'
when a.strt+3 = b.quarter then 'Q4'
when a.strt = b.quarter then 'Q1'end q,
b.sales sales
from a, b
where a.id = b.id)
pivot ( max(nvl(sales,0)) for Q in ('Q1', 'Q2', 'Q3', 'Q4'));
This is valid ANSI 92 SQL, as it is an inner join. The whole ANSI style version is just syntax candy.

SQL - For each ID, values in other columns should be repeated

The table I am trying to create should look like this
**ID** **Timeframe** Value
1 60 15
1 60 30
1 90 45
2 60 15
2 60 30
2 90 45
3 60 15
3 60 30
3 90 45
So for each ID the values of 60,60,90 and 15,30,45 should be repeated.
Could anyone help me with a code? :)
You are looking for a cross join. The basic idea is something like this:
select i.id, tv.timeframe, tv.value
from (values (1), (2), (3)) i(id) cross join
(values (60, 15), (60, 30), (90, 45)) tv(timeframe, value)
order by i.id, tv.value;
Not all databases support the values() table constructor. In those databases, you would need to use the appropriate syntax.
So you have this table: ...
id
1
2
3
and you have this table: ...
timeframe value
60 15
60 30
90 45
Then try this:
WITH
-- the ID table...
id(id) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
)
,
-- the values table:
vals(timeframe,value) AS (
SELECT 60,15
UNION ALL SELECT 60,30
UNION ALL SELECT 90,45
)
SELECT
id
, timeframe
, value
FROM id CROSS JOIN vals
ORDER BY id, timeframe;
-- out id | timeframe | value
-- out ----+-----------+-------
-- out 1 | 60 | 30
-- out 1 | 60 | 15
-- out 1 | 90 | 45
-- out 2 | 60 | 30
-- out 2 | 60 | 15
-- out 2 | 90 | 45
-- out 3 | 60 | 30
-- out 3 | 60 | 15
-- out 3 | 90 | 45
-- out (9 rows)

SQL performing day difference by matching value

My goal is to get the duration when the 1st OLD or 1st NEW status reaches to the 1st END. For example: Table1
ID Day STATUS
111 1 NEW
111 2 NEW
111 3 OLD
111 4 END
111 5 END
112 1 OLD
112 2 OLD
112 3 NEW
112 4 NEW
112 5 END
113 1 NEW
113 2 NEW
The desired outcome would be:
STATUS Count
NEW 2 (1 for ID 111-New on day 1 to End on day 4,and 1 for 112-new on day 3 to End on day 5)
OLD 2 (1 for ID 111-Old on day 3 to End on day 4, and 1 for 112-OLD on day 1 to End on day 5)
The following is T-SQL (SQL Server) and NOT available in MySQL. The choice of dbms is vital in a question because there are so many dbms specific choices to make. The query below requires using a "window function" row_number() over() and a common table expression neither of which exist yet in MySQL (but will one day). This solution also uses cross apply which (to date) is SQL Server specific but there are alternatives in Postgres and Oracle 12 using lateral joins.
SQL Fiddle
MS SQL Server 2014 Schema Setup:
CREATE TABLE Table1
(id int, day int, status varchar(3))
;
INSERT INTO Table1
(id, day, status)
VALUES
(111, 1, 'NEW'),
(111, 2, 'NEW'),
(111, 3, 'OLD'),
(111, 4, 'END'),
(111, 5, 'END'),
(112, 1, 'OLD'),
(112, 2, 'OLD'),
(112, 3, 'NEW'),
(112, 4, 'NEW'),
(112, 5, 'END'),
(113, 1, 'NEW'),
(113, 2, 'NEW')
;
Query 1:
with cte as (
select
*
from (
select t.*
, row_number() over(partition by id, status order by day) rn
from table1 t
) d
where rn = 1
)
select
t.id, t.day, ca.nxtDay, t.Status, ca.nxtStatus
from cte t
outer apply (
select top(1) Status, day
from cte nxt
where t.id = nxt.id
and t.status = 'NEW' and nxt.status = 'END'
order by day
) ca (nxtStatus, nxtDay)
where nxtStatus IS NOT NULL or Status = 'OLD'
order by id, day
Results:
| id | day | nxtDay | Status | nxtStatus |
|-----|-----|--------|--------|-----------|
| 111 | 1 | 4 | NEW | END |
| 111 | 3 | (null) | OLD | (null) |
| 112 | 1 | (null) | OLD | (null) |
| 112 | 3 | 5 | NEW | END |
As you can see, counting that Status column would result in NEW = 2 and OLD = 2