Fill Missing Dates for Running Total - sql

I have this table
UserID
Date
Sale
A
2021-05-01
3
A
2021-05-03
1
A
2021-05-03
2
A
2021-05-05
5
B
2021-05-02
4
B
2021-05-03
10
What I need is something that looks like this.
UserID
Date
DailySale
RunningSale
A
2021-05-01
3
3
A
2021-05-02
NULL
3
A
2021-05-03
3
6
A
2021-05-04
NULL
6
A
2021-05-05
5
11
B
2021-05-01
NULL
0
B
2021-05-02
4
4
B
2021-05-03
10
14
B
2021-05-04
NULL
14
B
2021-05-05
NULL
14
I need to join on itself with all the dates in a certain time period so I can create a running sum sales total by date.
I figured out how to do it all separately, I know how to do a running sum using (over partition by) and I know I can join a calendar table to my sales table to get the time period. But I want to try the self join method by distinct(datetime), and I'm not certain how to go about that. I've tried this, but it doesn't work for me. I have over 1 million rows, so it takes over 2 minutes to finished processing and the running-sum column looks exactly like the daily-sum column.
What's the best way to go about this?
Edit: Corrected Table Sums

You need a calendar table here containing all dates. Consider the following approach:
WITH dates AS (
SELECT '2021-05-01' AS Date UNION ALL
SELECT '2021-05-02' UNION ALL
SELECT '2021-05-03' UNION ALL
SELECT '2021-05-04' UNION ALL
SELECT '2021-05-05'
)
SELECT
u.UserID,
d.Date,
SUM(t.Sale) AS DailySale,
SUM(COALESCE(SUM(t.Sale), 0)) OVER (PARTITION BY u.UserID ORDER BY d.Date) AS RunningSale
FROM (SELECT DISTINCT UserID FROM yourTable) u
CROSS JOIN dates d
LEFT JOIN yourTable t
ON t.UserID = u.UserID AND t.Date = d.Date
GROUP BY
u.UserID,
d.Date
ORDER BY
u.UserID,
d.Date
Demo

Related

SQL : create intermediate data from date range

I have a table as shown here:
USER
ROI
DATE
1
5
2021-11-24
1
4
2021-11-26
1
6
2021-11-29
I want to get the ROI for the dates in between the other dates, expected result will be as below
From 2021-11-24 to 2021-11-30
USER
ROI
DATE
1
5
2021-11-24
1
5
2021-11-25
1
4
2021-11-26
1
4
2021-11-27
1
4
2021-11-28
1
6
2021-11-29
1
6
2021-11-30
You may use a calendar table approach here. Create a table containing all dates and then join with it. Sans an actual table, you may use an inline CTE:
WITH dates AS (
SELECT '2021-11-24' AS dt UNION ALL
SELECT '2021-11-25' UNION ALL
SELECT '2021-11-26' UNION ALL
SELECT '2021-11-27' UNION ALL
SELECT '2021-11-28' UNION ALL
SELECT '2021-11-29' UNION ALL
SELECT '2021-11-30'
),
cte AS (
SELECT USER, ROI, DATE, LEAD(DATE) OVER (ORDER BY DATE) AS NEXT_DATE
FROM yourTable
)
SELECT t.USER, t.ROI, d.dt
FROM dates d
INNER JOIN cte t
ON d.dt >= t.DATE AND (d.dt < t.NEXT_DATE OR t.NEXT_DATE IS NULL)
ORDER BY d.dt;

SQL query joining on existing date records and max date for missing records

I have an items table with dates and values. As soon as the value gets to 1, there are no more records for that Itemid.
Item Table
Itemid ItemDate Value
1 2020-04-30 0.5
1 2020-05-31 0.75
1 2020-06-30 1.0
2 2020-05-31 0.6
2 2020-06-30 1.0
I want to join this with a simple date table
dateId EOMDate
1 2020-04-30
2 2020-05-31
3 2020-06-30
4 2020-07-31
5 2020-08-31
The result should produce one record for each date in the date table and for each item where the date is >= the Item date. Where there is an exact date match with the Item table, it will use that record from the item table. Where there is no matching record in the item table, then it uses the record with the Max(ItemDate) value, that exists in the item table.
So it should produce this:
Result EOMDate ItemDate Value
1 2020-04-30 2020-04-30 0.5
1 2020-05-31 2020-05-31 0.75
1 2020-06-30 2020-06-30 1.0
1 2020-07-31 2020-06-30 1.0
1 2020-08-31 2020-06-30 1.0
2 2020-05-31 2020-05-31 0.6
2 2020-06-30 2020-06-30 1.0
2 2020-07-31 2020-06-30 1.0
2 2020-08-31 2020-06-30 1.0
The item table has several hundred millions of rows, and the date table has 120 records (each month end for 10 years), so I need a good performing solution. This has completely stumped me for some reason!
EDIT
my initial and non-working solution uses an outer apply
select p.ItemId, p.ItemDate, d.EOMDate, p.Value
from (select ItemId, ItemDate, Value from Items) p
OUTER APPLY
(
SELECT EOMDate from dates
) d
order by p.ItemDate,d.EOMDate
However it returns a table that has one record for each combination of Item date and EOM date. So in the above example, 20 records for ItemId 1 and 16 records for ItemId2
Here is to sql to create the above example tables:
CREATE TABLE #Items (ItemId int, ItemDate date, [Value] float)
Insert into #Items (ItemId,ItemDate,[Value])
Values (1,'2020-04-30',0.5),(1,'2020-05-31',0.75),(1,'2020-06-30',1),(2,'2020-05-31',0.6),(2,'2020-06-30',1)
Create Table #dates (dateId int, EOMDate date)
Insert into #dates (dateId,EOMDate) Values (1,'2020-04-30'),(2,'2020-05-31'),(3,'2020-06-30'),(4,'2020-07-31'),(5,'2020-08-31')
One method uses apply:
select i.*, d.*
from (select item_id, max(date) as max_date
from items
group by item_id
) i outer apply
(select top (1) d.*
from dates d
where d.date >= max_date
order by d.date asc
) d
You can use cross join and analytical function as follows:
Select * from
(Select a.item_id, d.eomdate, i.itemdate, i.value,
Row_number() over (partition by a.item_id, d.eomdate order by i.itemdate) as rn
From
(Select distinct item_id from items) a
Cross join Dates d
join items i on i.item_id = a.item_id and d.eomdate >= i.item_date) t
Where rn = 1

From Change Log Table to Status on a Given Day

I am trying to convert a change log table into a historical status table using BigQuery's Standard SQL.
The part giving me a hang up is how to select the most recent change log that is before the date to join on.
I had not encountered window functions or indexing during my college years, so I would appreciate guidance on how to apply those functions if they're part of the ideal solution.
Change_Logs table
Update Key Tostring
1 2019-01-30 17:57:51.910 PS-5864 To Do
2 2019-02-11 20:59:08.582 PS-5864 In Progress
3 2019-02-12 19:52:18.733 PS-5864 Done
4 2019-01-31 16:52:12.832 PS-4672 To Do
5 2019-02-11 14:11:13.442 PS-4672 In Progress
6 2019-02-12 04:22:33.111 PS-4672 Done
Dates table
Date
1 2019-02-10
2 2019-02-11
3 2019-02-12
4 2019-02-13
Desired Result:
Date Key Status
1 2019-02-10 00:00:00.000 PS-5864 To Do
2 2019-02-10 00:00:00.000 PS-4672 To Do
3 2019-02-11 00:00:00.000 PS-5864 To Do
4 2019-02-11 00:00:00.000 PS-4672 To Do
5 2019-02-12 00:00:00.000 PS-5864 In Progress
6 2019-02-12 00:00:00.000 PS-4672 In Progress
7 2019-02-13 00:00:00.000 PS-5864 Done
8 2019-02-13 00:00:00.000 PS-4672 Done
Below is for BigQuery Standard SQL
#standardSQL
SELECT d.date, key,
ARRAY_AGG(status ORDER BY l.update DESC LIMIT 1)[OFFSET(0)] status
FROM `project.dataset.dates` d
JOIN `project.dataset.change_logs` l
ON DATE_DIFF(d.date, DATE(l.update), DAY) > 0
GROUP BY d.date, key
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.change_logs` AS (
SELECT DATETIME '2019-01-30 17:57:51.910' `update`, 'PS-5864' key, 'To Do' status UNION ALL
SELECT '2019-02-11 20:59:08.582', 'PS-5864', 'In Progress' UNION ALL
SELECT '2019-02-12 19:52:18.733', 'PS-5864', 'Done' UNION ALL
SELECT '2019-01-31 16:52:12.832', 'PS-4672', 'To Do' UNION ALL
SELECT '2019-02-11 14:11:13.442', 'PS-4672', 'In Progress' UNION ALL
SELECT '2019-02-12 04:22:33.111', 'PS-4672', 'Done'
), `project.dataset.dates` AS (
SELECT DATE '2019-02-10' `date` UNION ALL
SELECT '2019-02-11' UNION ALL
SELECT '2019-02-12' UNION ALL
SELECT '2019-02-13'
)
SELECT d.date, key,
ARRAY_AGG(status ORDER BY l.update DESC LIMIT 1)[OFFSET(0)] status
FROM `project.dataset.dates` d
JOIN `project.dataset.change_logs` l
ON DATE_DIFF(d.date, DATE(l.update), DAY) > 0
GROUP BY d.date, key
-- ORDER BY d.date, key
with result
Row date key status
1 2019-02-10 PS-4672 To Do
2 2019-02-10 PS-5864 To Do
3 2019-02-11 PS-4672 To Do
4 2019-02-11 PS-5864 To Do
5 2019-02-12 PS-4672 In Progress
6 2019-02-12 PS-5864 In Progress
7 2019-02-13 PS-4672 Done
8 2019-02-13 PS-5864 Done
The key idea is to generate the rows with a cross join. Then what you really want is lag(. . . ignore nulls) -- but not supported in BigQuery.
Instead, you can do some array manipulation:
select d.date, cl.key,
array_agg(cl.status ignore nulls order by d.date desc limit 2)[ordinal(2)]
from dates d cross join
(select distinct key from change_logs cl) k left join
change_logs cl
on date(cl.update) = d.date and cl.key = k.key;
EDIT:
The above is not quite correct, because we are missing dates that occur before the specified period. I think the simplest method is to add them and then remove them:
select *
from (select d.date, cl.key,
array_agg(cl.status ignore nulls order by d.date desc limit 2)[ordinal(2)]
from (select d.date
from dates d
union
select distinct date(cl.update)
from change_logs
) d cross join
(select distinct key from change_logs cl) k left join
change_logs cl
on date(cl.update) = d.date and cl.key = k.key
)
where date in (select d.date from dates);

Full History Join

currently I am trying to figure out a join between to historized tables, where I want to synchronize both timeline.
As an example, I have the following two tables:
A
ID Value FROM TO
1 5 01.01.2018 31.03.2018
1 6 31.03.2018 08.04.2018
B A_FK Value FROM TO
1 1 50 01.02.2018 01.04.2018
2 1 51 04.04.2018 10.04.2018
As a baseline, I want to take the timeline of table A and join table B, including NULL values so that I know, for which times there is no fitting value.
The desired result should look like this:
C
Value_A Value_B FROM TO
5 NULL 01.01.2018 01.02.2018
5 50 01.02.2018 31.03.2018
6 50 31.03.2018 01.04.2018
6 NULL 01.04.2018 04.04.2018
6 51 04.04.2018 08.04.2018
Can you help me with this? I started, but can fail to align the wrong history - here my try:
with a as (SELECT *
FROM (VALUES (1,5,'01.01.2018','31.03.2018')
, (1,6,'31.03.2018','08.04.2018')
) A (ID, VALUE, FROM, TO)),
b as (
SELECT *
FROM (VALUES (1,1,50,'01.02.2018','01.04.2018')
, (2,1,51,'04.04.2018','10.04.2018')
) A (ID,A_FK, VALUE, FROM, TO)
)
select
a.value as value_a,
b.value as value_b,
max(a.from,b.from) as from,
min(a.to,b.to) as to
from a
left outer join b on
a.id = b.a_fk and
a.from < b.to and
a.to > b.from;
As you can see, it aligns, but not the way I expected it to.
Thank you for your help.
So as I suggested in the comments with the technique in my own answer from another question you can solve your problem.
Here is one solution.
The test data:
create table a (
id integer,
value integer,
dtfrom date,
dtto date
);
create table b(
id integer,
a_fk integer,
value integer,
dtfrom date,
dtto date
);
insert into a values
(1, 5, '2018-01-01', '2018-03-31'),
(1, 6, '2018-03-31', '2018-04-08');
insert into b values
(1, 1, 50, '2018-02-01', '2018-04-01'),
(2, 1, 51, '2018-04-04', '2018-04-10');
The trick part of this solution is to generate the date intervals that isn't in any of your tables such as 01.01.2018-01.02.2018 and 01.02.2018-31.03.2018 so in order to do that you must have all available dates as one table so I created a VIEW called timmings to make it easier:
create or replace view timmings as
select a.dtfrom dt from a inner join b on a.id=b.a_fk
union
select a.dtto from a inner join b on a.id=b.a_fk
union
select b.dtfrom from a inner join b on a.id=b.a_fk
union
select b.dtto from a inner join b on a.id=b.a_fk;
After that you need a query to generate all available periods (starts and ends) so it will be:
select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1
order by start;
This will result in (with your sample data):
start dend
01/01/2018 01/02/2018
01/02/2018 31/03/2018
31/03/2018 01/04/2018
01/04/2018 04/04/2018
04/04/2018 08/04/2018
08/04/2018 10/04/2018
10/04/2018 null
With that you can use it to get all available values from table a that intersects with the periods:
select a.id, a.value, tm.start, tm.dend
from (select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1) tm
left join a on tm.start >= a.dtfrom and tm.dend <= a.dtto
where a.id is not null
order by tm.start;
That results in:
id value start end
1 5 01/01/2018 01/02/2018
1 5 01/02/2018 31/03/2018
1 6 31/03/2018 01/04/2018
1 6 01/04/2018 04/04/2018
1 6 04/04/2018 08/04/2018
And finally you LEFT JOIN it with b table:
select x.value as valueA,
b.value as valueB,
x.start as "from",
x.dend as "to"
from (select a.id, a.value, tm.start, tm.dend
from (select t1.dt as start,
(select min(t2.dt)
from timmings t2
where t2.dt>t1.dt) as dend
from timmings t1) tm
left join a on tm.start >= a.dtfrom and tm.dend <= a.dtto
where a.id is not null
) x
left join b on b.a_fk = x.id
and b.dtfrom <= x.start
and b.dtto >= x.dend
order by x.start;
Which will give you the result you want:
valueA valueB start end
5 null 01/01/2018 01/02/2018
5 50 01/02/2018 31/03/2018
6 50 31/03/2018 01/04/2018
6 null 01/04/2018 04/04/2018
6 51 04/04/2018 08/04/2018
See the final solution working here: http://sqlfiddle.com/#!9/36418e/1 It is MySQL but since it is all SQL ANSI it will work just fine in DB2
There is an excellent Blog article about that
"Fun with Date Ranges" by John Maenpaa
And secondly if you have a chance to influence the DDL I would recommend to have a closer look at Db2 Temporal Tables - they come with full SQL support (Time Travel SQL) - find details here
This is actually really simple if you have what's known as a Calendar table - a table with every date in it - although you can construct one on-the-fly if necessary. You can use it to turn this more obviously into a gaps-and-islands problem.
(You want one anyways, since they're one of the most useful analysis dimension tables):
SELECT valueA, valueB,
MIN(calendarDate) AS startDate,
MAX(calendarDate) + 1 DAY AS endDate
FROM (SELECT A.val AS valueA, B.val AS valueB, Calendar.calendarDate,
ROW_NUMBER() OVER(ORDER BY Calendar.calendarDate) -
ROW_NUMBER() OVER(PARTITION BY A.val, B.val ORDER BY Calendar.calendarDate) AS grouping
FROM Calendar
LEFT JOIN A
ON A.startDate <= Calendar.calendarDate
AND A.endDate > Calendar.calendarDate
LEFT JOIN B
ON B.startDate <= Calendar.calendarDate
AND B.endDate > Calendar.calendarDate
WHERE A.val IS NOT NULL
OR B.val IS NOT NULL) Groups
GROUP BY valueA, valueB, grouping
ORDER BY grouping
SQL Fiddle Example (Minor tweaks for SQL Server usage in example)
...which yields the following results. Note that there's a few extra days from the date range in table B that aren't present in table A!
valueA valueB startDate endDate
5 (null) 2018-01-01 2018-02-01
5 50 2018-02-01 2018-03-31
6 50 2018-03-31 2018-04-01
6 (null) 2018-04-01 2018-04-04
6 51 2018-04-04 2018-04-08
(null) 51 2018-04-08 2018-04-10
(This of course is trivially changeable by switching the join to A to a regular INNER JOIN, but I figured this and other cases would be important.)

get median in overlap time range

vertica db, for example, have a table called revenue:
date revenue
2016-07-12 1
2016-07-12 10
2016-07-12 5
2016-07-12 3
2016-07-13 7
2016-07-13 120
2016-07-13 22
2016-07-14 5
2016-07-14 17
The tricky thing is I don't want median for each date but I want to calculate the median revenue for the timerange >= given each day, for example the result would be like:
daterange median_revenue
>= 2016-07-12 7
>= 2016-07-13 17
>= 2016-07-14 11
to be clear:
7 = median(1,10,5,3,7,120,22,5,17)
17 = median(7,120,22,5,17)
11 = median(5,17)
How could I write a sql script for these daterange? Is there an easy way to query? I don't want to calculate in each daterange then union because there are many days.
Would this help?
SELECT
date_table.[date],
MEDIAN (r.revenue) AS median_revenue
FROM
(SELECT DISTINCT [date] FROM revenue) date_table
LEFT JOIN revenue r ON r.[date] >= r_main.[date]
GROUP BY
date_table.[date]
just figured out
select distinct date, median(revenue) over (partition by date) as rev_median
from (select a.date,b.revenue
from (select distinct date from revenue_test) a
left outer join revenue b
on a.date<=b.date order by a.date,b.date) a ;`