From Change Log Table to Status on a Given Day - sql

I am trying to convert a change log table into a historical status table using BigQuery's Standard SQL.
The part giving me a hang up is how to select the most recent change log that is before the date to join on.
I had not encountered window functions or indexing during my college years, so I would appreciate guidance on how to apply those functions if they're part of the ideal solution.
Change_Logs table
Update Key Tostring
1 2019-01-30 17:57:51.910 PS-5864 To Do
2 2019-02-11 20:59:08.582 PS-5864 In Progress
3 2019-02-12 19:52:18.733 PS-5864 Done
4 2019-01-31 16:52:12.832 PS-4672 To Do
5 2019-02-11 14:11:13.442 PS-4672 In Progress
6 2019-02-12 04:22:33.111 PS-4672 Done
Dates table
Date
1 2019-02-10
2 2019-02-11
3 2019-02-12
4 2019-02-13
Desired Result:
Date Key Status
1 2019-02-10 00:00:00.000 PS-5864 To Do
2 2019-02-10 00:00:00.000 PS-4672 To Do
3 2019-02-11 00:00:00.000 PS-5864 To Do
4 2019-02-11 00:00:00.000 PS-4672 To Do
5 2019-02-12 00:00:00.000 PS-5864 In Progress
6 2019-02-12 00:00:00.000 PS-4672 In Progress
7 2019-02-13 00:00:00.000 PS-5864 Done
8 2019-02-13 00:00:00.000 PS-4672 Done

Below is for BigQuery Standard SQL
#standardSQL
SELECT d.date, key,
ARRAY_AGG(status ORDER BY l.update DESC LIMIT 1)[OFFSET(0)] status
FROM `project.dataset.dates` d
JOIN `project.dataset.change_logs` l
ON DATE_DIFF(d.date, DATE(l.update), DAY) > 0
GROUP BY d.date, key
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.change_logs` AS (
SELECT DATETIME '2019-01-30 17:57:51.910' `update`, 'PS-5864' key, 'To Do' status UNION ALL
SELECT '2019-02-11 20:59:08.582', 'PS-5864', 'In Progress' UNION ALL
SELECT '2019-02-12 19:52:18.733', 'PS-5864', 'Done' UNION ALL
SELECT '2019-01-31 16:52:12.832', 'PS-4672', 'To Do' UNION ALL
SELECT '2019-02-11 14:11:13.442', 'PS-4672', 'In Progress' UNION ALL
SELECT '2019-02-12 04:22:33.111', 'PS-4672', 'Done'
), `project.dataset.dates` AS (
SELECT DATE '2019-02-10' `date` UNION ALL
SELECT '2019-02-11' UNION ALL
SELECT '2019-02-12' UNION ALL
SELECT '2019-02-13'
)
SELECT d.date, key,
ARRAY_AGG(status ORDER BY l.update DESC LIMIT 1)[OFFSET(0)] status
FROM `project.dataset.dates` d
JOIN `project.dataset.change_logs` l
ON DATE_DIFF(d.date, DATE(l.update), DAY) > 0
GROUP BY d.date, key
-- ORDER BY d.date, key
with result
Row date key status
1 2019-02-10 PS-4672 To Do
2 2019-02-10 PS-5864 To Do
3 2019-02-11 PS-4672 To Do
4 2019-02-11 PS-5864 To Do
5 2019-02-12 PS-4672 In Progress
6 2019-02-12 PS-5864 In Progress
7 2019-02-13 PS-4672 Done
8 2019-02-13 PS-5864 Done

The key idea is to generate the rows with a cross join. Then what you really want is lag(. . . ignore nulls) -- but not supported in BigQuery.
Instead, you can do some array manipulation:
select d.date, cl.key,
array_agg(cl.status ignore nulls order by d.date desc limit 2)[ordinal(2)]
from dates d cross join
(select distinct key from change_logs cl) k left join
change_logs cl
on date(cl.update) = d.date and cl.key = k.key;
EDIT:
The above is not quite correct, because we are missing dates that occur before the specified period. I think the simplest method is to add them and then remove them:
select *
from (select d.date, cl.key,
array_agg(cl.status ignore nulls order by d.date desc limit 2)[ordinal(2)]
from (select d.date
from dates d
union
select distinct date(cl.update)
from change_logs
) d cross join
(select distinct key from change_logs cl) k left join
change_logs cl
on date(cl.update) = d.date and cl.key = k.key
)
where date in (select d.date from dates);

Related

SQL : create intermediate data from date range

I have a table as shown here:
USER
ROI
DATE
1
5
2021-11-24
1
4
2021-11-26
1
6
2021-11-29
I want to get the ROI for the dates in between the other dates, expected result will be as below
From 2021-11-24 to 2021-11-30
USER
ROI
DATE
1
5
2021-11-24
1
5
2021-11-25
1
4
2021-11-26
1
4
2021-11-27
1
4
2021-11-28
1
6
2021-11-29
1
6
2021-11-30
You may use a calendar table approach here. Create a table containing all dates and then join with it. Sans an actual table, you may use an inline CTE:
WITH dates AS (
SELECT '2021-11-24' AS dt UNION ALL
SELECT '2021-11-25' UNION ALL
SELECT '2021-11-26' UNION ALL
SELECT '2021-11-27' UNION ALL
SELECT '2021-11-28' UNION ALL
SELECT '2021-11-29' UNION ALL
SELECT '2021-11-30'
),
cte AS (
SELECT USER, ROI, DATE, LEAD(DATE) OVER (ORDER BY DATE) AS NEXT_DATE
FROM yourTable
)
SELECT t.USER, t.ROI, d.dt
FROM dates d
INNER JOIN cte t
ON d.dt >= t.DATE AND (d.dt < t.NEXT_DATE OR t.NEXT_DATE IS NULL)
ORDER BY d.dt;

Fill Missing Dates for Running Total

I have this table
UserID
Date
Sale
A
2021-05-01
3
A
2021-05-03
1
A
2021-05-03
2
A
2021-05-05
5
B
2021-05-02
4
B
2021-05-03
10
What I need is something that looks like this.
UserID
Date
DailySale
RunningSale
A
2021-05-01
3
3
A
2021-05-02
NULL
3
A
2021-05-03
3
6
A
2021-05-04
NULL
6
A
2021-05-05
5
11
B
2021-05-01
NULL
0
B
2021-05-02
4
4
B
2021-05-03
10
14
B
2021-05-04
NULL
14
B
2021-05-05
NULL
14
I need to join on itself with all the dates in a certain time period so I can create a running sum sales total by date.
I figured out how to do it all separately, I know how to do a running sum using (over partition by) and I know I can join a calendar table to my sales table to get the time period. But I want to try the self join method by distinct(datetime), and I'm not certain how to go about that. I've tried this, but it doesn't work for me. I have over 1 million rows, so it takes over 2 minutes to finished processing and the running-sum column looks exactly like the daily-sum column.
What's the best way to go about this?
Edit: Corrected Table Sums
You need a calendar table here containing all dates. Consider the following approach:
WITH dates AS (
SELECT '2021-05-01' AS Date UNION ALL
SELECT '2021-05-02' UNION ALL
SELECT '2021-05-03' UNION ALL
SELECT '2021-05-04' UNION ALL
SELECT '2021-05-05'
)
SELECT
u.UserID,
d.Date,
SUM(t.Sale) AS DailySale,
SUM(COALESCE(SUM(t.Sale), 0)) OVER (PARTITION BY u.UserID ORDER BY d.Date) AS RunningSale
FROM (SELECT DISTINCT UserID FROM yourTable) u
CROSS JOIN dates d
LEFT JOIN yourTable t
ON t.UserID = u.UserID AND t.Date = d.Date
GROUP BY
u.UserID,
d.Date
ORDER BY
u.UserID,
d.Date
Demo

Get max date for each from either of 2 columns

I have a table like below
AID BID CDate
-----------------------------------------------------
1 2 2018-11-01 00:00:00.000
8 1 2018-11-08 00:00:00.000
1 3 2018-11-09 00:00:00.000
7 1 2018-11-15 00:00:00.000
6 1 2018-12-24 00:00:00.000
2 5 2018-11-02 00:00:00.000
2 7 2018-12-15 00:00:00.000
And I am trying to get a result set as follows
ID MaxDate
-------------------
1 2018-12-24 00:00:00.000
2 2018-12-15 00:00:00.000
Each value in the id columns(AID,BID) should return the max of CDate .
ex: in the case of 1, its max CDate is 2018-12-24 00:00:00.000 (here 1 appears under BID)
in the case of 2 , max date is 2018-12-15 00:00:00.000 . (here 2 is under AID)
I tried the following.
1.
select
g.AID,g.BID,
max(g.CDate) as 'LastDate'
from dbo.TT g
inner join
(select AID,BID,max(CDate) as maxdate
from dbo.TT
group by AID,BID)a
on (a.AID=g.AID or a.BID=g.BID)
and a.maxdate=g.CDate
group by g.AID,g.BID
and 2.
SELECT
AID,
CDate
FROM (
SELECT
*,
max_date = MAX(CDate) OVER (PARTITION BY [AID])
FROM dbo.TT
) AS s
WHERE CDate= max_date
Please suggest a 3rd solution.
You can assemble the data in a table expression first, and the compute the max for each value is simple. For example:
select
id, max(cdate)
from (
select aid as id, cdate from t
union all
select bid, cdate from t
) x
group by id
You seem to only care about values that are in both columns. If this interpretation is correct, then:
select id, max(cdate)
from ((select aid as id, cdate, 1 as is_a, 0 as is_b
from t
) union all
(select bid as id, cdate, 1 as is_a, 0 as is_b
from t
)
) ab
group by id
having max(is_a) = 1 and max(is_b) = 1;

How to write a query that would leave 1 row out

I have a set of data that looks like this I want to remove one row for each of the debnrs that has a p in it for type. I don't care which one. The two rows with P in the type are identical except for the date. How would I select just one with a P in the type.
debnr docno date type num amount
4 NULL 2013-08-29 07:26:25.000 P 1761 -12
4 NULL 2013-09-12 00:00:00.000 P 1761 -12
4 168371 2013-08-29 00:00:00.000 I 168371 12
5 NULL 2013-10-11 09:24:58.000 P 7287 -24
5 NULL 2013-10-14 00:00:00.000 P 7287 -24
5 170366 2013-10-11 00:00:00.000 I 170366 24
6 NULL 2013-10-24 00:00:00.000 P 4023 -465
6 NULL 2013-10-24 09:42:18.000 P 4023 -465
6 171095 2013-10-24 00:00:00.000 I 171095 465
7 NULL 2013-12-16 00:00:00.000 P 171502 -394.2
7 NULL 2013-12-16 00:00:00.000 P 6601 -394.2
7 171502 2013-10-30 00:00:00.000 I 171502 394.2
how would I get it to look like this.
4 NULL 2013-09-12 00:00:00.000 P 1761 -12
4 168371 2013-08-29 00:00:00.000 I 168371 12
5 NULL 2013-10-14 00:00:00.000 P 7287 -24
5 170366 2013-10-11 00:00:00.000 I 170366 24
6 NULL 2013-10-24 09:42:18.000 P 4023 -465
6 171095 2013-10-24 00:00:00.000 I 171095 465
7 NULL 2013-12-16 00:00:00.000 P 6601 -394.2
7 171502 2013-10-30 00:00:00.000 I 171502 394.2
Shot in the dark:
select
debnr,
docno,
max(date),
type,
num,
amount
from magical_table
group by
debnr,
docno,
type,
num,
amount
You could GROUP and use an aggregate given your sample above, if however the amount field weren't identical, for instance, then you could use the ROW_NUMBER() function for this to avoid needing an aggregate:
;WITH cte AS (SELECT *
,CASE WHEN TYPE = 'P' THEN ROW_NUMBER() OVER(PARTITION BY debnr ORDER BY (SELECT 1))
ELSE 0
END AS RN
FROM Table1)
SELECT *
FROM cte
WHERE RN <= 1
Demo: SQL Fiddle
The ORDER BY (SELECT 1) could be changed to any field, that's just one way to get an arbitrary result if you don't want a min/max.
Want you line with type "I" ungrouped ?
select debnr, docno, max(date), type, num, amount
from magical_table
where type = "P"
group by debnr, docno, type, num, amount
UNION
select debnr, docno, date, type, num, amount
from magical_table
where type = "I"

SQL - Select next date query

I have a table with many IDs and many dates associated with each ID, and even a few IDs with no date. For each ID and date combination, I want to select the ID, date, and the next largest date also associated with that same ID, or null as next date if none exists.
Sample Table:
ID Date
1 5/1/10
1 6/1/10
1 7/1/10
2 6/15/10
3 8/15/10
3 8/15/10
4 4/1/10
4 4/15/10
4
Desired Output:
ID Date Next_Date
1 5/1/10 6/1/10
1 6/1/10 7/1/10
1 7/1/10
2 6/15/10
3 8/15/10
3 8/15/10
4 4/1/10 4/15/10
4 4/15/10
SELECT
mytable.id,
mytable.date,
(
SELECT
MIN(mytablemin.date)
FROM mytable AS mytablemin
WHERE mytablemin.date > mytable.date
AND mytable.id = mytablemin.id
) AS NextDate
FROM mytable
This has been tested on SQL Server 2008 R2 (but it should work on other DBMSs) and produces the following output:
id date NextDate
----------- ----------------------- -----------------------
1 2010-05-01 00:00:00.000 2010-06-01 00:00:00.000
1 2010-06-01 00:00:00.000 2010-06-15 00:00:00.000
1 2010-07-01 00:00:00.000 2010-08-15 00:00:00.000
2 2010-06-15 00:00:00.000 2010-07-01 00:00:00.000
3 2010-08-15 00:00:00.000 NULL
3 2010-08-15 00:00:00.000 NULL
4 2010-04-01 00:00:00.000 2010-04-15 00:00:00.000
4 2010-04-15 00:00:00.000 2010-05-01 00:00:00.000
4 NULL NULL
Update 1:
For those that are interested, I've compared the performance of the two variants in SQL Server 2008 R2 (one uses MIN aggregate and the other uses TOP 1 with an ORDER BY):
Without an index on the date column, the MIN version had a cost of 0.0187916 and the TOP/ORDER BY version had a cost of 0.115073 so the MIN version was "better".
With an index on the date column, they performed identically.
Note that this was testing with just these 9 records so the results could be (very) spurious...
Update 2:
The results hold for 10,000 uniformly distributed random records. The TOP/ORDER BY query takes so long to run at 100,000 records I had to cancel it and give up.
If your db is oracle, you can use lead() and lag() functions.
SELECT id, date,
LEAD(date, 1, 0) OVER (PARTITION BY ID ORDER BY Date DESC NULLS LAST) NEXT_DATE,
FROM Your_table
ORDER BY ID;
SELECT
id,
date,
( SELECT date
FROM table t1
WHERE t1.date > t2.date
ORDER BY t1.date LIMIT 1 )
FROM table t2
I think self JOIN would be faster than subselect.
WITH dates AS (
SELECT 1 AS ID, '2010-05-01' AS Date
UNION ALL SELECT 1, '2010-06-01'
UNION ALL SELECT 1, '2010-07-01'
UNION ALL SELECT 2, '2010-06-15'
UNION ALL SELECT 3, '2010-08-15'
UNION ALL SELECT 3, '2010-08-15'
UNION ALL SELECT 4, '2010-04-01'
UNION ALL SELECT 4, '2010-04-15'
UNION ALL SELECT 4, ''
)
SELECT
dates.ID,
dates.Date,
nextDates.Date AS Next_Date
FROM
dates
LEFT JOIN
dates nextDates
ON nextDates.ID = dates.ID
AND nextDates.Date > dates.Date
LEFT JOIN
dates noLower
ON noLower.ID = nextDates.ID
AND noLower.Date < nextDates.Date
AND noLower.Date > dates.Date
WHERE
dates.Date > 0
AND noLower.ID IS NULL
https://www.db-fiddle.com/f/4sWRLt2hxjik5HqiJ21ez8/1