Duplicating records to fill gap between dates in Google BigQuery - google-bigquery

So I've found similar resources that address how to do this in SQL, like this:
Duplicating records to fill gap between dates
I understand that BigQuery may not be the best place to do this, so I'm trying to see if it's at all possible. When trying to run some of the methods in the link above above I'm hitting a wall as some of the functions aren't supported within BigQuery.
If a table exists with data structured like so:
MODIFY_DATE SKU STORE STOCK_ON_HAND
08/01/2016 00:00:00 1120010 21 100
08/05/2016 00:00:00 1120010 21 75
08/07/2016 00:00:00 1120010 21 40
How can I build a query within Google BigQuery that yields an output like the one below? A value at a given date is repeated until the next change for the dates in between:
MODIFY_DATE SKU STORE STOCK_ON_HAND
08/01/2016 00:00:00 1120010 21 100
08/02/2016 00:00:00 1120010 21 100
08/03/2016 00:00:00 1120010 21 100
08/04/2016 00:00:00 1120010 21 100
08/05/2016 00:00:00 1120010 21 75
08/06/2016 00:00:00 1120010 21 75
08/07/2016 00:00:00 1120010 21 40
I know I need to generate a table that has all the dates within a given range, but I'm having a hard time understanding if this can be done. Any ideas?

How can I build a query within Google BigQuery that yields an output like the one below? A value at a given date is repeated until the next change for the dates in between
See example below
SELECT
MODIFY_DATE,
MAX(SKU_TEMP) OVER(PARTITION BY grp) AS SKU,
MAX(STORE_TEMP) OVER(PARTITION BY grp) AS STORE,
MAX(STOCK_ON_HAND_TEMP) OVER(PARTITION BY grp) AS STOCK_ON_HAND,
FROM (
SELECT
DAY AS MODIFY_DATE, SKU AS SKU_TEMP, STORE AS STORE_TEMP, STOCK_ON_HAND AS STOCK_ON_HAND_TEMP,
COUNT(SKU) OVER(ORDER BY DAY ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp,
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP("2016-08-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP("2016-08-07"), TIMESTAMP("2016-08-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
) AS DATES
LEFT JOIN (
SELECT DATE(MODIFY_DATE) AS MODIFY_DATE, SKU, STORE, STOCK_ON_HAND
FROM
(SELECT "2016-08-01" AS MODIFY_DATE, "1120010" AS SKU, 21 AS STORE, 75 AS STOCK_ON_HAND),
(SELECT "2016-08-05" AS MODIFY_DATE, "1120010" AS SKU, 22 AS STORE, 100 AS STOCK_ON_HAND),
(SELECT "2016-08-07" AS MODIFY_DATE, "1120011" AS SKU, 23 AS STORE, 40 AS STOCK_ON_HAND),
) AS TABLE_WITH_GAPS
ON TABLE_WITH_GAPS.MODIFY_DATE = DATES.DAY
)
ORDER BY MODIFY_DATE

I need to generate a table that has all the dates within a given range, but I'm having a hard time understanding if this can be done. Any ideas?
SELECT DATE(DATE_ADD(TIMESTAMP("2016-08-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP("2016-08-07"), TIMESTAMP("2016-08-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))

Related

How to merge rows startdate enddate based on column values using Lag Lead or window functions?

I have a table with 4 columns: ID, STARTDATE, ENDDATE and BADGE. I want to merge rows based on ID and BADGE values but make sure that only consecutive rows will get merged.
For example, If input is:
Output will be:
I have tried lag lead, unbounded, bounded precedings but unable to achieve the output:
SELECT ID,
STARTDATE,
MAX(ENDDATE),
NAME
FROM (SELECT USERID,
IFF(LAG(NAME) over(Partition by USERID Order by STARTDATE) = NAME,
LAG(STARTDATE) over(Partition by USERID Order by STARTDATE),
STARTDATE) AS STARTDATE,
ENDDATE,
NAME
from myTable )
GROUP BY USERID,
STARTDATE,
NAME
We have to make sure that we merge only consective rows having same ID and Badge.
Help will be appreciated, Thanks.
You can split the problem into two steps:
creating the right partitions
aggregating on the partitions with direct aggregation functions (MIN and MAX)
You can approach the first step using a boolean field that is 1 when there's no consecutive date match (row1.ENDDATE = row2.STARTDATE + 1 day). This value will indicate when a new partition should be created. Hence if you compute a running sum, you should have your correctly numbered partitions.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
)
SELECT *
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
Then the second step can be summarized in the direct aggregation of "STARTDATE" and "ENDDATE" using the MIN and MAX function respectively, grouping on your ranking value. For syntax correctness, you need to add "ID" and "Badge" too in the GROUP BY clause, even though their range of action is already captured by the computed ranking value.
WITH cte AS (
SELECT *,
IFF(LAG(ENDDATE) OVER(PARTITION BY ID, Badge ORDER BY STARTDATE) + INTERVAL 1 DAY = STARTDATE , 0, 1) AS boolval
FROM tab
), cte2 AS (
SELECT *,
SUM(COALESCE(boolval, 0)) OVER(ORDER BY ID DESC, STARTDATE) AS rn
FROM cte
)
SELECT ID,
MIN(STARTDATE) AS STARTDATE,
MAX(ENDDATE) AS ENDDATE,
Badge
FROM cte2
GROUP BY ID,
Badge,
rn
In Snowflake, such gaps and island problem can be solved using
function conditional_true_event
As below query -
First CTE, creates a column to indicate a change event (true or false) when a value changes for column badge.
Next CTE (cte_1) using this change event column with function conditional_true_event produces another column (increment if change is TRUE) to be used as grouping, in the final main query.
And, final query is just min, max group by.
with cte as (
select
m.*,
case when badge <> lag(badge) over (partition by id order by null)
then true
else false end flag
from merge_tab m
), cte_1 as (
select c.*,
conditional_true_event(flag) over (partition by id order by null) cn
from cte c
)
select id,min(startdate) ms, max(enddate) me, badge
from cte_1
group by id,badge,cn
order by id desc, ms asc, me asc, badge asc;
Final output -
ID
MS
ME
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
9999-12-31
3
With data -
select * from merge_tab;
ID
STARTDATE
ENDDATE
BADGE
51
1985-02-01
2019-04-28
1
51
2019-04-29
2019-04-28
2
51
2019-09-16
2019-11-16
2
51
2019-11-17
2020-08-16
2
51
2020-08-17
2021-04-03
3
51
2021-04-04
2021-04-05
1
51
2021-04-06
2022-05-05
2
51
2022-05-06
2022-08-20
2
51
2022-08-21
9999-12-31
3
10
2020-02-06
2019-04-28
3
10
2021-03-21
9999-12-31
3

How to create a start and end date with no gaps from one date column and to sum a value within the dates

I am new SQL coding using in SQL developer.
I have a table that has 4 columns: Patient ID (ptid), service date (dt), insurance payment amount (insr_amt), out of pocket payment amount (op_amt). (see table 1 below)
What I would like to do is (1) create two columns "start_dt" and "end_dt" using the "dt" column where if there are no gaps in the date by the patient ID then populate the start and end date with the first and last date by patient ID, however if there is a gap in service date within the patient ID then to create the separate start and end date rows per patient ID, along with (2) summing the two payment amounts by patient ID with in the one set of start and end date visits (see table 2 below).
What would be the way to run this using SQL code in SQL developer?
Thank you!
Table 1:
Ptid
dt
insr_amt
op_amt
A
1/1/2021
30
20
A
1/2/2021
30
10
A
1/3/2021
30
10
A
1/4/2021
30
30
B
1/6/2021
10
10
B
1/7/2021
20
10
C
2/1/2021
15
30
C
2/2/2021
15
30
C
2/6/2021
60
30
Table 2:
Ptid
start_dt
end_dt
total_insr_amt
total_op_amt
A
1/1/2021
1/4/2021
120
70
B
1/6/2021
1/7/2021
30
20
C
2/1/2021
2/2/2021
30
60
C
2/6/2021
2/6/2021
60
30
You didn't mention the specific database so this solution works in PostgreSQL. You can do:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select *,
sum(inc) over(partition by ptid order by dt) as grp
from (
select *,
case when dt - interval '1 day' = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
Result:
ptid start_dt end_dt total_insr_amt total_op_amt
----- ---------- ---------- -------------- -----------
A 2021-01-01 2021-01-04 120 70
B 2021-01-06 2021-01-07 30 20
C 2021-02-01 2021-02-02 30 60
C 2021-02-06 2021-02-06 60 30
See running example at DB Fiddle 1.
EDIT for Oracle
As requested, the modified query that works in Oracle is:
select
ptid,
min(dt) as start_dt,
max(dt) as end_dt,
sum(insr_amt) as total_insr_amt,
sum(op_amt) as total_op_amt
from (
select x.*,
sum(inc) over(partition by ptid order by dt) as grp
from (
select t.*,
case when dt - 1 = lag(dt) over(partition by ptid order by dt)
then 0 else 1 end as inc
from t
) x
) y
group by ptid, grp
order by ptid, grp
See running example at db<>fiddle 2.

SQL Count distinct per 30 days

Can SQL distinct count per 30 days backward or MAU (Monthly active user)? for example if I have data like this:
date user
1/1/2020 A
1/2/2020 B
1/2/2020 C
...
1/30/2020 Z
And I transform it into like this using DISTINCT COUNT
date distinct_user
1/1/2020 1
1/2/2020 2
...
1/30/2020 30
To make it easier, assume that distinct user is the number of distinct users that active per days and there is no overlap between days (in reality there is overlap). So the result of MAU will be like this
date distinct_user MAU
1/1/2020 1 1
1/2/2020 2 3
...
1/30/2020 30 465
465 is the result of calculating distinct user in 30 days (with assumption no overlap user every days). so if there is 5 new user that active on 1/31/2020, the result will be like this
date distinct_user MAU
1/1/2020 1 1
1/2/2020 2 3
...
1/30/2020 30 465
1/31/2020 5 469
469 is from (Last MAU) + (new distinct user) - (distinct user from 1/1/2020 because the range is 30 days) so the result is 465 + 5 - 1 with the assumption that 5 users that active on 1/31/2020 is not active from 1/2/2020 to 1/30/2020
There are different approches to answer this question, the better one in terms of performance may be the following :
SELECT mt1.`date`, SUM(mt2.distinct_user) AS MAU
FROM (
SELECT `date`
FROM myTable
GROUP BY `date`
) mt1 INNER JOIN (
SELECT `date`, SUM(distinct_user) AS distinct_user
FROM myTable
GROUP BY `date`
) mt2
WHERE mt2.`date` BETWEEN mt1.`date` - INTERVAL 29 DAY AND mt1.`date`
GROUP BY mt1.`date`
ORDER BY mt1.`date`;
SEE DEMO HERE
Perhaps the simplest method is to "unpivot" the data and reaggregate:
with t1 as (
select date, user, 1 as inc
from t
union all
select date + interval 30 day, user, -1 as inc
from t
),
select date,
sum(case when sum_inc > 0 then 1 else 0 end) as running_30day_users
from (select t1.*,
sum(inc) over (partition by user order by date) as sum_inc
from t1
) t1
group by date;
I should note that this can also be expressed in SQL as:
select distinct date, running_30
from (select t.*,
count(distinct user) over (order by date range between interval 29 day preceding and current date) as running_30
from t
) t;
However, I'm not sure if Athena supports that syntax.

Select weekly data from date table

I have a table with date and other columns. The dates are all weekdays excluding holidays and weekends. I need to select weekly data from the table (OR every Monday data and if Monday is a holiday select Tuesday's. Next row will be Monday's data and so on.).
Example table columns and data:
Date Rate StockQty
2018/08/31 22 25
2018/09/04 24 25
2018/09/05 23 24
2018/09/06 19 21
2018/09/07 25 22
2018/09/10 21 21
I need to select data such that the result will be:
Date Rate StockQty
2018/08/31 22 25
2018/09/04 24 25
2018/09/10 21 21
It is selecting one row per week. 9/3 is Monday and a holiday, so select Tuesday date, then select next week's Monday date.
I tried to partition by DatePart, but it lupms all week together.
create table #Date_rate
(
date smalldatetime,rate int,stockQty int
)
Insert into #Date_rate
select '2018/08/31', 22 , 25 union
select '2018/09/04', 24 , 25 union
select '2018/09/05', 23 , 24 union
select '2018/09/06', 19 , 21 union
select '2018/09/07', 25 , 22 union
select '2018/09/10', 21 , 21
select
a.date
,a.rate
,a.stockQty
from(
select
*
,dense_rank() over(partition by datepart(WEEK,date) order by datepart(WEEKDAY,date) asc) as SekectedDay
from #Date_rate
) a where SekectedDay=1
You can follow logic like this:
select t.*
from (select t.*,
row_number() over (partition by extract(year from date), extract(week from date) order by date asc) as seqnum
from t
) t
where seqnum = 1;
Date functions can vary by database. This uses ANSI/ISO standard functions.
This should work in SQL Server:
SELECT date,Rate,StockQty FROM
(SELECT
date,
Rate,
StockQty,
ROW_NUMBER() OVER(PARTITION BY YEAR(date),DATENAME(WK,Date) ORDER BY day(date))cnt
FROM
#temp
)m
WHERE
cnt = 1

Get the last 30 unique days that had data

I am trying to run a query that will retrieve the most recent 30 days that have data (not the last 30 days)
There are can be several rows for the same date (so can't use the limit 30)
My data has the following formatting:
date count
2017-05-05 111
2017-05-05 78
2017-04-28 54
2017-01-11 124
Is there a way for me to add a WHERE clause to get the most recent 30 days with data?
Not sure if I correctly understand, though...
(this is for most recent 2 day):
with t(date, count) as(
select '2017-05-05', 111 union all
select '2017-05-05', 78 union all
select '2017-04-28', 54 union all
select '2017-01-11', 124
)
select date from t group by date order by date desc limit 2
If you want all rows, which has the same date as the last 30, distinct dates in your table, you can use the dense_rank() window function:
select (t).*
from (select t, dense_rank() over (order by date desc)
from t) s
where dense_rank <= 30
or IN, with a sub-select:
select *
from t
where date in (select distinct date
from t
order by date desc
limit 30)
http://rextester.com/ESDLIM64772
select * from tablename
where datecolumn in (select TOP 30 max(datecolumn) from tablename group by datecolumn)