select last non-null value and append it to another column BigQuery/PYTHON - google-bigquery

I have a table in BQ that looks like this:
Row Field DateTime
1 one 10:00 AM
2 null 10:05 AM
3 null 10:10 AM
4 one 10:30 AM
5 null 11:00 AM
6 two 11:15 AM
7 two 11:30 AM
8 null 11:35 AM
9 null 11:40 AM
10 null 11:50 AM
11 null 12:00 AM
12 null 12:15 AM
13 two 12:30 AM
14 null 12:15 AM
15 null 12:25 AM
16 null 12:35 AM
17 three 12:55 AM
I want to create another column called prevField and fill it out with the last Field value that is not null, when the first and last entry around the null are the same. When the first and last entry around null are different, it should remain null. The result would look like the following:
Row Field DateTime prevField
1 one 10:00 AM null
2 null 10:05 AM one
3 null 10:10 AM one
4 one 10:30 AM one
5 null 11:00 AM null
6 two 11:15 AM two
7 two 11:30 AM two
8 null 11:35 AM two
9 null 11:40 AM two
10 null 11:50 AM two
11 null 12:00 AM two
12 null 12:15 AM two
13 two 12:30 AM two
14 null 12:15 AM null
15 null 12:15 AM null
16 null 12:15 AM null
17 three 12:15 AM three
So far i tried the following code variations for first part of the question (fill out prevField with the last Field value that is not null, when the first and last entry around the null are the same) but without success.
select Field, Datetime,
(1)--case when FieldName is null then LAG(FieldName) over (order by DateTime) else FieldName end as prevFieldName
(2)--LAST_VALUE(FieldName IGNORE NULLS) OVER (ORDER BY DateTime
(3)--ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS prevFieldName
(4)-- first_value(FieldName)over(order by DateTime) as prevFieldName
from table
EDIT: I added rows to the data and change row numbers

You can use following logic to achieve your goal.
Sample Data creation:
WITH
Base AS
(
SELECT *
FROM(
SELECT 123 Row, 'one' Field, '10:00 AM' DateTime
UNION ALL
SELECT 123, null, '10:05 AM'
UNION ALL
SELECT
123, null, '10:10 AM'
UNION ALL
SELECT
123 , 'one' , '10:30 AM'
UNION ALL
SELECT
456,null,'11:00 AM'
UNION ALL
SELECT
456,'two','11:15 AM'
UNION ALL
SELECT
789,'two','11:30 AM'))
Logic: The query grabs max and min for each field and also the lead and lag values for each row, based on that it determines the prevfield values.
SELECT a.Field,DateTime,
CASE WHEN a.DateTime = a.min_date THEN ''
WHEN a.lag_field IS NOT NULL and a.lead_field IS NULL THEN a.lag_field
WHEN a.lag_field IS NULL and a.lead_field IS NOT NULL THEN a.lead_field
WHEN a.lag_field != a.lead_field THEN a.lag_field
WHEN a.Field IS NOT NULL AND a.lag_field IS NULL AND a.lead_field IS NULL AND a.DateTime = a.Max_date THEN a.Field
ELSE ''
END as prevField
FROM(
SELECT Base.Field,DateTime,LAG(Base.Field) over (order by DateTime)lag_field,Lead(Base.Field) over (order by DateTime) lead_field,min_date,Max_date
From Base LEFT JOIN (SELECT Field,MIN(DateTime) min_date,MAX(DateTime) Max_date FROM Base Group by Field) b
ON Base.Field = b.Field
) a

This query partly solve my problem:
CREATE TEMP FUNCTION ToHex(x INT64) AS (
(SELECT STRING_AGG(FORMAT('%02x', x >> (byte * 8) & 0xff), '' ORDER BY byte DESC)
FROM UNNEST(GENERATE_ARRAY(0, 7)) AS byte)
);
SELECT
DateTime
Field
, SUBSTR(MAX( ToHex(row_n) || Field) OVER (ORDER BY row_n ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 17) AS previous
FROM (
SELECT *, ROW_NUMBER() over (ORDER BY DateTime) AS row_n
FROM `xx.yy.zz`
);

Related

SSMS 2018 - Find Gaps in Dates and Flag the Gaps

I have reviewed many posts about how to find gaps in dates and believe that I am close to figuring it out but need just a little extra help. Per my query I am pulling distinct days with a record count for each distinct day. I have added a "Gap_Days" column which should return a zero if no gap from previous date OR the number of days since the previous date. As you can see all of my Gap_Days are zero when in fact I am missing 10/24 and 10/25. Therefore on 10/26 there should be a gap of 2 since the previous date is 10/23.
Thanks in advance for pointing out what I am probably looking right at.
SELECT DISTINCT Run_Date, COUNT(Run_Date) AS Daily_Count,
Gap_Days = Coalesce(DateDiff(Day,Lag(Run_Date) Over (partition by Run_Date order by Run_Date DESC), Run_Date)-1,0)
FROM tblUnitsOfWork
WHERE (Run_Date >= '2022-10-01')
GROUP BY Run_Date
ORDER BY Run_Date DESC;
Run_Date Daily_Count Gap_Days
2022-10-29 00:00:00.000 8431 0
2022-10-28 00:00:00.000 8204 0
2022-10-27 00:00:00.000 8705 0
2022-10-26 00:00:00.000 7885 0
2022-10-23 00:00:00.000 7485 0
2022-10-22 00:00:00.000 8699 0
2022-10-21 00:00:00.000 9212 0
2022-10-20 00:00:00.000 9220 0
First let's set up some demo data:
DECLARE #table TABLE (ID INT IDENTITY, date DATE)
DECLARE #dt DATE
WHILE (SELECT COUNT(*) FROM #table) < 30
BEGIN
SET #dt = DATEADD(DAY,(ROUND(((50 - 1 -1) * RAND() + 1), 0) - 1)-25,CURRENT_TIMESTAMP)
IF NOT EXISTS (SELECT 1 FROM #table WHERE date = #dt) INSERT INTO #table (date) SELECT #dt
END
ID date
--------
1 2022-11-10
2 2022-11-15
3 2022-10-20
...
28 2022-10-14
29 2022-11-13
30 2022-11-21
This gives us a table variable with 30 random dates in a 50 day window. Now let's look for missing dates:
SELECT *, CASE WHEN ROW_NUMBER() OVER (ORDER BY date) > 1 AND LAG(date,1) OVER (ORDER BY date) <> DATEADD(DAY,-1,date) THEN 'GAP! ' + CAST(DATEDIFF(DAY,LAG(date,1) OVER (ORDER BY date),date)-1 AS NVARCHAR) + ' DAYS MISSING!' END
FROM #table
ORDER BY date
All we're doing here is ignoring the first date (since it's expected there wouldn't be one before then) and from then on comparing the last date (using lag ordered by date) to the current date. If it is not a day before the case statement will produce a message with how many days were missing.
ID date MissingDatesFlag
----------------------------
1 2022-10-08 NULL
4 2022-10-09 NULL
25 2022-10-10 NULL
28 2022-10-11 NULL
22 2022-10-15 GAP! 4 DAYS MISSING!
2 2022-10-18 GAP! 3 DAYS MISSING!
12 2022-10-19 NULL
24 2022-10-20 NULL
....
15 2022-11-18 GAP! 3 DAYS MISSING!
29 2022-11-21 GAP! 3 DAYS MISSING!
20 2022-11-22 NULL
Since the demo data is randomly selected your results may vary, but they should be similar.

Calculating slots with double bookings and null val

Example dataset.
CLINIC
APPTDATETIME
PATIENT_ID
NEW_FOLLOWUP_FLAG
TGYN
20/07/2022 09:00:00
1
N
TGYN
20/07/2022 09:45:00
2
F
TGYN
20/07/2022 10:05:00
NULL
NULL
TGYN
20/07/2022 10:05:00
4
F
TGYN
20/07/2022 10:25:00
5
F
TGYN
20/07/2022 10:30:00
NULL
NULL
TGYN
20/07/2022 10:35:00
NULL
NULL
TGYN
20/07/2022 10:40:00
NULL
NULL
TGYN
20/07/2022 10:45:00
NULL
NULL
TGYN
20/07/2022 11:10:00
6
F
TGYN
20/07/2022 11:10:00
7
F
As you can see there are times with multiple patients, times with empty slots and times with both (generally DQ errors).
I'm trying to calculate how many slots where filled and how many of those were new (N) or follow up(F). If there is a slot with a patient and also a NULL row then I only want to count the row with the patient. If there are only NULL rows for a timeslot then I want to count that as 'unfilled'.
From this dataset I would like to calculate the following for each group of clinic and apptdatetime.
CLINIC
APPTDATE
N Capacity
F Capacity
Unfilled Capacity
TGYN
20/07/2022
1
5
4
What's the best way to go about this?
I've considered taking a list of distinct values for each clinic and date and then joining to that but wanted to know if there are a more elegant way.
First I set up some demo data in a table from what you provided:
DECLARE #table TABLE (CLINIC NVARCHAR(4), APPTDATETIME DATETIME, PATIENT_ID INT, NEW_FOLLOWUP_FLAG NVARCHAR(1))
INSERT INTO #table (CLINIC, APPTDATETIME, PATIENT_ID, NEW_FOLLOWUP_FLAG) VALUES
('TGYN','07/20/2022 09:00:00', 1 ,'N'),
('TGYN','07/20/2022 09:45:00', 2 ,'F'),
('TGYN','07/20/2022 10:05:00', NULL ,NULL),
('TGYN','07/20/2022 10:05:00', 4 ,'F'),
('TGYN','07/20/2022 10:25:00', 5 ,'F'),
('TGYN','07/20/2022 10:30:00', NULL ,NULL),
('TGYN','07/20/2022 10:35:00', NULL ,NULL),
('TGYN','07/20/2022 10:40:00', NULL ,NULL),
('TGYN','07/20/2022 10:45:00', NULL ,NULL),
('TGYN','07/20/2022 11:10:00', 6 ,'F'),
('TGYN','07/20/2022 11:10:00', 7 ,'F')
Reading through your description it looks like you'd need a couple of case statements and a group by:
SELECT CLINIC, CAST(APPTDATETIME AS DATE) AS APPTDATE,
SUM(CASE WHEN NEW_FOLLOWUP_FLAG = 'N' THEN 1 ELSE 0 END) AS NCapacity,
SUM(CASE WHEN NEW_FOLLOWUP_FLAG = 'F' THEN 1 ELSE 0 END) AS FCapacity,
SUM(CASE WHEN NEW_FOLLOWUP_FLAG IS NULL THEN 1 ELSE 0 END) AS UnfilledCapacity
FROM #table
GROUP BY CLINIC, CAST(APPTDATETIME AS DATE)
Which returns a result set like this:
CLINIC APPTDATE NCapacity FCapacity UnfilledCapacity
------------------------------------------------------------
TGYN 2022-07-20 1 5 5
Note that I cast the datetime column to a date and grouped by that.
The case statements just test for a condition (is the column null, or F or N) and then just returns a 1, which is summed.
Your title also asked about finding duplicates in the data set. You should likely have a constraint on this table making CLINIC and APPTDATETIME forcibly unique. This would prevent rows even being inserted as dupes.
If you want to find them in the table try something like this:
SELECT CLINIC, APPTDATETIME, COUNT(*) AS Cnt
FROM #table
GROUP BY CLINIC, APPTDATETIME
HAVING COUNT(*) > 1
Which from the test data returned:
CLINIC APPTDATETIME Cnt
-----------------------------------
TGYN 2022-07-20 10:05:00.000 2
TGYN 2022-07-20 11:10:00.000 2
Indicating there are dupes for those clinic/datetime combinations.
HAVING is the magic here, we can count them up and state we only want ones which are greater than 1.
This is basically a straight-forward conditional aggregation with group by, with the slight complication of excluding NULL rows where a corresponding appointment also exists.
For this you can include an anti-semi self-join using not exists so as to exclude counting for unfilled capacity any row where there's also valid data for the same date:
select CLINIC, Convert(date, APPTDATETIME) AppDate,
Sum(case when NEW_FOLLOWUP_FLAG = 'N' then 1 end) N_Capacity,
Sum(case when NEW_FOLLOWUP_FLAG = 'f' then 1 end) F_Capacity,
Sum(case when NEW_FOLLOWUP_FLAG is null then 1 end) U_Capacity
from t
where not exists (
select * from t t2
where t.PATIENT_ID is null
and t2.PATIENT_ID is not null
and t.APPTDATETIME = t2.APPTDATETIME
)
group by CLINIC, Convert(date, APPTDATETIME);

Get cumulative distinct count of active ids(ids where deleted date is null as of/before the modified date)

I am facing a problem while getting the cumulative distinct count of resource ids as of different modified dates in vertica. If you see the below table I have resource id, modified date and deleted date and I want to calculate the count of distinct active resources as of all unique modified dates. A resource is considered active when deleted date is null as of/before that modified date.
I was able to get the count when for a particular resource lets say resource id 1 the active count(deleted date null) or inactive count(deleted date not null) dont occur consecutively.
But when they occur consecutively I have to take the count as 1 till it becomes inactive and then I have to consider count as 0 for that resource id when it becomes inactive and all consecutive inactive values till it becomes active again. Likewise for all the distinct resource ids and cumulative sum of those.
sa_resource_id
modified_date
deleted_Date
1
2022-01-22 15:46:06.758
2
2022-01-22 15:46:06.758
16
2022-04-22 15:46:06.758
17
2022-04-22 15:46:06.758
18
2022-04-22 15:46:06.758
16
2022-04-29 15:46:06.758
2022-04-29 15:46:06.758
17
2022-04-29 15:46:06.758
2022-04-29 15:46:06.758
1
2022-05-22 15:46:06.758
2022-05-22 15:46:06.758
2
2022-05-22 15:46:06.758
2022-05-22 15:46:06.758
1
2022-05-23 22:16:06.758
1
2022-05-24 22:16:06.758
2022-05-24 22:16:06.758
1
2022-05-25 22:16:06.758
1
2022-05-27 22:16:06.758
This is the partition and sum query I have tried out where I partition the table based on resource ids and do sum over different modified dates.
SELECT md,
dca_agent_count
FROM
(
SELECT modified_date AS md,
SUM(SUM(CASE WHEN deleted_Date IS NULL THEN 1
WHEN deleted_Date IS NOT NULL THEN -1 ELSE 0
END)) OVER (ORDER BY modified_date) AS dca_agent_count
FROM
(
SELECT sa_resource_id,
modified_date,
deleted_Date,
ROW_NUMBER() OVER (
PARTITION BY sa_Resource_id, deleted_Date
ORDER BY modified_date desc
) row_num
FROM mf_Shared_provider_Default.dca_entity_resource_raw
WHERE sa_ResourcE_id IS NOT NULL
AND sa_resource_id IN ('1','2','34','16','17','18')
) t
GROUP BY modified_date
ORDER BY modified_Date
) b
Current Output:
md
dca_agent_count
2022-01-22 15:46:06.758
2
2022-04-22 15:46:06.758
5
2022-04-29 15:46:06.758
3
2022-05-22 15:46:06.758
1
2022-05-23 22:16:06.758
2
2022-05-24 22:16:06.758
1
2022-05-25 22:16:06.758
2
2022-05-27 22:16:06.758
3
If you see the output above all the values are correct except for the last row 27-05-2022 where i need to get count 2 only instead of 3
How do I get the cumulative distinct count of sa resource ids as of the modified dates based on deleted date condition(null/not null) and count should not change when deleted date (null/not null) occur consecutively
To me, a DATE has no hours, minutes, seconds, let alone second fractions, so I renamed the time containing attributes to %_ts, as they are TIMESTAMPs.
I had to completely start from scratch to solve it.
I think this is the first problem I had to solve with as much as 5 Common Table Expressions:
Add a Boolean is_active that is never NULL
Add the previous obtained is_active using LAG(). NULL here means there is no predecessor for the same resource id.
remove the rows whose previous is_active is equal to the current is_active.
UNION SELECT the positive COUNT DISTINCTs of the active rows and the negative COUNT DISTINCTs of the inactive rows. This also removes the last timestamp.
get the distinct timestamps from the original input for the final query
The final query takes CTE 5 and LEFT JOINs it with CTE 4, making a running sum of the obtained distinct counts.
Here goes:
WITH
-- not part of the final query: this is your input data
indata(sa_resource_id,modified_ts,deleted_ts) AS (
SELECT 1,TIMESTAMP '2022-01-22 15:46:06.758',NULL
UNION ALL SELECT 2,TIMESTAMP '2022-01-22 15:46:06.758',NULL
UNION ALL SELECT 16,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 17,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 18,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 16,TIMESTAMP '2022-04-29 15:46:06.758',TIMESTAMP '2022-04-29 15:46:06.758'
UNION ALL SELECT 17,TIMESTAMP '2022-04-29 15:46:06.758',TIMESTAMP '2022-04-29 15:46:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-22 15:46:06.758',TIMESTAMP '2022-05-22 15:46:06.758'
UNION ALL SELECT 2,TIMESTAMP '2022-05-22 15:46:06.758',TIMESTAMP '2022-05-22 15:46:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-23 22:16:06.758',NULL
UNION ALL SELECT 1,TIMESTAMP '2022-05-24 22:16:06.758',TIMESTAMP '2022-05-24 22:16:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-25 22:16:06.758',NULL
UNION ALL SELECT 1,TIMESTAMP '2022-05-27 22:16:06.758',NULL
)
-- real query starts here, replace the following comma with "WITH" ...
,
-- need a "active flag" that is never null
w_active_flag AS (
SELECT
*
, (deleted_ts IS NULL) AS is_active
FROM indata
)
,
-- need current and previous is_active to filter ..
w_prev_flag AS (
SELECT
*
, LAG(is_active) OVER w AS prev_flag
FROM w_active_flag
WINDOW w AS(PARTITION BY sa_resource_id ORDER BY modified_ts)
)
,
-- use obtained filter arguments to filter out two consecutive
-- active or non-active rows for same sa_resource_id
-- this can remove timestamps from the final result
de_duped AS (
SELECT
sa_resource_id
, modified_ts
, is_active
FROM w_prev_flag
WHERE prev_flag IS NULL OR prev_flag <> is_active
)
-- get count distinct "sa_resource_id" only now
,
grp AS (
SELECT
modified_ts
, COUNT(DISTINCT sa_resource_id) AS dca_agent_count
FROM de_duped
WHERE is_active
GROUP BY modified_ts
UNION ALL
SELECT
modified_ts
, COUNT(DISTINCT sa_resource_id) * -1 AS dca_agent_count
FROM de_duped
WHERE NOT is_active
GROUP BY modified_ts
)
,
-- get back all input timestamps in a help table
tslist AS (
SELECT DISTINCT
modified_ts
FROM indata
)
SELECT
tslist.modified_ts
, SUM(NVL(dca_agent_count,0)) OVER w AS dca_agent_count
FROM tslist LEFT JOIN grp USING(modified_ts)
WINDOW w AS (ORDER BY tslist.modified_ts);
-- out modified_ts | dca_agent_count
-- out -------------------------+-----------------
-- out 2022-01-22 15:46:06.758 | 2
-- out 2022-04-22 15:46:06.758 | 5
-- out 2022-04-29 15:46:06.758 | 3
-- out 2022-05-22 15:46:06.758 | 1
-- out 2022-05-23 22:16:06.758 | 2
-- out 2022-05-24 22:16:06.758 | 1
-- out 2022-05-25 22:16:06.758 | 2
-- out 2022-05-27 22:16:06.758 | 2

SQL Dates Selection

I Have a OPL_Dates Table with Start Date and End Dates as Below:
dbo.OPL_Dates
ID Start_date End_date
--------------------------------------
12345 1975-01-01 2001-12-31
12345 1989-01-01 2004-12-31
12345 2005-01-01 NULL
12345 2007-01-01 NULL
12377 2009-06-01 2009-12-31
12377 2013-02-07 NULL
12377 2010-01-01 2012-01-01
12489 2011-12-31 NULL
12489 2012-03-01 2012-04-01
The Output I am looking for is:
ID Start_date End_date
-------------------------------------
12345 1975-01-01 2004-12-31
12345 2005-01-01 NULL
12377 2009-06-01 2009-12-31
12377 2010-01-01 2012-01-01
12377 2013-02-07 NULL
12489 2011-12-31 NULL
Basically, I want to show the gap between the OPL periods(IF Any) else I need min of Start Date and Max of End Dates, for a particular ID.NULL means Open-Ended Date which can be converted to "9999-12-31".
The following pretty much does what you want:
with p as (
select v.*, sum(inc) over (partition by v.id order by v.dte) as running_inc
from t cross apply
(values (id, start_date, 1),
(id, coalesce(end_date, '2999-12-31'), -1)
) v(id, dte, inc)
)
select id, min(dte), max(dte)
from (select p.*, sum(case when running_inc = 0 then 1 else 0 end) over (partition by id order by dte desc) as grp
from p
) p
group by id, grp;
Note that it changes the "inifinite" end date from NULL to 2999-12-31. This is a convenience, because NULL orders first in SQL Server ascending sorts.
Here is a SQL Fiddle.
What is this doing? It is unpivoting the dates into a single column, with a 1/-1 flag (inc) indicating whether the record is a start or end. The running sum of this flag then indicates the groups that should be combined. When the running sum is 0, then a group has ended. To include the end date in the right group, a reverse running sum is needed -- but that's a detail.

How to pick one non null date from dates - if date is null pick next one

I need to pick one date from week, it has to be Friday. However, when Friday is null - it means no data was entered, and I have to find any other day with data in the same week. Can someone share their views on how to solve this type of situation?
If you see in the following data, in the 2nd week, Friday has null entry, so another day has to be picked up.
Day Weekdate Data entry dt Data
1 2/7/2016
2 2/8/2016
3 2/9/2016
4 2/10/2016
5 2/11/2016
6 2/12/2016 2/12/2016 500
7 2/13/2016
1 2/14/2016
2 2/15/2016
3 2/16/2016
4 2/17/2016 2/17/2016 300
5 2/18/2016
6 2/19/2016 NULL NULL
7 2/20/2016
1 2/21/2016
2 2/22/2016
3 2/23/2016
4 2/24/2016
5 2/25/2016
6 2/26/2016 2/26/2016 250
7 2/27/2016
You may try this
--Not null data
select * from tblData
where DATEPART(dw,weekDate) = 6 and data is not null
Union
Select data.* from
(
select weekDate
from tblData
where DATEPART(dw,weekDate) = 6 and data is null
) nullData --Select Friday with null data
Cross Apply
(
--Find first record with not null data that is within this week
Select top 1 *
From tblData data
Where
data.weekDate between Dateadd(day, -6, nullData.weekDate) and nullData.weekDate
and data.data is not null
Order by data.weekDate desc
) data
You can try something like this to get the data entered for the latest date (Friday first, then every other day) for each week in your table:
SELECT
Weeks.FirstofWeek,
Detail.Day,
Detail.DataEntryDt,
Detail.Data
FROM
( --master list of weeks
SELECT DISTINCT DATEADD(DAY,(1-DATEPART(dw,Weekdate)),Weekdate) AS FirstofWeek
FROM dataTable
) AS Weeks
LEFT OUTER JOIN
( --detail
SELECT
--order first by presence of data, then by date, selecting Friday first:
ROW_NUMBER() OVER (PARTITION BY DATEADD(DAY,(1-DATEPART(dw,Weekdate)),Weekdate) ORDER BY CASE WHEN Data IS NOT NULL THEN 99 ELSE 0 END DESC, CASE WHEN [Day] = 6 THEN 99 ELSE [Day] END DESC) AS RowNum,
[Day],
DATEADD(DAY,(1-DATEPART(dw,Weekdate)),Weekdate) AS FirstofWeek,
Weekdate,
DataEntryDt,
Data
FROM dataTable
) AS Detail
ON Weeks.FirstofWeek = Detail.FirstofWeek
AND Detail.RowNum = 1 --get only top record for week with data present