Need to calculate next milestone in the sequence - sql

I have a dataset something like this
I want to calculate the next clinical milestone for the ID as per the sequence number.
E.g. for 665 the next clinical milestone as per the sequence should be DBF as it doesn't have any date present in the actual column ( we need to ignore the intermediate values like FPA and FCI where data isn't present for column actual as data is really dirty and dates can be smaller compared to last one in sequence.)
There is another case where all data in the actual column for an ID is null then, in that case first non-null planned column value for that clinical milestone should be the next one.
e.g. in ID 666 CPC should be the next clinical milestone.
Thought using LAG function as well for this using max of actual for an ID but not sure how will it work when two rows have same actual date.

Use MAX() OVER () with a CASE expression to work out the current sequence value for each id, then filter based on that.
WITH
resequenced AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY sequence) AS new_sequence
FROM
yourTable
WHERE
actual IS NOT NULL
OR planned IS NOT NULL
),
summarised AS
(
SELECT
*,
MAX(CASE WHEN actual IS NOT NULL THEN new_sequence ELSE 0 END) OVER (PARTITION BY id) AS last_sequence
FROM
resequenced
)
SELECT
*
FROM
summarised
WHERE
new_sequence = last_sequence + 1
EDIT: Adapted to deal with gaps in Both the actual and planned columns.

Related

Start date and end date assigning based on date ranges and value change

I have two tables temp_N and temp_C . Table script and data is given below . I am using teradata
Table Script and data
First image is temp_N and second one is temp_C
Now I will try to explain my requirement. Key column for this two tables are 'nbr'. This two table contains all the changes for a particular period of time.( this is sample data and this two tables will get daily loaded based on the updates). Now I need to merge this two tables into one table with date range assigned correctly. The expected result is given below. To explain the logic behind the expected result, first row in the expected result, fstrtdate is the least date which from the two tables which is 2022-01-31 and for the same row if we notice the end date is given as 2022-07-10 as there is a change in the cpnrate on 2022-07-11. second row is start with 2022-07-11 giving the changed cpnrate, now when comes to third row there is a change in ntr on 2022-08-31 and the data is update accordingly. Please note all this are date fields, there wont be any timestamp, please ignore the timestamp in screenshots
Now I would like to know how to achieve this in sql or is it possible to achieve ?
You can combine all the changes into a single table and order by effective start date (fstrtdate). Then you can compute effective end date as day prior to next change, and where one of the data values is NULL use LAG IGNORE NULLS to "bring down" the desired previous not-NULL value:
SELECT nbr, fstrtdate,
PRIOR(LEAD(fstrtdate) OVER (PARTITION BY nbr ORDER BY fstrtdate)) as fenddate,
COALESCE(ntr,LAG(ntr) IGNORE NULLS OVER (PARTITION BY nbr ORDER BY fstrtdate)) as ntr,
COALESCE(cpnrate,LAG(cpnrate) IGNORE NULLS OVER (PARTITION BY nbr ORDER BY fstrtdate)) as cpnrate
FROM (
SELECT nbr, fstrtdate, max(ntr) as ntr, max(cpnrate) as cpnrate
FROM (
SELECT nbr, fstrtdate, ntr, NULL (DECIMAL(9,2)) as cpnrate
from temp_n
UNION ALL
SELECT nbr, fstrtdate, NULL (DECIMAL(9,2)) as ntr, cpnrate
from temp_c
) AS COMBINED
GROUP BY 1, 2
) AS UNIQUESTART
ORDER BY fstrtdate;
The innermost SELECTs make the structure the same for data from both tables with NULLs for the values that come from the other table, so we can do a UNION to form one COMBINED derived table with rows for both types of change events. Note that you should explicitly assign datatype for those added NULL columns to match the datatype for the corresponding column in the other table; I somewhat arbitrarily chose DECIMAL(9,2) above since I didn't know the real data types. They can't be INT as in the example, though, since that would truncate the decimal part. There's no reason to carry along the original fenddate; a new non-overlapping fenddate will be computed in the outer query.
The intermediate GROUP BY is only to combine what would otherwise be two rows in the special case where both ntr and cpnrate changed on the same day for the same nbr. That case is not present in the example data - the dates are already unique - but it might be necessary to do this when processing the full table. The syntax requires an aggregate function, but there should be at most two rows for a (nbr, fstrtdate) group; and when there are two rows, in each of the other columns one row has NULL and the other row does not. In that case either MIN or MAX will return the non-NULL value.
In the outer query, the COALESCEs will return the value for that column from the current row in the UNIQUED derived table if it's not NULL, otherwise LAG is used to obtain the value from a previous row.
The first two rows in the result won't match the screenshot above but they do accurately reflect the data provided - specifically, the example does not identify a cpnrate for any date prior to 2022-05-11.
nbr
fstrtdate
fenddate
ntr
cpnrate
233
2022-01-31
2022-05-10
311,000.00
NuLL
233
2022-05-11
2022-07-10
311,000.00
3.31
...
-
-
-
-

Function to REPLACE* last previous known value for NULL

I want to fill the NULL values with the last given value for that column. A small sample of the data:
2021-08-15 Bulgaria 1081636
2021-08-16 Bulgaria 1084693
2021-08-17 Bulgaria 1089066
2021-08-18 Bulgaria NULL
2021-08-19 Bulgaria NULL
In this example, the NULL values should be 1089066 until I reach the next non-NULL value.
I tried the answer given in this response, but to no avail. Any help would be appreciated, thank you!
EDIT: Sorry, I got sidetracked with trying to return the last value that I forgot my ultimate goal, which is to replace the NULL values with the previous known value.
Therefore the query should be
UPDATE covid_data
SET people_vaccinated = ISNULL(?)
Assuming the number you have is always increasing, you can use MAX aggregate over a window:
SELECT dt
, country
, cnt
, MAX(cnt) OVER (PARTITION BY country ORDER BY dt)
FROM #data
If the number may decrease, the query becomes a little bit more complex as we need to mark the rows that have nulls as belonging to the same group as the last one without a null first:
SELECT dt
, country
, cnt
, SUM(cnt) OVER (PARTITION BY country, partition)
FROM (
SELECT country
, dt
, cnt
, SUM(CASE WHEN cnt IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY country ORDER BY dt) AS partition
FROM #data
) AS d
ORDER BY dt
Here's a working demo on dbfiddle, it returns the same data with ever increasing amount, but if you change the number for 08-17 to be lower than that of 08-16, you'll see MAX(...) method producing wrong results.
In many datasets it is incorrect to make assumptions about the behaviour of the data in the underlying dataset, if your goal is simply to fill the blanks that might appear mid-way in a dataset then the answer to the post you referenced A:sql server nulls duplicate last known value in table is still one of the best solutions, here is an adaptation:
SELECT dt
, country
, cnt
, ISNULL(source.cnt, excludeNulls.LastCnt)
FROM #data source
OUTER APPLY ( SELECT TOP 1 cnt as LastCnt
FROM #data
WHERE dt < source.dt
AND cnt IS NOT NULL
ORDER BY dt desc) ExcludeNulls
ORDER BY dt
MAX and LAST_VALUE will give you the a value with respect to the entire record set, which would not work with the existing solutions if you had a value for 2021-08-19. In that case the last value would be used to fill the gaps, not the previous non-null value.
When we need to fill in gaps that occur part-way through the results we need to apply a filter to the window query, TOP 1 ... ORDER BY gives us the ability to filter and sort on entirely different fields to the one that we want to capture, but also means that we can display the last value for fields that are not numeric, see this fiddle a few other examples: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=372285d29f97dbb9663e8552af6fb7a2

Row numbering based on contiguous data?

I need to assign numbers to rows based on a date. The rule is that the same number is assigned to multiple contiguous rows with the same date. When a row's date value differs from the previous row's date value, the number is incremented. The result set would look something like this (the first column would be used to determine row order):
1 7/1/2021 1
2 7/2/2021 2
3 7/2/2021 2
4 7/1/2021 3
5 7/2/2021 4
The value of the date is not what' relevant in this case. As you can see, there are repeats of the same date that get assigned different numeric values because they are not contiguous. I'm struggling to figure out how I would accomplish this.
This is a Gaps & Islands problem. You need to provide the extra ordering columns for the query to make sense.
If you added these, the solution would go along the lines of:
select
d,
1 + sum(inc) over(order by ordering_columns) as grp
from (
select d, ordering_columns,
case when d <> lag(d) over(order by ordering_columns) then 1 else 0 end as inc
from t
) x
order by ordering_columns

How Can I Retrieve The Earliest Date and Status Per Each Distinct ID

I have been trying to write a query to perfect this instance but cant seem to do the trick because I am still receiving duplicated. Hoping I can get help how to fix this issue.
SELECT DISTINCT
1.Client
1.ID
1.Thing
1.Status
MIN(1.StatusDate) as 'statdate'
FROM
SAMPLE 1
WHERE
[]
GROUP BY
1.Client
1.ID
1.Thing
1.status
My output is as follows
Client Id Thing Status Statdate
CompanyA 123 Thing1 Approved 12/9/2019
CompanyA 123 Thing1 Denied 12/6/2019
So although the query is doing what I asked and showing the mininmum status date per status, I want only the first status date. I have about 30k rows to filter through so whatever does not run overload the query and have it not run. Any help would be appreciated
Use window functions:
SELECT s.*
FROM (SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY statdate) as seqnum
FROM SAMPLE s
WHERE []
) s
WHERE seqnum = 1;
This returns the first row for each id.
Use whichever of these you feel more comfortable with/understand:
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY statusdate) as rn
FROM sample
WHERE ...
) x
WHERE rn = 1
The way that one works is to number all rows sequentially in order of StatusDate, restarting the numbering from 1 every time ID changes. If you thus collect all the number 1's togetyher you have your set of "first records"
Or can coordinate a MIN:
SELECT
*
FROM
sample s
INNER JOIN
(SELECT ID, MIN(statusDate) as minDate FROM sample WHERE ... GROUP BY ID) mins
ON s.ID = mins.ID and s.StatusDate = mins.MinDate
WHERE
...
This one prepares a list of all the ID and the min date, then joins it back to the main table. You thus get all the data back that was lost during the grouping operation; you cannot simultaneously "keep data" and "throw away data" during a group; if you group by more than just ID, you get more groups (as you have found). If you only group by ID you lose the other columns. There isn't any way to say "GROUP BY id, AND take the MIN date, AND also take all the other data from the same row as the min date" without doing a "group by id, take min date, then join this data set back to the main dataset to get the other data for that min date". If you try and do it all in a single grouping you'll fail because you either have to group by more columns, or use aggregating functions for the other data in the SELECT, which mixes your data up; when groups are done, the concept of "other data from the same row" is gone
Be aware that this can return duplicate rows if two records have identical min dates. The ROW_NUMBER form doesn't return duplicated records but if two records have the same minimum StatusDate then which one you'll get is random. To force a specific one, ORDER BY more stuff so you can be sure which will end up with 1

Redshift: Find MAX in list disregarding non-incremental numbers

I work for a sports film analysis company. We have teams with unique team IDs and I would like to find the number of consecutive weeks they have uploaded film to our site moving backwards from today. Each upload also has its own row in a separate table that I can join on teamid and has a unique date of when it was uploaded. So far I put together a simple query that pulls each unique DATEDIFF(week) value and groups on teamid.
Select teamid, MAX(weekdiff)
(Select teamid, DATEDIFF(week, dateuploaded, GETDATE()) as weekdiff
from leroy_events
group by teamid, weekdiff)
What I am given is a list of teamIDs and unique weekly date differences. I would like to then find the max for each teamID without breaking an increment of 1. For example, if my data set is:
Team datediff
11453 0
11453 1
11453 2
11453 5
11453 7
11453 13
I would like the max value for team: 11453 to be 2.
Any ideas would be awesome.
I have simplified your example assuming that I already have a table with weekdiff column. That would be what you're doing with DATEDIFF to calculate it.
First, I'm using LAG() window function to assign previous value (in ordered set) of a weekdiff to the current row.
Then, using a WHERE condition I'm retrieving max(weekdiff) value that has a previous value which is current_value - 1 for consecutive weekdiffs.
Data:
create table leroy_events ( teamid int, weekdiff int);
insert into leroy_events values (11453,0),(11453,1),(11453,2),(11453,5),(11453,7),(11453,13);
Code:
WITH initial_data AS (
Select
teamid,
weekdiff,
lag(weekdiff,1) over (partition by teamid order by weekdiff) as lag_weekdiff
from
leroy_events
)
SELECT
teamid,
max(weekdiff) AS max_weekdiff_consecutive
FROM
initial_data
WHERE weekdiff = lag_weekdiff + 1 -- this insures retrieving max() without breaking your consecutive increment
GROUP BY 1
SQLFiddle with your sample data to see how this code works.
Result:
teamid max_weekdiff_consecutive
11453 2
You can use SQL window functions to probe relationships between rows of the table. In this case the lag() function can be used to look at the previous row relative to a given order and grouping. That way you can determine whether a given row is part of a group of consecutive rows.
You still need overall to aggregate or filter to reduce the number of rows for each group of interest (i.e. each team) to 1. It's convenient in this case to aggregate. Overall, it might look like this:
select
team,
case min(datediff)
when 0 then max(datediff)
else -1
end as max_weeks
from (
select
team,
datediff,
case
when (lag(datediff) over (partition by team order by datediff) != datediff - 1)
then 0
else 1
end as is_consec
from diffs
) cd
where is_consec = 1
group by team
The inline view just adds an is_consec column to the data, marking whether each row is part of a group of consecutive rows. The outer query filters on that column (you cannot filter directly on a window function), and chooses the maximum datediff from the remaining rows for each team.
There are a few subtleties there:
The case expression in the inline view is written as it is to exploit the fact that the lag() computed for the first row of each partition will be NULL, which does not evaluate unequal (nor equal) to any value. Thus the first row in each partition is always marked consecutive.
The case testing min(datediff) in the outer select clause picks up teams that have no record with datediff = 0, and assigns -1 to column max_weeks for them.
It would also have been possible to mark rows non-consecutive if the first in their group did not have datediff = 0, but then you would lose such teams from the results altogether.