Adjusting Overlapping Dates and Applying Other Rules - sql

I'm trying to design a SELECT query that will modify some dates in a subset of "activity" records for each person in a table of data.
Each person is identified uniquely using "PersonID" and then each activity record using "RecordID".
Each subset of activity records a person has will also have dates against the "Start Date" and "End Date" fields.
For example, data like this (sorted by start date then longest duration):
(I've added the yellow bars to give an idea of some of the overlap and gaps between sets of dates).
Where I work, we have a task that involves adding a maximum of 1 "claim" record to associate with each of these activity records. The claim records have their own Start Date and End Date, but each claim record must:
Not cover a duration outside of the Start Date and End Date of the activity record it's being attached to.
Not overlap with the duration of any other claim record in the person's subset of claim records
Have a duration of at least 1 month, defined as either: starting on the first day of the month and ending on the last day of the month (e.g. 1/12/2018 to 31/12/2018), or starting in a month but ending in a different month (e.g. 31/12/2018 to 1/1/2019).
This is because the claim records are validated against the activity records (by an external validation tool we have no control over).
So, based on the example above, the query might output the following efficient set of dates for each activity record to use as claim records:
A brief run-down of what would happen on each record:
Record 1: For claim record 1, it used the original dates from the activity records, i.e. create a claim record that covers the entire
activity period. If working through the records in the sort order
described, it would make sense that it simply claims the full activity
period for the first claim record for each person's subset of activity
records.
Record 2: For claim record 2, NULLs have be supplied as there is no period to claim on this activity that hasn't already been claimed
in record 1.
Record 3: For claim record 3, the start date has been set to 1/12/2018, because that is the earliest date to claim from that is
both within this activity period and after the end month of the last
claim (i.e. record 1's 25/11/2019 end date).
The end date has been set to 31/12/2018. You may wonder why it is not
set to the activity's original end date of 23/1/2019. If you look
ahead to record 5, setting record 3's end date to 23/1/2019 would mean
setting record 5's start date to 1/2/2019, and it's end date would be
4/2/2019, which is not long enough to make a claim. So it would be
more efficient to stop record 3 in December so record 5 can claim both
January AND February. This may be hard to script though!
Record 4: For claim record 4, NULLs have be supplied as there is no period to claim on this activity that hasn't already been claimed
in previous claim records.
Record 5: For claim record 5, the start date has been set to 1/1/2019. See record 3 for an explanation of why.
Record 6: For claim record 6, it used the original dates from the activity records. This was just to illustrate that not all activity
records will overlap.
I'm not too sure how to approach this. I've looked at some CTE examples, but nothing that seems to match what I'm trying to do (perhaps too ambitious.. particularly the record 3 & 5 scenario?)
Any help / examples would be much appreciated.

Related

SQL - Update groups of data based on start and end dates

I have a table with dates of service for various hospital stays and want to update the starting and end dates for each claim to match the length of the entire stay. The table below has seven inpatient stays and dates of service for each of those stays. A min_max flag of 1 or 2 means that the dates in that row cover the entire length of that specific stay (each stay is color-coded).
Current table image here
I need to updated the dates for all rows within each colored grouping to match the starting and end dates for the row which has a min_max flag of 1 or 2 within the same group to ultimately find the sum of claims in each stay. I could do this manually here or in excel but I need it done on a much larger scale with thousands of hospital stays.
Goal table here
TIA!

SQL- Return less than 24 hour or 1 day difference between 2 dates values using self join

Logic: If an admission date is not reported on all claims, combine claims for the same beneficiary and provider that have less than a one-day break between the end date of the first claim and the start date of the next claim. For example, if the end date of the first claim is December 18 and the start date of the next claim is December 19, then combine the claims as a single stay. However, if the second claim has a start date of December 20 or later, then do not combine the claims.
I am working with MS SQL server and since it want to combine claims for less than one day break, I prefer to delete all but one record with the same beneficiaries.
The database is denormalized with one table and 52 fields so I prefer not to left or right join because it would double record. I watched this Youtube video https://www.youtube.com/watch?v=Ip6Ty2qmQXg and it shows how to self-join without doubling the records.
Given that, this is what I have came up so far. Since the question said to combine claims I want to delete the "duplicate" records but I do not know how to delete after self join. See the code below
-- self join
SELECT *
FROM #Metric2InpatientNoDischargeNoAdmission BEN1, #Metric2InpatientNoDischargeNoAdmission BEN2
WHERE BEN1.BEN_ID = BEN2.BEN_ID
AND BEN1.PERF_PROV_KEY = BEN2.PERF_PROV_KEY
AND BEN1.DTE_FIRST_SVC < BEN2.DTE_FIRST_SVC
AND DATEDIFF(day, BEN1.DTE_LAST_SVC, BEN2.DTE_FIRST_SVC) > 1 )
Edit:
I ran the above code and it returned more rows than the original table so this code is obviously not right. Help!!! :(

Boolean conditions that span rows in Spark

I'm trying to calculate a boolean column based on a group and date range.
I have a table that records transactions with the following row structure:
Person GUID - Date - Payment Amount
There are multiple rows per person.
What I want is a new boolean column, called Recent that is determined by whether the person had a transaction within a time period of say, 3 days prior. It would be True if they have, False if they have not.
Any idea for a query to do this?
It depends on when the start time for the beginning of "prior" is. If it's "now" (the current time), then it's quite easy: you want to find the max date per person and then filter on that being no more than some distance from the current time.
Take a look at window functions in Spark and how they can be used with time series.
To find the max date you'll use an expression such as
max(Date) over (partition by Person) as max_date
Hope this helps.

Reuse logic to query data based on date filter

I have logic in place to pull records based on date. For example i have to check if a record that appeared in a week has also occurred in next 14 days then that records need to be flagged. So basically i have put self join to get that record.
Now i have to pull record for 3 months and see if that record appeared again but logic will be same(in next 14 days), so ideally i have to change date filter in query for every week and get data, is there a way i can do it in same query and get full 3 months data
let me know if more clarification required.

storing data ranges - effective representation

I need to store values for every day in timeline, i.e. every user of database should has status assigned for every day, like this:
from 1.1.2000 to 28.05.2011 - status 1
from 29.05.2011 to 30.01.2012 - status 3
from 1.2.2012 to infinity - status 4
Each day should have only one status assigned, and last status is not ending (until another one is given). My question is what is effective representation in sql database? Obvious solution is to create row for each change (with the last day the status is assigned in each range), like this:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
01.01.9999 status 4
this has many problems - if i would want to add another range, say from 15.02.2012, i would need to alter last row too:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
14.02.2012 status 4
01.01.9999 status 8
and it requires lots of checking to make sure there is no overlapping and errors, especially if someone wants to modify ranges in the middle of the list - inserting a new status from 29.01.2012 to 10.02.2012 is hard to implement (it would require data ranges of status 3 and status 4 to shrink accordingly to make space for new status). Is there any better solution?
i thought about completly other solution, like storing each day status in separate row - so there will be row for every day in timeline. This would make it easy to update - simply enter new status for rows with date between start and end. Of course this would generate big amount of needless data, so it's bad solution, but is coherent and easy to manage. I was wondering if there is something in between, but i guess not.
more context: i want moderator to be able to assign status freely to any dates, and edit it if he would need to. But most often moderator will be adding new status data ranges at the end. I don't really need the last status. After moderator finishes editing whole month time, I need to generate raport based on status on each day in that month. But anytime moderator may want to edit data months ago (which would be reflected on updated raports), and he can put one status for i.e. one year in advance.
You seem to want to use this table for two things - recording the current status and the history of status changes. You should separate the current status out and move it up to the parent (just like the registered date)
User
===============
Registered Date
Current Status
Status History
===============
Uptodate
Status
Your table structure should include the effective and end dates of the status period. This effectively "tiles" the statuses into groups that don't overlap. The last row should have a dummy end date (as you have above) or NULL. Using a value instead of NULL is useful if you have indexes on the end date.
With this structure, to get the status on any given date, you use the query:
select *
from t
where <date> between effdate and enddate
To add a new status at the end of the period requires two changes:
Modify the row in the table with the enddate = 01/01/9999 to have an enddate of yesterday.
Insert a new row with the effdate of today and an enddate of 01/01/9999
I would wrap this in a stored procedure.
To change a status on one date in the past requires splitting one of the historical records in two. Multiple dates may require changing multiple records.
If you have a date range, you can get all tiles that overlap a given time period with the query:
select *
from t
where <periodstart> <= enddate and <periodend> >= effdate