SQL Ranking by consecutive date blocks - sql

I'm trying to rank the number of consecutive date blocks but what is the best way to do this? Example below shows the first 3 blocks being consecutive and then the 4 has a month between them so the counting would begin again.
Data I'm trying to order:
StartDate | EndDate |Rank
----------+-----------+----
01/01/2016| 01/02/2016| 1
01/02/2016| 01/03/2016| 2
01/03/2016| 01/04/2016| 3
01/05/2016| 01/06/2016| 1

You can do this by identifying where a grouping begins, doing a cumulative sum to identify the group, and then a row number:
select t.*,
row_number() over (partition by grp order by startdate) as rank
from (select t.*,
sum(case when tprev.startdate is null then 1 else 0 end) over (order by startdate) as grp
from t left join
t tprev
on t.startdate = tprev.enddate
) t;
This particular SQL works for the data you have presented. It will not handle data that overlaps by more than one day, nor multiple records that start on the same day. These can be handled. If your data is more like that, then ask another question with appropriate data in it.

Related

Compare prior row using time values

I have this set of data
What I want to do is compare the Start time to the prior row and if the start time falls between the Start and end time of the prior row then flag it. Whether that flag is binary or x doesn't matter, just needs to be counted.
So that the new column calls out the instances where the start time of the current row is between the Start and End time of the prior row. My results should look like this.
My thoughts are that LAG and/or LEAD need to be used here but I'm horribly novice at both of those. I'm also thinking I need to create a ROW() for these to make it work. Either way, looking for some guidance on this. I need to be able to track conversation times to see how many times an individual is handling simultaneous conversations (usually no more than 2).
Assuming you have a primary key like ID in the example below you can do something like the below
WITH data
AS (SELECT * FROM YOUR_TABLE),
d1
AS (SELECT d.*,
Lead(start_date)
over (
ORDER BY id) lead_start_date
FROM data d)
SELECT id,
start_date,
end_date,
CASE
WHEN lead_start_date BETWEEN start_date AND end_date THEN 1
ELSE 0
END marker
FROM d1;
One method is exists:
select t.*,
(case when exists (select 1
from t t2
where t2.starttime <= t.starttime and
t2.endtime >= t.starttime
)
then 1 else 0
end) as dual_convo
from t;
If I understand correctly, I think you can also use a cumulative maximum:
select t.*,
(case when max(endtime) over (order by starttime, endtime
rows between unbounded preceding and 1 preceding
) > starttime
then 1 else 0
end) as dual_convo
from t;
Your data only has examples where the previous row overlaps. But presumably you could have overlaps on earlier rows, such as:
1 9
2 3
4 5
8 12
All but the first overlap, and only the first with the "previous" row.

A cumulative sum of consecutive workdays that resets to 1 when consecutive days = 0, per ID

I have 3 columns:
Employee ID(numerical)
Day of work(a date yyyy-mm-dd when employee had a shift)
is_consecutive_work_day (1 if days of work are consecutive, else 0)
I need a 4th: Consecutive_work_days (a cumulative sum of is_consecutive_work_day, which resets to 1 when is_consecutive_work_day = 0). So this will go to a maximum of 5 for any employee id. Some will have 1,2,3 others 1,2...etc.
What am failing to figure out is how to write the 4th column (consecutive_work_days). Not how to write a consecutive sum per employee id, but specifically how to reset to 1 when is_consecutive_work_day = 0 per employee id.
May I ask for your help regarding this 4th column please? Thanks.
You can use window functions. lag() lets you access the previous day_of_work for the same employee, which you can compare to the current day_of_work: if there is a one day difference, then you can set is_consecutive_work_day to 1.
select
employee_id,
day_of_work,
case
when day_of_work
= lag(day_of_work) over(partition by employee_id order by day_of_work)
+ interval 1 day
then 1
else 0
end is_consecutive_work_day
from mytable
To compute the cumulative sum, it is a bit more complicated. We can use some gaps-and-island technique to put each record in the group it belongs to: basically, everytime is_consecutive_work_day of 0 is met, a new group starts; we can then do a window sum() over each group:
select
employee_id,
day_of_work,
is_consecutive_work_day,
sum(is_consecutive_work_day)
over(partition by employee_id, grp order by day_of_work)
consecutive_work_days
from (
select
t.*,
sum(1 - is_consecutive_work_day) over(partition by employee_id order by day_of_work) grp
from (
select
t.*,
case
when day_of_work
= lag(day_of_work) over(partition by employee_id order by day_of_work)
+ interval 1 day
then 1
else 0
end is_consecutive_work_day
from mytable t
) t
) t
Although this seem like a gap-and-islands problem, there is a simpler solution. Simply calculate the maximum previous value that is 0 and take the date difference.
The only caveat is if there is none.
That would be:
select t.*,
datediff(day_of_work,
coalesce(max(case when is_consecutive_work_day = 0 then day_of_work end) over (partition by employee_id),
date_add(min(day_of_work) partition by employee_id), 1)
)
) as fourth_column
from t;

Creating one record for a continuous sequnce of dates to a new table

We have a table in Microsoft SQL Server 2014 as shown below which has Id, LogId, AccountId, StateCode, Number and LastSentDate column.
Our goal was to move the data to a new table. When we move it we need to maintain the first and last record for that series. Based on our data the lastsentdate starts from 5/1 and continues till 5/5, then we should create a new row as shown below(we set the FirstSentDate as 5/1, Log Id as first log id that appeared - 28369 and since the series ended on 5/5 we update LastsentDate as 5/5 and LastSentLog Id as 28752)
if there are some dates with the difference in time, the desired output will be
Since our date series continues the last row in the new table will be
We were trying to group by date and achieve this
WITH t
AS (SELECT LastSentDate d,
ROW_NUMBER() OVER(
ORDER BY LastSentDate) i
FROM [dbo].[RegistrationActivity]
GROUP BY LastSentDate)
SELECT MIN(d),
MAX(d)
FROM t
GROUP BY DATEDIFF(day, i, d);
Use lag() to define where a group begins. Then use a cumulative sum to assign a group id to each group. And finally, extract the data you want. I'm not sure what data you actually want, but here is the idea:
select accountid, min(lastsentdate), max(lastsentdate)
from (select t.*,
sum(case when prev_lsd > dateadd(day, 1, lastsentdate )then 0 else 1 end) over (partition by accountid order by lastsentdate) as grp
from (select t.*, lag(lastsentdate) over (partition by accountid) as prev_lsd
from t
) t
) t
group by accountid;

Need to count unique transactions by month but ignore records that occur 3 days after 1st entry for that ID

I have a table with just two columns: User_ID and fail_date. Each time somebody's card is rejected they are logged in the table, their card is automatically tried again 3 days later, and if they fail again, another entry is added to the table. I am trying to write a query that counts unique failures by month so I only want to count the first entry, not the 3 day retries, if they exist. My data set looks like this
user_id fail_date
222 01/01
222 01/04
555 02/15
777 03/31
777 04/02
222 10/11
so my desired output would be something like this:
month unique_fails
jan 1
feb 1
march 1
april 0
oct 1
I'll be running this in Vertica, but I'm not so much looking for perfect syntax in replies. Just help around how to approach this problem as I can't really think of a way to make it work. Thanks!
You could use lag() to get the previous timestamp per user. If the current and the previous timestamp are less than or exactly three days apart, it's a follow up. Mark the row as such. Then you can filter to exclude the follow ups.
It might look something like:
SELECT month,
count(*) unique_fails
FROM (SELECT month(fail_date) month,
CASE
WHEN datediff(day,
lag(fail_date) OVER (PARTITION BY user_id,
ORDER BY fail_date),
fail_date) <= 3 THEN
1
ELSE
0
END follow_up
FROM elbat) x
WHERE follow_up = 0
GROUP BY month;
I'm not so sure about the exact syntax in Vertica, so it might need some adaptions. I also don't know, if fail_date actually is some date/time type variant or just a string. If it's just a string the date/time specific functions may not work on it and have to be replaced or the string has to be converted prior passing it to the functions.
If the data spans several years you might also want to include the year additionally to the month to keep months from different years apart. In the inner SELECT add a column year(fail_date) year and add year to the list of columns and the GROUP BY of the outer SELECT.
You can add a flag about whether this is a "unique_fail" by doing:
select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t;
Then, you want to count this flag by month:
select to_char(fail_date, 'Mon'), -- should aways include the year
sum(first_failure_flag)
from (select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t
) t
group by to_char(fail_date, 'Mon')
order by min(fail_date)
In a Derived Table, determine the previous fail_date (prev_fail_date), for a specific user_id and fail_date, using a Correlated subquery.
Using the derived table dt, Count the failure, if the difference of number of days between current fail_date and prev_fail_date is greater than 3.
DateDiff() function alongside with If() function is used to determine the cases, which are not repeated tries.
To Group By this result on Month, you can use MONTH function.
But then, the data can be from multiple years, so you need to separate them out yearwise as well, so you can do a multi-level group by, using YEAR function as well.
Try the following (in MySQL) - you can get idea for other RDBMS as well:
SELECT YEAR(dt.fail_date) AS year_fail_date,
MONTH(dt.fail_date) AS month_fail_date,
COUNT( IF(DATEDIFF(dt.fail_date, dt.prev_fail_date) > 3, user_id, NULL) ) AS unique_fails
FROM (
SELECT
t1.user_id,
t1.fail_date,
(
SELECT t2.fail_date
FROM your_table AS t2
WHERE t2.user_id = t1.user_id
AND t2.fail_date < t1.fail_date
ORDER BY t2.fail_date DESC
LIMIT 1
) AS prev_fail_date
FROM your_table AS t1
) AS dt
GROUP BY
year_fail_date,
month_fail_date
ORDER BY
year_fail_date ASC,
month_fail_date ASC

How to take only one entry from a table based on an offset to a date column value

I have a requirement to get values from a table based on an offset conditions on a date column.
Say for eg: for the below attached table, if there is any dates that comes close within 15 days based on effectivedate column I should return only the first one.
So my expected result would be as below:
Here for A1234 policy, it returns 6/18/16 entry and skipped 6/12/16 entry as the offset between these 2 dates is within 15 days and I took the latest one from the list.
If you want to group rows together that are within 15 days of each other, then you have a variant of the gaps-and-islands problem. I would recommend lag() and cumulative sum for this version:
select polno, min(effectivedate), max(expirationdate)
from (select t.*,
sum(case when prev_ed >= dateadd(day, -15, effectivedate)
then 1 else 0
end) over (partition by polno order by effectivedate) as grp
from (select t.*,
lag(expirationdate) over (partition by polno order by effectivedate) as prev_ed
from t
) t
) t
group by polno, grp;