Compare prior row using time values - sql

I have this set of data
What I want to do is compare the Start time to the prior row and if the start time falls between the Start and end time of the prior row then flag it. Whether that flag is binary or x doesn't matter, just needs to be counted.
So that the new column calls out the instances where the start time of the current row is between the Start and End time of the prior row. My results should look like this.
My thoughts are that LAG and/or LEAD need to be used here but I'm horribly novice at both of those. I'm also thinking I need to create a ROW() for these to make it work. Either way, looking for some guidance on this. I need to be able to track conversation times to see how many times an individual is handling simultaneous conversations (usually no more than 2).

Assuming you have a primary key like ID in the example below you can do something like the below
WITH data
AS (SELECT * FROM YOUR_TABLE),
d1
AS (SELECT d.*,
Lead(start_date)
over (
ORDER BY id) lead_start_date
FROM data d)
SELECT id,
start_date,
end_date,
CASE
WHEN lead_start_date BETWEEN start_date AND end_date THEN 1
ELSE 0
END marker
FROM d1;

One method is exists:
select t.*,
(case when exists (select 1
from t t2
where t2.starttime <= t.starttime and
t2.endtime >= t.starttime
)
then 1 else 0
end) as dual_convo
from t;
If I understand correctly, I think you can also use a cumulative maximum:
select t.*,
(case when max(endtime) over (order by starttime, endtime
rows between unbounded preceding and 1 preceding
) > starttime
then 1 else 0
end) as dual_convo
from t;
Your data only has examples where the previous row overlaps. But presumably you could have overlaps on earlier rows, such as:
1 9
2 3
4 5
8 12
All but the first overlap, and only the first with the "previous" row.

Related

How can I reference column values from previous rows in BigQuery SQL, in order to perform operations or calculations?

I have sorted my data by start time, and I want to create a new field that rolls up data that overlap start times from the previous rows start and end time.
More specifically, I want to write logic that, for a given record X, if the start time is somewhere between the start and end time of the previous row, I want to give record X the same value for the new field as that previous row. If the start time happens after the end time of the previous row, it would get a new value for the new field.
Is something like this possible in BigQuery SQL? Was thinking maybe lag or window function, but not quite sure. Below are examples of what the base table looks like and what I want for the final table.
Any insight appreciated!
Below is for BigQuery Standard SQL
#standardSQL
SELECT recordID, startTime, endTime,
COUNTIF(newRange) OVER(ORDER BY startTime) AS newRecordID
FROM (
SELECT *,
startTime >= MAX(endTime) OVER(ORDER BY startTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS newRange
FROM `project.dataset.table`
)
You can test, play with above using sample data from your question as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 recordID, TIME '12:35:00' startTime, TIME '12:50:00' endTime UNION ALL
SELECT 2, '12:46:00', '12:59:00' UNION ALL
SELECT 3, '14:27:00', '16:05:00' UNION ALL
SELECT 4, '15:48:00', '16:35:00' UNION ALL
SELECT 5, '16:18:00', '17:04:00'
)
SELECT recordID, startTime, endTime,
COUNTIF(newRange) OVER(ORDER BY startTime) AS newRecordID
FROM (
SELECT *,
startTime >= MAX(endTime) OVER(ORDER BY startTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS newRange
FROM `project.dataset.table`
)
-- ORDER BY startTime
with result
Row recordID startTime endTime newRecordID
1 1 12:35:00 12:50:00 0
2 2 12:46:00 12:59:00 0
3 3 14:27:00 16:05:00 1
4 4 15:48:00 16:35:00 1
5 5 16:18:00 17:04:00 1
This is a gaps and islands problem. What you want to do is assign a group id to non-intersecting groups. You can calculating the non-intersections using window functions.
A record starts a new group if the cumulative maximum value of the end time, ordered by start time and ending at the previous record, is less than the current end time. The rest is just a cumulative sum to assign a group id.
For your data:
select t.*,
sum(case when prev_endtime >= endtime then 0 else 1 end) over (order by starttime) as group_id
from (select t.*,
max(endtime) over (order by starttime rows between unbounded preceding and 1 preceding) as prev_endtime
from t
) t;
The only potential issue is if two records start at exactly the same time. If this can happen, the logic might need to be slightly more complex.

Need to count unique transactions by month but ignore records that occur 3 days after 1st entry for that ID

I have a table with just two columns: User_ID and fail_date. Each time somebody's card is rejected they are logged in the table, their card is automatically tried again 3 days later, and if they fail again, another entry is added to the table. I am trying to write a query that counts unique failures by month so I only want to count the first entry, not the 3 day retries, if they exist. My data set looks like this
user_id fail_date
222 01/01
222 01/04
555 02/15
777 03/31
777 04/02
222 10/11
so my desired output would be something like this:
month unique_fails
jan 1
feb 1
march 1
april 0
oct 1
I'll be running this in Vertica, but I'm not so much looking for perfect syntax in replies. Just help around how to approach this problem as I can't really think of a way to make it work. Thanks!
You could use lag() to get the previous timestamp per user. If the current and the previous timestamp are less than or exactly three days apart, it's a follow up. Mark the row as such. Then you can filter to exclude the follow ups.
It might look something like:
SELECT month,
count(*) unique_fails
FROM (SELECT month(fail_date) month,
CASE
WHEN datediff(day,
lag(fail_date) OVER (PARTITION BY user_id,
ORDER BY fail_date),
fail_date) <= 3 THEN
1
ELSE
0
END follow_up
FROM elbat) x
WHERE follow_up = 0
GROUP BY month;
I'm not so sure about the exact syntax in Vertica, so it might need some adaptions. I also don't know, if fail_date actually is some date/time type variant or just a string. If it's just a string the date/time specific functions may not work on it and have to be replaced or the string has to be converted prior passing it to the functions.
If the data spans several years you might also want to include the year additionally to the month to keep months from different years apart. In the inner SELECT add a column year(fail_date) year and add year to the list of columns and the GROUP BY of the outer SELECT.
You can add a flag about whether this is a "unique_fail" by doing:
select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t;
Then, you want to count this flag by month:
select to_char(fail_date, 'Mon'), -- should aways include the year
sum(first_failure_flag)
from (select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t
) t
group by to_char(fail_date, 'Mon')
order by min(fail_date)
In a Derived Table, determine the previous fail_date (prev_fail_date), for a specific user_id and fail_date, using a Correlated subquery.
Using the derived table dt, Count the failure, if the difference of number of days between current fail_date and prev_fail_date is greater than 3.
DateDiff() function alongside with If() function is used to determine the cases, which are not repeated tries.
To Group By this result on Month, you can use MONTH function.
But then, the data can be from multiple years, so you need to separate them out yearwise as well, so you can do a multi-level group by, using YEAR function as well.
Try the following (in MySQL) - you can get idea for other RDBMS as well:
SELECT YEAR(dt.fail_date) AS year_fail_date,
MONTH(dt.fail_date) AS month_fail_date,
COUNT( IF(DATEDIFF(dt.fail_date, dt.prev_fail_date) > 3, user_id, NULL) ) AS unique_fails
FROM (
SELECT
t1.user_id,
t1.fail_date,
(
SELECT t2.fail_date
FROM your_table AS t2
WHERE t2.user_id = t1.user_id
AND t2.fail_date < t1.fail_date
ORDER BY t2.fail_date DESC
LIMIT 1
) AS prev_fail_date
FROM your_table AS t1
) AS dt
GROUP BY
year_fail_date,
month_fail_date
ORDER BY
year_fail_date ASC,
month_fail_date ASC

How to take only one entry from a table based on an offset to a date column value

I have a requirement to get values from a table based on an offset conditions on a date column.
Say for eg: for the below attached table, if there is any dates that comes close within 15 days based on effectivedate column I should return only the first one.
So my expected result would be as below:
Here for A1234 policy, it returns 6/18/16 entry and skipped 6/12/16 entry as the offset between these 2 dates is within 15 days and I took the latest one from the list.
If you want to group rows together that are within 15 days of each other, then you have a variant of the gaps-and-islands problem. I would recommend lag() and cumulative sum for this version:
select polno, min(effectivedate), max(expirationdate)
from (select t.*,
sum(case when prev_ed >= dateadd(day, -15, effectivedate)
then 1 else 0
end) over (partition by polno order by effectivedate) as grp
from (select t.*,
lag(expirationdate) over (partition by polno order by effectivedate) as prev_ed
from t
) t
) t
group by polno, grp;

SQL Ranking by consecutive date blocks

I'm trying to rank the number of consecutive date blocks but what is the best way to do this? Example below shows the first 3 blocks being consecutive and then the 4 has a month between them so the counting would begin again.
Data I'm trying to order:
StartDate | EndDate |Rank
----------+-----------+----
01/01/2016| 01/02/2016| 1
01/02/2016| 01/03/2016| 2
01/03/2016| 01/04/2016| 3
01/05/2016| 01/06/2016| 1
You can do this by identifying where a grouping begins, doing a cumulative sum to identify the group, and then a row number:
select t.*,
row_number() over (partition by grp order by startdate) as rank
from (select t.*,
sum(case when tprev.startdate is null then 1 else 0 end) over (order by startdate) as grp
from t left join
t tprev
on t.startdate = tprev.enddate
) t;
This particular SQL works for the data you have presented. It will not handle data that overlaps by more than one day, nor multiple records that start on the same day. These can be handled. If your data is more like that, then ask another question with appropriate data in it.

Count over rows in previous time range partitioned by a specific column

My dataset consists of daily (actually business days) timeseries for different companies from different industries and I work with PostgreSQL. I have an indicator variable in my dataset taking values 1, -1 and most of the times 0. For better readability of the question I refer to days where the indicator variable is unequal to zero as indicator event.
So for all indicator events that are preceded by another indicator event for the same industry in the previous three business days, the indicator variable shall be updated to zero.
We can think of the following example dataset:
day company industry indicator
2012-01-12 A financial 1
2012-01-12 B consumer 0
2012-01-13 A financial 1
2012-01-13 B consumer -1
2012-01-16 A financial 0
2012-01-16 B consumer 0
2012-01-17 A financial 0
2012-01-17 B consumer 0
2012-01-17 C consumer 0
2012-01-18 A financial 0
2012-01-18 B consumer 0
2012-01-18 C consumer 1
So the indicator values that shall be updated to zero are on 2012-01-13 the entry for company A, and on 2012-01-18 the entry for company C, because they are preceded by another indicator event in the same industry within 3 business days.
I tried to accomplish it in the following way:
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
COUNT(CASE WHEN indicator <> 0 THEN 1 END)
OVER (PARTITION BY industry ORDER BY day
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) As cnt
FROM test
) alias
WHERE cnt >= 2)
My idea was to count the indicator events for the current day and the 3 preceding days partitioned by industry. If it counts more than 1, it updates the indicator value to zero.
The weak spot is, that so far it counts over the three preceding rows (partitioned by industry) instead of the three preceding business days. So in the example data, it is not able to update company C on 2012-01-18, because it counts over the last three rows where industry = consumer instead of counting over all rows where industry=consumer for the last three business days.
I tried different methods like adding another subquery in the third last line of the code or adding a WHERE EXISTS - clause after the third last line, to ensure that the code counts over the three preceding dates. But nothing worked. I really don't know out how to do that (I just learn to work with PostgreSQL).
Do you have any ideas how to fix it?
Or maybe I am thinking in a completely wrong direction and you know another approach how to solve my problem?
DB design
Fist off, your table should be normalized. industry should be a small foreign key column (typically integer) referencing industry_id of an industry table. Maybe you have that already and only simplified for the sake of the question. Your actual table definition would go a long way.
Since rows with an indicator are rare but highly interesting, create a (possibly "covering") partial index to make any solution faster:
CREATE INDEX tbl_indicator_idx ON tbl (industry, day)
WHERE indicator <> 0;
Equality first, range last.
Assuming that indicator is defined NOT NULL. If industry was an integer, this index would be perfectly efficient.
Query
This query identifies rows to be reset:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
SELECT industry, day
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('ISODOW' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) sub
WHERE incident
AND ct > 1
ORDER BY 1, 2;
SQL Fiddle.
ISODOW as extract() parameter is convenient to truncate weekends.
Integrate this in your UPDATE:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
UPDATE tbl t
SET indicator = 0
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('isodow' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) u
WHERE u.incident
AND u.ct > 1
AND t.industry = u.industry
AND t.day = u.day;
This should be substantially faster than your solution with correlated subqueries and a function call for every row. Even if that's based on my own previous answer, it's not perfect for this case.
In the meantime I found one possible solution myself (I hope that this isn't against the etiquette of the forum).
Please note that this is only one possible solution. You are very welcome to comment it or to develop
improvements if you want to.
For the first part, the function addbusinessdays which can add (or subtract) business day to
a given date, I am referring to:
http://osssmb.wordpress.com/2009/12/02/business-days-working-days-sql-for-postgres-2/
(I just slightly modified it because I don't care for holidays, just for weekends)
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
with alldates as (
SELECT i,
$1 + (i * case when $2 < 0 then -1 else 1 end) AS date
FROM generate_series(0,(abs($2) + 5)*2) i
),
days as (
select i, date, extract('dow' from date) as dow
from alldates
),
businessdays as (
select i, date, d.dow from days d
where d.dow between 1 and 5
order by i
)
select date from businessdays where
case when $2 > 0 then date >=$1 when $2 < 0 then date <=$1 else date =$1 end
limit 1
offset abs($2)
$BODY$
LANGUAGE 'sql' VOLATILE
COST 100;
ALTER FUNCTION addbusinessdays(date, integer) OWNER TO postgres;
For the second part, I am referring to this related question, where I am applying Erwin Brandstetter's correlated subquery approach: Window Functions or Common Table Expressions: count previous rows within range
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
(SELECT COUNT(CASE WHEN indicator <> 0 THEN 1 END)
FROM test t1
WHERE t1.industry = t.industry
AND t1.day between addbusinessdays(t.day,-3) and t.day) As cnt
FROM test t
) alias
WHERE cnt >= 2)