Grouping rows based on a consecutive flag in SQL (Redshift) - sql

I've got a tricky problem that I am trying to solve here and can't get my head around it so far.
So the problem is this: I have tracking data, where there are records produced over time. Let's say you have a robot driving around and you record it's position once every second. Each of those positions is recorded as one record in the database (we use AWS Redshift).
Each record has a tracking_id which is unique across all records that belong to the same source of the tracking, i.e. unique for the robot. Then I have a record_id which is globally unique, a timestamp, and a flag that indicates if the record was created while the robot was inside or outside a defined zone. And then there is some additional data like coordinates.
Here is a little illustration. The pink box is the zone, the green line is the path of the robot and the blue dots are the produced records.
So now I would like to group records based on the zone flag (have a look at the screenshot below). So I want to isolate sub-paths inside the zone into a record and grab the start and end timestamp and position. The IDs don't matter so I don't necessarily need to keep the tracking or record ids even though I listed them in the desired result.
Thanks for the help, I would really appreciate it! Also just solving part of the problem like how to group based on the flag without grabbing first and last values within the sub-paths would help already.

This is a gaps and islands problem. In this case, you want the islands where in_zone happens to be TRUE (and there are two of them). We can use the difference in row number method here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tracking_id ORDER BY timestamp) rn1,
ROW_NUMBER() OVER (PARTITION BY tracking_id, in_zone ORDER BY timestamp) rn2
FROM yourTable
)
SELECT
tracking_id,
MIN(record_id) AS record_id,
MIN(timestamp) AS start_timestamp,
MAX(timestamp) AS end_timestamp,
(SELECT t2.coordinates FROM yourTable t2
WHERE t2.record_id = MIN(t1.record_id) AND t2.tracking_id = t1.tracking_id) AS entry_coordinates,
(SELECT t2.coordinates FROM yourTable t2
WHERE t2.record_id = MAX(t1.record_id) AND t2.tracking_id = t1.tracking_id) AS exit_coordinates
FROM cte t1
WHERE
in_zone = 'TRUE'
GROUP BY
tracking_id,
rn1 - rn2,
in_zone
ORDER BY
tracking_id,
record_id DESC;
Demo

This is a gaps-and-islands problem. I would approach it using LAG() to identify the previous in-group and a cumulative sum. You can also use conditional aggregation to get the first and last coordinate values:
SELECT tracking_id, MIN(record_id), MIN(timestamp) as start_timestamp,
MIN(timestamp) as end_timestamp,
MAX(CASE WHEN prev_in_zone IS NULL OR prev_in_zone <> in_zone THEN coordinates END) as entry_coordinates,
MAX(CASE WHEN next_in_zone IS NULL OR next_in_zone <> in_zone THEN coordinates END) as entry_coordinates
FROM (SELECT t.*,
SUM( CASE WHEN prev_in_zone = in_zone THEN 0 ELSE 1 END) OVER (PARTITION BY tracking_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as grp
FROM (SELECT t.*,
LAG(in_zone) OVER (PARTITION BY tracking_id ORDER BY timestamp) as prev_in_zone,
LEAD(in_zone) OVER (PARTITION BY tracking_id ORDER BY timestamp) as next_in_zone
FROM t
) t
) t
WHERE in_zone = 'TRUE'
GROUP BY tracking_id, grp;
With much appreciation to Tim, here is a db<>fiddle.

Related

Complex Ranking in SQL (Teradata)

I have a peculiar problem at hand. I need to rank in the following manner:
Each ID gets a new rank.
rank #1 is assigned to the ID with the lowest date. However, the subsequent dates for that particular ID can be higher but they will get the incremental rank w.r.t other IDs.
(E.g. ADF32 series will be considered to be ranked first as it had the lowest date, although it ends with dates 09-Nov, and RT659 starts with 13-Aug it will be ranked subsequently)
For a particular ID, if the days are consecutive then ranks are same, else they add by 1.
For a particular ID, ranks are given in date ASC.
How to formulate a query?
You need two steps:
select
id_col
,dt_col
,dense_rank()
over (order by min_dt, id_col, dt_col - rnk) as part_col
from
(
select
id_col
,dt_col
,min(dt_col)
over (partition by id_col) as min_dt
,rank()
over (partition by id_col
order by dt_col) as rnk
from tab
) as dt
dt_col - rnk caluclates the same result for consecutives dates -> same rank
Try datediff on lead/lag and then perform partitioned ranking
select t.ID_COL,t.dt_col,
rank() over(partition by t.ID_COL, t.date_diff order by t.dt_col desc) as rankk
from ( SELECT ID_COL,dt_col,
DATEDIFF(day, Lag(dt_col, 1) OVER(ORDER BY dt_col),dt_col) as date_diff FROM table1 ) t
One way to think about this problem is "when to add 1 to the rank". Well, that occurs when the previous value on a row with the same id_col differs by more than one day. Or when the row is the earliest day for an id.
This turns the problem into a cumulative sum:
select t.*,
sum(case when prev_dt_col = dt_col - 1 then 0 else 1
end) over
(order by min_dt_col, id_col, dt_col) as ranking
from (select t.*,
lag(dt_col) over (partition by id_col order by dt_col) as prev_dt_col,
min(dt_col) over (partition by id_col) as min_dt_col
from t
) t;

Min and max value per group keeping order

I have a small problem in Redshift with with grouping; I have a table like following:
INPUT
VALUE CREATED UPDATED
------------------------------------
1 '2020-09-10' '2020-09-11'
1 '2020-09-11' '2020-09-13'
2 '2020-09-15' '2020-09-16'
1 '2020-09-17' '2020-09-18'
I want to obtain this output:
VALUE CREATED UPDATED
------------------------------------
1 '2020-09-10' '2020-09-13'
2 '2020-09-15' '2020-09-16'
1 '2020-09-17' '2020-09-18'
If I do a simple Min and Max date grouping by the value, it doesn't work.
This is an example of a gap-and-islands problem. If there are no time gaps in the data, then a difference of row numbers is a simple solution:
select value, min(created), max(updated)
from (select t.*,
row_number() over (order by created) as seqnum,
row_number() over (partition by value order by created) as seqnum_2
from t
) t
group by value, (seqnum - seqnum_2)
order by min(created);
Why this works is a little tricky to explain. But if you look at the results of the subquery, you will see how the difference between the row numbers identifies adjacent rows with the same value.

Is there a way to split rows into groups based on certain values?

consider this table:
I want to divide these rows into groups based on their id and price values: as long as two rows have the same id and price and are not divided by any other row they belong to the same group, so I expect the output to be sorta like this:
I tried using window functions but with them I ended up with the last row having the same group as the first 3. Is there something I'm missing?
This is a gaps-and-islands problem. One method is to use lag() to detect changes and then a cumulative sum:
select t.*,
sum(case when prev_price = price then 0 else 1 end) over
(partition by id order by dt) as group_num
from (select t.*,
lag(price) over (partition by id order by dt) as prev_price
from t
) t

SQL query for backfilling register read values

I have a table with ID,timestamp,register reads for a day, the register reads are like running totals starts at 12.00 at midnight and ends at 11.00 at night.
Problem is there are some random timeintervals in which the cumulative reads may not be present, I need to back fill those,
The below picture gives a snapshot of the problem, The KWH_RDNG is the difference between two cumulative intervals divided by 1000, but the 4th column 5.851 is actually accumulation of 3 missing hours along with the 4th hour value. its fine if i simply divide 5.851/4 and distribute it.
The challenge is they can happen at random intervals and it can be different for different meters (1st column). I am using SQL Server 2016.
Please help.!!
This is a gaps and islands problem -- sort of. You need to identify groups of NULL values with the subsequent value. One method is to use a cumulative sum of the non-NULL value on or after each value. This defines the groups.
Then, you need the count and the reading. So, this should do the calculation:
select t.*,
(max_kwh_rding / cnt) as new_kwh_rding
from (select t.*, count(*) over (partition by meter_serial, grp) as cnt,
max(kwh_rding) over (partition by meter_serial, grp) as max_kwh_rding
from (select t.*,
count(kwh_rding) over (partition by meter_serial order by read_utc desc rows between unbounded preceding and current row) as grp
from t
) t
) t
where cnt > 1;
You can incorporate this into an update:
with toupdate as (
select t.*,
(max_kwh_rding / cnt) as new_kwh_rding
from (select t.*, count(*) over (partition by meter_serial, grp) as cnt,
max(kwh_rding) over (partition by meter_serial, grp) as max_kwh_rding
from (select t.*,
count(kwh_rding) over (partition by meter_serial order by read_utc desc rows between unbounded preceding and current row) as grp
from t
) t
) t
where cnt > 1
)
update toupdate
set kwh_rding = max_kwh_rding;

SQL Find the minimum date based on consecutive values

I'm having trouble constructing a query that can find consecutive values meeting a condition. Example data below, note that Date is sorted DESC and is grouped by ID.
To be selected, for each ID, the most recent RESULT must be 'Fail', and what I need back is the earliest date in that run of 'Fails'. For ID==1, only the 1st two values are of interest (the last doesn't count due to prior 'Complete'. ID==2 doesn't count at all, failing the first condition, and for ID==3, only the first value matters.
A result table might be:
The trick seems to be doing some type of run-length encoding, but even with several attempts manipulating ROW_NUM and an attempt at the tabibitosan method for grouping consecutive values, I've been unable to gain traction.
Any help would be appreciated.
If your database supports window functions, you can do
select id, case when result='Fail' then earliest_fail_date end earliest_fail_date
from (
select t.*
,row_number() over(partition by id order by dt desc) rn
,min(case when result = 'Fail' then dt end) over(partition by id) earliest_fail_date
from tablename t
) x
where rn=1
Use row_number to get the latest row in the table. min() over() to get the earliest fail date for each id. If the first row has status Fail, you select the earliest_fail_date or else it would be null.
It should be noted that the expected result for id=1 is wrong. It should be 2016-09-20 as it is the earliest fail date.
Edit: Having re-read the question, i think this is what you might be looking for. Getting the minimum Fail date from the latest consecutive groups of Fail rows.
with grps as (
select t.*,row_number() over(partition by id order by dt desc) rn
,row_number() over(partition by id order by dt)-row_number() over(partition by id,result order by dt) grp
from tablename t
)
,maxfailgrp as (
select g.*,
max(case when result = 'Fail' then grp end) over(partition by id) maxgrp
from grps g
)
select id,
case when result = 'Fail' then (select min(dt) from maxfailgrp where id = m.id and grp=m.maxgrp) end earliest_fail_date
from maxfailgrp m
where rn=1
Sample Demo