SQL Find the minimum date based on consecutive values - sql

I'm having trouble constructing a query that can find consecutive values meeting a condition. Example data below, note that Date is sorted DESC and is grouped by ID.
To be selected, for each ID, the most recent RESULT must be 'Fail', and what I need back is the earliest date in that run of 'Fails'. For ID==1, only the 1st two values are of interest (the last doesn't count due to prior 'Complete'. ID==2 doesn't count at all, failing the first condition, and for ID==3, only the first value matters.
A result table might be:
The trick seems to be doing some type of run-length encoding, but even with several attempts manipulating ROW_NUM and an attempt at the tabibitosan method for grouping consecutive values, I've been unable to gain traction.
Any help would be appreciated.

If your database supports window functions, you can do
select id, case when result='Fail' then earliest_fail_date end earliest_fail_date
from (
select t.*
,row_number() over(partition by id order by dt desc) rn
,min(case when result = 'Fail' then dt end) over(partition by id) earliest_fail_date
from tablename t
) x
where rn=1
Use row_number to get the latest row in the table. min() over() to get the earliest fail date for each id. If the first row has status Fail, you select the earliest_fail_date or else it would be null.
It should be noted that the expected result for id=1 is wrong. It should be 2016-09-20 as it is the earliest fail date.
Edit: Having re-read the question, i think this is what you might be looking for. Getting the minimum Fail date from the latest consecutive groups of Fail rows.
with grps as (
select t.*,row_number() over(partition by id order by dt desc) rn
,row_number() over(partition by id order by dt)-row_number() over(partition by id,result order by dt) grp
from tablename t
)
,maxfailgrp as (
select g.*,
max(case when result = 'Fail' then grp end) over(partition by id) maxgrp
from grps g
)
select id,
case when result = 'Fail' then (select min(dt) from maxfailgrp where id = m.id and grp=m.maxgrp) end earliest_fail_date
from maxfailgrp m
where rn=1
Sample Demo

Related

Complex Ranking in SQL (Teradata)

I have a peculiar problem at hand. I need to rank in the following manner:
Each ID gets a new rank.
rank #1 is assigned to the ID with the lowest date. However, the subsequent dates for that particular ID can be higher but they will get the incremental rank w.r.t other IDs.
(E.g. ADF32 series will be considered to be ranked first as it had the lowest date, although it ends with dates 09-Nov, and RT659 starts with 13-Aug it will be ranked subsequently)
For a particular ID, if the days are consecutive then ranks are same, else they add by 1.
For a particular ID, ranks are given in date ASC.
How to formulate a query?
You need two steps:
select
id_col
,dt_col
,dense_rank()
over (order by min_dt, id_col, dt_col - rnk) as part_col
from
(
select
id_col
,dt_col
,min(dt_col)
over (partition by id_col) as min_dt
,rank()
over (partition by id_col
order by dt_col) as rnk
from tab
) as dt
dt_col - rnk caluclates the same result for consecutives dates -> same rank
Try datediff on lead/lag and then perform partitioned ranking
select t.ID_COL,t.dt_col,
rank() over(partition by t.ID_COL, t.date_diff order by t.dt_col desc) as rankk
from ( SELECT ID_COL,dt_col,
DATEDIFF(day, Lag(dt_col, 1) OVER(ORDER BY dt_col),dt_col) as date_diff FROM table1 ) t
One way to think about this problem is "when to add 1 to the rank". Well, that occurs when the previous value on a row with the same id_col differs by more than one day. Or when the row is the earliest day for an id.
This turns the problem into a cumulative sum:
select t.*,
sum(case when prev_dt_col = dt_col - 1 then 0 else 1
end) over
(order by min_dt_col, id_col, dt_col) as ranking
from (select t.*,
lag(dt_col) over (partition by id_col order by dt_col) as prev_dt_col,
min(dt_col) over (partition by id_col) as min_dt_col
from t
) t;

SQL ZOO Window LAG #8

Question: For each country that has had at last 1000 new cases in a single day, show the date of the peak number of new cases.
Here is a few sample data of the covid table.
What I write:
SELECT name,date,MAX(confirmed-lag) AS PeakNew
FROM(
SELECT name, DATE_FORMAT(whn,'%Y-%m-%d') date, confirmed,
LAG(confirmed, 1) OVER (PARTITION BY name ORDER BY whn) lag
FROM covid
ORDER BY confirmed
) temp
GROUP BY name
HAVING PeakNew>=1000
ORDER BY PeakNew DESC;
The result I got is weird, PeakNew seems correct, but the related date is not.
My answer
The right answer
Anyone can help to get the right answer? Thank you!
The below query works perfectly fine for me. Though the dates and values are correct, the output will say otherwise as the order is different. Here the order is by date, then by name.
SELECT z1.name, DATE_FORMAT(c.dt,'%Y-%m-%d'), z1.nc
FROM
(
SELECT z.name, MAX(z.nc) AS 'mx'
FROM (
SELECT DATE(whn) AS 'dt', name, confirmed - LAG(confirmed,1) OVER(PARTITION BY name ORDER BY DATE(whn) ASC) AS 'nc'
FROM covid ) z
WHERE z.nc >= 1000
GROUP BY z.name
) z1
INNER JOIN
(
SELECT DATE(whn) AS 'dt', name, confirmed - LAG(confirmed,1) OVER(PARTITION BY name ORDER BY DATE(whn) ASC) AS 'nc'
FROM covid
) c
ON c.nc = z1.mx
AND c.name = z1.name
ORDER BY 2 ASC
The date value in the outer query doesn't correspond to row where MAX(confirmed-lag) is found - it's just a random date value within that group. Check out the section titled, "The ONLY_FULL_GROUP_BY Issue" in this blog post: https://www.percona.com/blog/2019/05/13/solve-query-failures-regarding-only_full_group_by-sql-mode/ for more information.
I used the ROW_NUMBER() function to get the entire row corresponding to the maximum new cases. However, my final result wasn't ordered the way the answer was, and there's no specification to how it should be ordered, so I still didn't get that satisfying happy emoji.
You need to self join to obtain the date on which the max count occurred:
WITH CTE1 as
(SELECT name,DATE_FORMAT(whn, "%Y-%m-%d") as date,
confirmed - LAG(confirmed, 1) OVER (PARTITION BY name ORDER BY DATE(whn)) as increase
FROM covid
ORDER BY whn),
CTE2 AS
(SELECT name, MAX(increase) as max_increase
FROM CTE1
WHERE increase >999
GROUP BY name
ORDER BY date)
SELECT c1.name,c1.date,c2.max_increase as peakNewCases
FROM CTE1 as c1
JOIN CTE2 as c2
ON c1.name=c2.name AND c1.increase=c2.max_increase
WITH CTE1 as
(SELECT name, DATE_FORMAT(whn,'%Y-%m-%d') as date_form, confirmed - LAG(confirmed,1) OVER(PARTITION BY name ORDER BY whn) AS newcases
FROM covid
ORDER BY name,whn)
SELECT name, date_form, newcases FROM
(
SELECT name, date_form, newcases, ROW_NUMBER() OVER (PARTITION BY name ORDER BY newcases DESC) as rank
FROM CTE1
WHERE newcases > 999
) cte2
WHERE rank =1

Grouping rows based on a consecutive flag in SQL (Redshift)

I've got a tricky problem that I am trying to solve here and can't get my head around it so far.
So the problem is this: I have tracking data, where there are records produced over time. Let's say you have a robot driving around and you record it's position once every second. Each of those positions is recorded as one record in the database (we use AWS Redshift).
Each record has a tracking_id which is unique across all records that belong to the same source of the tracking, i.e. unique for the robot. Then I have a record_id which is globally unique, a timestamp, and a flag that indicates if the record was created while the robot was inside or outside a defined zone. And then there is some additional data like coordinates.
Here is a little illustration. The pink box is the zone, the green line is the path of the robot and the blue dots are the produced records.
So now I would like to group records based on the zone flag (have a look at the screenshot below). So I want to isolate sub-paths inside the zone into a record and grab the start and end timestamp and position. The IDs don't matter so I don't necessarily need to keep the tracking or record ids even though I listed them in the desired result.
Thanks for the help, I would really appreciate it! Also just solving part of the problem like how to group based on the flag without grabbing first and last values within the sub-paths would help already.
This is a gaps and islands problem. In this case, you want the islands where in_zone happens to be TRUE (and there are two of them). We can use the difference in row number method here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tracking_id ORDER BY timestamp) rn1,
ROW_NUMBER() OVER (PARTITION BY tracking_id, in_zone ORDER BY timestamp) rn2
FROM yourTable
)
SELECT
tracking_id,
MIN(record_id) AS record_id,
MIN(timestamp) AS start_timestamp,
MAX(timestamp) AS end_timestamp,
(SELECT t2.coordinates FROM yourTable t2
WHERE t2.record_id = MIN(t1.record_id) AND t2.tracking_id = t1.tracking_id) AS entry_coordinates,
(SELECT t2.coordinates FROM yourTable t2
WHERE t2.record_id = MAX(t1.record_id) AND t2.tracking_id = t1.tracking_id) AS exit_coordinates
FROM cte t1
WHERE
in_zone = 'TRUE'
GROUP BY
tracking_id,
rn1 - rn2,
in_zone
ORDER BY
tracking_id,
record_id DESC;
Demo
This is a gaps-and-islands problem. I would approach it using LAG() to identify the previous in-group and a cumulative sum. You can also use conditional aggregation to get the first and last coordinate values:
SELECT tracking_id, MIN(record_id), MIN(timestamp) as start_timestamp,
MIN(timestamp) as end_timestamp,
MAX(CASE WHEN prev_in_zone IS NULL OR prev_in_zone <> in_zone THEN coordinates END) as entry_coordinates,
MAX(CASE WHEN next_in_zone IS NULL OR next_in_zone <> in_zone THEN coordinates END) as entry_coordinates
FROM (SELECT t.*,
SUM( CASE WHEN prev_in_zone = in_zone THEN 0 ELSE 1 END) OVER (PARTITION BY tracking_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as grp
FROM (SELECT t.*,
LAG(in_zone) OVER (PARTITION BY tracking_id ORDER BY timestamp) as prev_in_zone,
LEAD(in_zone) OVER (PARTITION BY tracking_id ORDER BY timestamp) as next_in_zone
FROM t
) t
) t
WHERE in_zone = 'TRUE'
GROUP BY tracking_id, grp;
With much appreciation to Tim, here is a db<>fiddle.

SQL - Window function to get values from previous row where value is not null

I am using Exasol, in other DBMS it was possible to use analytical functions such LAST_VALUE() and specify some condition for the ORDER BY clause withing the OVER() function, like:
select ...
LAST_VALUE(customer)
OVER (PARTITION BY ID ORDER BY date_x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) as the_last
Unfortunately I get the following error:
ERROR: [0A000] Feature not supported: windowing clause (Session:
1606983630649130920)
the same do not happen if instead of AND 1 PRECEDING I use: CURRENT ROW.
Basically what I wanted is to get the last value according the Order by that is NOT the current row. In this example it would be the $customer of the previous row.
I know that I could use the LAG(customer,1) OVER ( ...) but the problem is that I want the previous customer that is NOT null, so the offset is not always 1...
How can I do that?
Many thanks!
Does this work?
select lag(customer) over (partition by id
order by (case when customer is not null then 1 else 0 end),
date
)
You can do this with two steps:
select t.*,
max(customer) over (partition by id, max_date) as max_customer
from (select t.*,
max(case when customer is not null then date end) over (partition by id order by date) as max_date
from t
) t;

Find the nth value in hive

I am trying to identify the Nth Score Value which is also dependant on another variable.
For example I want to see the nth Transaction amount of each person, the issue I currently have is that my RANK does not re-start the count of n at each name, it just continues down the output like a row count:
Syntax example:
SELECT name, txn_amount, dense_rank() over (order by name,txn_amount desc ) as nth_value FROM payment_table
Any help is greatly appreciated.
P.S I am using HIVE to run this if it helps
You need to partition by one value and order by the other:
SELECT name, txn_amount,
FROM (SELECT pt.*,
dense_rank() over (partition by name order by txn_amount desc ) as nth_value
FROM payment_table pt
) pt
WHERE nth_value = X;
The subquery is needed to get a particular value. If you want multiple values in the same row, you can use GROUP BY:
SELECT name,
MAX(CASE WHEN nth_value = 1 THEN txn_amount END) as value_1,
MAX(CASE WHEN nth_value = 2 THEN txn_amount END) as value_2
FROM (SELECT pt.*,
dense_rank() over (partition by name order by txn_amount desc ) as nth_value
FROM payment_table pt
) pt
WHERE nth_value = X
GROUP BY name;
Note: DENSE_RANK() will ignore duplicates. If you want to see those as well (so the second value could have the same value as the first), then use ROW_NUMBER().