Using LEAD with condition(?) - snowflake

Using LEAD with condition(?) - snowflake - sql

I have +10k IDs and, for each one of them, there are 10 zones; each zone can be affected in some way
I want to count the time duration that each zone was affected for each ID, ordered by day (considering last week as a whole)
To know if/when a zone was affected, the column AFFECTED_ZONE will return a value from 1 to 10 (determining which zone was the one)
I know the zone was normalized once the next row within AFFECTED_ZONE is 0
So, for example, it looks a little like this:
DATE
ID
AFFECTED_ZONE
2022-12-21 15:00:00
1
1
2022-12-21 15:03:00
1
0
2022-12-21 15:15:00
1
3
2022-12-21 15:25:00
1
0
2022-12-21 16:00:00
1
0
2022-12-21 16:43:00
1
4
2022-12-21 17:00:00
1
0
In this case, the zone 1 from ID 1 was affected at 15:00:00 and was normalized at 15:03:00 - overall affected time should be 3 min; same thing with zone 4 in this example (affected at 16:43:00 and normalized at 17:00:00 - overall affected time should be 17 min)
For zone 3, the affectation happened at 15:15:00 and was normalized at 15:25:00 (first 0) and we had another 0 at a posterior time that we do not consider - overall affected time should be 10 min
The problem is that, sometimes, it can look like this:
DATE
ID
AFFECTED_ZONE
2022-12-21 15:00:00
1
1
2022-12-21 15:03:00
1
1
2022-12-21 15:15:00
1
0
2022-12-21 15:25:00
1
6
2022-12-21 16:00:00
1
4
2022-12-21 16:43:00
1
3
2022-12-21 17:00:00
1
0
In this case, the zone 1 from ID 1 was affected at 15:00:00 and was normalized at 15:15:00, however the 1 showed up again at 15:03:00, but it should be desconsidered since the same zone has already been affected since 15:00:00 - overall affected time should be 15 min
After this, zones 6, 4 and 3 were affected in a row, and normalization only came at 17:00:00; the overall afected times for each zone, respectively, should be 95 min, 60 min and 17 min
I can't figure this second part out. At first, I separated the dates of each event (affectation and normalization) like this:
case when affectation_zone <> 0 then date end as affected_at,
case when affectation_zone = 0 then date end as normal_at
Then, I added a LEAD() function so that I could subtract the AFFECTED_AT date from the NORMAL_AT date and thus find the overall affected time, like this:
datediff(minutes, affected_at, lead(normal_at) over (partition by id order by date)) as lead
It works just fine for the first scenario
DATE
ID
AFFECTED_ZONE
AFFECTED_AT
NORMAL_AT
LEAD
2022-12-21 15:00:00
1
1
2022-12-21 15:00:00
null
3
2022-12-21 15:03:00
1
0
null
2022-12-21 15:03:00
null
2022-12-21 15:15:00
1
3
2022-1-21 15:15:00
null
10
2022-12-21 15:25:00
1
0
null
2022-12-21 15:25:00
null
2022-12-21 16:00:00
1
0
null
2022-12-21 16:00:00
null
2022-12-21 16:43:00
1
4
2022-12-21 16:43:00
null
17
2022-12-21 17:00:00
1
0
null
2022-12-21 17:00:00
null
However, for the second one, the LEAD() only considers the last row in which the
AFFECTED_AT
column is not null, desconsidering the other ones, like this:
DATE
ID
AFFECTED_ZONE
AFFECTED_AT
NORMAL_AT
LEAD
2022-12-21 15:00:00
1
1
2022-12-21 15:00:00
null
null
2022-12-21 15:03:00
1
1
2022-12-21 15:03:00
null
12
2022-12-21 15:15:00
1
0
null
2022-12-21 15:15:00
null
2022-12-21 15:25:00
1
6
2022-12-21 15:25:00
null
null
2022-12-21 16:00:00
1
4
2022-12-21 16:00:00
null
null
2022-12-21 16:43:00
1
3
2022-12-21 16:43:00
null
17
2022-12-21 17:00:00
1
0
null
2022-12-21 17:00:00
null
I could ignore nulls with the LEAD() function, and it would work well for the cases in which there are different zones one after the other, but it wouldn't work in cases in which the same zone repeats itself, as I would be adding unnecessary time, for example:
DATE
ID
AFFECTED_ZONE
AFFECTED_AT
NORMAL_AT
LEAD
2022-12-21 15:00:00
1
1
2022-12-21 15:00:00
null
15
2022-12-21 15:03:00
1
1
2022-12-21 15:03:00
null
12
2022-12-21 15:15:00
1
0
null
2022-12-21 15:15:00
null
2022-12-21 15:25:00
1
6
2022-12-21 15:25:00
null
95
2022-12-21 16:00:00
1
4
2022-12-21 16:00:00
null
60
2022-12-21 16:43:00
1
3
2022-12-21 16:43:00
null
17
2022-12-21 17:00:00
1
0
null
2022-12-21 17:00:00
null
the overall affection time for zone 1 should be 15 min, but if I add everything it would be 23 min
Any ideas on how to solve this? I'm no expert on Snowflake/SQL (quite on the contrary) so I would much appreciate it!!

I can think of two possible approaches, second probably the best but I'll let you decide:
1 - Remove Extra Records
Assuming, based on your question, that an ID can only affect an AFFECTED_ZONE once (each occurrence possibly including multiple records). i.e.
DATE
ID
AFFECTED_ZONE
2022-12-21 15:00:00
1
1
2022-12-21 15:03:00
1
1
2022-12-21 15:15:00
1
0
2022-12-21 15:25:00
1
6
2022-12-21 16:00:00
1
4
2022-12-21 16:43:00
1
3
2022-12-21 17:00:00
1
0
and not
DATE
ID
AFFECTED_ZONE
2022-12-21 15:00:00
1
1
2022-12-21 15:03:00
1
1
2022-12-21 15:15:00
1
0
2022-12-21 15:25:00
1
1
2022-12-21 16:00:00
1
0
2022-12-21 16:43:00
1
3
2022-12-21 17:00:00
1
0
We could use a LAG function to find each records previous AFFECTED_ZONE and remove those with the same ID and AFFECTED_ZONE - while ignoring where AFFECTED_ZONE = 0. If you do have more than one occurrence of an ID, AFFECTED_ZONE pairing, this process would merge them together.
select foo.id,
foo.date,
foo.affected_zone
from (select id,
date,
affected_zone,
lag(affected_zone,1) over (partition by id
order by date) prev_affected_zone
from your_table) foo
where ifnull(foo.affected_zone,-1) != ifnull(foo.prev_affected_zone,-1)
or ifnull(foo.affected_zone,-1) = 0
This approach will give you something like
DATE
ID
AFFECTED_ZONE
2022-12-21 15:00:00
1
1
2022-12-21 15:15:00
1
0
2022-12-21 15:25:00
1
6
2022-12-21 16:00:00
1
4
2022-12-21 16:43:00
1
3
2022-12-21 17:00:00
1
0
Allowing you to use your existing LEAD
2 - Use FIRST_VALUE instead of LEAD
Use your current process but replace LEAD with FIRST_VALUE.
FIRST_VALUE will select the first value in an ordered group of values, so we can ignore nulls and return the first normal_at value after our current row.
first_value
select date,
id,
affected_zone,
affected_at,
first_value(normal_at ignore nulls) over (partition by id
order by date
rows between current row and unbound following) normal_at
from (select id,
date,
affected_zone,
case when affected_zone != 0 then date end affected_at,
case when affected_zone = 0 then date end normal_at
from your_table) foo
This should give you:
DATE
ID
AFFECTED_ZONE
AFFECTED_AT
NORMAL_AT
2022-12-21 15:00:00
1
1
2022-12-21 15:00:00
2022-12-21 15:15:00
2022-12-21 15:03:00
1
1
2022-12-21 15:03:00
2022-12-21 15:15:00
2022-12-21 15:15:00
1
0
null
2022-12-21 15:15:00
2022-12-21 15:25:00
1
6
2022-12-21 15:25:00
null
2022-12-21 16:00:00
1
4
2022-12-21 16:00:00
null
2022-12-21 16:43:00
1
3
2022-12-21 16:43:00
2022-12-21 17:00:00
2022-12-21 17:00:00
1
0
null
2022-12-21 17:00:00
You can then do your duration calculation and select the first record for each ID, AFFECTED_ZONE pairing, probably with a ROW_NUMBER.

I would approach this as a gaps-and-islands problem, which gives us a lot of flexibility to address the various use cases.
My pick would be to define groups of adjacent records that start with one or more affected zones and end with a normalization (affected_zone = 0), using window functions:
select t.*,
sum(case when lag_affected_zone = 0 then 1 else 0 end) over(partition by id order by date) grp
from (
select t.*,
lag(affected_zone, 1, 0) over(partition by id order by date) lag_affected_zone
from mytable t
) t
Starting with a mix of some of the data you provided, that I hope represents the different use cases, this returns:
DATE
ID
AFFECTED_ZONE
lag_affected_zone
grp
2022-12-21 15:00:00.000
1
1
0
1
2022-12-21 15:03:00.000
1
1
1
1
2022-12-21 15:15:00.000
1
0
1
1
2022-12-21 15:17:00.000
1
0
0
2
2022-12-21 15:25:00.000
1
6
0
3
2022-12-21 16:00:00.000
1
4
6
3
2022-12-21 16:43:00.000
1
3
4
3
2022-12-21 16:50:00.000
1
1
3
3
2022-12-21 17:00:00.000
1
0
1
3
You can see how records are being grouped together to form consistend islands. Now we can work on each group: we want to bring the earliest date of each affected zone in the group, and compare it to the latest date of the group (which corresponds to the normalization step); we can use aggregation:
select *
from (
select id, affected_zone, min(date) affected_at, max(max(date)) over(partition by grp) normalized_at
from (
select t.*,
sum(case when lag_affected_zone = 0 then 1 else 0 end) over(partition by id order by date) grp
from (
select t.*,
lag(affected_zone, 1, 0) over(partition by id order by date) lag_affected_zone
from mytable t
) t
) t
group by id, affected_zone, grp
) t
where affected_zone != 0
order by id, affected_at
id
affected_zone
affected_at
normalized_at
1
1
2022-12-21 15:00:00.000
2022-12-21 15:15:00.000
1
6
2022-12-21 15:25:00.000
2022-12-21 17:00:00.000
1
4
2022-12-21 16:00:00.000
2022-12-21 17:00:00.000
1
3
2022-12-21 16:43:00.000
2022-12-21 17:00:00.000
1
1
2022-12-21 16:50:00.000
2022-12-21 17:00:00.000
Here is a demo on DB Fiddle: this is SQL Server, but uses standard SQL that Snowflakes supports as well.

Related

Calculate difference between endTime from ID 1 and startTime from ID 2

id startTime endTime
1 2022-12-3 13:00:00 2022-12-3 14:00:00
2 2022-12-3 14:00:00 2022-12-3 14:30:00
3 2022-12-3 15:00:00 2022-12-3 15:15:00
4 2022-12-3 15:30:00 2222-12-3 16:30:00
5 2022-12-3 18:30:00 2022-12-3 19:00:00
SELECT startTime, endTime,
(TIMESTAMPDIFF(MINUTE, startTime , endTime) = '60') AS MinuteDiff
FROM booking
OUTPUT:
id startTime endTime MinuteDiff
1 2022-12-3 13:00:00 2022-12-3 14:00:00 1
2 2022-12-3 14:00:00 2022-12-3 14:30:00 0
3 2022-12-3 15:00:00 2022-12-3 15:15:00 0
4 2022-12-3 15:30:00 2022-12-3 16:30:00 1
5 2022-12-3 18:30:00 2022-12-3 19:00:00 0
I am calculating the difference between the startTime and endTime of ID 1, how to calculate the difference between the endTime of ID 1 and the startTime of ID 2, and so on?

Do try this one:
If you want your last row to be included in your result, use LEFT JOIN, if you don't want to include the last row use 'JOIN'.
SELECT d.`id`,
d.`endTime`,
IFNULL(d1.`startTime`,d.`endTime`),
IFNULL(TIMESTAMPDIFF(MINUTE, d.endTime, d1.startTime),0) FROM date_table d LEFT
JOIN date_table d1 ON d1.`id`=d.`id`+1
Or you can use following with Windows Functions:
SELECT
id,
endTime,
lead(startTime) over (order by id) nextStartDate,
TIMESTAMPDIFF(MINUTE,endTime,lead(startTime) over (order by id)) as timeDiff
FROM
date_table d;

SQL time-series resampling

I have clickhouse table with some rows like that
id
created_at
6962098097124188161
2022-07-01 00:00:00
6968111372399976448
2022-07-02 00:00:00
6968111483775524864
2022-07-03 00:00:00
6968465518567268352
2022-07-04 00:00:00
6968952917160271872
2022-07-07 00:00:00
6968952924479332352
2022-07-09 00:00:00
I need to resample time-series and get count by date like this
created_at
count
2022-07-01 00:00:00
1
2022-07-02 00:00:00
2
2022-07-03 00:00:00
3
2022-07-04 00:00:00
4
2022-07-05 00:00:00
4
2022-07-06 00:00:00
4
2022-07-07 00:00:00
5
2022-07-08 00:00:00
5
2022-07-09 00:00:00
6
I've tried this
SELECT
arrayJoin(
timeSlots(
MIN(created_at),
toUInt32(24 * 3600 * 10),
24 * 3600
)
) as ts,
SUM(
COUNT(*)
) OVER (
ORDER BY
ts
)
FROM
table
but it counts all rows.
How can I get expected result?

why not use group by created_at
like
select count(*) from table_name group by toDate(created_at)

finding the data of time between 2 columns times

i have data like this
no_shift start_time end_time
1 08:00:01 15:00:00
2 15:00:01 20:00:00
3 20:00:01 03:00:00
4 03:00:01 08:00:00
i am using this syntax:
select * from shift_time where `20:15:22` between start_time and end_time
i got null.. but if i changed the value to
08:01:22 return 1
16:35:12 return 2
05:11:23 return 4
but if 22:02:22 i got null. how to solve this problem?

SQL Collapse Data

I am trying to collapse data that is in a sequence sorted by date. While grouping on the person and the type.
The data is stored in an SQL server and looks like the following -
seq person date type
--- ------ ------------------- ----
1 1 2018-02-10 08:00:00 1
2 1 2018-02-11 08:00:00 1
3 1 2018-02-12 08:00:00 1
4 1 2018-02-14 16:00:00 1
5 1 2018-02-15 16:00:00 1
6 1 2018-02-16 16:00:00 1
7 1 2018-02-20 08:00:00 2
8 1 2018-02-21 08:00:00 2
9 1 2018-02-22 08:00:00 2
10 1 2018-02-23 08:00:00 1
11 1 2018-02-24 08:00:00 1
12 1 2018-02-25 08:00:00 2
13 2 2018-02-10 08:00:00 1
14 2 2018-02-11 08:00:00 1
15 2 2018-02-12 08:00:00 1
16 2 2018-02-14 16:00:00 3
17 2 2018-02-15 16:00:00 3
18 2 2018-02-16 16:00:00 3
This data set contains about 1.2 million records that resemble the above.
The result that I would like to get from this would be -
person start type
------ ------------------- ----
1 2018-02-10 08:00:00 1
1 2018-02-20 08:00:00 2
1 2018-02-23 08:00:00 1
1 2018-02-25 08:00:00 2
2 2018-02-10 08:00:00 1
2 2018-02-14 16:00:00 3
I have the data in the first format by running the following query -
select
ROW_NUMBER() OVER (ORDER BY date) AS seq
person,
date,
type,
from table
group by person, date, type
I am just not sure how to keep the minimum date with the other distinct values from person and type.

This is a gaps-and-islands problem so, you can use differences of row_number() & use them in grouping :
select person, min(date) as start, type
from (select *,
row_number() over (partition by person order by seq) seq1,
row_number() over (partition by person, type order by seq) seq2
from table
) t
group by person, type, (seq1 - seq2)
order by person, start;

The correct solution using the difference of row numbers is:
select person, type, min(date) as start
from (select t.*,
row_number() over (partition by person order by seq) as seqnum_p,
row_number() over (partition by person, type order by seq) as seqnum_pt
from t
) t
group by person, type, (seqnum_p - seqnum_pt)
order by person, start;
type needs to be included in the GROUP BY.

Update a Field/Column based on Current and Previous Record Value

I need assistance with updating a field/column "IsLatest" based on the comparison between the current and previous record. I'm using CTE's syntax and I'm able to get the current and previous record but I'm unable updated field/column "IsLatest" which I need based on the field/column "Value" of the current and previous record.
Example
Current Output
Dates Customer Value IsLatest
2010-01-01 00:00:00.000 1 12 1
Dates Customer Value IsLatest
2010-01-01 00:00:00.000 1 12 0
2010-01-02 00:00:00.000 1 30 1
Dates Customer Value IsLatest
2010-01-01 00:00:00.000 1 12 0
2010-01-02 00:00:00.000 1 30 0
2010-01-03 00:00:00.000 1 13 1
Expected Final Output
Dates Customer Value ValueSetId IsLatest
2010-01-01 00:00:00.000 1 12 12 0
2010-01-01 00:00:00.000 1 12 13 0
2010-01-01 00:00:00.000 1 12 14 0
2010-01-02 00:00:00.000 1 30 12 0
2010-01-02 00:00:00.000 1 30 13 0
2010-01-02 00:00:00.000 1 30 14 0
2010-01-03 00:00:00.000 1 13 12 0
2010-01-03 00:00:00.000 1 13 13 0
2010-01-03 00:00:00.000 1 13 14 0
2010-01-04 00:00:00.000 1 14 12 0
2010-01-04 00:00:00.000 1 14 13 0
2010-01-04 00:00:00.000 1 14 14 1

;WITH a AS
(
SELECT
Dates Customer Value,
row_number() over (partition by customer order by Dates desc, ValueSetId desc) rn
FROM #Customers)
SELECT Dates, Customer, Value, case when RN = 1 then 1 else 0 end IsLatest
FROM a

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using LEAD with condition(?) - snowflake - sql

Related

Calculate difference between endTime from ID 1 and startTime from ID 2

SQL time-series resampling

finding the data of time between 2 columns times

SQL Collapse Data

Update a Field/Column based on Current and Previous Record Value

Categories

Resources