Calculate the time length of 0-1 sequence with hive - sql

The link 'calculate the time length' has solved the problem which the time length is calculated in the sub-sequency.
The data is like:
time(string) id(int)
201801051127 0
201801051130 0
201801051132 0
201801051135 1
201801051141 1
201801051145 0
201801051147 0
Now I have some questions:
(1) the time length of the first sequence should begin with '201801051100', and end with the start time of next sequency like '201801051135', so the time length of the first sequence is 35;
(2) the time length of the second sequency should begin with the start time of it and end with the start time of next sequency;
(3) the time length of the final sequency should start with the start time of it and end with '201801051200'.
In order to satisfy these three calculation rules as the first sequence,the middle sequences and the final sequence, how to use hive to realize it base on the code written in 'calculate the time length':
with q1 as (
select unix_timestamp(time, 'yyyyMMddHHmm')/60 time, id,
case id when lag(id) over(order by time) then null else 1 end
first_in_group
from t
), q2 as (
select time, id, count(first_in_group) over (order by time) grp_id
from q1
)
select min(id) id, max(time) - min(time) minutes
from q2
group by grp_id
order by grp_id

You can achieve that with some minor modifications to the query:
with q1 as (
select unix_timestamp(time, 'yyyyMMddHHmm')/60 time, id,
case id when lag(id) over(order by time) then null else 1 end
first_in_group
from t
), q2 as (
select time, id
from q1
where first_in_group = 1
)
select id, lead(time, 1, unix_timestamp('201801051200', 'yyyyMMddHHmm')/60)
over (order by time) - time
as minutes
from q2

Related

Determine time duration based on events without using loops

I have a table with timestamps of 5 different types of events (start, stopped, restart, aborted, and completed).
The given table looks like this:
Time
EventID
Event
7:38:20
1
start
7:40:20
2
stopped
7:48:20
3
restart
7:50:20
4
aborted
8:00:20
1
start
8:40:20
5
completed
8:58:20
1
start
9:00:15
4
aborted
I would like to determine the following and display it:
Duration of individual Wash --> From (start or restart) to (stopped or aborted or completed)
Duration of Wash Cycle --> From (start) to (aborted or completed)
Duration of total wash time --> Sum of all individual wash in a Wash cycle
Duration of idle time --> Wash Cycle duration - total wash time duration
So the table should look something like the following:
Time
EventID
Event
Duration of individual Wash
Duration of Wash Cycle
Duration of total wash time
Duration of idle time
7:38:20
1
start
0:02:00
0:12:00
0:04:00
0:08:00
7:40:20
2
stopped
NULL
NULL
NULL
NULL
7:48:20
3
restart
0:02:00
NULL
NULL
NULL
7:50:20
4
aborted
NULL
NULL
NULL
NULL
8:00:20
1
start
0:40:00
0:40:00
0:01:55
0:00:00
8:40:20
5
completed
NULL
NULL
NULL
NULL
8:58:20
1
start
0:01:55
0:01:55
0:01:55
0:00:00
9:00:15
4
aborted
NULL
NULL
NULL
NULL
So far I was able to get the duration of individual Wash and the duration of Wash Cycle by joining two table (one with only start, abort, and complete; the other with all events). I am stuck on the last two columns. I'm not sure how to approach this problem efficiently without using a while loop or counter of some sort. Would love some pointers.
Here are my code so far:
SELECT IndivWash.DateTimeStamp as 'Event TimeStamp'
,IndivWash.EventIDNo AS 'Event ID Number'
,IndivWash.EventDesc AS 'Event Description'
-- for the duration of the WASH ----------------------------------------------------
,CASE
WHEN (IndivWash.EventIDNo = '1' OR IndivWash.EventIDNo = '3')
AND (LEAD(IndivWash.EventIDNo) OVER (ORDER BY IndivWash.DateTimeStamp) = '2'
OR LEAD(IndivWash.EventIDNo) OVER (ORDER BY IndivWash.DateTimeStamp) = '4'
OR LEAD(IndivWash.EventIDNo) OVER (ORDER BY IndivWash.DateTimeStamp) = '5')
AND LEAD(IndivWash.EventIDNo) OVER (ORDER BY IndivWash.DateTimeStamp) <> IndivWash.EventIDNo
THEN
DATEDIFF(s, IndivWash.DateTimeStamp, LEAD(IndivWash.DateTimeStamp) OVER (ORDER BY IndivWash.DateTimeStamp))
ELSE
NULL
END AS 'Duration of individual Wash'
-- For the duration of the CYCLE ----------------------------------------------------
,CASE
WHEN WashCycle.EventIDNo = '1'
AND LEAD(WashCycle.EventIDNo) OVER (ORDER BY WashCycle.DateTimeStamp) <> WashCycle.EventIDNo
AND (LEAD(WashCycle.EventIDNo) OVER (ORDER BY WashCycle.DateTimeStamp) = '4' OR
LEAD(WashCycle.EventIDNo) OVER (ORDER BY WashCycle.DateTimeStamp) = '5')
THEN
DATEDIFF(s, WashCycle.DateTimeStamp, LEAD(WashCycle.DateTimeStamp) OVER (ORDER BY WashCycle.DateTimeStamp))
ELSE
NULL
END AS 'Duration of Wash Cycle'
-- ----------------------------------------------------
FROM (
--FROM: table with only start, abort and complete.
-- to differentiate the cycles that are not aborted
SELECT TOP (1000) DateTimeStamp
,EventIDNo
,EventDesc
/*----------CHANGE DATABASE HERE----------*/
FROM Washer.dbo.EventLog_vw
/*----------------------------------------*/
WHERE EventIDNo IN ('1','4','5')
ORDER BY DateTimeStamp
) WashCycle
RIGHT JOIN
(
--FROM: table with all five events
SELECT TOP (1000)
DateTimeStamp
,EventIDNo
,EventDesc
/*----------CHANGE DATABASE HERE----------*/
FROM Washer.dbo.EventLog_vw
/*----------------------------------------*/
WHERE EventIDNo IN ('1','2','3','4','5')
ORDER BY DateTimeStamp
) IndivWash
ON WashCycle.DateTimeStamp=IndivWash.DateTimeStamp
Try this example based on precalculating cycles IDs CycleStartId and CycleRestartId:
SELECT *,
CASE WHEN EventID IN (1, 3) THEN
DATEDIFF(SS,
MIN(Time) OVER (PARTITION BY CycleRestartId),
MAX(Time) OVER (PARTITION BY CycleRestartId)
)
END AS DurIndSec,
CASE WHEN EventID IN (1) THEN
DATEDIFF(SS,
MIN(Time) OVER (PARTITION BY CycleStartId),
MAX(Time) OVER (PARTITION BY CycleStartId)
)
END AS DurSec,
CASE WHEN EventId = 1 THEN
SUM(CASE WHEN EventId = 1 THEN 0 ELSE TimeDiff END) OVER (PARTITION BY CycleStartId)
END AS TotalWashSec,
CASE WHEN EventId = 1 THEN
SUM(COALESCE(StopIdle, 0)) OVER (PARTITION BY CycleStartId)
END AS DurIdleSec
FROM (
SELECT *,
DATEDIFF(SS, LAG(Time, 1, Time) OVER (ORDER BY Time), Time) as TimeDiff,
SUM(CASE WHEN EventID = 1 THEN 1 ELSE 0 END)
OVER (ORDER BY Time) AS CycleStartId,
SUM(CASE WHEN EventID IN (1, 3) THEN 1 ELSE 0 END)
OVER (ORDER BY Time) AS CycleRestartId,
CASE WHEN LAG(EventId, 1, EventId) OVER (ORDER BY Time) = 2 THEN
DATEDIFF(SS, LAG(Time, 1, Time) OVER (ORDER BY Time), Time)
END AS StopIdle
FROM events
) t
Here the reports are shown in seconds. If you need to format them as time, then you can use the following expression:
CONVERT(varchar(8), DATEADD(SS, <Int in seconds>, '0:00:00'), 114)
fiddle

How to find value in a range of following rows - SQL Teradata

I have a table with the following columns:
account, validity_date,validity_month,amount.
For each row i want to check if the value in field "amount' exist over the rows range of the next month. if yes, indicator=1, else 0.
account validity_date validity_month amount **required_column**
------- ------------- --------------- ------- ----------------
123 15oct2019 201910 400 0
123 20oct2019 201910 500 1
123 15nov2019 201911 1000 0
123 20nov2019 201911 500 0
123 20nov2019 201911 2000 1
123 15dec2019 201912 400
123 15dec2019 201912 2000
Can anyone help?
Thanks
validity_month/100*12 + validity_month MOD 100 calculates a month number (for comparing across years, Jan to previous Dec) and the inner ROW_NUMBER reduces multiple rows with the same amount per month to a single row (kind of DISTINCT):
SELECT dt.*
,CASE -- next row is from next month
WHEN Lead(nextMonth IGNORE NULLS)
Over (PARTITION BY account, amount
ORDER BY validity_date)
= (validity_month/100*12 + validity_month MOD 100) +1
THEN 1
ELSE 0
END
FROM
(
SELECT t.*
,CASE -- one row per account/month/amount
WHEN Row_Number()
Over (PARTITION BY account, amount, validity_month
ORDER BY validity_date ) = 1
THEN validity_month/100*12 + validity_month MOD 100
END AS nextMonth
FROM tab AS t
) AS dt
Edit:
The previous is for exact matching amounts, for a range match the query is probably very hard to write with OLAP-functions, but easy with a Correlated Subquery:
SELECT t.*
,CASE
WHEN
( -- check if there's a row in the next month matching the current amount +/- 10 percent
SELECT Count(*)
FROM tab AS t2
WHERE t2.account_ = t.account_
AND (t2.validity_month/100*12 + t2.validity_month MOD 100)
= ( t.validity_month/100*12 + t.validity_month MOD 100) +1
AND t2.amount BETWEEN t.amount * 0.9 AND t.amount * 1.1
) > 0
THEN 1
ELSE 0
END
FROM tab AS t
But then performance might be really bad...
Assuming the values are unique within a month and you have a value for each month for each account, you can simplify this to:
select t.*,
(case when lead(seqnum) over (partition by account, amount order by validity_month) = seqnum + 1
then 1 else 0
end)
from (select t.*,
dense_rank() over (partition by account order by validity_month) as seqnum
from t
) t;
Note: This puts 0 for the last month rather than NULL, but that can easily be adjusted.
You can do this without the subquery by using month arithmetic. It is not clear what the data type of validity_month is. If I assume a number:
select t.*,
(case when lead(floor(validity_month / 100) * 12 + (validity_month mod 100)
) over (partition by account, amount order by validity_month) =
(validity_month / 100) * 12 + (validity_month mod 100) - 1
then 1 else 0
end)
from t;
Just to add another way to do this using Standard SQL. This query will return 1 when the condition is met, 0 when it is not, and null when there isn't a next month to evaluate (as implied in your result column).
It is assumed that we're partitioning on the account field. Also includes a 10% range match on the amount field based on the comment made. Note that if you have an id field, you should include it (if two rows have the same account, validity_date, validity_month, amount there will only be one resulting row, due to DISTINCT).
Performance-wise, should be similar to the answer from #dnoeth.
SELECT DISTINCT
t1.account,
t1.validity_date,
t1.validity_month,
t1.amount,
CASE
WHEN t2.amount IS NOT NULL THEN 1
WHEN MAX(t1.validity_month) OVER (PARTITION BY t1.account) > t1.validity_month THEN 0
ELSE NULL
END AS flag
FROM `project.dataset.table` t1
LEFT JOIN `project.dataset.table` t2
ON
t2.account = t1.account AND
DATE_DIFF(
PARSE_DATE("%Y%m", CAST(t2.validity_month AS STRING)),
PARSE_DATE("%Y%m", CAST(t1.validity_month AS STRING)),
MONTH
) = 1 AND
t2.amount BETWEEN t1.amount * 0.9 AND t1.amount * 1.1;

SQL Date intelligence: filtering data by seconds ran from last known valid result

Help! We're trying to create a new column (Is Valid?) to reproduce the logic below.
It is a binary result that:
it is 1 if it is the first known value of an ID
it is 1 if it is 3 seconds or later than the previous "1" of that ID
Note 1: this is not the difference in seconds from the previous record
It is 0 if it is less than 3 seconds than the previous "1" of that ID
Note 2: there are many IDs in the data set
Note 3: original dataset has ID and Date
Attached a PoC of the data and the expected result.
You would have to do this using a recursive CTE, which is quite expensive:
with tt as (
select t.*, row_number() over (partition by id order by time) as seqnum
from t
),
recursive cte as (
select t.*, time as grp_start
from tt
where seqnum = 1
union all
select tt.*,
(case when tt.time < cte.grp_start + interval '3 second'
then tt.time
else tt.grp_start
end)
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select cte.*,
(case when grp_start = lag(grp_start) over (partition by id order by time)
then 0 else 1
end) as isValid
from cte;

How can I identify start and end of uninterrupted sequences?

I have a list of events sorted by TITLE and TIME e.g.:
TITLE |TIME
A |11:59
A |12:00
A |12:01
A |12:02
A |12:03
B |12:04
B |12:05
B |12:06
B |12:07
B |12:14
B |12:15
B |12:16
I want to calculate START and END of sequences. Sequence is a set of events in which minutes follow each other without gaps for same TITLE, e.g.:
TITLE |START |END
A |11:59 |12:03
B |12:04 |12:07
B |12:14 |12:16
Assuming all the window functions are supported, you can do this with lag and a running sum to assign groups based on a 1 minute time difference.
select title,min(time) as start_time,max(time) as end_time
from (select title,time,sum(col) over(partition by title order by time) as grp
from (select title,time,
case when lag(time) over(partition by title order by time) - time = 1
/*change this calculation for 1 minute time difference*/
then 0 else 1 end as col
from tbl
) t
) t
group by title,grp
Another way is
select title,min(time),max(time)
from (
select title,time,
time-row_number() over(partition by title order by time) as grp
/*change this calculation to subtract row_number from time*/
from tbl
) t
group by title,grp

SQL Server - Conditionally Increment a Counter

What I'm looking to do is create grouped sequences for continuous date ranges. Take the following sample data:
Person|BeginDate |EndDate
A |1/1/2015 |1/31/2015
A |2/1/2015 |2/28/2015
A |4/1/2015 |4/30/2015
A |5/1/2015 |5/31/2015
B |1/1/2015 |1/30/2015
B |8/1/2015 |8/30/2015
B |9/1/2015 |9/30/2015
If BeginDate in the current row is >1 day from the EndDate in the previous row then increment the counter by 1, otherwise assign the counter's current value. The sequencing would look like :
Person|BeginDate |EndDate |Sequence
A |1/1/2015 |1/31/2015|1
A |2/1/2015 |2/28/2015|1
A |4/1/2015 |4/30/2015|2
A |5/1/2015 |5/31/2015|2
B |1/1/2015 |1/30/2015|1
B |8/1/2015 |8/30/2015|2
B |9/1/2015 |9/30/2015|2
Partitioned and reset for each person.
For your testing :
CREATE TABLE ##SequencingTest(
Person char(1)
,BeginDate date
,EndDate date)
INSERT INTO ##SequencingTest
VALUES
('A','1/1/2015','1/31/2015')
,('A','2/1/2015','2/28/2015')
,('A','4/1/2015','4/30/2015')
,('A','5/1/2015','5/31/2015')
,('B','1/1/2015','1/30/2015')
,('B','8/15/2015','8/31/2015')
,('B','9/1/2015','9/30/2015')
You can do this with lag() and then a cumulative sum:
select t.*,
sum(flag) over (partition by person order by begindate) as sequence
from (select t.*,
(case when datediff(day, lag(endDate) over (partition by person order by begindate), begindate) < 2
then 0
else 1
end) as flag
from t
) t;
If the continuous end dates are always 1 day before the next start date you could do something really primitive like this:
SELECT S1.Person, S1.BeginDate, S1.EndDate, SUM(S2.Cntr) AS Sequence
FROM Sequencing S1
INNER JOIN (SELECT Person, BeginDate,
CASE WHEN EXISTS (SELECT Person FROM Sequencing S2 WHERE S2.[EndDate] =
DATEADD(d, -1, S1.[BeginDate]) AND S2.Person = S1.Person) THEN 0 ELSE 1 END AS Cntr
FROM [Sequencing] S1
) S2
ON S1.Person = S2.Person
AND S1.BeginDate >= S2.BeginDate
GROUP BY S1.Person, S1.BeginDate, S1.EndDate
ORDER BY S1.Person, S1.BeginDate, S1.EndDate
Note I think you meant to say '1/31/2015' and '8/31/2015' as end dates to work with your example.
Also, #GordonLinoff's answer is probably better. I simply do not have the version of SQL Server to test it with.