Partition & consecutive in SQL - sql

fellow stackers
I have a data set like so:
+---------+------+--------+
| user_id | date | metric |
+---------+------+--------+
| 1 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 2 | 1 | 1 |
| 2 | 2 | 1 |
| 2 | 3 | 0 |
| 2 | 4 | 1 |
+---------+------+--------+
I am looking to flag those customers who has 3 consecutive "1"s in the metric column. I have a solution as below.
select distinct user_id
from (
select user_id
,metric +
ifnull( lag(metric, 1) OVER (PARTITION BY user_id ORDER BY date), 0 ) +
ifnull( lag(metric, 2) OVER (PARTITION BY user_id ORDER BY date), 0 )
as consecutive_3
from df
) b
where consecutive_3 = 3
While it works it is not scalable. As one can imagine what the above query would look like if I were looking for a consecutive 50.
May I ask if there is a scalable solution? Any cloud SQL will do. Thank you.

If you only want such users, you can use a sum(). Assuming that metric is only 0 or 1:
select user_id,
(case when max(metric_3) = 3 then 1 else 0 end) as flag_3
from (select df.*,
sum(metric) over (partition by user_id
order by date
rows between 2 preceding and current row
) as metric_3
from df
) df
group by user_id;
By using a windowing clause, you can easily expand to as many adjacent 1s as you like.

Related

Get some values from the table by selecting

I have a table:
| id | Number |Address
| -----| ------------|-----------
| 1 | 0 | NULL
| 1 | 1 | NULL
| 1 | 2 | 50
| 1 | 3 | NULL
| 2 | 0 | 10
| 3 | 1 | 30
| 3 | 2 | 20
| 3 | 3 | 20
| 4 | 0 | 75
| 4 | 1 | 22
| 4 | 2 | 30
| 5 | 0 | NULL
I need to get: the NUMBER of the last ADDRESS change for each ID.
I wrote this select:
select dh.id, dh.number from table dh where dh =
(select max(min(t.history)) from table t where t.id = dh.id group by t.address)
But this select not correctly handling the case when the address first changed, and then changed to the previous value. For example id=1: group by return:
| Number |
| -------- |
| NULL |
| 50 |
I have been thinking about this select for several days, and I will be happy to receive any help.
You can do this using row_number() -- twice:
select t.id, min(number)
from (select t.*,
row_number() over (partition by id order by number desc) as seqnum1,
row_number() over (partition by id, address order by number desc) as seqnum2
from t
) t
where seqnum1 = seqnum2
group by id;
What this does is enumerate the rows by number in descending order:
Once per id.
Once per id and address.
These values are the same only when the value is 1, which is the most recent address in the data. Then aggregation pulls back the earliest row in this group.
I answered my question myself, if anyone needs it, my solution:
select * from table dh1 where dh1.number = (
select max(x.number)
from (
select
dh2.id, dh2.number, dh2.address, lag(dh2.address) over(order by dh2.number asc) as prev
from table dh2 where dh1.id=dh2.id
) x
where NVL(x.address, 0) <> NVL(x.prev, 0)
);

How to transform a range of records to the values of the record after that range in SQL?

I am trying to replace some bad input records within a specific date range with correct records. However, I'm not sure if there is an efficient way to do so. Therefore my question is how to transform a (static) range of records to the values of the record after that range in SQL? Below you will find an example to clarify what I try to achieve.
In this example you can see that customer number 1 belongs to group number 0 (None) in the period from 25-06-2020 to 29-06-2020. From 30-06-2020 to 05-07-2020 this group number changes from 0 to 11 for customer number 1. This static period contains the wrong records, and should be changed to the values that are valid on 06-07-2020 (group number == 10). Is there a way to do this?
If I understand correctly, you can use window functions to get the data on that particular date and case logic to assign it to the specific date range:
select t.*,
(case when date >= '2020-07-01' and date <= '2020-07-05'
then max(case when date = '2020-07-06' then group_number end) over (partition by customer_number)
else group_number
end) as imputed_group_number,
(case when date >= '2020-07-01' and date <= '2020-07-05'
then max(case when date = '2020-07-06' then role end) over (partition by customer_number)
else role
end) as imputed_role
from t;
If you want to update the values, you can use JOIN:
update t
set group_number = tt.group_number,
role = tt.role
from tt
where tt.customer_number = t.customer_number and tt.date = '2020-07-06'
I think that window function first_value() does what you want:
select
date,
customer_number,
first_value(group_number) over(partition by customer_number order by date) group_number,
first_value(role) over(partition by customer_number order by date) role
from mytable
You can do the following as an example. Here i have choosen the criteria that if role='Leader' its a bad record and therefore you would be applying the next available group_number --> in column group_number1, and role1.
I have used a smaller subset of the rows you have in your excel example.
select date1
,customer_number
,group_number
,case when role='Leader' then
(select t1.group_number
from t t1
where t1.date1>t.date1
and t1.role<>'Leader'
order by t1.date1 asc
limit 1
)
else group_number
end as group_number1
,role
,case when role='Leader' then
(select t1.role
from t t1
where t1.date1>t.date1
and t1.role<>'Leader'
order by t1.date1 asc
limit 1
)
else role
end as role1
from t
order by 1
+------------+-----------------+--------------+---------------+--------+--------+
| DATE1 | CUSTOMER_NUMBER | GROUP_NUMBER | GROUP_NUMBER1 | ROLE | ROLE1 |
+------------+-----------------+--------------+---------------+--------+--------+
| 2020-06-25 | 1 | 0 | 0 | None | None |
| 2020-06-26 | 1 | 0 | 0 | None | None |
| 2020-06-27 | 1 | 0 | 0 | None | None |
| 2020-06-28 | 1 | 0 | 0 | None | None |
| 2020-06-29 | 1 | 0 | 0 | None | None |
| 2020-06-30 | 1 | 11 | 10 | Leader | Member |
| 2020-07-01 | 1 | 11 | 10 | Leader | Member |
| 2020-07-06 | 1 | 10 | 10 | Member | Member |
+------------+-----------------+--------------+---------------+--------+--------+
db fiddle link
https://dbfiddle.uk/?rdbms=db2_11.1&fiddle=c95d12ced067c1df94947848b5a94c14

Single query to split out data of one column, into two columns, from the same table based on different criteria [SQL]

I have the following data in a table, this is a single column shown from a table that has multiple columns, but only data from this column needs to be pulled into two column output using a query:
+----------------+--+
| DataText | |
| 1 DEC20 DDD | |
| 1 JUL20 DDD | |
| 1 JAN21 DDD | |
| 1 JUN20 DDD500 | |
| 1 JUN20 DDD500 | |
| 1 JUN20DDDD500 | |
| 1 JUN20DDDD500 | |
| 1 JUL20 DDD800 | |
| 1 JUL20 DDD800 | |
| 1 JUL20DDDD800 | |
| 1 JUL20DDDD400 | |
| 1 JUL20DDDD400 | |
+----------------+--+
Required result: distinct values based on the first 13 characters of the data, split into two columns based on "long data", and "short data", BUT only giving the first 13 characters in output for both columns:
+-------------+-------------+
| ShortData | LongData |
| 1 DEC20 DDD | 1 JUN20 DDD |
| 1 JUL20 DDD | 1 JUN20DDDD |
| 1 JAN21 DDD | 1 JUL20 DDD |
| | 1 JUL20DDDD |
+-------------+-------------+
Something like:
Select
(Select DISTINCT LEFT(DataText,13)
From myTable)
Where LEN(DataText)=13) As ShortData
,
(Select DISTINCT LEFT(DataText,13)
From myTable)
Where LEN(DataText)>13) As LongData
I would also like to query/"scan" the table only once if possible. I can't get any of the SO examples modified to make such a query work.
This is quite ugly, but doable. As a starter, you need a column that defines the order of the rows - I assumed that you have such a column, and that is called id.
Then you can select the distinct texts, put them in separate groups depending on their length, and finally pivot:
select
max(case when grp = 0 then dataText end) shortData,
max(case when grp = 1 then dataText end) longData
from (
select
dataText,
grp,
row_number() over(partition by grp order by id) rn
from (
select
id,
case when len(dataText) <= 13 then 0 else 1 end grp,
substring(dataText, 1, 13) dataText
from (select min(id) id, dataText from mytable group by dataText) t
) t
) t
group by rn
If you are content with ordering the records by the string column itself, it is a bit simpler (and, for your sample data, it produces the same results):
select
max(case when grp = 0 then dataText end) shortData,
max(case when grp = 1 then dataText end) longData
from (
select
dataText,
grp,
row_number() over(partition by grp order by dataText) rn
from (
select distinct
case when len(dataText) <= 13 then 0 else 1 end grp,
substring(dataText, 1, 13) dataText
from mytable
) t
) t
group by rn
Demo on DB Fiddle:
shortData | longData
:---------- | :------------
1 DEC20 DDD | 1 JUL20 DDD80
1 JAN21 DDD | 1 JUL20DDDD40
1 JUL20 DDD | 1 JUL20DDDD80
null | 1 JUN20 DDD50
null | 1 JUN20DDDD50

Calculating consecutive range of dates with a value in Hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!
You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id
The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.

Summing dates across multiple rows in SQL?

We have a Table that stores alarms for certain SetPoints in our system. I'm attempting to write a query that first gets the difference between two dates (spread across two rows), and then sums all of the date differences to get a total sum for the amount of time the setpoint was in alarm.
We have one database where I've accomplished similar, but in that case, both the startTime and endTime were in the same row. In this case, this is not adequate
Some example Data
| Row | TagID | SetPointID | EventLogTime | InAlarm |
-------------------------------------------------------------------------------------
| 1 | 1 | 2 | 2016-01-01 01:49:18.070 | 1 |
| 2 | 1 | 1 | 2016-01-01 03:23:39.970 | 1 |
| 3 | 1 | 2 | 2016-01-01 03:23:40.070 | 0 |
| 4 | 1 | 1 | 2016-01-01 08:04:01.260 | 0 |
| 5 | 1 | 2 | 2016-01-01 08:04:01.370 | 1 |
| 6 | 1 | 1 | 2016-01-01 11:40:36.367 | 1 |
| 7 | 1 | 2 | 2016-01-01 11:40:36.503 | 0 |
| 8 | 1 | 1 | 2016-01-01 13:00:30.263 | 0 |
Results
| TagID | SetPointID | TotalTimeInAlarm |
------------------------------------------------------
| 1 | 1 | 6.004443 (hours) |
| 1 | 2 | 5.182499 (hours) |
Essentially, what I need to do is to get the start time and end time for each tag and each setpoint, then I need to get the total time in alarm. I'm thing CTEs might be able to help, but I'm not sure.
I believe the pseudo query logic would be similar to
Define #startTime DATETIME, #endTime DATETIME
SELECT TagID,
SetPointID,
ABS(First Occurrence of InAlarm = True (since last occurrence WHERE InAlarm = False)
- First Occurrence of InAlarm = False (since last occurrence WHERE InAlarm = True))
-- IF no InAlarm = False use #endTime.
GROUP BY TagID, SetPointID
You can use the LEAD windowed function (or LAG) to do this pretty easily. This assumes that the rows always come in pairs with 1-0-1-0 for "InAlarm". If that doesn't happen then it's going to throw things off. You would need to have business rules for these situations in any event.
;WITH CTE_Timespans AS
(
SELECT
TagID,
SetPointID,
InAlarm,
EventLogTime,
LEAD(EventLogTime, 1) OVER (PARTITION BY TagID, SetPointID ORDER BY EventLogTime) AS EndingEventLogTime
FROM
My_Table
)
SELECT
TagID,
SetPointID,
SUM(DATEDIFF(SS, EventLogTime, EndingEventLogTime))/3600.0 AS TotalTime
FROM
CTE_Timespans
WHERE
InAlarm = 1
GROUP BY
TagID,
SetPointID
One easy way is to use OUTER APPLY to get the next date that is not InAlarm
SELECT mt.TagID,
mt.SetPointID,
SUM(DATEDIFF(ss,mt.EventLogTime,oa.EventLogTime)) / 3600.0 AS [TotalTimeInAlarm]
FROM MyTable mt
OUTER APPLY (SELECT MIN([EventLogTime]) EventLogTime
FROM MyTable mt2
WHERE mt.TagID = mt2.TagID
AND mt.SetPointID = mt2.SetPointID
AND mt2.EventLogTime > mt.EventLogTime
AND InAlarm = 0
) oa
WHERE mt.InAlarm = 1
GROUP BY mt.TagID,
mt.SetPointID
LEAD() might perform better if using MSSQL 2012+
In SQL Server 2014+:
SELECT tagId, setPointId, SUM(DATEDIFF(second, pt, eventLogTime)) / 3600. AS diff
FROM (
SELECT *,
LAG(inAlarm) OVER (PARTITION BY tagId, setPointId ORDER BY eventLogTime, row) ppa,
LAG(eventLogTime) OVER (PARTITION BY tagId, setPointId ORDER BY eventLogTime, row) pt
FROM (
SELECT LAG(inAlarm) OVER (PARTITION BY tagId, setPointId ORDER BY eventLogTime, row) pa,
*
FROM mytable
) q
WHERE EXISTS
(
SELECT pa
EXCEPT
SELECT inAlarm
)
) q
WHERE ppa = 0
AND inAlarm = 1
GROUP BY
tagId, setPointId
This will filter out consecutive events with same alarm state