find the consecutive values in impala - sql

I have a data set below with ID, Date and Value. I want to flag the ID where three consecutive days has value 0.
id
date
value
1
8/10/2021
1
1
8/11/2021
0
1
8/12/2021
0
1
8/13/2021
0
1
8/14/2021
5
2
8/10/2021
2
2
8/11/2021
3
2
8/12/2021
0
2
8/13/2021
0
2
8/14/2021
6
3
8/10/2021
3
3
8/11/2021
4
3
8/12/2021
0
3
8/13/2021
0
3
8/14/2021
0
output
id
date
value
Flag
1
8/10/2021
1
Y
1
8/11/2021
0
Y
1
8/12/2021
0
Y
1
8/13/2021
0
Y
1
8/14/2021
5
Y
2
8/10/2021
2
N
2
8/11/2021
3
N
2
8/12/2021
0
N
2
8/13/2021
0
N
2
8/14/2021
6
N
3
8/10/2021
3
Y
3
8/11/2021
4
Y
3
8/12/2021
0
Y
3
8/13/2021
0
Y
3
8/14/2021
0
Y
Thank you.

Using window count() function you can count 0's in the frame [current row, 2 following] (ordered by date) - three consecutive rows frame calculated for each row:
count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt.
If count happens to equal 3 then it means 3 consecutive 0's found, case expression produces Y for each row with cnt=3 : case when cnt=3 then 'Y' else 'N' end.
To propagate 'Y' flag to the whole id group use max(...) over (partition by id)
Demo with your data example (tested on Hive):
with mydata as (--Data example, dates converted to sortable format yyyy-MM-dd
select 1 id,'2021-08-10' date_, 1 value union all
select 1,'2021-08-11',0 union all
select 1,'2021-08-12',0 union all
select 1,'2021-08-13',0 union all
select 1,'2021-08-14',5 union all
select 2,'2021-08-10',2 union all
select 2,'2021-08-11',3 union all
select 2,'2021-08-12',0 union all
select 2,'2021-08-13',0 union all
select 2,'2021-08-14',6 union all
select 3,'2021-08-10',3 union all
select 3,'2021-08-11',4 union all
select 3,'2021-08-12',0 union all
select 3,'2021-08-13',0 union all
select 3,'2021-08-14',0
) --End of data example, use your table instead of this CTE
select id, date_, value,
max(case when cnt=3 then 'Y' else 'N' end) over (partition by id) flag
from
(
select id, date_, value,
count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt
from mydata
)s
order by id, date_ --remove ordering if not necessary
--added it to get result in the same order
Result:
id date_ value flag
1 2021-08-10 1 Y
1 2021-08-11 0 Y
1 2021-08-12 0 Y
1 2021-08-13 0 Y
1 2021-08-14 5 Y
2 2021-08-10 2 N
2 2021-08-11 3 N
2 2021-08-12 0 N
2 2021-08-13 0 N
2 2021-08-14 6 N
3 2021-08-10 3 Y
3 2021-08-11 4 Y
3 2021-08-12 0 Y
3 2021-08-13 0 Y
3 2021-08-14 0 Y

You can identify the ids by comparing lag()s. Then spread the value across all rows. The following gets the flag on the third 0:
select t.*,
(case when value = 0 and prev_value_date_2 = prev_date_2
then 'Y' else 'N'
end) as flag_on_row
from (select t.*,
lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
lag(date, 2) over (partition by id order by date) as prev_date_2
from t
) t;
The above logic uses lag() so it is easy to extend to longer streaks of 0s. The "2" is looking two rows behind, so if the lagged values are the same, then there are three rows in a row with the same value.
And to spread the value:
select t.*, max(flag_on_row) over (partition by id) as flag
from (select t.*,
(case when value = 0 and prev_value_date_2 = prev_date_2
then 'Y' else 'N'
end) as flag_on_row
from (select t.*,
lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
lag(date, 2) over (partition by id order by date) as prev_date_2
from t
) t
) t;

Related

Finding Latest First x among consecutive x from table

I am trying to write a query to find first latest 1's from each group as below. For example, for Group 1, It shouldn't be 1/2/2022 since it has 1/6/2022 which was shown later. Shouldn't be 1/7/2022 too for Group 1.
Please let me know if you have any idea.
Thanks!
Table x (AsOfDate, Group_Id, Value)
AsOfDate Group_Id Value
1/1/2022 1 0
1/1/2022 2 1
1/2/2022 1 1
1/2/2022 2 1
1/3/2022 1 0
1/3/2022 2 0
1/4/2022 1 0
1/4/2022 2 0
1/5/2022 1 0
1/5/2022 2 1
1/6/2022 1 1
1/6/2022 2 0
1/7/2022 1 1
1/7/2022 2 0
Output
AsOfDate Group_Id
1/6/2022 1
1/5/2022 2
What you want is find the earliest date of the last group for continuous row with Value = 1
Use LAG() window function to find the continuous group of Value
use dense_rank() to rank it by grp find the latest group (r = 1)
min() to get the "first" AsOfDate
select AsOfDate = min(AsOfDate),
Group_Id
from
(
select *, r = dense_rank() over (partition by Group_Id, Value
order by grp desc)
from
(
select *, grp = sum(g) over (partition by Group_Id order by AsOfDate)
from
(
select *, g = case when Value <> lag(Value) over (partition by Group_Id
order by AsOfDate)
then 1
else 0
end
from x
) x
) x
) x
where Value = 1
and r = 1
group by Group_Id

count zeros between 1s in same column

I've data like this.
ID IND
1 0
2 0
3 1
4 0
5 1
6 0
7 0
I want to count the zeros before the value 1. So that, the output will be like below.
ID IND OUT
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
Is it possible without pl/sql? I tried to find the differences between row numbers but couldn't achieve it.
The match_recognize clause, introduced in Oracle 12.1, can do quick work of such "row pattern recognition" problems. The solution is just a bit complex due to the special treatment of a "last row" with ID = 0, but it is straightforward otherwise.
As usual, the with clause is not part of the solution; I include it to test the query. Remove it and use your actual table and column names.
with
inputs (id, ind) as (
select 1, 0 from dual union all
select 2, 0 from dual union all
select 3, 1 from dual union all
select 4, 0 from dual union all
select 5, 1 from dual union all
select 6, 0 from dual union all
select 7, 0 from dual
)
select id, ind, out
from inputs
match_recognize(
order by id
measures case classifier() when 'Z' then 0
when 'O' then count(*) - 1
else count(*) end as out
all rows per match
pattern ( Z* ( O | X ) )
define Z as ind = 0, O as ind != 0
);
ID IND OUT
---------- ---------- ----------
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
You can treat this as a gaps-and-islands problem. You can define the "islands" by the number of "1"s one or after each row. Then use a window function:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then sum(1 - ind) over (partition by grp)
else 0
end) as num_zeros
from (select t.*,
sum(ind) over (order by id desc) as grp
from t
) t;
If id is sequential with no gaps, you can do this without a subquery:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then id - coalesce(lag(case when ind = 1 then id end ignore nulls) over (order by id), min(id) over () - 1)
else 0
end)
from t;
I would suggest removing the case conditions and just using the then clause for the expression, so the value is on all rows.

Select top rows until value in specific column has appeared twice

I have the following query where I am trying to select all records, ordered by date, until the second time EmailApproved = 1 is found. The second record where EmailApproved = 1 should not be selected.
declare #Test table (id int, EmailApproved bit, Created datetime);
insert into #Test (id, EmailApproved, Created)
values
(1,0,'2011-03-07 03:58:58.423')
, (2,0,'2011-02-21 04:55:52.103')
, (3,0,'2011-01-29 13:24:02.103')
, (4,1,'2010-10-12 14:41:54.217')
, (5,0,'2010-10-12 14:34:15.903')
, (6,0,'2010-10-12 10:10:19.123')
, (7,1,'2010-08-27 12:07:16.073')
, (8,1,'2010-08-25 12:15:49.413')
, (9,0,'2010-08-25 12:14:51.970')
, (10,1,'2010-04-12 16:43:44.777');
select *
, case when Row1 = Row2 then 1 else 0 end Row1EqualRow2
from (
select id, EmailApproved, Created
, row_number() over (partition by EmailApproved order by Created desc) Row1
, row_number() over (order by Created desc) Row2
from #Test
) X
--where Row1 = Row2
order by Created desc;
Which produces the following results:
id EmailApproved Created Row1 Row2 Row1EqualsRow2
1 0 2011-03-07 03:58:58.423 1 1 1
2 0 2011-02-21 04:55:52.103 2 2 1
3 0 2011-01-29 13:24:02.103 3 3 1
4 1 2010-10-12 14:41:54.217 1 4 0
5 0 2010-10-12 14:34:15.903 4 5 0
6 0 2010-10-12 10:10:19.123 5 6 0
7 1 2010-08-27 12:07:16.073 2 7 0
8 1 2010-08-25 12:15:49.413 3 8 0
9 0 2010-08-25 12:14:51.970 6 9 0
10 1 2010-04-12 16:43:44.777 4 10 0
What I actually want is:
id EmailApproved Created Row1 Row2 Row1EqualsRow2
1 0 2011-03-07 03:58:58.423 1 1 1
2 0 2011-02-21 04:55:52.103 2 2 1
3 0 2011-01-29 13:24:02.103 3 3 1
4 1 2010-10-12 14:41:54.217 1 4 0
5 0 2010-10-12 14:34:15.903 4 5 0
6 0 2010-10-12 10:10:19.123 5 6 0
Note: Row, Row2 & Row1EqualsRow2 are just working columns to show my calculations.
Steps:
Create a row number, rn, over all rows in case id is not in sequence.
Create a row number, approv_rn, partitioned by EmailApproved so we know when EmailApproved = 1 for the second time
Use a outer apply to find the row number of the second instance of EmailApproved = 1
In the where clause filter out all rows where the row number is >= the value found in step 3.
If there is 1 or 0 EmailApproved records available then the outer apply will return null, in which case return all available rows.
with test as
(
select *,
rn = row_number() over (order by Created desc),
approv_rn = row_number() over (partition by EmailApproved
order by Created desc)
from #Test
)
select *
from test t
outer apply
(
select x.rn
from test x
where x.EmailApproved = 1
and x.approv_rn = 2
) x
where t.rn < x.rn or x.rn is null
order by t.Created desc;

Can I start a new group when value changes from 0 to 1?

Can I somehow assign a new group to a row when a value in a column changes in T-SQL?
I would be grateful if you can provide solution that will work on unlimited repeating numbers without CTE and functions. I made a solution that work in sutuation with 100 consecutive identical numbers(with
coalesce(lag()over(), lag() over(), lag() over() ) - it is too bulky
but can not make a solution for a case with unlimited number of consecutive identical numbers.
Data
id somevalue
1 0
2 1
3 1
4 0
5 0
6 1
7 1
8 1
9 0
10 0
11 1
12 0
13 1
14 1
15 0
16 0
Expected
id somevalue group
1 0 1
2 1 2
3 1 2
4 0 3
5 0 3
6 1 4
7 1 4
8 1 4
9 0 5
10 0 5
11 1 6
12 0 7
13 1 8
14 1 8
15 0 9
16 0 9
If you just want a group identifier, you can use:
select t.*,
min(id) over (partition by some_value, seqnum - seqnum_1) as grp
from (select t.*,
row_number() over (order by id) as seqnum,
row_number() over (partition by somevalue order by id) as sequm_1
from t
) t;
If you want them enumerated . . . well, you can enumerate the id above using dense_rank(). Or you can use lag() and a cumulative sum:
select t.*,
sum(case when some_value = prev_sv then 0 else 1 end) over (order by id) as grp
from (select t.*,
lag(somevalue) over (order by id) as prev_sv
from t
) t;
Here's a different approach:
First I created a view to provide the group increment on each row:
create view increments as
select
n2.id,n2.somevalue,
case when n1.somevalue=n2.somevalue then 0 else 1 end as increment
from
(select 0 as id,1 as somevalue union all select * from mytable) n1
join mytable n2
on n2.id = n1.id+1
Then I used this view to produce the group values as cumulative sums of the increments:
select id, somevalue,
(select sum(increment) from increments i1 where i1.id <= i2.id)
from increments i2

Adjusting table based on previous values in BigQuery

I have a table that looks like below:
ID|Date |X| Flag |
1 |1/1/16|2| 0
2 |1/1/16|0| 0
3 |1/1/16|0| 0
1 |2/1/16|0| 0
2 |2/1/16|1| 0
3 |2/1/16|2| 0
1 |3/1/16|2| 0
2 |3/1/16|1| 0
3 |3/1/16|2| 0
I'm trying to make it so that flag is populated if X=2 in the PREVIOUS month. As such, it should look like this:
ID|Date |X| Flag |
1 |1/1/16|2| 0
2 |1/1/16|0| 0
3 |1/1/16|0| 0
1 |2/1/16|2| 1
2 |2/1/16|1| 0
3 |2/1/16|2| 0
1 |3/1/16|2| 1
2 |3/1/16|1| 0
3 |3/1/16|2| 1
I use this in SQL:
`select ID, date, X, flag into Work_Table from t
(
Select ID, date, X, flag,
Lag(X) Over (Partition By ID Order By date Asc) As Prev into Flag_table
From Work_Table
)
Update [dbo].[Flag_table]
Set flag = 1
where prev = '2'
UPDATE t
Set t.flag = [dbo].[Flag_table].flag FROM T
JOIN [dbo].[Flag_table]
ON t.ID= [dbo].[Flag_table].ID where T.date = [dbo].[Flag_table].date`
However I cannot do this in Bigquery. Any ideas?
Below is for BigQuery Standard SQL
#standardSQL
SELECT id, dt, x,
IF(LAG(x = 2) OVER(PARTITION BY id ORDER BY dt), 1, 0) flag
FROM `project.dataset.work_table`
You can test / play with it using dummy data from your question as
#standardSQL
WITH `project.dataset.work_table` AS (
SELECT 1 id, '1/1/16' dt, 2 x, 0 flag UNION ALL
SELECT 2, '1/1/16', 0, 0 UNION ALL
SELECT 3, '1/1/16', 0, 0 UNION ALL
SELECT 1, '2/1/16', 0, 0 UNION ALL
SELECT 2, '2/1/16', 1, 0 UNION ALL
SELECT 3, '2/1/16', 2, 0 UNION ALL
SELECT 1, '3/1/16', 2, 0 UNION ALL
SELECT 2, '3/1/16', 1, 0 UNION ALL
SELECT 3, '3/1/16', 2, 0
)
SELECT id, dt, x,
IF(LAG(x = 2) OVER(PARTITION BY id ORDER BY dt), 1, 0) flag
FROM `project.dataset.work_table`
ORDER BY dt, id
with result as
Row id dt x flag
1 1 1/1/16 2 0
2 2 1/1/16 0 0
3 3 1/1/16 0 0
4 1 2/1/16 0 1
5 2 2/1/16 1 0
6 3 2/1/16 2 0
7 1 3/1/16 2 0
8 2 3/1/16 1 0
9 3 3/1/16 2 1