Adjusting table based on previous values in BigQuery - google-bigquery

I have a table that looks like below:
ID|Date |X| Flag |
1 |1/1/16|2| 0
2 |1/1/16|0| 0
3 |1/1/16|0| 0
1 |2/1/16|0| 0
2 |2/1/16|1| 0
3 |2/1/16|2| 0
1 |3/1/16|2| 0
2 |3/1/16|1| 0
3 |3/1/16|2| 0
I'm trying to make it so that flag is populated if X=2 in the PREVIOUS month. As such, it should look like this:
ID|Date |X| Flag |
1 |1/1/16|2| 0
2 |1/1/16|0| 0
3 |1/1/16|0| 0
1 |2/1/16|2| 1
2 |2/1/16|1| 0
3 |2/1/16|2| 0
1 |3/1/16|2| 1
2 |3/1/16|1| 0
3 |3/1/16|2| 1
I use this in SQL:
`select ID, date, X, flag into Work_Table from t
(
Select ID, date, X, flag,
Lag(X) Over (Partition By ID Order By date Asc) As Prev into Flag_table
From Work_Table
)
Update [dbo].[Flag_table]
Set flag = 1
where prev = '2'
UPDATE t
Set t.flag = [dbo].[Flag_table].flag FROM T
JOIN [dbo].[Flag_table]
ON t.ID= [dbo].[Flag_table].ID where T.date = [dbo].[Flag_table].date`
However I cannot do this in Bigquery. Any ideas?

Below is for BigQuery Standard SQL
#standardSQL
SELECT id, dt, x,
IF(LAG(x = 2) OVER(PARTITION BY id ORDER BY dt), 1, 0) flag
FROM `project.dataset.work_table`
You can test / play with it using dummy data from your question as
#standardSQL
WITH `project.dataset.work_table` AS (
SELECT 1 id, '1/1/16' dt, 2 x, 0 flag UNION ALL
SELECT 2, '1/1/16', 0, 0 UNION ALL
SELECT 3, '1/1/16', 0, 0 UNION ALL
SELECT 1, '2/1/16', 0, 0 UNION ALL
SELECT 2, '2/1/16', 1, 0 UNION ALL
SELECT 3, '2/1/16', 2, 0 UNION ALL
SELECT 1, '3/1/16', 2, 0 UNION ALL
SELECT 2, '3/1/16', 1, 0 UNION ALL
SELECT 3, '3/1/16', 2, 0
)
SELECT id, dt, x,
IF(LAG(x = 2) OVER(PARTITION BY id ORDER BY dt), 1, 0) flag
FROM `project.dataset.work_table`
ORDER BY dt, id
with result as
Row id dt x flag
1 1 1/1/16 2 0
2 2 1/1/16 0 0
3 3 1/1/16 0 0
4 1 2/1/16 0 1
5 2 2/1/16 1 0
6 3 2/1/16 2 0
7 1 3/1/16 2 0
8 2 3/1/16 1 0
9 3 3/1/16 2 1

Related

find the consecutive values in impala

I have a data set below with ID, Date and Value. I want to flag the ID where three consecutive days has value 0.
id
date
value
1
8/10/2021
1
1
8/11/2021
0
1
8/12/2021
0
1
8/13/2021
0
1
8/14/2021
5
2
8/10/2021
2
2
8/11/2021
3
2
8/12/2021
0
2
8/13/2021
0
2
8/14/2021
6
3
8/10/2021
3
3
8/11/2021
4
3
8/12/2021
0
3
8/13/2021
0
3
8/14/2021
0
output
id
date
value
Flag
1
8/10/2021
1
Y
1
8/11/2021
0
Y
1
8/12/2021
0
Y
1
8/13/2021
0
Y
1
8/14/2021
5
Y
2
8/10/2021
2
N
2
8/11/2021
3
N
2
8/12/2021
0
N
2
8/13/2021
0
N
2
8/14/2021
6
N
3
8/10/2021
3
Y
3
8/11/2021
4
Y
3
8/12/2021
0
Y
3
8/13/2021
0
Y
3
8/14/2021
0
Y
Thank you.
Using window count() function you can count 0's in the frame [current row, 2 following] (ordered by date) - three consecutive rows frame calculated for each row:
count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt.
If count happens to equal 3 then it means 3 consecutive 0's found, case expression produces Y for each row with cnt=3 : case when cnt=3 then 'Y' else 'N' end.
To propagate 'Y' flag to the whole id group use max(...) over (partition by id)
Demo with your data example (tested on Hive):
with mydata as (--Data example, dates converted to sortable format yyyy-MM-dd
select 1 id,'2021-08-10' date_, 1 value union all
select 1,'2021-08-11',0 union all
select 1,'2021-08-12',0 union all
select 1,'2021-08-13',0 union all
select 1,'2021-08-14',5 union all
select 2,'2021-08-10',2 union all
select 2,'2021-08-11',3 union all
select 2,'2021-08-12',0 union all
select 2,'2021-08-13',0 union all
select 2,'2021-08-14',6 union all
select 3,'2021-08-10',3 union all
select 3,'2021-08-11',4 union all
select 3,'2021-08-12',0 union all
select 3,'2021-08-13',0 union all
select 3,'2021-08-14',0
) --End of data example, use your table instead of this CTE
select id, date_, value,
max(case when cnt=3 then 'Y' else 'N' end) over (partition by id) flag
from
(
select id, date_, value,
count(case when value=0 then 1 else null end) over(partition by id order by date_ rows between current row and 2 following ) cnt
from mydata
)s
order by id, date_ --remove ordering if not necessary
--added it to get result in the same order
Result:
id date_ value flag
1 2021-08-10 1 Y
1 2021-08-11 0 Y
1 2021-08-12 0 Y
1 2021-08-13 0 Y
1 2021-08-14 5 Y
2 2021-08-10 2 N
2 2021-08-11 3 N
2 2021-08-12 0 N
2 2021-08-13 0 N
2 2021-08-14 6 N
3 2021-08-10 3 Y
3 2021-08-11 4 Y
3 2021-08-12 0 Y
3 2021-08-13 0 Y
3 2021-08-14 0 Y
You can identify the ids by comparing lag()s. Then spread the value across all rows. The following gets the flag on the third 0:
select t.*,
(case when value = 0 and prev_value_date_2 = prev_date_2
then 'Y' else 'N'
end) as flag_on_row
from (select t.*,
lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
lag(date, 2) over (partition by id order by date) as prev_date_2
from t
) t;
The above logic uses lag() so it is easy to extend to longer streaks of 0s. The "2" is looking two rows behind, so if the lagged values are the same, then there are three rows in a row with the same value.
And to spread the value:
select t.*, max(flag_on_row) over (partition by id) as flag
from (select t.*,
(case when value = 0 and prev_value_date_2 = prev_date_2
then 'Y' else 'N'
end) as flag_on_row
from (select t.*,
lag(date, 2) over (partition by value, id order by date) as prev_value_date_2,
lag(date, 2) over (partition by id order by date) as prev_date_2
from t
) t
) t;

count zeros between 1s in same column

I've data like this.
ID IND
1 0
2 0
3 1
4 0
5 1
6 0
7 0
I want to count the zeros before the value 1. So that, the output will be like below.
ID IND OUT
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
Is it possible without pl/sql? I tried to find the differences between row numbers but couldn't achieve it.
The match_recognize clause, introduced in Oracle 12.1, can do quick work of such "row pattern recognition" problems. The solution is just a bit complex due to the special treatment of a "last row" with ID = 0, but it is straightforward otherwise.
As usual, the with clause is not part of the solution; I include it to test the query. Remove it and use your actual table and column names.
with
inputs (id, ind) as (
select 1, 0 from dual union all
select 2, 0 from dual union all
select 3, 1 from dual union all
select 4, 0 from dual union all
select 5, 1 from dual union all
select 6, 0 from dual union all
select 7, 0 from dual
)
select id, ind, out
from inputs
match_recognize(
order by id
measures case classifier() when 'Z' then 0
when 'O' then count(*) - 1
else count(*) end as out
all rows per match
pattern ( Z* ( O | X ) )
define Z as ind = 0, O as ind != 0
);
ID IND OUT
---------- ---------- ----------
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
You can treat this as a gaps-and-islands problem. You can define the "islands" by the number of "1"s one or after each row. Then use a window function:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then sum(1 - ind) over (partition by grp)
else 0
end) as num_zeros
from (select t.*,
sum(ind) over (order by id desc) as grp
from t
) t;
If id is sequential with no gaps, you can do this without a subquery:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then id - coalesce(lag(case when ind = 1 then id end ignore nulls) over (order by id), min(id) over () - 1)
else 0
end)
from t;
I would suggest removing the case conditions and just using the then clause for the expression, so the value is on all rows.

Calculating date intervals from a daily-grained fact table

I have the data for student absence which I got after some transformations. The data is day by day:
WITH datasample AS (
SELECT 1 AS StudentID, 20180101 AS DateID, 0 AS AbsentToday, 0 AS AbsentYesterday UNION ALL
SELECT 1, 20180102, 1, 0 UNION ALL
SELECT 1, 20180103, 1, 1 UNION ALL
SELECT 1, 20180104, 1, 1 UNION ALL
SELECT 1, 20180105, 1, 1 UNION ALL
SELECT 1, 20180106, 0, 1 UNION ALL
SELECT 2, 20180101, 0, 0 UNION ALL
SELECT 2, 20180102, 1, 0 UNION ALL
SELECT 2, 20180103, 1, 1 UNION ALL
SELECT 2, 20180104, 0, 1 UNION ALL
SELECT 2, 20180105, 1, 0 UNION ALL
SELECT 2, 20180106, 1, 1 UNION ALL
SELECT 2, 20180107, 0, 1
)
SELECT *
FROM datasample
ORDER BY StudentID, DateID
I need to add a column (AbsencePeriodInMonth) which would calculate the student's absence period during the month.
For example, StudentID=1 was absent in one consecutive period during the month and StudentID=2 had two periods, something like this:
StudentID DateID AbsentToday AbsentYesterday AbsencePeriodInMonth
1 20180101 0 0 0
1 20180102 1 0 1
1 20180103 1 1 1
1 20180104 1 1 1
1 20180105 1 1 1
1 20180106 0 1 0
2 20180101 0 0 0
2 20180102 1 0 1
2 20180103 1 1 1
2 20180104 0 1 0
2 20180105 1 0 2
2 20180106 1 1 2
2 20180107 0 1 0
My goal is actually to calculate the consecutive absent days prior to each day in the fact table, I think I can do it if I get the AbsencePeriodInMonth column, by having this added to my query after the *:
,CASE WHEN AbsentToday = 1 THEN DENSE_RANK() OVER(PARTITION BY StudentID, AbsencePeriodInMonth ORDER BY DateID)
ELSE 0
END AS DaysAbsent
Any idea on how I can add that AbsencePeriodInMonth or maybe calculate the consecutive absent days in some other way?
You can identify each period by counting the number of 0s before hand. Then you can enumerate them using dense_rank().
select ds.*,
(case when absenttoday = 1 then dense_rank() over (partition by studentid order by grp)
else 0
end) as AbsencePeriodInMonth
from (select ds.*, sum(case when absenttoday = 0 then 1 else 0 end) over (partition by studentid order by dateid) as grp
from datasample ds
) ds
order by StudentID, DateID;
Here is a SQL Fiddle.
Using Recursive CTE and Dense_Rank
WITH datasample AS (
SELECT 1 AS StudentID, 20180101 AS DateID, 0 AS AbsentToday, 0 AS AbsentYesterday UNION ALL
SELECT 1, 20180102, 1, 0 UNION ALL
SELECT 1, 20180103, 1, 1 UNION ALL
SELECT 1, 20180104, 1, 1 UNION ALL
SELECT 1, 20180105, 1, 1 UNION ALL
SELECT 1, 20180106, 0, 1 UNION ALL
SELECT 2, 20180101, 0, 0 UNION ALL
SELECT 2, 20180102, 1, 0 UNION ALL
SELECT 2, 20180103, 1, 1 UNION ALL
SELECT 2, 20180104, 0, 1 UNION ALL
SELECT 2, 20180105, 1, 0 UNION ALL
SELECT 2, 20180106, 1, 1 UNION ALL
SELECT 2, 20180107, 0, 1
), cte as
(Select *,DateID as dd
from datasample
where AbsentToday = 1 and AbsentYesterday = 0
union all
Select d.*, c.dd
from datasample d
join cte c
on d.StudentID = c.StudentID and d.DateID = c.DateID + 1
where d.AbsentToday = 1
), cte1 as
(
Select *, DENSE_RANK() over (partition by StudentId order by dd) as de
from cte
)
Select d.*, IsNull(c.de,0) as AbsencePeriodInMonth
from cte1 c
right join datasample d
on d.StudentID = c.StudentID and c.DateID = d.DateID
order by d.StudentID, d.DateID

Summing numbers across multiple columns in BigQuery

I have a query which returns many columns which are either 1 or 0 depending on a users interaction with many points of a website, my data looks like this:
UserID Variable_1 Variable_2 Variable_3 Variable_4 Variable_5
User 1 1 0 1 0 0
User 2 0 0 1 0 0
User 3 0 0 0 0 1
User 4 0 1 1 1 1
User 5 1 0 0 0 1
Each variable is defined with it's own line of code like:
MAX(IF(LOWER(hits_product.productbrand) LIKE "Variable_1",1,0)) AS Variable_1,
I'd like to have one column that sums up all the rows per user. which looks like this:
UserID Total Variable_1 Variable_2 Variable_3 Variable_4 Variable_5
User 1 2 1 0 1 0 0
User 2 3 1 1 1 0 0
User 3 0 0 0 0 0 0
User 4 5 1 1 1 1 1
User 5 3 1 0 1 0 1
What is the most elegant way to achieve this?
Even though it happen that for OP's particular case simple COUNT(DISTINCT) will suffice - I still wanted to answer original question of how to sum up all numerical columns into one Total without having dependency on number and names of those columns
Below is for BigQuery Standard SQL
#standardSQL
SELECT
UserID,
( SELECT SUM(CAST(value AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':(\d+),?')) value
) Total,
* EXCEPT(UserID)
FROM t
This can be tested / played with using dummy data from question
#standardSQL
WITH t AS (
SELECT 'User 1' UserID, 1 Variable_1, 0 Variable_2, 1 Variable_3, 0 Variable_4, 0 Variable_5 UNION ALL
SELECT 'User 2', 1, 1, 1, 0, 0 UNION ALL
SELECT 'User 3', 0, 0, 0, 0, 0 UNION ALL
SELECT 'User 4', 1, 1, 1, 1, 1 UNION ALL
SELECT 'User 5', 1, 0, 1, 0, 1
)
SELECT
UserID,
( SELECT SUM(CAST(value AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':(\d+),?')) value
) Total,
* EXCEPT(UserID)
FROM t
ORDER BY UserID
result is
Row UserID Total Variable_1 Variable_2 Variable_3 Variable_4 Variable_5
1 User 1 2 1 0 1 0 0
2 User 2 3 1 1 1 0 0
3 User 3 0 0 0 0 0 0
4 User 4 5 1 1 1 1 1
5 User 5 3 1 0 1 0 1
A simple method uses a subquery or CTE:
select t.*, (v1 + v2 + v3 . . . ) as total
from (<your query here>
) t;
Not knowing what the data looks like, it is quite possible that count(distinct hits_product.productbrand) would also do the trick.
How about defining multiple variable columns into one repeated 'variables' column, of KeyValue messages, where a key would be your variable name and value a number, it can greatly simplify your calculation.

oracle coding sql

I have data in the below format
g_name amt flag
g1 0 0
g1 0 0
g1 10 1
g1 0 0
g1 15 2
g1 0 0
and I would require in the below format
n1 will have data starting from row where amt hits 1 and it keeps retaining it till the end, similarly n2 will have data starting from row where amt hits 2 and it keeps retaining it till the end, please help me with any window functions with out needing joins. please.
g_name amt flag n1 n2
g1 0 0 0 0
g1 0 0 0 0
g1 10 1 10 0
g1 0 0 10 0
g1 15 2 10 15
g1 0 0 10 15
I added a column for ordering - change as needed. I also added a few more rows with a different g_name, presumably this must be done "by g_name".
This is a good test case for the first_value() analytic function. It has the ability to ignore nulls - so we make the amt NULL when flag is not 1 (or 2, etc.) and then apply first_value() with the proper PARTITION BY and ORDER BY clauses.
with
test_data ( id, g_name, amt, flag ) as (
select 1, 'g1', 0, 0 from dual union all
select 2, 'g1', 0, 0 from dual union all
select 3, 'g1', 10, 1 from dual union all
select 4, 'g1', 0, 0 from dual union all
select 5, 'g1', 15, 2 from dual union all
select 6, 'g1', 0, 0 from dual union all
select 1, 'g2', 0, 0 from dual union all
select 2, 'g2', 4, 1 from dual union all
select 3, 'g2', 3, 2 from dual union all
select 4, 'g2', 0, 0 from dual
)
-- end of test data; solution (SQL query) begins below this line
select id, g_name, amt, flag,
coalesce (first_value(case when flag = 1 then amt end ignore nulls)
over (partition by g_name order by id), 0) as n1,
coalesce (first_value(case when flag = 2 then amt end ignore nulls)
over (partition by g_name order by id), 0) as n2
from test_data
order by g_name, id
;
ID G_NAME AMT FLAG N1 N2
--- ------ ---------- ---------- ---------- ----------
1 g1 0 0 0 0
2 g1 0 0 0 0
3 g1 10 1 10 0
4 g1 0 0 10 0
5 g1 15 2 10 15
6 g1 0 0 10 15
1 g2 0 0 0 0
2 g2 4 1 4 0
3 g2 3 2 4 3
4 g2 0 0 4 3
SQL tables represent unordered sets. There is no ordering, unless a column specifies that ordering. Let me assume that such a column exists.
If so, you can do this with analytic functions:
select t.*,
max(case when flag = 1 then amt else 0 end) over (order by ??) as n1,
max(case when flag = 2 then amt else 0 end) over (order by ??) as n2
from t;
The ?? specifies the ordering.