How do I add an autoincrement Counter based on Conditions and conditional reset in Google-Bigquery - sql

Have my table in Big query and have a problem getting an incremental field based on a condition.
Basically every time the score hits below 95% it returns Stage 1 for the first week. If it hits below 95% for a second straight week it returns Stage 2 etc etc. however, if it goes above 95 % the counter resets to "Good". and thereafter returns Stage 1 if it goes below 95% etc etc.

You can use row_number() -- but after assigning a group based on the count of > 95% values up to each row:
select t.*,
(case when row_number() over (partition by grp order by month, week) = 1
then 'Good'
else concat('Stage ', row_number() over (partition by grp order by month, week) - 1)
end) as level
from (select t.*,
countif(score > 0.95) over (order by month, week) as grp
from t
) t;

Consider below
select * except(grp),
(case when Average_score >= 95 and 1 = row_number() over grps then 'Good'
else format('Stage %i', row_number() over grps - sign(grp))
end) as Level
from (
select *, countif(Average_score >= 95) over (order by Month, Week) as grp
from `project.dataset.table`
)
window grps as (partition by grp order by Month, Week)
If applied to sample data in your question - output is

Related

How to get increment number when there are any change in a column in Bigquery?

I have data date, id, and flag on this table. How I can get the value column where this column is incremental number and reset from 1 when there are any change in flag column?
Consider below approach
select * except(changed, grp),
row_number() over(partition by id, grp order by date) value
from (
select *, countif(changed) over(partition by id order by date) grp
from (
select *,
ifnull(flag != lag(flag) over(partition by id order by date), true) changed
from `project.dataset.table`
))
if applied to sample data in your question - output is
You seem to want to count the number of falses since the last true. You can use:
select t.* except (grp),
(case when flag
then 1
else row_number() over (partition by id, grp order by date) - 1
end)
from (select t.*,
countif(flag) over (partition by id order by date) as grp
from t
) t;
If you know that the dates have no gaps, you can actually do this without a subquery:
select t.*,
(case when flag then 1
else date_diff(date,
max(case when flag then date end) over (partition by id),
day)
end)
from t;

how can i reset the count to 0 in sql when i have a condition that is false?

i have a sql table which the following data shown in the picture
I need to create a query in sql which counts for ticker the number of consecutive days per year in which
the close_value is greater than the open_value, if close_value is less than the open value the counter must be reset to zero and I have to save the counter in that instant
This is an example of a gaps-and-islands problem. You can use the difference of row_numbers():
select ticker, min(date), max(date), min(open_value), max(close_value),
count(*) as num_rows
from (select t.*,
row_number() over (partition by ticker order by date) as seqnum,
row_number() over (partition by ticker, (case when close_value > open_value then 1 else 2 end) order by date) as seqnum_2
from t
) t
where close_value > open_value
group by ticker, (seqnum - seqnum_2);
This returns all such periods. You haven't specified what the result set should look like, but this should be pretty close.

How to select specific rows in a "group by" groups using conditions on multiple columns?

I have the following table with many userId (in the example only one userId for demo purpose):
For every userId I want to extract two rows:
The first row should be isTransaction = 0 and the earliest date!
The second row should be isTransaction = 1, device should be different from that of the first row, isTransaction should be equal to 1 and the earliest date right after that of the first row
That is, the output should be:
Time userId device isTransaction
2021-01-27 10187675 mobile 0
2021-01-30 10187675 web 1
I tried to rank rows with partitioning and ordering but it didn't work:
Select * from
(SELECT *, rank() over(partition by userId, device, isTransaction order by isTransaction, Time) as rnk
FROM table 1)
where rnk=1
order by Time
Please help! It would be also good to check the time difference between these two rows to not exceed 30 days. Otherwise, userId should be dropped.
You can first identify the earliest time for 0. Then enumerate the rows and take only the first one:
select t.*
from (select t.*,
row_number() over (partition by userid, status order by time) as seqnum
from (select t.*,
min(case when isTransaction = 0 then time end) over (partition by userid order by time) as time_0
from t
) t
where time > time_0
) t
where seqnum = 1;
This satisfies the two conditions you enumerated.
Then buried in the text, you want to eliminate rows where the difference is greater than 30 days. That is a little tricker . . . but not too hard:
select t.*
from (select t.*,
min(case when isTransaction = 1 then time end) over (partition by userid) as time_1
row_number() over (partition by userid, status order by time) as seqnum
from (select t.*,
min(case when isTransaction = 0 then time end) over (partition by userid order by time) as time_0
from t
) t
where time > time_0
) t
where seqnum = 1 and
time_1 < timestamp_add(time_0, interval 30 day);

count consecutive record with timestamp interval requirement

ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.

Reactivation SQL

I have the following:
with t as (
SELECT advertisable, EXTRACT(YEAR from day) as yy, EXTRACT(MONTH from day) as mon,
ROUND(SUM(cost)/1e6) as val
FROM adcube dac
WHERE advertisable IN (SELECT advertisable
FROM adcube dac
GROUP BY advertisable
HAVING SUM(cost)/1e6 > 100
)
GROUP BY advertisable, EXTRACT(YEAR from day), EXTRACT(MONTH from day)
)
select advertisable, min(yy * 10000 + mon) as yyyymm
from (select t.*,
(row_number() over (partition by advertisable order by yy, mon) -
row_number() over (partition by advertisable, val order by yy, mon)
) as grp
from t
)as foo
group by advertisable, grp, val
having count(*) >= 6 and val = 0
;
This tracks the activation date of an account that stops spend for 4 months. However I would like to track the reactivation date instead. So if an account starts spend again after 4 months I can see the new start date for that account?
You want to find accounts where val > 0 and there are 4 (or 6) preceding records with 0s.
Here is an idea:
Calculate the groups of similar values as in your query.
Assign a sequential number to each group (val_seqnum).
Then pull the previous value and sequence number for each record.
Now, you want the records where the following is true:
val > 0
prev_val = 0
The previous val_seqnum >= 4 (or whatever your threshold).
The following query should do this (assuming the same definition of t):
select t.*
from (select t.* ,
lag(val) over (partition by advertisable order by yy, mon) prev_val,
lag(val_seqnum) over (partition by advertisable order by yy, mon) as prev_val_seqnum
from (select t.*,
row_number() over (partition by advertisable, val, grp order by yy, mon) as val_seqnum
) as grp
from (select t.*,
(row_number() over (partition by advertisable order by yy, mon) -
row_number() over (partition by advertisable, val order by yy, mon)
) as grp
from t
) t
) t
) t
where val > 0 and prev_val = 0 and prev_val_seqnum >= 4;
I think this can be radically simpler (and faster):
SELECT advertisable, ym AS reactivation_ym
FROM (
SELECT advertisable
, date_trunc('month', day) AS ym
, SUM(cost) < 500000 AS asleep
, count(SUM(cost) < 500000 OR NULL)
OVER (PARTITION BY advertisable
ORDER BY date_trunc('month', day)
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS ct
FROM adcube dac
JOIN (
SELECT advertisable
FROM adcube
GROUP BY 1
HAVING SUM(cost) > 1e8 -- really 10000000 ?
) x USING (advertisable)
GROUP BY 1, 2
) sub
WHERE NOT asleep
AND ct = 4;
Building on a couple of assumptions to fill in for missing information.
I largely untangled your calculations and simplified the code making it shorter and faster than your original.
Count for each advertisable how many of the the last 4 months had a total cost below 500000. Only with all 4 (existing) months below the threshold, the row qualifies. (If you don't have rows for all months, you need to decide how to handle missing rows. Information is not available in your question.)
Using count() as window aggregate function with a custom frame. Here is a recent related answer with detailed explanation:
Querying count on daily basis with date constraints over multiple weeks
How can you "nest" count() and sum()?
They are not really nested. It's a window function over an aggregate function. Details:
Get the distinct sum of a joined table column