Reactivation SQL - sql

I have the following:
with t as (
SELECT advertisable, EXTRACT(YEAR from day) as yy, EXTRACT(MONTH from day) as mon,
ROUND(SUM(cost)/1e6) as val
FROM adcube dac
WHERE advertisable IN (SELECT advertisable
FROM adcube dac
GROUP BY advertisable
HAVING SUM(cost)/1e6 > 100
)
GROUP BY advertisable, EXTRACT(YEAR from day), EXTRACT(MONTH from day)
)
select advertisable, min(yy * 10000 + mon) as yyyymm
from (select t.*,
(row_number() over (partition by advertisable order by yy, mon) -
row_number() over (partition by advertisable, val order by yy, mon)
) as grp
from t
)as foo
group by advertisable, grp, val
having count(*) >= 6 and val = 0
;
This tracks the activation date of an account that stops spend for 4 months. However I would like to track the reactivation date instead. So if an account starts spend again after 4 months I can see the new start date for that account?

You want to find accounts where val > 0 and there are 4 (or 6) preceding records with 0s.
Here is an idea:
Calculate the groups of similar values as in your query.
Assign a sequential number to each group (val_seqnum).
Then pull the previous value and sequence number for each record.
Now, you want the records where the following is true:
val > 0
prev_val = 0
The previous val_seqnum >= 4 (or whatever your threshold).
The following query should do this (assuming the same definition of t):
select t.*
from (select t.* ,
lag(val) over (partition by advertisable order by yy, mon) prev_val,
lag(val_seqnum) over (partition by advertisable order by yy, mon) as prev_val_seqnum
from (select t.*,
row_number() over (partition by advertisable, val, grp order by yy, mon) as val_seqnum
) as grp
from (select t.*,
(row_number() over (partition by advertisable order by yy, mon) -
row_number() over (partition by advertisable, val order by yy, mon)
) as grp
from t
) t
) t
) t
where val > 0 and prev_val = 0 and prev_val_seqnum >= 4;

I think this can be radically simpler (and faster):
SELECT advertisable, ym AS reactivation_ym
FROM (
SELECT advertisable
, date_trunc('month', day) AS ym
, SUM(cost) < 500000 AS asleep
, count(SUM(cost) < 500000 OR NULL)
OVER (PARTITION BY advertisable
ORDER BY date_trunc('month', day)
ROWS BETWEEN 4 PRECEDING AND 1 PRECEDING) AS ct
FROM adcube dac
JOIN (
SELECT advertisable
FROM adcube
GROUP BY 1
HAVING SUM(cost) > 1e8 -- really 10000000 ?
) x USING (advertisable)
GROUP BY 1, 2
) sub
WHERE NOT asleep
AND ct = 4;
Building on a couple of assumptions to fill in for missing information.
I largely untangled your calculations and simplified the code making it shorter and faster than your original.
Count for each advertisable how many of the the last 4 months had a total cost below 500000. Only with all 4 (existing) months below the threshold, the row qualifies. (If you don't have rows for all months, you need to decide how to handle missing rows. Information is not available in your question.)
Using count() as window aggregate function with a custom frame. Here is a recent related answer with detailed explanation:
Querying count on daily basis with date constraints over multiple weeks
How can you "nest" count() and sum()?
They are not really nested. It's a window function over an aggregate function. Details:
Get the distinct sum of a joined table column

Related

How do I add an autoincrement Counter based on Conditions and conditional reset in Google-Bigquery

Have my table in Big query and have a problem getting an incremental field based on a condition.
Basically every time the score hits below 95% it returns Stage 1 for the first week. If it hits below 95% for a second straight week it returns Stage 2 etc etc. however, if it goes above 95 % the counter resets to "Good". and thereafter returns Stage 1 if it goes below 95% etc etc.
You can use row_number() -- but after assigning a group based on the count of > 95% values up to each row:
select t.*,
(case when row_number() over (partition by grp order by month, week) = 1
then 'Good'
else concat('Stage ', row_number() over (partition by grp order by month, week) - 1)
end) as level
from (select t.*,
countif(score > 0.95) over (order by month, week) as grp
from t
) t;
Consider below
select * except(grp),
(case when Average_score >= 95 and 1 = row_number() over grps then 'Good'
else format('Stage %i', row_number() over grps - sign(grp))
end) as Level
from (
select *, countif(Average_score >= 95) over (order by Month, Week) as grp
from `project.dataset.table`
)
window grps as (partition by grp order by Month, Week)
If applied to sample data in your question - output is

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

count consecutive record with timestamp interval requirement

ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.

Filter rows in PostgreSQL based on values of consecutive rows in one column

So I'm working with the following postgresql table:
10 rows from PostGreSQL table
For each business_id, I want to filter out those businesses where the review_count isn't above a specific review_count threshold for 2 consecutive months (or rows). Depending on the city the business_id is in, the threshold will be different (so for example, in the screenshot above, we can assume rows with city = Charlotte has a review_count threshold of >= 2, and those with city = Las Vegas has a review_count threshold of >= 3. If a business_id does not have at least one instance of consecutive months with review_counts above the specified threshold, I want to filter it out.
I want this query to return only the business_ids that meet this condition (as well as all the other columns in the table that go along with that business_id). The composite primary key on this table is (business_id, year, month).
Some months, as you may notice, are missing from the data (month 9 of the second business_id). If that is the case, I do NOT want to count 2 rows as 'consecutive months'. For example, for the business in Las Vegas, I do NOT want to consider month 8 to 10 as 'consecutive months', even though they appear in consecutive rows.
I've tried something like this, but have kind of run into a wall and don't think its getting me far:
SELECT *
FROM us_business_monthly_review_growth
WHERE business_id IN (SELECT DISTINCT(business_id)
FROM us_business_monthly_review_growth
GROUP BY business_id, year, month
HAVING (city = 'Las Vegas'
AND (CASE WHEN COUNT(review_count >= 2 * 2.21) >= 2))
OR (city = 'Charlotte' AND (CASE WHEN COUNT(review_count >= 2 * 1.95) >= 2))
I'm new to Postgre and StackOverflow, so if you have any feedback on the way I asked this question please don't hesitate to let me know! =)
UPDATE:
Thanks to some help from #Gordon Linoff, I found the following solution:
SELECT *
FROM us_businesses_monthly_growth_and_avg
WHERE business_id IN (SELECT distinct(business_id)
FROM (SELECT *,
lag(year) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_year,
lag(month) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_month,
lag(review_count) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_review_count
FROM us_businesses_monthly_growth_and_avg
) AS usga
WHERE (city = 'Charlotte' AND review_count >= 4 * 1.95 AND prev_review_count >= 4 * 1.95 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1)
OR (city = 'Las Vegas' AND review_count >= 4 * 3.31 AND prev_review_count >= 4 * 3.31 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1);
You can do this with lag():
select distinct business_id
from (select t.*,
lag(year) over (partition by business_id order by year, month) as prev_year,
lag(month) over (partition by business_id order by year, month) as prev_month,
lag(rating) over (partition by business_id order by year, month) as prev_rating
from us_business_monthly_review_growth t
) t
where rating >= $threshhold and prev_rating >= $threshhold and
(year * 12 + month) = (prev_year * 12 + prev_month) + 1;
The only trick is setting the threshold value. I have no idea how you plan on doing that.
Please try...
SELECT business_id
FROM
(
SELECT business_id AS business_id,
LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
city,
LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates,
review_count AS review_count
FROM us_business_monthly_review_growth
order BY business_id,
year,
month
) tempTable
JOIN tblCityThresholds ON tblCityThresholds.city = tempTable.city
WHERE business_id = lag_in_business_id
AND diffInDates = 1
AND tblCityThresholds.threshold <= review_count
GROUP BY business_id;
In formulating this answer I first used the following code to test that LAG() performed as hoped...
SELECT business_id,
LAG( business_id, 1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
year,
month,
LAG( year, 1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, 1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
year,
month;
Here I was trying to get LAG() to refer to values on the next row, but the output showed that it was referring to the previous row in that comparison. Unfortunately I wanted to compare current values with the next one to see if the next record had the same business_id, etc. So I changed the 1 in LAG() to `-1', giving me...
SELECT business_id,
LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
year,
month,
LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
year,
month;
As this gave me the desired results I added city, to allow a JOIN between the results and an assumed table holding the details of each city and its corresponding threshold. I chose the name tblCityThresholds as a suggestion since I am not sure what you have / would call it. This completed the inner SELECT statement.
I then joined the results of the inner SELECT statement to tblCityThresholds and refined the output as per your criteria. Note : It is assumed that the city field will always have a corresponding entry in tblCityThresholds;
I then used GROUP BY to ensure no repetition of a business_id.
If you have any questions or comments, then please feel free to post a Comment accordingly.
Further Reading
https://www.postgresql.org/docs/8.4/static/functions-window.html (in regards LAG())

SQL SELECT rows where the difference between consecutive columns is less than X

Basically Mysql: Find rows, where timestamp difference is less than x, but I want to stop at the first value whose timestamp difference is larger than X.
I got so far:
SELECT *
FROM (
SELECT *,
(LEAD(datetime) OVER (ORDER BY datetime)) - datetime AS difference
FROM history
) AS sq
WHERE difference < '00:01:00'
Which seems to correctly return all rows where the difference between the row and the one "behind" it is less than a minute, but that means I still get large jumps in the datetimes, which I don't want - I want to select the most recent "run" of rows, where a "run" is defined as "the timestamps in datetime differ by less than a minute".
e.g., I have rows whose hypothetical timestamps are as follows:
24, 22, 21, 19, 18, 12, 11, 9, 7...
And my limit of differences is 3, i.e. I want the run of the rows whose difference between "timestamps" is less than 3; therefore just:
24, 22, 21, 19, 18
Is this possible in SQL?
You can use lag to get the previous row's timestamp and check if the current row is within 3 minutes of it. Reset the group if the condition fails. After this grouping is done, you have the find the latest such group, use max to get it. Then get all those rows from the latest group.
Include a partition by clause in the window functions lag, sum andmax if this has to be done for each id in the table.
with grps as (
select x.*,sum(col) over(order by dt) grp
from (select t.*
--checking if the current row's timestamp is within 3 minutes of the next row
,case WHEN dt BETWEEN LAG(dt) OVER (ORDER BY dt)
AND LAG(dt) OVER (ORDER BY dt) + interval '3 minute' THEN 0 ELSE 1 END col
from t) x
)
select dt
from (select g.*,max(grp) over() maxgrp --getting the latest group
from grps g
) g
where grp = maxgrp
The above would get you the members in the latest group even though it has one row. To avoid such results get the latest group which has more than 1 row.
with grps as (
select x.*,sum(col) over(order by dt) grp
from (select t.*
,case WHEN dt BETWEEN LAG(dt) OVER (ORDER BY dt)
AND LAG(dt) OVER (ORDER BY dt) + 3 THEN 0 ELSE 1 END col
from t) x
)
,grpcnts as (select g.*,count(*) over(partition by grp) grpcnt from grps g)
select dt from (select g.*,max(grp) over() maxgrp
from grpcnts g
where grpcnt > 1
) g
where grp = maxgrp
You can do this by using a flag based on the lead() or lag() values. I believe this does what you want:
SELECT h.*
FROM (SELECT h.*,
SUM( (next_datetime < datetime + interval '1 minute')::int) OVER (ORDER BY datetime DESC) as grp
FROM (SELECT h.*,
LEAD(h.datetime) OVER (ORDER BY h.datetime)) as next_datetime
FROM history h
) h
WHERE next_datetime < datetime + interval '1 hour'
) h
WHERE grp IS NULL OR grp = 0;
This can be easily solved with recursive CTEs (this will select your rows one-by-one and stops when there is no row in range interval '1 min'):
with recursive h as (
select * from (
select *
from history
order by history.datetime desc
limit 1
) s
union all
select * from (
select history.*
from h
join history on history.datetime >= h.datetime - interval '1 min'
and history.datetime < h.datetime
order by history.datetime desc
limit 1
) s
)
select * from h
This should be efficient if you have an index on history.datetime. Though, if you care about performance, you should test it against the window-function based ones. (I personally get a headache when see as much subqueries and window functions as needed for this problem. The irony in my answer is that postgresql does not support the ORDER BY clause directly inside recrursive CTEs, so I had to use 2 meaningless subqueries to "hide" them).
rextester