Distinct in Window Functions. BigQuery - sql

I'm trying to do something like this in BigQuery
COUNT(DISTINCT user_id) OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
In other words, I have a table with Date, Userid, Sample and Application ID. I need to count the cumulative number of unique active users for each day starting from the beginning of the month and ending with the current day.
The function works properly without distinct, however, this gives me a total count of users and it's not what I need.
Tried some tricks with dense_rank, however it doesn't work here as well.
Are there any ways to calculative the number of distinct users using window functions?
-------------UPDATED----------------
here is the full query, so you could better understand what I need
with mtd1 as (select
'MonthToDate' as TIMELINE
,fd.date DATE
,td.SAMPLE as SAMPLE
,td.APPNAME as APP_ID
,sum(fd.revenue) as REVENUE
,td.user_id ACTIVE_USERS
from DWH.DailyUser fd
join DWH.Depositors td using (userid)
group by 1,2,3,4,6
),
mtd as (
select TIMELINE
,DATE
,SAMPLE
,APP_ID
,sum(revenue) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as REVENUE
,COUNT(distinct active_users) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
from mtd1
)
select * from mtd
where extract(day from date) = extract(day from current_date)
group by 1,2,3,4,5,6

Distinct in Window Functions. BigQuery - Are there any ways to calculate the number of distinct users using window functions?
This specific question is a duplicate and already answered here
... here is the full query ...
As of how to apply above to your particular query - see below (not tested and fully based on your code
#standardSQL
WITH mtd1 AS (
SELECT
'MonthToDate' AS TIMELINE
,fd.date DATE
,td.SAMPLE AS SAMPLE
,td.APPNAME AS APP_ID
,SUM(fd.revenue) AS REVENUE
,td.user_id ACTIVE_USERS
FROM `DWH.DailyUser` fd
JOIN `DWH.Depositors` td USING (userid)
GROUP BY 1,2,3,4,6
), mtd2 AS (
SELECT
TIMELINE
,DATE
,SAMPLE
,APP_ID
,SUM(REVENUE) OVER (PARTITION BY DATE_TRUNC(DATE, MONTH), SAMPLE, APP_ID ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS REVENUE
,ARRAY_AGG(ACTIVE_USERS) OVER (PARTITION BY DATE_TRUNC(DATE, MONTH), SAMPLE, APP_ID ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ACTIVE_USERS
FROM mtd1
), mtd AS (
SELECT * REPLACE((SELECT COUNT(DISTINCT u) FROM UNNEST(ACTIVE_USERS) AS u) AS ACTIVE_USERS)
FROM mtd2
)
SELECT * FROM mtd
WHERE EXTRACT(day FROM DATE) = EXTRACT(day FROM CURRENT_DATE)
GROUP BY 1,2,3,4,5,6

You can use ARRAY_AGG, then count the distinct elements in each array. Note that your query will run out of memory if the arrays end up being too big, though.
with mtd1 as (select
'MonthToDate' as TIMELINE
,fd.date DATE
,td.SAMPLE as SAMPLE
,td.APPNAME as APP_ID
,sum(fd.revenue) as REVENUE
,td.user_id ACTIVE_USERS
from DWH.DailyUser fd
join DWH.Depositors td using (userid)
group by 1,2,3,4,6
),
mtd1 as (
select TIMELINE
,DATE
,SAMPLE
,APP_ID
,sum(revenue) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as REVENUE
,ARRAY_AGG(active_users) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
from mtd1
), mtd AS (
SELECT * EXCEPT(ACTIVE_USERS),
(SELECT COUNT(DISTINCT u) FROM UNNEST(ACTIVE_USERS) AS u) AS ACTIVE_USERS
FROM mtd1
)
select * from mtd
where extract(day from date) = extract(day from current_date)
group by 1,2,3,4,5,6

One method for implementing count(distinct) uses row_number() and then counts the "1"s:
select SUM(CASE WHEN seqnum = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id ORDER BY date) as Active_Users
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id, user_id ORDER BY DATE) as seqnum
FROM t
) t

Related

How can i group rows on sql base on condition

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

SQL Rolling LTV (Lifetime Value)

I am trying to get a rolling calculation of customer lifetime value. The basic formula that I am using would 'SUM(revenue) / COUNT(DISTINCT CUSTOMERS)' but am running into issues when trying to just get those numbers from whatever day it is moving backward. I have code below that isn't correct but had also tried PARTITION code that also didn't work.
CREATE TEMP TABLE customer_revenue AS
(
SELECT TRUNC(timestamp) AS "order_date", COUNT(DISTINCT customer_email) AS "customers",
SUM(revenue)-SUM(discount)-SUM(shipping)-SUM(tax) AS "revenue"
FROM public.fact_shopify_orders
GROUP BY TRUNC(timestamp)
);
SELECT TRUNC(SO.timestamp) AS "date", SUM(CR.revenue) / COUNT(customers) AS "LTV"
FROM customer_revenue CR
LEFT JOIN public.fact_shopify_orders SO ON CR.order_date = SO.timestamp
WHERE CR.order_date <= SO.timestamp
GROUP BY TRUNC(SO.timestamp)
ORDER BY TRUNC(SO.timestamp) DESC
I think you want rolling sums and count(distinct). The latter is a little tricky but you can emulate it easily using a flag based on the first time the customer is seen:
SELECT date,
( SUM(SUM(net_revenue)) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
SUM(SUM( (seqnum = 1)::int )) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
) as LTV
FROM (SELECT so.*, TRUNC(SO.timestamp) as date,
(revenue - discount - shipping - tax) as net_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_email ORDER BY timestamp) as seqnum
FROM public.fact_shopify_orders so
) so
GROUP BY date;
EDIT:
I think Redshift supports window functions with aggregation . . . but there is some database out there that does not. You can try this:
SELECT date,
( SUM(net_revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
SUM(num_firsts) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
) as LTV
FROM (SELECT date, SUM(net_revenue) as net_revenue,
SUM( (seqnum = 1)::int ) as num_firsts
FROM (SELECT so.*, TRUNC(SO.timestamp) as date,
(revenue - discount - shipping - tax) as net_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_email ORDER BY timestamp) as seqnum
FROM public.fact_shopify_orders so
) so
GROUP BY date
) so;
Here is a similar version running in Postgres.

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

Active customers for each day who were active in last 30 days

I have a BQ table, user_events that looks like the following:
event_date | user_id | event_type
Data is for Millions of users, for different event dates.
I want to write a query that will give me a list of users for every day who were active in last 30 days.
This gives me total unique users on only that day; I can't get it to give me the last 30 for each date. Help is appreciated.
SELECT
user_id,
event_date
FROM
[TableA]
WHERE
1=1
AND user_id IS NOT NULL
AND event_date >= DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY')
GROUP BY
1,
2
ORDER BY
2 DESC
Below is for BigQuery Standard SQL and has few assumption about your case:
there is only one row per date per user
user is considered active in last 30 days if user has at least 5 (sure can be any number - even just 1) entries/rows within those 30 days
If above make sense - see below
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM `yourTable`
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date
If above assumption #1 is not correct - you can just simple add pre-grouping as a sub-select
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM (
SELECT user_id, event_date
FROM `yourTable`
GROUP BY user_id, event_date
)
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date
UPDATE
From comments: If user have any of the event_type IN ('view', 'conversion', 'productDetail', 'search') , they will be considered active. That means any kind of event triggered within the app
So, you can go with below, I think
#standardSQL
SELECT
user_id, event_date
FROM (
SELECT
user_id, event_date,
(COUNT(1)
OVER(PARTITION BY user_id
ORDER BY UNIX_DATE(event_date)
RANGE BETWEEN 30 PRECEDING AND 1 PRECEDING)
) >= 5 AS activity
FROM (
SELECT user_id, event_date
FROM `yourTable`
WHERE event_type IN ('view', 'conversion', 'productDetail', 'search')
GROUP BY user_id, event_date
)
)
WHERE activity
GROUP BY user_id, event_date
-- ORDER BY event_date

How can I create a week-to-date metric in vertica?

I have a table which stores year-to-date metrics once per client per day. The schema simplified looks roughly like so, lets call this table history::
bus_date | client_id | ytd_costs
I'd like to create a view that adds a week-to-date costs, essentially any cost that occurs after the prior friday would be considered part of the week-to-date. Currently, I have the following but I'm concerned about the switch case logic.
Here is an example of the logic I have right now to show that this works.
I also got to use the timeseries clause which I've never used before...
;with history as (
select bus_date,client_id,ts_first_Value(value,'linear') "ytd_costs"
from (select {ts'2016-10-07'} t,1 client_id,5.0 "value" union all select {ts'2016-10-14'},1, 15) k
timeseries bus_Date as '1 day' over (partition by client_id order by t)
)
,history_with_wtd as (select bus_date
,client_id
,ytd_costs
,ytd_costs - decode(
dayofweek(bus_date)
,6,first_value(ytd_costs) over (partition by client_id order by bus_date range '1 week' preceding)
,first_value(ytd_costs) over (partition by client_id,date_trunc('week',bus_date+3) order by bus_date)
) as "wtd_costs"
,ytd_costs - 5 "expected_wtd"
from history)
select *
from history_with_wtd
where date_trunc('week',bus_date) = '2016-10-10'
In Sql server, I could just use the lag function, as I can pass a variable to the look-back clause. but in Vertica no such option exists.
How about you partition by week starting on Saturday? First grab the first day of the week, then offset to start on Saturday. trunc(bus_date + 1,'D') - 1
Also notice the window frame is from the start of the partition (Saturday, unbounded preceding) to the current row.
select
bus_date
,client_id
,ytd_costs
,ytd_costs - first_value(ytd_costs) over (
partition by client_id, trunc(bus_date + 1,'D') - 1
order by bus_date
range between unbounded preceding and current row) wtd_costs
from sos.history
order by client_id, bus_date