BigQuery, WAU from DAU - google-bigquery

I am trying to calculate WAU from DAU using this link:
Fix the MAU problem while calculating DAU and MAU on Amazon Redshift
But the solution in the link works for Redshift. I am trying to do the same thing in BigQuery but its giving me this error:
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join
My code is here:
WITH
data AS (
SELECT
EXTRACT(date FROM active_time) AS active_date,
COUNT(distinct user_id) AS DAU
FROM
abc
GROUP BY
active_date )
SELECT
active_date,
DAU,
(select count(distinct user_id)
from abc
where EXTRACT(date FROM abc.active_time) between DATE_SUB(data.active_date, interval 7 day) and data.active_date
) as WAU
from data
Could someone pls help? TIA

Here is one way to achieve your results (note: It's kind of a mean hack to pass through the left join equality check in BQ, but it does the job):
select
a.active_date,
count(distinct a.user) as dau,
array_length(array_agg(distinct b.user)) as wau
from abc a
left join data b on array_length(split(cast(a.active_date as string), '-')) = array_length(split(cast(b.active_date as string), '-'))
and b.active_date between date_sub(a.active_date, interval 1 day) and a.active_date
group by 1

Related

Counting working days between two dates in data table using calendar table

I have a data table, which consist of 4 dates per string:
table example
Also I have calendar table with holidays and weekends for my location.
calendar table
What I need is to count number of working days for following pairs in data table:
task_work_end_date and task_got_to_work_date
task_got_to_work_date and task_assigned_date
I have tried following select, but it would always show 1 working day, because I'm putting calendar_date in front:
select data_table.*, days.work_days
from data_table
left join (
select calendar_date, count(calendar_date) as work_days
from calendar_table
where type_of_day IN ('workday', 'workday shortened')
group by calendar_date ) days
ON days.calendar_date between task_assigned_date and task_got_to_work_date
Please advise on SQL to achieve correct joining those tables.
If you are on SQL server then use OUTER APPLY as follows:
select d.*, days.work_days
from data_table d
outer apply (
select count(calendar_date) as work_days
from calendar_table c
where c.type_of_day IN ('workday', 'workday shortened')
and c.calendar_date between d.task_assigned_date and d.task_got_to_work_date) days
A lateral join is definitely one way to solve the problem (that is the apply syntax in the other answers).
A more generic answer is simply a correlated subquery:
select d.*,
(select count(*)
from calendar_table c
where c.type_of_day in ('workday', 'workday shortened') and
c.calendar_date between d.task_assigned_date and d.task_got_to_work_datework_days
) as work_days
from data_table d;
Note: If performance is an issue, there may be other approaches. If that is the case, accept one of the answers here and ask a new question.
To use a left join, you need to change how you are grouping. You can list the actual columns in data_table in the group by and the select as well.
select data_table.*, count(days.calendar_date)
from data_table
left join calendar_table days
ON days.calendar_date between task_assigned_date and task_got_to_work_date
and type_of_day IN ('workday', 'workday shortened')
group by data_table.*
Another option would be to outer apply and get the count this way:
select data_table.*, days.work_days
from data_table
outer apply (
select count(calendar_date) as work_days
from calendar_table
where type_of_day IN ('workday', 'workday shortened')
and calendar_date between task_assigned_date and task_got_to_work_date) days
Solution worked perfectly for me in POSTGRES:
table example
join
calendar table ON tsrange(task_assigned_date, task_got_to_work_date)&&tsrange(calendar.start_time, calendar.end_time)

LEFT OUTER JOIN Error creating a subquery on bigquery

I'm trying to eval MAL, WAL and DAU from a event table on my bq...
I create a query find DAU and with him find WAU and MAU,
but it does not work, i received this error:
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
It's my query
WITH dau AS (
SELECT
date,
COUNT(DISTINCT(events.device_id)) as DAU_explorer
FROM `workspace.event_table` as events
GROUP BY 1
)
SELECT
date,
dau,
(SELECT
COUNT(DISTINCT(device_id))
FROM `workspace.event_table` as events
WHERE events.date BETWEEN DATE_ADD(dau.date, INTERVAL -30 DAY) AND dau.date
) AS mau,
(SELECT
COUNT(DISTINCT(device_id)) as DAU_explorer
FROM `workspace.event_table` as events
WHERE events.date BETWEEN DATE_ADD(dau.date, INTERVAL -7 DAY) AND dau.date
) AS wau
FROM dau
Where is my error? Is not possible run subqueries like this on bq?
Try this instead:
WITH data AS (
SELECT DATE(creation_date) date, owner_user_id device_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
)
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT IF(i<31,device_id,null)) unique_30_day_users
, COUNT(DISTINCT IF(i<8,device_id,null)) unique_7_day_users
FROM `data`, UNNEST(GENERATE_ARRAY(1, 30)) i
GROUP BY 1
ORDER BY date_grp
LIMIT 100
OFFSET 30
And if you are looking for a more efficient solution, try approximate results.

SQL - Unequal left join BigQuery

New here. I am trying to get the Daily and Weekly active users over time. they have 30 days before they are considered inactive. My goal is to create graph's that can be split by user_id to show cohorts, regions, categories, etc.
I have created a date table to get every day for the time period and I have the simplified orders table with the base info that I need to calculate this.
I am trying to do a Left Join to get the status by date using the following SQL Query:
WITH daily_use AS (
SELECT
__key__.id AS user_id
, DATE_TRUNC(date(placeOrderDate), day) AS activity_date
FROM `analysis.Order`
where isBuyingGroupOrder = TRUE
AND testOrder = FALSE
GROUP BY 1, 2
),
dates AS (
SELECT DATE_ADD(DATE "2016-01-01", INTERVAL d.d DAY) AS date
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY __key__.id) -1 AS d
FROM `analysis.Order`
ORDER BY __key__.id
LIMIT 1096
) AS d
ORDER BY 1 DESC
)
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
LEFT JOIN daily_use
ON wd.date >= daily_use.activity_date
AND wd.date < DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
I am getting this Error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. In BigQuery and was wondering how can I go around this. I am using Standard SQL within BigQuery.
Thank you
Below is for BigQuery Standard SQL and mostly reproduce logic in your query with exception of not including days where no activity at all is found
#standardSQL
SELECT
daily_use.user_id
, wd.date AS DATE
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
-- ORDER BY 1,2
if for whatever reason you still need to exactly reproduce your logic - you can embrace above with final left join as below:
#standardSQL
SELECT *
FROM dates AS wd
LEFT JOIN (
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
) AS daily_use
USING (date)
-- ORDER BY 1,2

How to limit datasets using _table_suffix on complex query?

I understand how _TABLE_SUFFIX works and have successfully used it before on simpler queries. I'm currently trying to build an application that will get active users from 100+ datasets but have been running into resource limits. In order to bypass these resource limits I'm going to loop and run the query multiple times and limit how much it selects at once using _TABLE_SUFFIX.
Here is my current query:
WITH allTables AS (SELECT
app,
date,
SUM(CASE WHEN period = 30 THEN users END) as days_30
FROM (
SELECT
CONCAT(user_dim.app_info.app_id, ':', user_dim.app_info.app_platform) as app,
dates.date as date,
periods.period as period,
COUNT(DISTINCT user_dim.app_info.app_instance_id) as users
FROM `table.app_events_*` as activity
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170502'
OR _TABLE_SUFFIX BETWEEN 'intraday_20170101' AND 'intraday_20170502'
CROSS JOIN
UNNEST(event_dim) AS event
CROSS JOIN (
SELECT DISTINCT
TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event.timestamp_micros), DAY, 'UTC') as date
FROM `table.app_events_*`
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170502'
OR _TABLE_SUFFIX BETWEEN 'intraday_20170101' AND 'intraday_20170502'
CROSS JOIN
UNNEST(event_dim) as event) as dates
CROSS JOIN (
SELECT
period
FROM (
SELECT 30 as period
)
) as periods
WHERE
dates.date >= TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event.timestamp_micros), DAY, 'UTC')
AND
FLOOR(TIMESTAMP_DIFF(dates.date, TIMESTAMP_MICROS(event.timestamp_micros), DAY)/periods.period) = 0
GROUP BY 1,2,3
)
GROUP BY 1,2)
SELECT
app as target,
UNIX_SECONDS(date) as datapoint_time,
SUM(days_30) as datapoint_value
FROM allTables
WHERE date >= TIMESTAMP_ADD(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP, Day, 'UTC'), INTERVAL -30 DAY)
GROUP BY date,1
ORDER BY date ASC
This currently gives me:
Error: Syntax error: Expected ")" but got keyword CROSS at [14:3]
So my question is, how can I limit the amount of data I pull in using this query and _TABLE_SUFFIX? I feel like I'm missing something very simple here. Any help would be great, thanks!
The CROSS JOIN UNNEST(event_dim) AS event (and the cross join following it) needs to come before the WHERE clause. You can read more in the query syntax documentation.

Netezza not supporting sub query and similar... any workaround?

I'm sure this will be a very simple question for most of you, but it is driving me crazy...
I have a table like this (simplifying):
| customer_id | date | purchase amount |
I need to extract, for each day, the number of customers that made a purchase that day, and the number of customers that made at least a purchase in the 30 days previous to the current one.
I tried using a subquery like this:
select purch_date as date, count (distinct customer_id) as DAU,
count(distinct (select customer_id from table where purch_date<= date and purch_date>date-30)) as MAU
from table
group by purch_date
Netezza returns an error saying that subqueries are not supported, and that I should think to rewrite the query. But how?!?!?
I tried using case when statement, but did not work. In fact, the following:
select purch_date as date, count (distinct customer_id) as DAU,
count(distinct case when (purch_date<= date and purch_date>date-30) then player_id else null end) as MAU
from table
group by purch_date
returned no errors, but the MAU and DAU columns are the same (which is wrong).
Can anybody help me, please? thanks a lot
I don't beleive netezza supports subqueries in the select line...move to the from statement
select pur_date as date, count(distinct customer_id) as DAU
from table
group by purch_date
select pur_date as date, count (distinct customer_ID) as MAU
from table
where purch_date<= date and purch_date>date-30
group by purch_date
I hope thats right for MAU and DAU. join them to get the results combined:
select a.date, a.dau, b.mau
from
(select pur_date as date, count(distinct customer_id) as DAU
from table
group by purch_date) a
left join
(select pur_date as date, count (distinct customer_ID) as MAU
from table
where purch_date<= date and purch_date>date-30
group by purch_date) b
on b.date = a.date
I got it finally :) For all interested, here is the way I solved it:
select a.date_dt, max(a.dau), count(distinct b.player_id)
from (select dt.cal_day_dt as date_dt,
count(distinct s.player_id) as dau
FROM IA_PLAYER_SALES_HOURLY s
join IA_DATES dt on dt.date_key = s.date_key
group by dt.cal_day_dt
order by dt.cal_day_dt
) a
join (
select dt.cal_day_dt as date_dt,
s.player_id as player_id
FROM IA_PLAYER_SALES_HOURLY s
join IA_DATES dt on dt.date_key = s.date_key
order by dt.cal_day_dt
) b on b.date_dt <= a.date_dt and b.date_dt > a.date_dt - 30
group by a.date_dt
order by a.date_dt;
Hope this is helpful.