LEFT OUTER JOIN Error creating a subquery on bigquery - sql

I'm trying to eval MAL, WAL and DAU from a event table on my bq...
I create a query find DAU and with him find WAU and MAU,
but it does not work, i received this error:
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
It's my query
WITH dau AS (
SELECT
date,
COUNT(DISTINCT(events.device_id)) as DAU_explorer
FROM `workspace.event_table` as events
GROUP BY 1
)
SELECT
date,
dau,
(SELECT
COUNT(DISTINCT(device_id))
FROM `workspace.event_table` as events
WHERE events.date BETWEEN DATE_ADD(dau.date, INTERVAL -30 DAY) AND dau.date
) AS mau,
(SELECT
COUNT(DISTINCT(device_id)) as DAU_explorer
FROM `workspace.event_table` as events
WHERE events.date BETWEEN DATE_ADD(dau.date, INTERVAL -7 DAY) AND dau.date
) AS wau
FROM dau
Where is my error? Is not possible run subqueries like this on bq?

Try this instead:
WITH data AS (
SELECT DATE(creation_date) date, owner_user_id device_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
)
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT IF(i<31,device_id,null)) unique_30_day_users
, COUNT(DISTINCT IF(i<8,device_id,null)) unique_7_day_users
FROM `data`, UNNEST(GENERATE_ARRAY(1, 30)) i
GROUP BY 1
ORDER BY date_grp
LIMIT 100
OFFSET 30
And if you are looking for a more efficient solution, try approximate results.

Related

BigQuery, WAU from DAU

I am trying to calculate WAU from DAU using this link:
Fix the MAU problem while calculating DAU and MAU on Amazon Redshift
But the solution in the link works for Redshift. I am trying to do the same thing in BigQuery but its giving me this error:
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join
My code is here:
WITH
data AS (
SELECT
EXTRACT(date FROM active_time) AS active_date,
COUNT(distinct user_id) AS DAU
FROM
abc
GROUP BY
active_date )
SELECT
active_date,
DAU,
(select count(distinct user_id)
from abc
where EXTRACT(date FROM abc.active_time) between DATE_SUB(data.active_date, interval 7 day) and data.active_date
) as WAU
from data
Could someone pls help? TIA
Here is one way to achieve your results (note: It's kind of a mean hack to pass through the left join equality check in BQ, but it does the job):
select
a.active_date,
count(distinct a.user) as dau,
array_length(array_agg(distinct b.user)) as wau
from abc a
left join data b on array_length(split(cast(a.active_date as string), '-')) = array_length(split(cast(b.active_date as string), '-'))
and b.active_date between date_sub(a.active_date, interval 1 day) and a.active_date
group by 1

CASE AND WHEN SQL

I have transactional data of customers' purchase. I tried to select customer_id from the last 1 month and calculate recency as the average day customers come to purchase (AVG(gap))
SELECT
customer_id,
(
CASE WHEN day::DATE<= '2015-05-01'::DATE AND day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
THEN
(
SELECT
AVG(gap)
FROM
(
SELECT
customer_id,
( day- LAG(day) OVER ( PARTITION BY customer_id ORDER BY day ) ) AS gap
FROM
baskets
JOIN
basket_lines
USING
( basket_id )
GROUP BY 1
) a
) b
ELSE 0
) AS A
FROM
baskets
JOIN
basket_lines
USING
(basket_id)
GROUP BY
1;
However, I have an error like `
ERROR: syntax error at or near "b"
LINE 45: GROUP BY 1)a)b ELSE 0) AS A
^
Does it mean I can not use subquery after THEN statement?
A subquery in the THEN clause does not take an alias. Also, you must end your CASE expression with END:
SELECT
customer_id,
(CASE WHEN day::DATE<= '2015-05-01'::DATE AND
day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
THEN
(SELECT AVG(gap) FROM (
SELECT customer_id,
(day- LAG(day) OVER (PARTITION BY customer_id ORDER BY day)) as gap
FROM baskets
JOIN basket_lines
USING (basket_id)
GROUP BY 1) a) ELSE 0 END) AS A
FROM baskets
JOIN basket_lines
USING (basket_id)
GROUP BY 1;
But you have a correlated subquery in your select statement. This is probably not optimal, and we can likely rewrite your query using a join.
I propose the following refactor:
WITH cte AS (
SELECT
customer_id,
(day- LAG(day) OVER (PARTITION BY customer_id ORDER BY day)) as gap
FROM baskets
INNER JOIN basket_lines
USING (basket_id)
WHERE day::DATE<= '2015-05-01'::DATE AND
day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
)
SELECT
customer_id,
AVG(gap) AS cust_avg
FROM cte
GROUP BY
customer_id;

SQL - Unequal left join BigQuery

New here. I am trying to get the Daily and Weekly active users over time. they have 30 days before they are considered inactive. My goal is to create graph's that can be split by user_id to show cohorts, regions, categories, etc.
I have created a date table to get every day for the time period and I have the simplified orders table with the base info that I need to calculate this.
I am trying to do a Left Join to get the status by date using the following SQL Query:
WITH daily_use AS (
SELECT
__key__.id AS user_id
, DATE_TRUNC(date(placeOrderDate), day) AS activity_date
FROM `analysis.Order`
where isBuyingGroupOrder = TRUE
AND testOrder = FALSE
GROUP BY 1, 2
),
dates AS (
SELECT DATE_ADD(DATE "2016-01-01", INTERVAL d.d DAY) AS date
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY __key__.id) -1 AS d
FROM `analysis.Order`
ORDER BY __key__.id
LIMIT 1096
) AS d
ORDER BY 1 DESC
)
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
LEFT JOIN daily_use
ON wd.date >= daily_use.activity_date
AND wd.date < DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
I am getting this Error: LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. In BigQuery and was wondering how can I go around this. I am using Standard SQL within BigQuery.
Thank you
Below is for BigQuery Standard SQL and mostly reproduce logic in your query with exception of not including days where no activity at all is found
#standardSQL
SELECT
daily_use.user_id
, wd.date AS DATE
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
-- ORDER BY 1,2
if for whatever reason you still need to exactly reproduce your logic - you can embrace above with final left join as below:
#standardSQL
SELECT *
FROM dates AS wd
LEFT JOIN (
SELECT
daily_use.user_id
, wd.date AS date
, MIN(DATE_DIFF(wd.date, daily_use.activity_date, DAY)) AS days_since_last_action
FROM dates AS wd
CROSS JOIN daily_use
WHERE wd.date BETWEEN
daily_use.activity_date AND DATE_ADD(daily_use.activity_date, INTERVAL 30 DAY)
GROUP BY 1,2
) AS daily_use
USING (date)
-- ORDER BY 1,2

How to limit datasets using _table_suffix on complex query?

I understand how _TABLE_SUFFIX works and have successfully used it before on simpler queries. I'm currently trying to build an application that will get active users from 100+ datasets but have been running into resource limits. In order to bypass these resource limits I'm going to loop and run the query multiple times and limit how much it selects at once using _TABLE_SUFFIX.
Here is my current query:
WITH allTables AS (SELECT
app,
date,
SUM(CASE WHEN period = 30 THEN users END) as days_30
FROM (
SELECT
CONCAT(user_dim.app_info.app_id, ':', user_dim.app_info.app_platform) as app,
dates.date as date,
periods.period as period,
COUNT(DISTINCT user_dim.app_info.app_instance_id) as users
FROM `table.app_events_*` as activity
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170502'
OR _TABLE_SUFFIX BETWEEN 'intraday_20170101' AND 'intraday_20170502'
CROSS JOIN
UNNEST(event_dim) AS event
CROSS JOIN (
SELECT DISTINCT
TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event.timestamp_micros), DAY, 'UTC') as date
FROM `table.app_events_*`
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170502'
OR _TABLE_SUFFIX BETWEEN 'intraday_20170101' AND 'intraday_20170502'
CROSS JOIN
UNNEST(event_dim) as event) as dates
CROSS JOIN (
SELECT
period
FROM (
SELECT 30 as period
)
) as periods
WHERE
dates.date >= TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event.timestamp_micros), DAY, 'UTC')
AND
FLOOR(TIMESTAMP_DIFF(dates.date, TIMESTAMP_MICROS(event.timestamp_micros), DAY)/periods.period) = 0
GROUP BY 1,2,3
)
GROUP BY 1,2)
SELECT
app as target,
UNIX_SECONDS(date) as datapoint_time,
SUM(days_30) as datapoint_value
FROM allTables
WHERE date >= TIMESTAMP_ADD(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP, Day, 'UTC'), INTERVAL -30 DAY)
GROUP BY date,1
ORDER BY date ASC
This currently gives me:
Error: Syntax error: Expected ")" but got keyword CROSS at [14:3]
So my question is, how can I limit the amount of data I pull in using this query and _TABLE_SUFFIX? I feel like I'm missing something very simple here. Any help would be great, thanks!
The CROSS JOIN UNNEST(event_dim) AS event (and the cross join following it) needs to come before the WHERE clause. You can read more in the query syntax documentation.

grouping by column but getting multiple results for each

I am trying to calculate the median response time for conversations on each date for the last X days.
I use the following query below, but for some reason, it will generate multiple rows with the same date.
with grouping as (
SELECT a.id, d.date, extract(epoch from (first_response_at - started_at)) as response_time
FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD') AS date
FROM generate_series(0, 2) AS offs
) d
LEFT OUTER JOIN apps a on true
LEFT OUTER JOIN conversations c ON (d.date=to_char(date_trunc('day'::varchar, c.started_at), 'YYYY-MM-DD')) and a.id = c.app_id
and c.app_id = a.id and c.first_response_at > (current_date - (2 || ' days')::interval)::date
)
select
*
from grouping
where grouping.id = 'ASnYW1-RgCl0I'
Any ideas?
First a number of issues with your query, assuming there aren't any parts you haven't shown us:
You don't need a CTE for this query.
From table apps you only use column id whose value is the same as c.app_id. You can remove the table apps and select c.app_id for the same result.
When you use to_char() you do not first have to date_trunc() to a date, the to_char() function handles that.
generate_series() also works with timestamps. Just enter day values with an interval and cast the end result to date before using it.
So, removing all the flotsam we end up with this which does exactly the same as the query in your question but now we can at least see what is going on.
SELECT c.app_id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (first_response_at - started_at)) AS response_time
FROM generate_series(CURRENT_DATE - 2, CURRENT_DATE, interval '1 day') d(date)
LEFT JOIN conversations c ON d.date::date = c.started_at::date
AND c.app_id = 'ASnYW1-RgCl0I'
AND c.first_response_at > CURRENT_DATE - 2;
You don't calculate the median response time anywhere, so that is a big problem you need to solve. This only requires data from table conversations and would look somewhat like this to calculate the median response time for the past 2 days:
SELECT app_id, started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = 'ASnYW1-RgCl0I'
AND first_response_at > CURRENT_DATE - 2
GROUP BY 2;
When we fold the two queries, and put the parameters handily in a single place, this is the final result:
SELECT p.id, to_date(d.date, 'YYYY-MM-DD') AS date,
extract(epoch from (c.median_response)) AS response_time
FROM (VALUES ('ASnYW1-RgCl0I', 2)) p(id, days)
JOIN generate_series(CURRENT_DATE - p.days, CURRENT_DATE, interval '1 day') d(date) ON true
LEFT JOIN LATERAL (
SELECT started_at::date AS start_date,
percentile_disc(0.5) WITHIN GROUP (ORDER BY first_response_at - started_at) AS median_response
FROM conversations
WHERE app_id = p.id
AND first_response_at > CURRENT_DATE - p.days
GROUP BY 2) c ON d.date::date = c.start_date;
If you want to change the id of the app or the number of days to look back, you only have to change the VALUES clause accordingly. You can also wrap the whole thing in a SQL function and convert the VALUES clause into two parameters.