How to limit datasets using _table_suffix on complex query? - sql

I understand how _TABLE_SUFFIX works and have successfully used it before on simpler queries. I'm currently trying to build an application that will get active users from 100+ datasets but have been running into resource limits. In order to bypass these resource limits I'm going to loop and run the query multiple times and limit how much it selects at once using _TABLE_SUFFIX.
Here is my current query:
WITH allTables AS (SELECT
app,
date,
SUM(CASE WHEN period = 30 THEN users END) as days_30
FROM (
SELECT
CONCAT(user_dim.app_info.app_id, ':', user_dim.app_info.app_platform) as app,
dates.date as date,
periods.period as period,
COUNT(DISTINCT user_dim.app_info.app_instance_id) as users
FROM `table.app_events_*` as activity
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170502'
OR _TABLE_SUFFIX BETWEEN 'intraday_20170101' AND 'intraday_20170502'
CROSS JOIN
UNNEST(event_dim) AS event
CROSS JOIN (
SELECT DISTINCT
TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event.timestamp_micros), DAY, 'UTC') as date
FROM `table.app_events_*`
WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20170502'
OR _TABLE_SUFFIX BETWEEN 'intraday_20170101' AND 'intraday_20170502'
CROSS JOIN
UNNEST(event_dim) as event) as dates
CROSS JOIN (
SELECT
period
FROM (
SELECT 30 as period
)
) as periods
WHERE
dates.date >= TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event.timestamp_micros), DAY, 'UTC')
AND
FLOOR(TIMESTAMP_DIFF(dates.date, TIMESTAMP_MICROS(event.timestamp_micros), DAY)/periods.period) = 0
GROUP BY 1,2,3
)
GROUP BY 1,2)
SELECT
app as target,
UNIX_SECONDS(date) as datapoint_time,
SUM(days_30) as datapoint_value
FROM allTables
WHERE date >= TIMESTAMP_ADD(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP, Day, 'UTC'), INTERVAL -30 DAY)
GROUP BY date,1
ORDER BY date ASC
This currently gives me:
Error: Syntax error: Expected ")" but got keyword CROSS at [14:3]
So my question is, how can I limit the amount of data I pull in using this query and _TABLE_SUFFIX? I feel like I'm missing something very simple here. Any help would be great, thanks!

The CROSS JOIN UNNEST(event_dim) AS event (and the cross join following it) needs to come before the WHERE clause. You can read more in the query syntax documentation.

Related

Count of active user sessions per hour

For each user login to our website, we insert a record into our user_session table with the user's login and logout timestamps. If I wanted to produce a graph of the number of logins per hour over time, it would be easy with the following SQL.
SELECT
date_trunc('hour',login_time) AS "time",
count(*)
FROM user_session
group by time
order by time
Time would be the X-axis and count would be the Y-axis.
But what I really need is the number of active sessions in each hour where "active" means
login_time <= foo and logout_time >= foo where foo is the particular time slot.
How can I do this in a single SELECT statement?
One brute force method generates the hours and then uses a lateral join or correlated subquery to do the calculation:
select gs.ts, us.num_active
from generate_series('2021-03-21'::timestamp, '2021-03-22'::timestamp, interval '1 hour') gs(ts) left join lateral
(select count(*) as num_active
from user_session us
where us.login_time <= gs.ts and
us.logout_time > gs.ts
) us
on 1=1;
A more efficient method -- particularly for longer periods of time -- is to pivot the times and keep an incremental count is ins and outs:
with cte as (
select date_trunc('hour', login_time) as hh, count(*) as inc
from user_session
group by hh
select date_trunc('hour', logout_time + interval '1 hour') as hh, - count(*) as inc
from user_session
group by hh
)
select hh, sum(inc) as net_in_hour,
sum(sum(inc)) over (order by hh) as active_in_hour
from cte
group by hh;

PostgreSQL - generating an hourly list

I have an API that counts events from a table and groups them by the hour of day and severity that I use to draw a graph. this is my current query
SELECT
extract(hour FROM time) AS hours,
alarm. "severity",
COUNT(*)
FROM
alarm
WHERE
date = '2019-06-12'
GROUP BY
extract(hour FROM time),
alarm. "severity"
ORDER BY
extract(hour FROM time),
alarm. "severity"
what I really want to do is get a list of hours from 00 to 24 with the corresponding event counts and 0 if there are no events that hour. is there a way to make postgres generate such a structure?
Use generate_series() to generate the hours and a cross join for the severities:
SELECT gs.h, s.severity, COUNT(a.time)
FROM GENERATE_SERIES(0, 23, 1) gs(h) CROSS JOIN
(SELECT DISTINCT a.severity FROM alarm
) s LEFT JOIN
alarm a
ON extract(hour FROM a.time) = gs.h AND
a.severity = s.severity AND
a.date = '2019-06-12'
GROUP BY gs.h, s.severity
ORDER BY gs.h, s.severity;

LEFT OUTER JOIN Error creating a subquery on bigquery

I'm trying to eval MAL, WAL and DAU from a event table on my bq...
I create a query find DAU and with him find WAU and MAU,
but it does not work, i received this error:
LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join.
It's my query
WITH dau AS (
SELECT
date,
COUNT(DISTINCT(events.device_id)) as DAU_explorer
FROM `workspace.event_table` as events
GROUP BY 1
)
SELECT
date,
dau,
(SELECT
COUNT(DISTINCT(device_id))
FROM `workspace.event_table` as events
WHERE events.date BETWEEN DATE_ADD(dau.date, INTERVAL -30 DAY) AND dau.date
) AS mau,
(SELECT
COUNT(DISTINCT(device_id)) as DAU_explorer
FROM `workspace.event_table` as events
WHERE events.date BETWEEN DATE_ADD(dau.date, INTERVAL -7 DAY) AND dau.date
) AS wau
FROM dau
Where is my error? Is not possible run subqueries like this on bq?
Try this instead:
WITH data AS (
SELECT DATE(creation_date) date, owner_user_id device_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
)
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT IF(i<31,device_id,null)) unique_30_day_users
, COUNT(DISTINCT IF(i<8,device_id,null)) unique_7_day_users
FROM `data`, UNNEST(GENERATE_ARRAY(1, 30)) i
GROUP BY 1
ORDER BY date_grp
LIMIT 100
OFFSET 30
And if you are looking for a more efficient solution, try approximate results.

Billing tier issues with 30 day active user query within bigquery

Is there a way using bigquery that I can run this query and not have to use such a huge billing tier? It ranges anywhere from 11 - 20 on the billing tier. Is my only option to crank up the billing tier and let the charges flow?
WITH allTables AS (SELECT
app,
date,
SUM(CASE WHEN period = 1 THEN users END) as days_1
FROM (
SELECT
CONCAT(user_dim.app_info.app_id, ':', user_dim.app_info.app_platform) as app,
dates.date as date,
periods.period as period,
COUNT(DISTINCT user_dim.app_info.app_instance_id) as users
FROM `table.*` as activity
CROSS JOIN
UNNEST(event_dim) AS event
CROSS JOIN (
SELECT DISTINCT
TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event.timestamp_micros), DAY, 'UTC') as date
FROM `table.*`
CROSS JOIN
UNNEST(event_dim) as event) as dates
CROSS JOIN (
SELECT
period
FROM (
SELECT 1 as period
)
) as periods
WHERE
dates.date >= TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event.timestamp_micros), DAY, 'UTC')
AND
FLOOR(TIMESTAMP_DIFF(dates.date, TIMESTAMP_MICROS(event.timestamp_micros), DAY)/periods.period) = 0
GROUP BY 1,2,3
)
GROUP BY 1,2) SELECT
app as target,
UNIX_SECONDS(date) as datapoint_time,
SUM(days_1) as datapoint_value
FROM allTables
WHERE
date >= TIMESTAMP_ADD(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP, Day, 'UTC'), INTERVAL -1 DAY)
GROUP BY date,1
ORDER BY date ASC

BigQuery Tier 20 or higher required

I'm attempting to run the following query within BigQuery:
SELECT
FORMAT_TIMESTAMP('%Y-%m-%d', TIMESTAMP_MICROS(date)) as target,
SUM(CASE WHEN period = 7 THEN users END) as days_07,
SUM(CASE WHEN period = 14 THEN users END) as days_14,
SUM(CASE WHEN period = 30 THEN users END) as days_30
FROM (
SELECT
activity.date as date,
periods.period as period,
COUNT(DISTINCT user) as users
FROM (
SELECT
event.timestamp_micros as date,
user_dim.app_info.app_instance_id as user
FROM `table.*`
CROSS JOIN
UNNEST(event_dim) as event
) as activity
CROSS JOIN (
SELECT
event.timestamp_micros as date
FROM `table.*`
CROSS JOIN
UNNEST(event_dim) as event
GROUP BY event.timestamp_micros
) as dates
CROSS JOIN (
SELECT period
FROM
(
SELECT 7 as period
UNION ALL
SELECT 14 as period
UNION ALL
SELECT 30 as period
)
) as periods
WHERE
dates.date >= activity.date
AND
SAFE_CAST(FLOOR(TIMESTAMP_DIFF(TIMESTAMP_MICROS(dates.date), TIMESTAMP_MICROS(activity.date), DAY)/periods.period) AS INT64) = 0
GROUP BY 1,2
)
GROUP BY date
ORDER BY date DESC
It is working and will select the active users for specific time frames if I run it on a single table but within my actual application I'm going to be running this on all my datasets (40+). When I attempt to run it on a single dataset with all tables dataset.* I get this error:
Query exceeded resource limits for tier 1. Tier 20 or higher required.
I'm unsure what I can do now. I'm thinking that possibly I might have to end up moving this to code instead of SQL for performance sake.
I think I see the reason for this query to be CPU expensive so it gets "promoted" to that high billing tier
The reason is that sub-selects dates and activity have huge amount of rows because each row represents timestamp in microsecond so no pre-grouping is happenning at all
So, I recommend to transform below
FROM (
SELECT
event.timestamp_micros as date,
user_dim.app_info.app_instance_id as user
FROM `table.*`
CROSS JOIN
UNNEST(event_dim) as event
) as activity
into
FROM (
SELECT DISTINCT
DATE(TIMESTAMP_MICROS(event.timestamp_micros)) AS DATE,
user_dim.app_info.app_instance_id AS user
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
CROSS JOIN UNNEST(event_dim) AS event
) AS activity
and respectively below
CROSS JOIN (
SELECT
event.timestamp_micros as date
FROM `table.*`
CROSS JOIN
UNNEST(event_dim) as event
GROUP BY event.timestamp_micros
) as dates
into
CROSS JOIN (
SELECT DATE(TIMESTAMP_MICROS(event.timestamp_micros)) AS DATE
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
CROSS JOIN UNNEST(event_dim) AS event
GROUP BY 1
) AS dates
above change will make number of rows much more lower so than CROSS JOIN will be not that expensive
of course than you need respectively modify other pieces of your query to accommodate fact that now date fields are actually of DATE type and not microseconds anymore
Hope this helps!