Big Query Compute Average time between two Custom Events - google-bigquery

I'm attempting to determine the average time between two events in my Firebase analytics using BigQuery. The table looks something like this:
I'd like to collect the timstamp_micros for the LOGIN_CALL and LOGIN_CALL_OK events, subtract LOGIN_CALL from LOGIN_CALL_OK and compute the average for this across all rows.
#standardSQL
SELECT AVG(
(SELECT
event.timestamp_micros
FROM
`table`,
UNNEST(event_dim) AS event
where event.name = "LOGIN_CALL_OK") -
(SELECT
event.timestamp_micros
FROM
`table`,
UNNEST(event_dim) AS event
where event.name = "LOGIN_CALL"))
from `table`
I've managed to list either the low or the hi numbers, but any time I try to do any math on them I run into errors I'm struggling to pull apart. This approach above seems like it should work but i get the following error:
Error: Scalar subquery produced more than one element
I read this error to mean that each of the UNNEST() functions is returning an array, and not single value which is causing AVG to barf. I've tried to unnest once and apply a "low" and "hi" name to the values, but can't figure out how to filter using the event_dim.name correctly.

I couldn't fully test this one but maybe this might work for you:
WITH data AS(
SELECT STRUCT('1' as user_id) user_dim, ARRAY< STRUCT<date string, name string, timestamp_micros INT64> > [('20170610', 'EVENT1', 1497088800000000), ('20170610', 'LOGIN_CALL', 1498088800000000), ('20170610', 'LOGIN_CALL_OK', 1498888800000000), ('20170610', 'EVENT2', 159788800000000), ('20170610', 'LOGIN_CALL', 1599088800000000), ('20170610', 'LOGIN_CALL_OK', 1608888800000000)] event_dim union all
SELECT STRUCT('2' as user_id) user_dim, ARRAY< STRUCT<date string, name string, timestamp_micros INT64> > [('20170610', 'EVENT1', 1497688500400000), ('20170610', 'LOGIN_CALL', 1497788800000000)] event_dim UNION ALL
SELECT STRUCT('3' as user_id) user_dim, ARRAY< STRUCT<date string, name string, timestamp_micros INT64> > [('20170610', 'EVENT1', 1487688500400000), ('20170610', 'LOGIN_CALL', 1487788845000000), ('20170610', 'LOGIN_CALL_OK', 1498888807700000)] event_dim
)
SELECT
AVG(time_diff) avg_time_diff
FROM(
SELECT
CASE WHEN e.name = 'LOGIN_CALL' AND LEAD(NAME,1) OVER(PARTITION BY user_dim.user_id ORDER BY timestamp_micros ASC) = 'LOGIN_CALL_OK' THEN TIMESTAMP_DIFF(TIMESTAMP_MICROS(LEAD(TIMESTAMP_MICROS, 1) OVER(PARTITION BY user_dim.user_id ORDER BY timestamp_micros ASC)), TIMESTAMP_MICROS(TIMESTAMP_MICROS), day) END time_diff
FROM data,
UNNEST(event_dim) e
WHERE e.name in ('LOGIN_CALL', 'LOGIN_CALL_OK')
)
I've simulated 3 users with the same schema that you have in Firebase Schema.
Basically, I first applied the UNNEST operation so to have each value of event_dim.name. Then applied filter to get only the events that you are interested in, that is, "LOGIN_CALL" and "LOGIN_CALL_OK".
As Mosha commented above, you do need to have some identification for these rows as otherwise you won't know which event succeeded which so that's why the partitioning of the analytical functions takes the user_dim.user_id as input as well.
After that, it's just TIMESTAMP operations to get the differences when appropriate (when the leading event is "LOGIN_CALL_OK" and the current one being "LOGIN_CALL" then take the difference. This is expressed in the CASE expression).
You can choose in the TIMESTAMP_DIFF function which part of the date you want to analyze, such as seconds, minutes, days and so on.

Related

Populating empty bins in a histogram generated using SQL

In Redshift I can create a histogram – in this case it's binning a column named metric into 100ms buckets
select floor(metric / 100) * 100 as bin, count(*) as impressions
from tablename
where epoch > date_part(epoch, dateadd(day, -1, sysdate))
and metric is not null
group by bin
order by bin
There's a danger that some of the bins might be empty and won't appear in the result set, so I want to use generate_series to create the empty bins e.g.
select *, 0 as impressions from generate_series(0, maxMetricValue, 100) as bin
and union the two sets of results together to produce the 'full' histogram
select bin, sum(impressions)
from
(
select floor(metric/100)*100 as bin, count(*) as impressions
from tablename
where epoch > date_part(epoch, dateadd(day, -1, sysdate))
and metric is not null
group by bin
order by bin
)
union
(
select *, 0 as impressions from generate_series(0, maxMetricValue, 100) as bin
)
oroup by bin
order by bin
The challenge is that calculating the maxMetricValue requires a subquery i.e. select max(metric)… etc and I'd like to avoid that
Is there a way I can calculate the max value from the histogram query and use that instead?
Edit:
Something like this seems along the right lines but Redshift doesn't like it
with histogram as (
select cast(floor(metric/100)*100 as integer) as bin, count(*) as impressions
from table name
and epoch > date_part(epoch, dateadd(day, -1, sysdate))
and metric is not null
group by bin
order by bin)
select bin, sum(impressions)
from (
select * from histogram
union
select *, 0 as impressions from generate_series(0, (select max(bin) from histogram), 100) as bin
)
group by bin
order by bin
I get this error, but there are no INFO messages visible: ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
If I remove the cast I get: ERROR: function generate_series(integer, double precision, integer) does not exist Hint: No function matches the given name and argument types. You may need to add explicit type casts.
If I try using cast or convert in the parameter for generate_series I get the first error again!
Edit 2:
Presume the above query is failing because Redshift is trying to execute generate_series on a compute node rather than a leader but not sure
First off generate_series is a leader-only function and will throw an error when used in combo with user data. A recursive CTE is the way to do this but since this isn't what you want I won't get into it.
You could create a numbers table and calculate the min, max and count from the other data you know. You could then outer join on some condition that will never match.
However i expect you will be much better off the union all you already have.

Selecting the first and last event per user, per day

I have a Google Analytics event which fires on my website when certain interactions are made, this may or may not fire for a user in a session, or can fire many times.
I'd like to return results showing the userID and the value of the first and last event label, per day. I have tried to do this with MAX(hits.eventInfo.eventLabel), but when I fact check my results this is not returning the last value for that user in the day as I was expecting.
SELECT Date,
customDimension.value AS UserID,
MAX(hits.eventInfo.eventLabel) AS last_value
FROM `project.dataset.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(t.customdimensions) AS customDimension
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND hits.eventInfo.eventAction = "Value"
AND customDimension.index = 2
GROUP BY Date, UserID
For example, the query above returns results where user X has the following MAX() value:
20180806 User_x 69.96
But when I look at the details of that users interactions on the day I see:
Based on this, I would expect to see 79.95 as my MAX() result as it has the highest hit number, instead I seem to have selected a value from somewhere in the middle of the session - how can I adjust my query to ensure I select the last event value?
When you are looking for maximum value of column colA while doing GROUP BY - obviously MAX(colA) will work
But when you are looking for value in column colA based on maximum value in column colB - you should use STRING_AGG(colA ORDER BY colB DESC LIMIT 1) or similar using ARRAY_AGG()
So, in you case, I think it will be something like below (you should tune it further)
STRING_AGG(eventInfo.eventLabel ORDER BY hiNumber DESC LIMIT 1) AS last_value
In your case one should work with subqueries on the hits array. This allows full control over what you want to have. I used the example ga data from Google, so labels are different. But I wrote it in a way you can easily modify to fit your needs:
SELECT
date,
fullvisitorid,
visitstarttime,
(SELECT value FROM t.customDimensions WHERE index=2) userId,
(SELECT
--STRUCT(hour, minute, hitNumber, eventinfo.eventlabel) -- for testing, comment out next line
eventInfo.eventLabel
FROM t.hits
WHERE type='EVENT' AND eventInfo.eventAction <> '' -- modify to fit your condition
ORDER BY hitNumber ASC LIMIT 1
) AS firstEventLabel,
(SELECT
--STRUCT(hour, minute, hitNumber, eventinfo.eventlabel) -- for testing, comment out next line
eventInfo.eventLabel
FROM t.hits
WHERE type='EVENT' AND eventInfo.eventAction <> '' -- modify to fit your condition
ORDER BY hitNumber DESC LIMIT 1
) AS lastEventLabel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
LIMIT 1000 -- for testing
Basically, I'm querying events order them by hitNumber ascending or descending and limit to one to only have one result per row. The line with userId also shows how to properly get a custom dimension value.
If you are very new to this concept of working with arrays you can learn all about it here: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
MAX() should work. The one time it would return an unexpected value is if it is operating on a string, not a number.
Does this fix the problem?
MAX(CAST(hits.eventInfo.eventLabel as float128)) AS last_value

Select the date of a UserIDs first/most recent purchase

I am working with Google Analytics data in BigQuery, looking to aggregate the date of last visit and first visit up to UserID level, however my code is currently returning the max visit date for that user, so long as they have purchased within the selected date range, because I am using MAX().
If I remove MAX() I have to GROUP by DATE, which I don't want as this then returns multiple rows per UserID.
Here is my code which returns a series of dates per user - last_visit_date is currently working, as it's the only date that can simply look at the last date of user activity. Any advice on how I can get last_ord_date to select the date on which the order actually occurred?
SELECT
customDimension.value AS UserID,
# Last order date
IF(COUNT(DISTINCT hits.transaction.transactionId) > 0,
(MAX(DATE)),
"unknown") AS last_ord_date,
# first visit date
IF(SUM(totals.newvisits) IS NOT NULL,
(MAX(DATE)),
"unknown") AS first_visit_date,
# last visit date
MAX(DATE) AS last_visit_date,
# first order date
IF(COUNT(DISTINCT hits.transaction.transactionId) > 0,
(MIN(DATE)),
"unknown") AS first_ord_date
FROM
`XXX.XXX.ga_sessions_20*` AS t
CROSS JOIN
UNNEST (hits) AS hits
CROSS JOIN
UNNEST(t.customdimensions) AS customDimension
CROSS JOIN
UNNEST(hits.product) AS hits_product
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND customDimension.index = 2
AND customDimension.value NOT LIKE "true"
AND customDimension.value NOT LIKE "false"
AND customDimension.value NOT LIKE "undefined"
AND customDimension.value IS NOT NULL
GROUP BY
UserID
the most efficient and clear way to do this (and also most portable) is to have a simple table/view that has two columns: userid, last_purchase and another that has other two cols userid, first_visit.
then you inner join it with the original raw table on userid and hit timestamp to get, say, the session IDs you're interested in. 3 steps but simple, readable and easy to maintain
It's very easy to hit too much complexity for a query that relies on first or last purchase/action (just look at the unnest operations you have there) that is becomes unusable and you'll spend way too much time trying to figure out the meaning of the output.
Also keep in mind that using the wildcard in the query has a limit of 1000 tables, so your last and first visits are in a rolling window of 1000 days.

Query multiple params in multiple tables with TABLE_DATE_RANGE for Firebase Analytics

I intend to get from the events I have in the applications a stat for most played audios within an article. In the event I send articleId and the audioID that has been played.
I want to obtain as result rows like this ordered by number of ocurrences:
| ID of the article | ID of the audio | number of occurrences
Since firebase analytics exports to bigquery in a diary basis and I want those events per month I created a query that takes the values from multiple tables, and mixed it with the info I found in this thread.
The resulting query is:
SELECT
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Article_ID') AS Article_ID,
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Audio_ID') AS Audio_ID,
COUNT(event_dim.name) as Number_Of_Plays
FROM
TABLE_DATE_RANGE([project-id:my_app_id.app_events_], DATE_ADD(CURRENT_TIMESTAMP(), -30, 'DAY'), CURRENT_TIMESTAMP()), UNNEST(event_dim) AS x
WHERE event_dim.name = 'Audio_Play'
GROUP BY Audio_ID, Article_ID
ORDER BY Number_Of_Plays desc
Unfortunately this query is not being parsed correctly provided me an error:
Error: Table name cannot be resolved: dataset name is missing.
RUN QUERY
I am pretty sure the issue is related to querying multiple tables in a range, but not sure how to fix it. Thanks.
The other answer you reference, is using StandardSQL and you are trying to use TABLE_DATE_RANGE which is only available in LegacySQL.
This is the query in Standard SQL that allows you multiple tables
#standardSql
SELECT
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Article_ID') AS Article_ID,
(SELECT params.value.int_value FROM x.params
WHERE params.key = 'Audio_ID') AS Audio_ID,
COUNT(event_dim.name) as Number_Of_Plays
FROM
`project-id:my_app_id.app_events_*`, UNNEST(event_dim) AS x
WHERE _TABLE_SUFFIX BETWEEN cast(DATE_ADD(current_date(), INTERVAL -30 DAY) as string) AND cast(current_date() as string)
AND event_dim.name = 'Audio_Play'
GROUP BY Audio_ID, Article_ID
ORDER BY Number_Of_Plays desc
See this From clause: project-id:my_app_id.app_events_* and the WHERE _TABLE_SUFFIX BETWEEN syntax line.

BigQuery - counting number of events within a sliding time frame

I would like to count the number of events within a sliding time frame.
For example, say I would like to know how many bids were in the last 1000 seconds for the Google stock (GOOG).
I'm trying the following query:
SELECT
symbol,
start_date,
start_time,
bid_price,
count(if(max(start_time)-start_time<1000,1,null)) over (partition by symbol order by start_time asc) cnt
FROM [bigquery-samples:nasdaq_stock_quotes.quotes]
where symbol = 'GOOG'
The logic is as follow: the partition window (by symbol) is ordered with the bid time (leaving alone the bid date for sake of simplicity).
For each window (defined by the row at the "head" of the window) I would like to count the number of rows which have start_time that is less than 1000 seconds than the "head" row time.
I'm trying to use max(start_time) to get the top row in the window. This doesn't seem to work and I get an error:
Error: MAX is an analytic function and must be accompanied by an OVER clause.
Is it possible to have two an analytic functions in one column (both count and max in this case)? Is there a different solution to the problem presented?
Try using the range function.
SELECT
symbol,
start_date,
start_time,
bid_price,
count(market_center) over (partition by symbol order by start_time RANGE 1000 PRECEDING) cnt
FROM [bigquery-samples:nasdaq_stock_quotes.quotes]
where symbol = 'GOOG'
order by 2, 3
I used market_center just as a counter, additional fields can be used as well.
Note: the RANGE function is not documented in BigQuery Query Reference, however it's a standard SQL function which appears to work in this case