Selecting the first and last event per user, per day - sql

I have a Google Analytics event which fires on my website when certain interactions are made, this may or may not fire for a user in a session, or can fire many times.
I'd like to return results showing the userID and the value of the first and last event label, per day. I have tried to do this with MAX(hits.eventInfo.eventLabel), but when I fact check my results this is not returning the last value for that user in the day as I was expecting.
SELECT Date,
customDimension.value AS UserID,
MAX(hits.eventInfo.eventLabel) AS last_value
FROM `project.dataset.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
CROSS JOIN UNNEST(t.customdimensions) AS customDimension
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND hits.eventInfo.eventAction = "Value"
AND customDimension.index = 2
GROUP BY Date, UserID
For example, the query above returns results where user X has the following MAX() value:
20180806 User_x 69.96
But when I look at the details of that users interactions on the day I see:
Based on this, I would expect to see 79.95 as my MAX() result as it has the highest hit number, instead I seem to have selected a value from somewhere in the middle of the session - how can I adjust my query to ensure I select the last event value?

When you are looking for maximum value of column colA while doing GROUP BY - obviously MAX(colA) will work
But when you are looking for value in column colA based on maximum value in column colB - you should use STRING_AGG(colA ORDER BY colB DESC LIMIT 1) or similar using ARRAY_AGG()
So, in you case, I think it will be something like below (you should tune it further)
STRING_AGG(eventInfo.eventLabel ORDER BY hiNumber DESC LIMIT 1) AS last_value

In your case one should work with subqueries on the hits array. This allows full control over what you want to have. I used the example ga data from Google, so labels are different. But I wrote it in a way you can easily modify to fit your needs:
SELECT
date,
fullvisitorid,
visitstarttime,
(SELECT value FROM t.customDimensions WHERE index=2) userId,
(SELECT
--STRUCT(hour, minute, hitNumber, eventinfo.eventlabel) -- for testing, comment out next line
eventInfo.eventLabel
FROM t.hits
WHERE type='EVENT' AND eventInfo.eventAction <> '' -- modify to fit your condition
ORDER BY hitNumber ASC LIMIT 1
) AS firstEventLabel,
(SELECT
--STRUCT(hour, minute, hitNumber, eventinfo.eventlabel) -- for testing, comment out next line
eventInfo.eventLabel
FROM t.hits
WHERE type='EVENT' AND eventInfo.eventAction <> '' -- modify to fit your condition
ORDER BY hitNumber DESC LIMIT 1
) AS lastEventLabel
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
LIMIT 1000 -- for testing
Basically, I'm querying events order them by hitNumber ascending or descending and limit to one to only have one result per row. The line with userId also shows how to properly get a custom dimension value.
If you are very new to this concept of working with arrays you can learn all about it here: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays

MAX() should work. The one time it would return an unexpected value is if it is operating on a string, not a number.
Does this fix the problem?
MAX(CAST(hits.eventInfo.eventLabel as float128)) AS last_value

Related

Grafana, postgresql: aggregate function calls cannot contain window function calls

In Grafana, we want to show bars indicating maximum of 15-minut averages in the choosen time interval. Our data has regular 1-minute intervals. The database is Postgresql.
To show the 15-minute averages, we use the following query:
SELECT
timestamp AS time,
AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING) AS value,
'15-min Average' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
ORDER BY time
To show bars indicating maximum of raw values in the choosen time interval, we use the following query:
SELECT
$__timeGroup(timestamp,'$INTERVAL') AS time,
MAX(rawvalue) AS value,
'Interval Max' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
GROUP BY $__timeGroup(timestamp,'$INTERVAL')
ORDER BY time
A naive combination of both solutions does not work:
SELECT
$__timeGroup(timestamp,'$INTERVAL') AS time,
MAX(AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING)) AS value,
'Interval Max 15-min Average' AS metric
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
GROUP BY $__timeGroup(timestamp,'$INTERVAL')
ORDER BY time
We get error: "pq: aggregate function calls cannot contain window function calls".
There is a suggestion on SO to use "with" (Count by criteria over partition) but I do not know hot to use it in our case.
Use the first query as a CTE (or with) for the second one. The order by clause of the CTE and the where clause of the second query as well as the metric column of the CTE are no longer needed. Alternatively you can use the first query as a derived table in the from clause of the second one.
with t as
(
SELECT
timestamp AS time,
AVG(rawvalue) OVER(ORDER BY timestamp ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING) AS value
FROM database.schema
WHERE $__timeFilter(timestamp) AND device = '$Device'
)
SELECT
$__timeGroup(time,'$INTERVAL') AS time,
MAX(value) AS value,
'Interval Max 15-min Average' AS metric
FROM t
GROUP BY 1 ORDER BY 1;
Unrelated but what are $__timeFilter and $__timeGroup? Their sematics are clear but where do they come from? BTW you may find this function useful.

Select the date of a UserIDs first/most recent purchase

I am working with Google Analytics data in BigQuery, looking to aggregate the date of last visit and first visit up to UserID level, however my code is currently returning the max visit date for that user, so long as they have purchased within the selected date range, because I am using MAX().
If I remove MAX() I have to GROUP by DATE, which I don't want as this then returns multiple rows per UserID.
Here is my code which returns a series of dates per user - last_visit_date is currently working, as it's the only date that can simply look at the last date of user activity. Any advice on how I can get last_ord_date to select the date on which the order actually occurred?
SELECT
customDimension.value AS UserID,
# Last order date
IF(COUNT(DISTINCT hits.transaction.transactionId) > 0,
(MAX(DATE)),
"unknown") AS last_ord_date,
# first visit date
IF(SUM(totals.newvisits) IS NOT NULL,
(MAX(DATE)),
"unknown") AS first_visit_date,
# last visit date
MAX(DATE) AS last_visit_date,
# first order date
IF(COUNT(DISTINCT hits.transaction.transactionId) > 0,
(MIN(DATE)),
"unknown") AS first_ord_date
FROM
`XXX.XXX.ga_sessions_20*` AS t
CROSS JOIN
UNNEST (hits) AS hits
CROSS JOIN
UNNEST(t.customdimensions) AS customDimension
CROSS JOIN
UNNEST(hits.product) AS hits_product
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND customDimension.index = 2
AND customDimension.value NOT LIKE "true"
AND customDimension.value NOT LIKE "false"
AND customDimension.value NOT LIKE "undefined"
AND customDimension.value IS NOT NULL
GROUP BY
UserID
the most efficient and clear way to do this (and also most portable) is to have a simple table/view that has two columns: userid, last_purchase and another that has other two cols userid, first_visit.
then you inner join it with the original raw table on userid and hit timestamp to get, say, the session IDs you're interested in. 3 steps but simple, readable and easy to maintain
It's very easy to hit too much complexity for a query that relies on first or last purchase/action (just look at the unnest operations you have there) that is becomes unusable and you'll spend way too much time trying to figure out the meaning of the output.
Also keep in mind that using the wildcard in the query has a limit of 1000 tables, so your last and first visits are in a rolling window of 1000 days.

Show transactions from a user who saw X page/s in their session

I am working with Google Analytics data in BigQuery.
I'd like to show a list of transaction ids from users who visited a particular page on a website in their session, I've unnested hits.page.pagepath in order to identify a particular page, but since I don't know which row the actual transaction ID will occur on I am having trouble returning meaningful results.
My code looks like this, but is returning 0 results, as all the transaction Ids are NULL values, since they do not happen on rows where the page path meets the AND hits.page.pagePath LIKE "%clear-out%" condition:
SELECT hits.transaction.transactionId AS orderid
FROM `xxx.xxx.ga_sessions_20*` AS t
CROSS JOIN UNNEST(hits) AS hits
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
AND totals.transactions > 0
AND hits.page.pagePath LIKE "%clear-out%"
AND hits.transaction.transactionId IS NOT NULL
How can I say, for example, return the transaction Ids for all sessions where the user viewed AND hits.page.pagePath LIKE "%clear-out%"?
When cross joining, you're repeating the whole session for each hit. Use this nested info per hit to look for your page - not the cross joined hits.
You're unfortunately giving both the same name. It's better to keep them seperate - here's what it could look like:
SELECT
h.transaction.transactionId AS orderId
--,ARRAY( (SELECT AS STRUCT hitnumber, page.pagePath, transaction.transactionId FROM t.hits ) ) AS hitInfos -- test: show all hits in this session
FROM
`google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` AS t
CROSS JOIN t.hits AS h
WHERE
totals.transactions > 0 AND h.transaction.transactionId IS NOT NULL
AND
-- use the repeated hits nest (not the cross joined 'h') to check all pagePaths in the session
(SELECT LOGICAL_OR(page.pagePath LIKE "/helmets/%") FROM t.hits )
LOGICAL_OR() is an aggregation function for OR - so if any hit matches the condition it returns TRUE
(This query uses the openly available GA data from Google. It's a bit old but good to play around with.)

BigQuery "Response Too Large To Return" when trying to ORDER BY

I have a query that finds all the events that a user's "last" day (ie, does not show up again for 2+ weeks). I want to whittle this down to the last N events they perform before leaving in order of most recent to least recent.
I created an unordered table without issue, but when I try to ORDER BY timestamp DESC, it then gives me a "Response too large to return" error.
Why do I get this error when trying to sort (no GROUP BYs or anything), but not on the unordered table?
EDITED TO ADD QUERY BELOW
This query gives me the table with events for users who have not shown up in the last 14 days.
SELECT user.user_key as user_key, user.lastTime as lastTime, evt.actiontime as actiontime, evt.actiontype as actiontype, evt.action_parameters.parameter_name as actionParameterName
FROM (
SELECT user_key , MAX(actiontime) AS lastTime, DATE(MAX(actiontime)) as lastDate
FROM [db.action_log]
WHERE DATEDIFF(CURRENT_TIMESTAMP(), actiontime) >= 14
GROUP EACH BY user_key
HAVING DATEDIFF(CURRENT_TIMESTAMP(), lastTime) >= 14) as user
JOIN EACH(
SELECT user_key, actiontime, actiontype, action_parameters.parameter_name, DATE(actiontime) as actionDate
FROM [db.action_log]
WHERE DATEDIFF(CURRENT_TIMESTAMP(), actiontime) >= 14) as evt
ON (user.user_key = evt.user_key) AND (user.lastDate = evt.actionDate)
WHERE actiontime <= lastTime;
And this runs just fine. I want to GROUP_CONCAT() to turn the actions into a list, but first I need to sort by actiontime (descending) so that the most recent event is the first in the list. But when I run:
SELECT user_key, lastTime, actiontime, user_level, actiontype, actionParameterName
FROM [db.lastActions]
ORDER BY actiontime DESC;
I get "Response too large to return."
BQ sometimes shouts at you when you have too many combinations of group, order and group_concats in a single query.
However, I believe that a straight group_concat does not create an ordered list, so doing an order by before the concat is essentially meaningless in this case. This question should solve your problem though
For completions sake, I have found several useful ways to get around getting shouted at (for whatever reason):
1) Break up your script into smaller segments if possible. This way you can have more control if you are playing around with data.
2) Instead of using order by and then limit, it will be quicker and may be more useful to use more where statements.
3) Use subselects - wisely. For example grouping first and then ordering should always be quicker than ordering first for example.

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?