How to see whole session path in BigQuery? - sql

I'm learning standard sql in BigQuery and I have a task, where I have to show, what users did after entering checkout - what specific urls they've visited. I figured out something like this, but it'll only show me one previous step and I have to see at least 5 of them. Is this possible? Thank you
SELECT ARRAY(
SELECT AS STRUCT hits.page.pagePath
, LAG(hits.page.pagePath) OVER(ORDER BY i) prevPagePath
FROM UNNEST(hits) hits WITH OFFSET i
) x
FROM `xxxx.ga_sessions_20160801`
)
SELECT COUNT(*) as cnt, pagePath, prevPagePath
FROM path_and_prev, UNNEST(x)
WHERE regexp_contains (pagePath,r'(checkout/cart)')
GROUP BY 2,3
ORDER BY
cnt desc

Here is official GA shema for BQ export :
https://support.google.com/analytics/answer/3437719?hl=en
(Just a tip, feel free to export it in a sheet (Excel or Google or whatever) and indent decently to ease understanding of nesting :) )
The only way to safely get session behaviour is to get hits.hitNumber. Since pagePath is under page, which is under hits, hitnumber will always be specified :)
Up to you to filter on filled pagePath only, but still displaying hitnumber value.
Tell me if the solution does match your issue, or correct me :)

Related

BigQuery Session & Hit level understanding

I want to ask about your knowledge regarding the concept of Events.
Hit level
Session Level
How in BigQuery (standard SQL) how i can map mind this logic, and also
Sessions
Events Per Session
Unique Events
Please can somebody guide me to understand these concepts?
totals.visitors is Session
sometime
visitId is taken as Session
to achieve that you need to grapple a little with a few different concepts. The first being "what is a session" in GA lingo. you can find that here. A session is a collection of hits. A hit is one of the following: pageview, event, social interaction or transaction.
Now to see how that is represented in the BQ schema, you can look here. visitId and visitorId will help you define a session (as opposed to a user).
Then you can count the number of totals.hits that are events of the type you want.
It could look something like:
select visitId,
sum(case when hits.type = "EVENT" then totals.hits else 0) from
dataset.table_* group by 1
That should work to get you an overview. If you need to slice and dice the event details (i.e. hits.eventInfo.*) then I suggest you make a query for all the visitId and one for all the relevant events and their respective visitId
I hope that works!
Cheers
You can think of these concepts like this:
every row is a session
technically every row with totals.visits=1 is a valid session
hits is an array containing structs which contain information for every hit
You can write subqueries on arrays - basically treat them as tables. I'd recommend to study Working with Arrays and apply/transfer every exercise directly to hits, if possible.
Example for subqueries on session level
SELECT
fullvisitorid,
visitStartTime,
(SELECT SUM(IF(type='EVENT',1,0)) FROM UNNEST(hits)) events,
(SELECT COUNT(DISTINCT CONCAT(eventInfo.eventCategory,eventInfo.eventAction,eventInfo.eventLabel) )
FROM UNNEST(hits) WHERE type='EVENT') uniqueEvents,
(SELECT SUM(IF(type='PAGE',1,0)) FROM UNNEST(hits)) pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE
totals.visits=1
LIMIT
1000
Example for Flattening to hit level
There's also the possibility to use fields in arrays for grouping if you cross join arrays with their parent row
SELECT
h.type,
COUNT(1) hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS t CROSS JOIN t.hits AS h
WHERE
totals.visits=1
GROUP BY
1
Regarding the relation between visitId and Sessions you can read this answer.

Google Analytics session-scoped fields returning multiple values

I've discovered that there are certain GA "session" scoped fields in BigQuery that have multiple values for the same fullVisitorId and visitId fields. See the example below:
Grouping the fields doesn't help either. In GA, I've checked the number of users vs number of users split by different devices. The user count is different:
This explains what's going on, a user would be grouped under multiple devices. My conclusion is that at some point during the users session, their browser user-agent changes and in the subsequent hit, a new device type is set in GA.
I'd have hoped GA would use either the first or last value, to avoid this scenario, but I guess they don't. My question is, if I'm accepting this as a "flaw" in GA. I'd rather pick one value. What's the best way to select the last or first device value from the below query:
SELECT
fullVisitorId,
visitId,
device.deviceCategory
FROM (
SELECT
*
FROM
`project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT
*
FROM
`project.dataset.ga_sessions_*` mobile ) table
I've tried doing a sub-select and using STRING_AGG(), attempting to order by hits.time and limiting to one value and that still creates another row.
I've tested and found that the below fields all have the same issue:
visitNumber
totals.hits
totals.pageviews
totals.timeOnSite
trafficSource.campaign
trafficSource.medium
trafficSource.source
device.deviceCategory
totals.sessionQualityDim
channelGrouping
device.mobileDeviceInfo
device.mobileDeviceMarketingName
device.mobileDeviceModel
device.mobileInputSelector
device.mobileDeviceBranding
UPDATE
See below queries around this particular fullVisitorId and visitId - UNION has been removed:
visitStartTime added:
visitStartTime and hits.time added:
Well, from the looks of things, I think you have 3 options:
1 - Group by fullVisitorId, visitId; and use Max or MIN deviceCategory. That should prevent a device switcher from being double-counted, It's kind of arbitrary but then so is the GA data.
2 - Option two is similar but, if the deviceCategory result can be anything (i.e. isn't constrained in the results to just the valid deviceCategory members), you can use a CASE to check MAX(deviceCategory) = MIN(deviceCategory) and if they are different, return 'Multiple Devices'
3 - You could go further, counting the number of different devices used, construct a concatenation that lists them in some way, etc.
I'm going to write up Number 2 for you. In your question, you have 2 different queries: one with [date] and one without - I'll provide both.
Without [date]:
SELECT
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY fullVisitorId, visitId
With [date]:
SELECT
[date],
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY [date], fullVisitorId, visitId
I'm assuming here that the Selects and Union that you gave are sound.
Also, I should point out that those {metric aggregations} should be something other than SUMs, otherwise you will still be double-counting.
I hope this helps.
It's simply not possible to have two values for one row in this field, because it can only contain one value.
There are 2 possibilities:
you're actually querying two separate datasets/ two different views - that's not clearly visible with the example code. Client id (=fullvisitorid) is only unique per Property (Tracking Id, the UA-xxxxx stuff). If you query two different views from different properties you have to expect to get same ids used twice.
Given they are coming from one property, these two rows could actually be one session on a midnight split, which means visitId stays the same, but visitStartTime changes. But that would also mean the decision algorithm for device type changed in the meantime ... that would be curious.
Try using visitStartTime and see what happens.
If you're using two different properties use a user id to combine or separate the sessions by adding a constant - you can't combine them.
SELECT 'property_A' AS constant FROM ...
hth

Getting flattened hit data from Google Analytics in Big Query over a range of tables

The following is a query to flatten "hit" results piped to BigQuery from Google Analytics:
SELECT
*
FROM flatten(flatten(flatten(flatten(flatten(flatten(flatten(flatten(flatten([ga_sessions_20171116], hits), hits.product), hits.product.customDimensions), hits.product.customMetrics), hits.promotion), hits.experiment), hits.customDimensions), hits.customVariables), hits.customMetrics)
Limit 20
How would one do the same across a range of tables, or is that even possible? I tried:
SELECT
*
FROM flatten(flatten(flatten(flatten(flatten(flatten(flatten(flatten(flatten([ga_sessions_2017111*], hits), hits.product), hits.product.customDimensions), hits.product.customMetrics), hits.promotion), hits.experiment), hits.customDimensions), hits.customVariables), hits.customMetrics)
WHERE _TABLE_SUFFIX BETWEEN '0' and '10'
Limit 20
But it didn't work. Does anyone know how one would do this?
Working with nested data in Standard SQL is much easier than it is in Legacy (because of query syntax and predictability of behavior).
This being said, do consider using it. Your query in Standard becomes something like:
SELECT
fullvisitorid visitor_id,
prods.productSku sku,
custd.index index
FROM `project_id.dataset_id.ga_sessions_*`,
UNNEST(hits) hits,
UNNEST(hits.product) prods,
UNNEST(prods.customDimensions) custd
WHERE _TABLE_SUFFIX BETWEEN '20171110' and '20171111'
LIMIT 1000
This is just an example but hopefully it's enough to understand the concept.
hits is a repeated field of structs so it's something like:
hits = [{'hitNumber': 1, 'product': [{'productSku': 'sku0'}]}, {'hitNumber': 2}, ...]
When you apply the unnest(hits) AS unnested_hits operation it becomes:
unnested_hits = {'hitNumber': 1, 'product': [{'productSku': 'sku0'}]},
{'hitNumber': 2}
...
So if you call it "unnested_hits", when you reference to this alias you'll get this flattened data. You can go on such as unnesting the field product inside unnested_hits).
For a deeper understanding of these concepts, make sure to read through the docs, they are quite well written and you can learn pretty much everything you'll need to work effectively in BigQuery.
As a final note, you are selecting all fields from GA. As the old saying goes, every time someone runs a query of the type "SELECT * FROM TABLE" a panda somewhere in the world dies.
You must be very careful to run this type of query in BQ as you'll get billed by volume of processed data; make sure you are bringing only what is absolutely necessary.

Query for selecting sequence of hits consumes large quantity of data

I'm trying to measure the conversion rate through alternative funnels on a website. My query has been designed to output a count of sessions that viewed the relevant start URL and a count of sessions that hit the confirmation page strictly in that order. It does this by comparing the times of the hits.
My query appears to return accurate figures, but in doing so selects a massive quantity of data, just under 23GB for what I've attempted to limit to one hour of one day. I don't seem to have written my query in a particularly efficient way and gather that I'll use up all of my company's data quota fairly quickly if I continue to use it.
Here's the offending query in full:
WITH
s1 AS (
SELECT
fullVisitorId,
visitId,
LOWER(h.page.pagePath),
device.deviceCategory AS platform,
MIN(h.time) AS s1_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
LOWER(h.page.pagePath) LIKE '{funnel-start-url-1}%' OR LOWER(h.page.pagePath) LIKE '{funnel-start-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
AND
h.type = "PAGE"
GROUP BY
path,
platform,
fullVisitorId,
visitId
ORDER BY
fullVisitorId ASC, visitId ASC
),
confirmations AS (
SELECT
fullVisitorId,
visitId,
MIN(h.time) AS confirmation_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
h.type = "PAGE"
AND
LOWER(h.page.pagePath) LIKE '{confirmation-url-1}%' OR LOWER(h.page.pagePath) LIKE '{confirmations-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
GROUP BY
fullVisitorId,
visitId
)
SELECT
platform,
path,
COUNT(path) AS Views,
SUM(
CASE
WHEN s1.s1_time < confirmations.confirmation_time
THEN 1
ELSE 0
END
) AS SubsequentPurchases
FROM
s1
LEFT JOIN
confirmations
ON
s1.fullVisitorId = confirmations.fullVisitorId
AND
s1.visitId = confirmations.visitId
GROUP BY
platform,
path
What is it about this query that means it has to process so much data? Is there a better way to get at these numbers. Ideally any method should be able to measure the multiple different routes, but I'd settle for sustainability at this point.
There are probably a few ways that you can optimize your query but it seems like it won't entirely solve your issue (as I'll further try to explain).
As for the query, this one does the same but avoids re-selecting data and the LEFT JOIN operation:
SELECT
path,
platform,
COUNT(path) views,
COUNT(CASE WHEN last_hn > first_hn THEN 1 END) SubsequentPurchases
from(
SELECT
fv,
v,
platform,
path,
first_hn,
MAX(last_hn) OVER(PARTITION BY fv, v) last_hn
from(
SELECT
fullvisitorid fv,
visitid v,
device.devicecategory platform,
LOWER(hits.page.pagepath) path,
MIN(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product') THEN hits.hitnumber ELSE null END) first_hn,
MAX(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'success') then hits.hitnumber ELSE null END) last_hn
FROM `project_id.data_set.ga_sessions_20170112`,
UNNEST(hits) hits
WHERE
REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product|success')
AND totals.visits = 1
AND hits.type = 'PAGE'
GROUP BY
fv, v, path, platform
)
)
GROUP BY
path, platform
HAVING NOT REGEXP_CONTAINS(path, r'success')
first_hn tracks the funnel-start-url (in which I used the terms "catalog" and "product") and the last_hn tracks the confirmation URLs (which I used the term "success" but could add more values in the regex selector). Also, by using MIN and MAX operations and the analytical functions you can have some optimizations in your query.
There are a few points though to make here:
If you insert WHERE hits.hithour = 20, BigQuery still has to scan the whole table to find what is 20 from what is not. That means that the 23Gbs you observed still accounts for the whole day.
For comparison, I tested your query against our ga_sessions and it took around 31 days to reach 23Gb of data. As you are not selecting that many fields, it shouldn't be that easy to reach this amount unless you have a considerable high traffic volume coming from your data source.
Given current pricing for BigQuery, 23Gbs would consume you roughly $0.11 to process, which is quite cost-efficient.
Another thing I could imagine is that you are running this query several times a day and have no cache or some proper architecture for these operations.
All this being said, you can optimize your query but I suspect it won't change that much in the end as it seems you have quite a high volume of data. Processing 23Gbs a few times shouldn't be a problem but if you are concerned that it will reach your quota then it seems like you are running several times a day this query.
This being the case, see if using either some cache flag or saving the results into another table and then querying it instead will help. Also, you could start saving daily tables with just the sessions you are interested in (having the URL patterns you are looking for) and then running your final query in these newly created tables, which would allow you to query over a bigger range of days spending much less for that.

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

How would you overcome the above restriction?
I am trying to find flows based on sequences of 3 records using the LEAD and LAG window functions, and than calculate some aggregations (count, sum, etc,) of their attributes.
When i run my queries on a small sample of data, everything is fine and the group by runs OK. but when running on larger data set, i get: "Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead."
In many other cases switching to GROUP EACH BY do the work...
However, as I use window functions, I cannot use EACH...
Any suggestions? Best practices?
here is a sample query based of wikipedia sample data. it shows the frequency of title editing by different contributors. the where condition is just to limit response size, if you remove the "B" we get results, if we add it we got the "use EACH" recomendation.
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true)
group by title
Thanks
I guess your particular use case is different to the sample query, but let me comment on what I'm able to see:
You found a way to make GROUP EACH and OVER possible: Surrounding the OVER() query with another one allows you to change the GROUP BY to GROUP EACH BY. However, this query's problem is not there.
Let's forget about GROUP and GROUP EACH. Let's look at the core query:
SELECT title, contributor_id, LEAD(contributor_id)
OVER(PARTITION BY title ORDER BY timestamp) AS LeadContributor
FROM [publicdata:samples.wikipedia]
WHERE REGEXP_MATCH(title, r'^[A,B]')
This query fails with r'^[A,B]' and works with r'^[A]', and it highlight an OVER() limitation: As GROUP BY and ORDER BY, it only works when data fits in one machine, as they are not parallelizable. As the answer to r'^[A]' reveals, that can be a lot of data - though sometimes not enough. That's why BigQuery offers the parallelizable GROUP EACH BY. However, there is no parallelizable OVER EACH BY we can use here.
The workaround I would apply here is exactly what you are doing: Do the OVER() with just a fraction of the data.
(btw, let me say I love the sample query... it's an interesting question with an interesting answer!)