Google Analytics query: landing page and page paths - sql

I am a newbie with SQL and BigQuery so I had hoped you could help me with a standard SQL query I am working on.
The data set is from a Google Analytics roll-up property.
The objective with this query is to have for each session: the date, the GA property, the number of transactions, the total revenue, some custom dimensions that are session-scoped and 2 other elements I can't seem to grasp.
1) I would like to add the landing page for each session. There are some resources on the internet for that but so far I didn't succeed. Do you have any idea ?
2) I also would like to add funnel steps based on page paths, like have a column "Step 1" where it indicates 1 or 0 depending if the session contains a page view on a certain page path or not. Do you have any idea on how to do that ?
This is my current query (sorry if it's not well formatted):
SELECT
date,
visitId,
hits.sourcePropertyInfo.sourcePropertyDisplayName AS service,
totals.transactions AS transactions,
totals.totalTransactionRevenue AS revenue,
ARRAY(
SELECT STRUCT(
MAX(IF(cd.index=3, cd.value, NULL)) AS endUserProvider,
MAX(IF(cd.index=2, cd.value, NULL)) AS connection,
MAX(IF(cd.index=10, cd.value, NULL)) AS sid,
MAX(IF(cd.index=11, cd.value, NULL)) AS price,
MAX(IF(cd.index=12, cd.value, NULL)) AS period,
MAX(IF(cd.index=13, cd.value, NULL)) AS serviceId,
MAX(IF(cd.index=14, cd.value, NULL)) AS promotion)
FROM UNNEST(hits.customDimensions) cd) result
FROM `wide-oasis-135923.126764585.ga_sessions_*`,
UNNEST(hits) hits
WHERE _TABLE_SUFFIX = '20171026'
LIMIT 100;
Thank you very much

Related

BigQuery - Transactions in internal promo report

like in the question here (Replicate Internal Promotion report in BigQuery with transactions) I want to rebuild the internal promo report from Google Analytics.
I was able to get PromoViews and PromoClicks, but I don't get the transactions...
My query looks like this
SELECT
clientid,
fullvisitorid,
visitstarttime,
concat(fullvisitorid, cast(visitstarttime AS string)) AS sessionid,
hp.promoid,
hp.promoname,
hp.promocreative
hp.promoposition,
promotionActionInfo.promoIsView as promoview,
promotionActionInfo.promoIsClick as promoclick
FROM [MYDATA], UNNEST (hits) as h, UNNEST (h.promotion) as hp
When I sum promoview and promoclick I get the exact same results like in Google Analytics
The official Google Documentation says:
>How transactions are attributed
>The Internal Promotion report attributes transactions to either an internal-promotion click or >internal-promotion view.
>
>Each hit in an ecommerce session can have:
>
>0 or 1 internal-promotion clicks
>0 or more internal-promotion views
>Internal-promotion click attribution
>If a hit includes a single internal-promotion click, then that internal-promotion is credited >for the transaction.
>
>If a session includes multiple internal-promotion clicks, then the last-clicked internal->promotion is credited for the transaction.
>
>If a hit includes zero internal-promotions clicks but one of that user’s previous hits does >include an internal-promotion click, then the internal promotion from the previous click is >credited for the transaction.
>
>Internal-promotion view attribution
>If none of the conditions above is true but a hit includes one or more internal-promotion >views, then the transaction is credited to all promotional views within the session.
https://support.google.com/analytics/answer/6014872?hl=en
Keeping this in mind, my approach was to do a join with a separate table where I query the enhanced ecommerce data using sessionid as the join key
SELECT
clientid,
fullvisitorid,
visitstarttime,
concat(fullvisitorid, cast(visitstarttime AS string)) AS sessionid,
hp.v2ProductName AS ProductName,
h.transaction.transactionId AS TransactionId,
hp.productQuantity as Quantity,
FROM [MYDATA], UNNEST (hits) AS h, UNNEST (h.product) as hp
WHERE
h.eCommerceAction.action_type = "6" AND
(hp.isImpression IS NULL) AND
(
(h.promotionActionInfo.promoIsView is true) OR
(h.promotionActionInfo.promoIsClick is true)
)
But it seems that my WHERE clause (filtering the promoviews and clicks) is not working like I expect, because I receive an empty table as a result.
Can anybody help me with this?

How to see whole session path in BigQuery?

I'm learning standard sql in BigQuery and I have a task, where I have to show, what users did after entering checkout - what specific urls they've visited. I figured out something like this, but it'll only show me one previous step and I have to see at least 5 of them. Is this possible? Thank you
SELECT ARRAY(
SELECT AS STRUCT hits.page.pagePath
, LAG(hits.page.pagePath) OVER(ORDER BY i) prevPagePath
FROM UNNEST(hits) hits WITH OFFSET i
) x
FROM `xxxx.ga_sessions_20160801`
)
SELECT COUNT(*) as cnt, pagePath, prevPagePath
FROM path_and_prev, UNNEST(x)
WHERE regexp_contains (pagePath,r'(checkout/cart)')
GROUP BY 2,3
ORDER BY
cnt desc
Here is official GA shema for BQ export :
https://support.google.com/analytics/answer/3437719?hl=en
(Just a tip, feel free to export it in a sheet (Excel or Google or whatever) and indent decently to ease understanding of nesting :) )
The only way to safely get session behaviour is to get hits.hitNumber. Since pagePath is under page, which is under hits, hitnumber will always be specified :)
Up to you to filter on filled pagePath only, but still displaying hitnumber value.
Tell me if the solution does match your issue, or correct me :)

Unable to get the right sessions count using Bigquery standard SQL

I want to get the total sessions but just because I am unnesting 'hit.product' and 'hits' at same time as shown in the below code, its giving me less session count than that I can see in GA. I am suspecting that it is filtering out the sessions that doesn't have any products.
There is also a way that i can handle that without using unnest by using Array as shown below
ARRAY(SELECT DISTINCT v2ProductCategory FROM UNNEST(hits.product)) AS v2ProductCategory
Is there any way if I can pull all the sessions and its product category, product name and hits info(hits.time, hits.page.pagepath) without using ARRAY which I would be using it in my further analysis?
select count(distinct session) from(SELECT
fullvisitorid,
CONCAT(CAST(fullVisitorId AS string),CAST(visitId AS string)) AS session,
hits.time,
hits.page.pagePath,
hits.eCommerceAction.action_type,
product.v2ProductCategory,
product.v2ProductName
FROM
`XXXXXXXXXXXXXXX`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN "20170930"
AND "20170930")

Query for selecting sequence of hits consumes large quantity of data

I'm trying to measure the conversion rate through alternative funnels on a website. My query has been designed to output a count of sessions that viewed the relevant start URL and a count of sessions that hit the confirmation page strictly in that order. It does this by comparing the times of the hits.
My query appears to return accurate figures, but in doing so selects a massive quantity of data, just under 23GB for what I've attempted to limit to one hour of one day. I don't seem to have written my query in a particularly efficient way and gather that I'll use up all of my company's data quota fairly quickly if I continue to use it.
Here's the offending query in full:
WITH
s1 AS (
SELECT
fullVisitorId,
visitId,
LOWER(h.page.pagePath),
device.deviceCategory AS platform,
MIN(h.time) AS s1_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
LOWER(h.page.pagePath) LIKE '{funnel-start-url-1}%' OR LOWER(h.page.pagePath) LIKE '{funnel-start-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
AND
h.type = "PAGE"
GROUP BY
path,
platform,
fullVisitorId,
visitId
ORDER BY
fullVisitorId ASC, visitId ASC
),
confirmations AS (
SELECT
fullVisitorId,
visitId,
MIN(h.time) AS confirmation_time
FROM
`project.dataset.ga_sessions_*`, UNNEST(hits) AS h
WHERE
_TABLE_SUFFIX BETWEEN '20170107' AND '20170107'
AND
h.type = "PAGE"
AND
LOWER(h.page.pagePath) LIKE '{confirmation-url-1}%' OR LOWER(h.page.pagePath) LIKE '{confirmations-url-2}%'
AND
totals.visits = 1
AND
h.hour < 21
AND
h.hour >= 20
GROUP BY
fullVisitorId,
visitId
)
SELECT
platform,
path,
COUNT(path) AS Views,
SUM(
CASE
WHEN s1.s1_time < confirmations.confirmation_time
THEN 1
ELSE 0
END
) AS SubsequentPurchases
FROM
s1
LEFT JOIN
confirmations
ON
s1.fullVisitorId = confirmations.fullVisitorId
AND
s1.visitId = confirmations.visitId
GROUP BY
platform,
path
What is it about this query that means it has to process so much data? Is there a better way to get at these numbers. Ideally any method should be able to measure the multiple different routes, but I'd settle for sustainability at this point.
There are probably a few ways that you can optimize your query but it seems like it won't entirely solve your issue (as I'll further try to explain).
As for the query, this one does the same but avoids re-selecting data and the LEFT JOIN operation:
SELECT
path,
platform,
COUNT(path) views,
COUNT(CASE WHEN last_hn > first_hn THEN 1 END) SubsequentPurchases
from(
SELECT
fv,
v,
platform,
path,
first_hn,
MAX(last_hn) OVER(PARTITION BY fv, v) last_hn
from(
SELECT
fullvisitorid fv,
visitid v,
device.devicecategory platform,
LOWER(hits.page.pagepath) path,
MIN(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product') THEN hits.hitnumber ELSE null END) first_hn,
MAX(CASE WHEN REGEXP_CONTAINS(hits.page.pagepath, r'success') then hits.hitnumber ELSE null END) last_hn
FROM `project_id.data_set.ga_sessions_20170112`,
UNNEST(hits) hits
WHERE
REGEXP_CONTAINS(hits.page.pagepath, r'/catalog/|product|success')
AND totals.visits = 1
AND hits.type = 'PAGE'
GROUP BY
fv, v, path, platform
)
)
GROUP BY
path, platform
HAVING NOT REGEXP_CONTAINS(path, r'success')
first_hn tracks the funnel-start-url (in which I used the terms "catalog" and "product") and the last_hn tracks the confirmation URLs (which I used the term "success" but could add more values in the regex selector). Also, by using MIN and MAX operations and the analytical functions you can have some optimizations in your query.
There are a few points though to make here:
If you insert WHERE hits.hithour = 20, BigQuery still has to scan the whole table to find what is 20 from what is not. That means that the 23Gbs you observed still accounts for the whole day.
For comparison, I tested your query against our ga_sessions and it took around 31 days to reach 23Gb of data. As you are not selecting that many fields, it shouldn't be that easy to reach this amount unless you have a considerable high traffic volume coming from your data source.
Given current pricing for BigQuery, 23Gbs would consume you roughly $0.11 to process, which is quite cost-efficient.
Another thing I could imagine is that you are running this query several times a day and have no cache or some proper architecture for these operations.
All this being said, you can optimize your query but I suspect it won't change that much in the end as it seems you have quite a high volume of data. Processing 23Gbs a few times shouldn't be a problem but if you are concerned that it will reach your quota then it seems like you are running several times a day this query.
This being the case, see if using either some cache flag or saving the results into another table and then querying it instead will help. Also, you could start saving daily tables with just the sessions you are interested in (having the URL patterns you are looking for) and then running your final query in these newly created tables, which would allow you to query over a bigger range of days spending much less for that.

Page Sequencing in BigQuery

I am trying to build a segment for users who hit page /company first. This cannot be their only page in the visit.
SELECT totals.pageviews, visitid, COUNT(distinct visitorid) as TotalUsers
FROM (TABLE_DATE_RANGE([1234567.ga_sessions_],TIMESTAMP('2015-01-05'),TIMESTAMP('2015-01-11')))
WHERE hits.hitnumber = 1
AND page.pageTitle = 'Company Page'
GROUP EACH BY visitorid, visitid
LIMIT 10;
This isn't right because this just matches visitors with 1 total hit for their session. I want to specify that hit #1 should equal 'Company Page'.
Is it possible to also create a sequence of pages? For example: a segment that hit Page x, Page y, Page Z - in that specific order.
If I understood it right, you could get what you want by querying on the fullvisitorid (be careful visitorid is deprecated) and the first page as your company page.
Then just make a query on those visitors whose id is in that subset.
Something like:
SELECT fullvisitorid, hits.page.pageTitle
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
HAVING fullvisitorid IN
(SELECT fullvisitorid
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.hitnumber = 1 AND hits.page.pageTitle = 'London Cycle Helmet')
LIMIT 1000
Where I used the BigQuery public sample of GA (google.com:analytics-bigquery:LondonCycleHelmet) and 'London Cycle Helmet' as the first page.