Unable to get the right sessions count using Bigquery standard SQL - sql

I want to get the total sessions but just because I am unnesting 'hit.product' and 'hits' at same time as shown in the below code, its giving me less session count than that I can see in GA. I am suspecting that it is filtering out the sessions that doesn't have any products.
There is also a way that i can handle that without using unnest by using Array as shown below
ARRAY(SELECT DISTINCT v2ProductCategory FROM UNNEST(hits.product)) AS v2ProductCategory
Is there any way if I can pull all the sessions and its product category, product name and hits info(hits.time, hits.page.pagepath) without using ARRAY which I would be using it in my further analysis?
select count(distinct session) from(SELECT
fullvisitorid,
CONCAT(CAST(fullVisitorId AS string),CAST(visitId AS string)) AS session,
hits.time,
hits.page.pagePath,
hits.eCommerceAction.action_type,
product.v2ProductCategory,
product.v2ProductName
FROM
`XXXXXXXXXXXXXXX`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN "20170930"
AND "20170930")

Related

BigQuery - Transactions in internal promo report

like in the question here (Replicate Internal Promotion report in BigQuery with transactions) I want to rebuild the internal promo report from Google Analytics.
I was able to get PromoViews and PromoClicks, but I don't get the transactions...
My query looks like this
SELECT
clientid,
fullvisitorid,
visitstarttime,
concat(fullvisitorid, cast(visitstarttime AS string)) AS sessionid,
hp.promoid,
hp.promoname,
hp.promocreative
hp.promoposition,
promotionActionInfo.promoIsView as promoview,
promotionActionInfo.promoIsClick as promoclick
FROM [MYDATA], UNNEST (hits) as h, UNNEST (h.promotion) as hp
When I sum promoview and promoclick I get the exact same results like in Google Analytics
The official Google Documentation says:
>How transactions are attributed
>The Internal Promotion report attributes transactions to either an internal-promotion click or >internal-promotion view.
>
>Each hit in an ecommerce session can have:
>
>0 or 1 internal-promotion clicks
>0 or more internal-promotion views
>Internal-promotion click attribution
>If a hit includes a single internal-promotion click, then that internal-promotion is credited >for the transaction.
>
>If a session includes multiple internal-promotion clicks, then the last-clicked internal->promotion is credited for the transaction.
>
>If a hit includes zero internal-promotions clicks but one of that user’s previous hits does >include an internal-promotion click, then the internal promotion from the previous click is >credited for the transaction.
>
>Internal-promotion view attribution
>If none of the conditions above is true but a hit includes one or more internal-promotion >views, then the transaction is credited to all promotional views within the session.
https://support.google.com/analytics/answer/6014872?hl=en
Keeping this in mind, my approach was to do a join with a separate table where I query the enhanced ecommerce data using sessionid as the join key
SELECT
clientid,
fullvisitorid,
visitstarttime,
concat(fullvisitorid, cast(visitstarttime AS string)) AS sessionid,
hp.v2ProductName AS ProductName,
h.transaction.transactionId AS TransactionId,
hp.productQuantity as Quantity,
FROM [MYDATA], UNNEST (hits) AS h, UNNEST (h.product) as hp
WHERE
h.eCommerceAction.action_type = "6" AND
(hp.isImpression IS NULL) AND
(
(h.promotionActionInfo.promoIsView is true) OR
(h.promotionActionInfo.promoIsClick is true)
)
But it seems that my WHERE clause (filtering the promoviews and clicks) is not working like I expect, because I receive an empty table as a result.
Can anybody help me with this?

Why such a big discrepancy in unique events between Google Analytics and BigQuery?

I am trying to get the number of unique events in BigQuery and despite my efforts, the results are not even close to what I see in GA. Certain rows have up to 50% difference between BQ and GA, and I can't figure out why. Total events and users are exactly the same as in GA, it's only unique events that don't match.
I am using a CONCAT function to build the sessionID, and when used to calculate total sessions for a given period, it returns a very close number to what I see in GA. But as soon as I use it with the event category column, the numbers are off.
This is my query:
SELECT h.eventInfo.eventCategory,
count(h.eventInfo.eventCategory) as total_events,
count(distinct CONCAT(fullVisitorId, CAST(visitId AS STRING))) as unique_events
FROM `marketing-stack.12345678.ga_sessions_20190525` as ga,
UNNEST(ga.hits) as h
GROUP BY h.eventInfo.eventCategory
For example, the top event looks like this in GA:
4276 total events - 3155 unique events - 1510 users
And in BigQuery:
4276 total events - 1566 unique events - 1510 users
Am I doing something wrong in the query or is there a difference between GA and BQ in regards to unique events and how you count them that I don't grasp?
I'd appreciate any help or input because I'm at loss here!
You are counting users with events and not unique events ...
Action and label mustn't be NULL when you COUNT(DISTINCT ) them.
SUM( (SELECT
COUNT(DISTINCT CONCAT(h.eventInfo.eventCategory,
coalesce(h.eventinfo.eventaction, ''),
coalesce(h.eventinfo.eventlabel, '')
)) FROM t.hits h ) ) uniqueEvents
See also here
One possibility is collisions in the CONCAT(). You can try using a separator:
count(distinct CONCAT(fullVisitorId, ':', CAST(visitId AS STRING))) as unique_events
This is just a possibility.
Another possibility is that one or the other value is NULL. COALESCE() can help:
count(distinct CONCAT(COALESCE(fullVisitorId, ''), ':', COALESCE(CAST(visitId AS STRING), ''))) as unique_events

BigQuery: filter out hits based on hit and product scope dimensions

In BigQuery, there is the Google Analytics based query as is stated below and this works correctly.
#standard sql
SELECT
Date,
SUM(totals.visits) AS Sessions,
SUM(totals.transactions) AS Transactions
FROM
`[projectID].[DatasetID].ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20181217'
AND '20181217'
AND totals.visits > 0
GROUP BY
Date
In this query, I need to exclude all hits where within a hit...
..GA custom dimension #23 (hit-scope) contains value 'editor'
OR
..GA custom dimension #6 (product-scope) matches regular expression value '^63.....$'
OR
..GA hits.page.pagePath matches regular expression value 'gebak|cake'
Note: it is not the intention to apply the 3 conditions as stated above on session-level (as in this screenshot) but on hit-level, since I'd like to reproduce numbers from another GA view than the view from which the data is loaded to BigQuery. In this other GA view the 3 conditions as are stated above are set as view filters.
The 'best' query thus far is the one below (based on the post of Martin Weitzmann below). However, the dataset is not filtered in this query (in other words, the conditions do not work).
SELECT Date,
-- hits,
SUM(totals.transactions),
SUM(totals.visits)
FROM (
(
SELECT date, totals,
-- create own hits array
ARRAY(
SELECT AS STRUCT
hitnumber,
page,
-- create own product array
ARRAY(
SELECT AS STRUCT productSku, productQuantity
FROM h.product AS p
WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%')
) AS product
FROM t.hits as h
WHERE
NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
AND
(SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
) AS hits
FROM
`[projectID].[DatasetID].ga_sessions_*` t
WHERE
_TABLE_SUFFIX BETWEEN '20181217'
AND '20181217'
AND totals.visits > 0
))
GROUP BY Date
Does anyone know how to achieve the desired output?
Thanks a lot in advance!
Note: the projectID and datasetID have been masked in both queries because of privacy concerns.
Own arrays approach
You can create your own hits and product arrays by using sub-queries on the original and feeding their output back into array functions. In those subqueries you can filter out your hits and products:
#standardsql
SELECT
date,
hits
--SUM(totals.visits) AS Sessions,
--SUM(totals.transactions) AS Transactions
FROM
(
SELECT
date, totals,
-- create own hits array
ARRAY(
SELECT AS STRUCT
hitnumber,
page,
-- create own product array
ARRAY(
SELECT AS STRUCT productSku, productQuantity
FROM h.product AS p
WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%')
) AS product
FROM t.hits as h
WHERE
NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
AND
(SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
) AS hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20161104` t
)
--GROUP BY 1
LIMIT 100
I left this example in an ungrouped state, but you can easily adjust it by commenting out the hits and group accordingly ...
Segmentation approach
I think you just need the right sub-query in your WHERE statement:
#standardsql
SELECT
date,
SUM(totals.visits) AS Sessions,
SUM(totals.transactions) AS Transactions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*` t
WHERE
(SELECT COUNT(1)=0 FROM t.hits h
WHERE
(SELECT count(1)>0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
OR
(SELECT count(1)>0 from h.product p, p.customdimensions cd WHERE index=6 AND value like '63%')
OR
REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
)
GROUP BY date
Since all your groups are on session level, you don't need any flattening (resp. cross joins with arrays) on the main table, which is costly.
In your outermost WHERE you enter the hits array with a subquery - it's like a for-each on rows. Here you can already count occasions of REGEXP_CONTAINS(page.pagePath,r'gebak|cake').
For the other cases, you write a subquery again to enter the respective array - in the first case, customDimensions within hits. This is like a nested for-each inside the other one (subquery in a subquery).
In the second case, I'm simply flattening - but within the subquery only: product with its customDimensions. So this is a one-time nested for-each as well because I was lazy and cross-joined. I could've written another Subquery instead of the cross-join, so basically a triple-nested for-each (subquery in a subquery in a subquery).
Since I'm counting cases that I want to exclude, my outer condition is COUNT(1)=0.
I could only test it with ga sample data .. so it's kind of untested. But I guess you get the idea.
Just a quick example/idea on how to use WITH and REGEXP_EXTRACT on a public set
WITH CD6 AS (
SELECT cd.value, SUM(totals.visits) AS Sessions6Sum
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS prod,
UNNEST(prod.customDimensions) AS cd
WHERE cd.index=6
AND NOT REGEXP_CONTAINS(cd.value,r'^63.....$')
GROUP BY cd.value
),
CD23 AS (
SELECT cd.value, SUM(totals.visits) AS Sessions23Sum
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS prod,
UNNEST(prod.customDimensions) AS cd
WHERE cd.index=23
AND NOT REGEXP_CONTAINS(cd.value,r'editor')
GROUP BY cd.value
)
select CD6.Sessions6Sum + CD23.Sessions23Sum from CD6, CD23
You can get more information on how to use REGEXP_EXTRACT in bigQuery official API page

Google Analytics session-scoped fields returning multiple values

I've discovered that there are certain GA "session" scoped fields in BigQuery that have multiple values for the same fullVisitorId and visitId fields. See the example below:
Grouping the fields doesn't help either. In GA, I've checked the number of users vs number of users split by different devices. The user count is different:
This explains what's going on, a user would be grouped under multiple devices. My conclusion is that at some point during the users session, their browser user-agent changes and in the subsequent hit, a new device type is set in GA.
I'd have hoped GA would use either the first or last value, to avoid this scenario, but I guess they don't. My question is, if I'm accepting this as a "flaw" in GA. I'd rather pick one value. What's the best way to select the last or first device value from the below query:
SELECT
fullVisitorId,
visitId,
device.deviceCategory
FROM (
SELECT
*
FROM
`project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT
*
FROM
`project.dataset.ga_sessions_*` mobile ) table
I've tried doing a sub-select and using STRING_AGG(), attempting to order by hits.time and limiting to one value and that still creates another row.
I've tested and found that the below fields all have the same issue:
visitNumber
totals.hits
totals.pageviews
totals.timeOnSite
trafficSource.campaign
trafficSource.medium
trafficSource.source
device.deviceCategory
totals.sessionQualityDim
channelGrouping
device.mobileDeviceInfo
device.mobileDeviceMarketingName
device.mobileDeviceModel
device.mobileInputSelector
device.mobileDeviceBranding
UPDATE
See below queries around this particular fullVisitorId and visitId - UNION has been removed:
visitStartTime added:
visitStartTime and hits.time added:
Well, from the looks of things, I think you have 3 options:
1 - Group by fullVisitorId, visitId; and use Max or MIN deviceCategory. That should prevent a device switcher from being double-counted, It's kind of arbitrary but then so is the GA data.
2 - Option two is similar but, if the deviceCategory result can be anything (i.e. isn't constrained in the results to just the valid deviceCategory members), you can use a CASE to check MAX(deviceCategory) = MIN(deviceCategory) and if they are different, return 'Multiple Devices'
3 - You could go further, counting the number of different devices used, construct a concatenation that lists them in some way, etc.
I'm going to write up Number 2 for you. In your question, you have 2 different queries: one with [date] and one without - I'll provide both.
Without [date]:
SELECT
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY fullVisitorId, visitId
With [date]:
SELECT
[date],
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY [date], fullVisitorId, visitId
I'm assuming here that the Selects and Union that you gave are sound.
Also, I should point out that those {metric aggregations} should be something other than SUMs, otherwise you will still be double-counting.
I hope this helps.
It's simply not possible to have two values for one row in this field, because it can only contain one value.
There are 2 possibilities:
you're actually querying two separate datasets/ two different views - that's not clearly visible with the example code. Client id (=fullvisitorid) is only unique per Property (Tracking Id, the UA-xxxxx stuff). If you query two different views from different properties you have to expect to get same ids used twice.
Given they are coming from one property, these two rows could actually be one session on a midnight split, which means visitId stays the same, but visitStartTime changes. But that would also mean the decision algorithm for device type changed in the meantime ... that would be curious.
Try using visitStartTime and see what happens.
If you're using two different properties use a user id to combine or separate the sessions by adding a constant - you can't combine them.
SELECT 'property_A' AS constant FROM ...
hth

BigQuery accessing CustomDimensions with new SQL syntax

I am migrating to the new SQL syntax in BigQuery, since it seems more flexible. However I am a bit stuck when it comes to access the fields in the customDimensions. I am writing something quite simple like this:
SELECT
cd.customDimensions.index,
cd.customDimensions.value
FROM `xxxxx.ga_sessions_20170312`, unnest(hits) cd
limit 100
But I get the error
Error: Cannot access field index on a value with type ARRAY<STRUCT<index INT64, value STRING>>
However if I run something like this works perfectly fine:
SELECT
date,
SUM((SELECT SUM(latencyTracking.pageLoadTime) FROM UNNEST(hits))) pageLoadTime,
SUM((SELECT SUM(latencyTracking.serverResponseTime) FROM UNNEST(hits))) serverResponseTime
FROM `xxxxxx.ga_sessions_20170312`
group by 1
Is there some different logic when it comes to query the customDimensions?
If the intention is to retrieve all custom dimensions in a flattened form, then join with UNNEST(customDimensions) as well:
#standardSQL
SELECT
cd.index,
cd.value
FROM `xxxxx.ga_sessions_20170312`,
unnest(hits) hit,
unnest(hit.customDimensions) cd
limit 100;
SELECT
fullvisitorid,
( SELECT MAX(IF(index=1,value, NULL))FROM UNNEST(hits.customDimensions)) AS CustomDimension1,
( SELECT MAX(IF(index=2,value, NULL))FROM UNNEST(hits.customDimensions)) AS CustomDimension2
FROM
`XXXXXXX`, unnest(hits) as hits