BigQuery - Transactions in internal promo report - sql

like in the question here (Replicate Internal Promotion report in BigQuery with transactions) I want to rebuild the internal promo report from Google Analytics.
I was able to get PromoViews and PromoClicks, but I don't get the transactions...
My query looks like this
SELECT
clientid,
fullvisitorid,
visitstarttime,
concat(fullvisitorid, cast(visitstarttime AS string)) AS sessionid,
hp.promoid,
hp.promoname,
hp.promocreative
hp.promoposition,
promotionActionInfo.promoIsView as promoview,
promotionActionInfo.promoIsClick as promoclick
FROM [MYDATA], UNNEST (hits) as h, UNNEST (h.promotion) as hp
When I sum promoview and promoclick I get the exact same results like in Google Analytics
The official Google Documentation says:
>How transactions are attributed
>The Internal Promotion report attributes transactions to either an internal-promotion click or >internal-promotion view.
>
>Each hit in an ecommerce session can have:
>
>0 or 1 internal-promotion clicks
>0 or more internal-promotion views
>Internal-promotion click attribution
>If a hit includes a single internal-promotion click, then that internal-promotion is credited >for the transaction.
>
>If a session includes multiple internal-promotion clicks, then the last-clicked internal->promotion is credited for the transaction.
>
>If a hit includes zero internal-promotions clicks but one of that user’s previous hits does >include an internal-promotion click, then the internal promotion from the previous click is >credited for the transaction.
>
>Internal-promotion view attribution
>If none of the conditions above is true but a hit includes one or more internal-promotion >views, then the transaction is credited to all promotional views within the session.
https://support.google.com/analytics/answer/6014872?hl=en
Keeping this in mind, my approach was to do a join with a separate table where I query the enhanced ecommerce data using sessionid as the join key
SELECT
clientid,
fullvisitorid,
visitstarttime,
concat(fullvisitorid, cast(visitstarttime AS string)) AS sessionid,
hp.v2ProductName AS ProductName,
h.transaction.transactionId AS TransactionId,
hp.productQuantity as Quantity,
FROM [MYDATA], UNNEST (hits) AS h, UNNEST (h.product) as hp
WHERE
h.eCommerceAction.action_type = "6" AND
(hp.isImpression IS NULL) AND
(
(h.promotionActionInfo.promoIsView is true) OR
(h.promotionActionInfo.promoIsClick is true)
)
But it seems that my WHERE clause (filtering the promoviews and clicks) is not working like I expect, because I receive an empty table as a result.
Can anybody help me with this?

Related

BigQuery Session & Hit level understanding

I want to ask about your knowledge regarding the concept of Events.
Hit level
Session Level
How in BigQuery (standard SQL) how i can map mind this logic, and also
Sessions
Events Per Session
Unique Events
Please can somebody guide me to understand these concepts?
totals.visitors is Session
sometime
visitId is taken as Session
to achieve that you need to grapple a little with a few different concepts. The first being "what is a session" in GA lingo. you can find that here. A session is a collection of hits. A hit is one of the following: pageview, event, social interaction or transaction.
Now to see how that is represented in the BQ schema, you can look here. visitId and visitorId will help you define a session (as opposed to a user).
Then you can count the number of totals.hits that are events of the type you want.
It could look something like:
select visitId,
sum(case when hits.type = "EVENT" then totals.hits else 0) from
dataset.table_* group by 1
That should work to get you an overview. If you need to slice and dice the event details (i.e. hits.eventInfo.*) then I suggest you make a query for all the visitId and one for all the relevant events and their respective visitId
I hope that works!
Cheers
You can think of these concepts like this:
every row is a session
technically every row with totals.visits=1 is a valid session
hits is an array containing structs which contain information for every hit
You can write subqueries on arrays - basically treat them as tables. I'd recommend to study Working with Arrays and apply/transfer every exercise directly to hits, if possible.
Example for subqueries on session level
SELECT
fullvisitorid,
visitStartTime,
(SELECT SUM(IF(type='EVENT',1,0)) FROM UNNEST(hits)) events,
(SELECT COUNT(DISTINCT CONCAT(eventInfo.eventCategory,eventInfo.eventAction,eventInfo.eventLabel) )
FROM UNNEST(hits) WHERE type='EVENT') uniqueEvents,
(SELECT SUM(IF(type='PAGE',1,0)) FROM UNNEST(hits)) pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801`
WHERE
totals.visits=1
LIMIT
1000
Example for Flattening to hit level
There's also the possibility to use fields in arrays for grouping if you cross join arrays with their parent row
SELECT
h.type,
COUNT(1) hits
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` AS t CROSS JOIN t.hits AS h
WHERE
totals.visits=1
GROUP BY
1
Regarding the relation between visitId and Sessions you can read this answer.

Google Analytics session-scoped fields returning multiple values

I've discovered that there are certain GA "session" scoped fields in BigQuery that have multiple values for the same fullVisitorId and visitId fields. See the example below:
Grouping the fields doesn't help either. In GA, I've checked the number of users vs number of users split by different devices. The user count is different:
This explains what's going on, a user would be grouped under multiple devices. My conclusion is that at some point during the users session, their browser user-agent changes and in the subsequent hit, a new device type is set in GA.
I'd have hoped GA would use either the first or last value, to avoid this scenario, but I guess they don't. My question is, if I'm accepting this as a "flaw" in GA. I'd rather pick one value. What's the best way to select the last or first device value from the below query:
SELECT
fullVisitorId,
visitId,
device.deviceCategory
FROM (
SELECT
*
FROM
`project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT
*
FROM
`project.dataset.ga_sessions_*` mobile ) table
I've tried doing a sub-select and using STRING_AGG(), attempting to order by hits.time and limiting to one value and that still creates another row.
I've tested and found that the below fields all have the same issue:
visitNumber
totals.hits
totals.pageviews
totals.timeOnSite
trafficSource.campaign
trafficSource.medium
trafficSource.source
device.deviceCategory
totals.sessionQualityDim
channelGrouping
device.mobileDeviceInfo
device.mobileDeviceMarketingName
device.mobileDeviceModel
device.mobileInputSelector
device.mobileDeviceBranding
UPDATE
See below queries around this particular fullVisitorId and visitId - UNION has been removed:
visitStartTime added:
visitStartTime and hits.time added:
Well, from the looks of things, I think you have 3 options:
1 - Group by fullVisitorId, visitId; and use Max or MIN deviceCategory. That should prevent a device switcher from being double-counted, It's kind of arbitrary but then so is the GA data.
2 - Option two is similar but, if the deviceCategory result can be anything (i.e. isn't constrained in the results to just the valid deviceCategory members), you can use a CASE to check MAX(deviceCategory) = MIN(deviceCategory) and if they are different, return 'Multiple Devices'
3 - You could go further, counting the number of different devices used, construct a concatenation that lists them in some way, etc.
I'm going to write up Number 2 for you. In your question, you have 2 different queries: one with [date] and one without - I'll provide both.
Without [date]:
SELECT
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY fullVisitorId, visitId
With [date]:
SELECT
[date],
fullVisitorId,
visitId,
case when max(device.deviceCategory) = min(device.deviceCategory)
then max(device.deviceCategory)
else 'Multiple Devices'
end as deviceCategory,
{metric aggregations here}
FROM
(SELECT *
FROM `project.dataset.ga_sessions_*` desktop
UNION ALL
SELECT *
FROM `project.dataset.ga_sessions_*` mobile
) table
GROUP BY [date], fullVisitorId, visitId
I'm assuming here that the Selects and Union that you gave are sound.
Also, I should point out that those {metric aggregations} should be something other than SUMs, otherwise you will still be double-counting.
I hope this helps.
It's simply not possible to have two values for one row in this field, because it can only contain one value.
There are 2 possibilities:
you're actually querying two separate datasets/ two different views - that's not clearly visible with the example code. Client id (=fullvisitorid) is only unique per Property (Tracking Id, the UA-xxxxx stuff). If you query two different views from different properties you have to expect to get same ids used twice.
Given they are coming from one property, these two rows could actually be one session on a midnight split, which means visitId stays the same, but visitStartTime changes. But that would also mean the decision algorithm for device type changed in the meantime ... that would be curious.
Try using visitStartTime and see what happens.
If you're using two different properties use a user id to combine or separate the sessions by adding a constant - you can't combine them.
SELECT 'property_A' AS constant FROM ...
hth

Unable to get the right sessions count using Bigquery standard SQL

I want to get the total sessions but just because I am unnesting 'hit.product' and 'hits' at same time as shown in the below code, its giving me less session count than that I can see in GA. I am suspecting that it is filtering out the sessions that doesn't have any products.
There is also a way that i can handle that without using unnest by using Array as shown below
ARRAY(SELECT DISTINCT v2ProductCategory FROM UNNEST(hits.product)) AS v2ProductCategory
Is there any way if I can pull all the sessions and its product category, product name and hits info(hits.time, hits.page.pagepath) without using ARRAY which I would be using it in my further analysis?
select count(distinct session) from(SELECT
fullvisitorid,
CONCAT(CAST(fullVisitorId AS string),CAST(visitId AS string)) AS session,
hits.time,
hits.page.pagePath,
hits.eCommerceAction.action_type,
product.v2ProductCategory,
product.v2ProductName
FROM
`XXXXXXXXXXXXXXX`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN "20170930"
AND "20170930")

How to select google analytics segment in google big query? SQL

I create Google Analytics data source in Tableau.
The data source has the segment by "new user".
Now, I would like to push the Google Analytics in Google Bigquery and create the same data source in Tableau by creating a data source from Google Bigquery.
After checking the GA data source in Google Bigquery project.
There is no segment in Bigquery.
How to query by segment "new user" in Google Bigquery??
You can look at BigQuery GA Schema to see all fields that are exported there.
The field totals.newVisits has what you are looking for:
select
hits.transaction.transactionid tid,
date,
totals.pageviews pageviews,
hits.item.itemquantity item_qtd,
hits.transaction.transactionrevenue / 1e6 rvn,
totals.bounces bounces,
fullvisitorid fv,
visitid v,
totals.timeonsite tos,
totals.newVisits new_visit
FROM
`project_id.dataset_id.ga_sessions*`,
unnest(hits) hits
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-05-10')
AND TIMESTAMP('2017-05-10')
group by
tid, date, pageviews, item_qtd, rvn, bounces, fv, v, tos, new_visit
Notice that this field is defined in the session level.

Page Sequencing in BigQuery

I am trying to build a segment for users who hit page /company first. This cannot be their only page in the visit.
SELECT totals.pageviews, visitid, COUNT(distinct visitorid) as TotalUsers
FROM (TABLE_DATE_RANGE([1234567.ga_sessions_],TIMESTAMP('2015-01-05'),TIMESTAMP('2015-01-11')))
WHERE hits.hitnumber = 1
AND page.pageTitle = 'Company Page'
GROUP EACH BY visitorid, visitid
LIMIT 10;
This isn't right because this just matches visitors with 1 total hit for their session. I want to specify that hit #1 should equal 'Company Page'.
Is it possible to also create a sequence of pages? For example: a segment that hit Page x, Page y, Page Z - in that specific order.
If I understood it right, you could get what you want by querying on the fullvisitorid (be careful visitorid is deprecated) and the first page as your company page.
Then just make a query on those visitors whose id is in that subset.
Something like:
SELECT fullvisitorid, hits.page.pageTitle
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
HAVING fullvisitorid IN
(SELECT fullvisitorid
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.hitnumber = 1 AND hits.page.pageTitle = 'London Cycle Helmet')
LIMIT 1000
Where I used the BigQuery public sample of GA (google.com:analytics-bigquery:LondonCycleHelmet) and 'London Cycle Helmet' as the first page.