Query across multiple datasets and a dynamic date range in BigQuery - sql

I have a query that collects data from a dynamic date range (last 7 days) from one dataset in BigQuery - my data source is Google Analytics, so I have other datasets connected with identical schema. I'd like my query to also return data from other datasets, usually I would use a UNION ALL for this, but my query contains a complex categorization query which needs to be updated regularly and I'd rather not do this multiple times for each set.
Could you advise on how to query across datasets, or suggest a more elegant way to handle the UNION ALL approach?
SELECT
Date,
COUNT(DISTINCT VisitId) AS users,
COUNT(VisitId) AS sessions,
SUM(totals.transactions) AS orders,
CASE
# Organic Search - Google
WHEN ( channelGrouping LIKE "Organic Search"
OR trafficSource.source LIKE "com.google.android.googlequicksearchbox")
AND trafficSource.source LIKE "%google%" THEN "Organic Search - Google"
ELSE "Other"
END AS Channel,
hits.page.hostname AS site
FROM
`xxx.dataset1.ga_sessions_20*`
CROSS JOIN
UNNEST (hits) AS hits
WHERE
parse_DATE('%y%m%d',
_table_suffix) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 day)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 day)
AND totals.visits = 1
AND hits.isEntrance IS TRUE
GROUP BY
Date,
Channel,
hits.isEntrance
ORDER BY
Users DESC
UPDATE: I have got as far as follows thanks to the responses below, the following queries all datasets in the UNION but the date range is not applying, instead all data is being queried, any ideas why it's not picking it up?
SELECT
Date,
LOWER(hits.page.hostname) AS site,
IFNULL(COUNT(VisitId),0) AS sessions,
IFNULL(SUM(totals.transactions),0) AS orders,
IFNULL(ROUND(SUM(totals.transactions)/COUNT(VisitId),4),0) AS conv_rate,
# Channel definition starts here
CASE
# Organic Search - Google
WHEN ( channelGrouping LIKE "Organic Search"
OR trafficSource.source LIKE "com.google.android.googlequicksearchbox")
AND trafficSource.source LIKE "%google%" THEN "Organic Search - Google"
ELSE "Other"
END AS Channel
FROM (
SELECT * FROM `xxx.43786551.ga_sessions_20*` UNION ALL
SELECT * FROM `xxx.43786097.ga_sessions_20*` UNION ALL
SELECT * FROM `xxx.43786092.ga_sessions_20*`
WHERE PARSE_DATE('%Y%m%d',_TABLE_SUFFIX) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
)
CROSS JOIN UNNEST (hits) AS hits
WHERE totals.visits = 1
AND hits.isEntrance IS TRUE
GROUP BY
Date,
channel,
hits.isEntrance,
site
HAVING hits.isEntrance IS TRUE

#standardSQL
SELECT
DATE,
COUNT(DISTINCT VisitId) AS users,
COUNT(VisitId) AS sessions,
SUM(totals.transactions) AS orders,
CASE
# Organic Search - Google
WHEN ( channelGrouping LIKE "Organic Search"
OR trafficSource.source LIKE "com.google.android.googlequicksearchbox")
AND trafficSource.source LIKE "%google%" THEN "Organic Search - Google"
ELSE "Other"
END AS Channel,
hits.page.hostname AS site
FROM (
SELECT * FROM `xxx.dataset1.ga_sessions_20*` WHERE PARSE_DATE('%y%m%d',_TABLE_SUFFIX) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
UNION ALL SELECT * FROM `xxx.dataset2.ga_sessions_20*` WHERE PARSE_DATE('%y%m%d',_TABLE_SUFFIX) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
UNION ALL SELECT * FROM `xxx.dataset3.ga_sessions_20*` WHERE PARSE_DATE('%y%m%d',_TABLE_SUFFIX) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
)
CROSS JOIN UNNEST (hits) AS hits
WHERE totals.visits = 1
AND hits.isEntrance IS TRUE
GROUP BY
DATE,
Channel,
site
ORDER BY
Users DESC

Related

Split credit for transactions and Revenue between clicks on events (UA GA)

In this task, the idea is to assign the sales credit (transactions and revenues) equally to the events that were clicked on during the user's session. The output table would look like this, except that the revenue and transaction are split if the user had two events, three events, etc.
Below are three scenarios -> "three scenarios" on how transactions and revenues should be shared between events. Does anyone have an idea how to customize the code?
I include a code that assigns sales, but without dividing the credit into Revenue and Transactions, and this code would need to be modified.
three scenarios
output table
Grateful in advance for any help
with event_home_page as (select q.* except(isEntrance), if (isEntrance = true, 'true', 'false') isEntrance
from (
select
PARSE_DATE('%Y%m%d', CAST(date AS STRING)) as true_date,
hits.isEntrance,
hits.eventInfo.eventCategory,
hits.eventInfo.eventAction,
hits.eventInfo.eventLabel,
concat(fullvisitorid, cast(visitstarttime as string)) ID,
count(*) click
FROM `ga360.123456.ga_sessions_*`, unnest (hits) as hits
WHERE
_table_suffix = FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
and hits.page.pagePath in ('www.example.com/')
and regexp_contains(hits.eventInfo.eventCategory, '^clickable_element.*')
group by 1,2,3,4,5,6) q
),
transactions as (
select
PARSE_DATE('%Y%m%d', CAST(date AS STRING)) as true_date,
concat(fullvisitorid, cast(visitstarttime as string)) ID,
sum(totals.totalTransactionRevenue/1000000) as all_revenue,
sum(totals.transactions) all_transactions
FROM `ga360.123456.ga_sessions_*`
WHERE
_table_suffix = FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
group by 1,2
)
select hp.true_date, hp.isEntrance, hp.eventCategory, hp.eventAction, hp.eventLabel, hp.click, t.all_revenue revenue, t.all_transactions transactions
from event_home_page hp left join transactions t on hp.true_date=t.true_date and hp.id=t.id
order by revenue desc

Get most recent record for each id

I am trying to get a list of all users in a database. Then I have another table where I only have the users who are members.
The issue is that some of those who are members today, could have been customers, members or none of them earlier. So we could have duplicates.
What I want to do is to pick only the most recent record based on date column which is present in the database.
Here are the 2 tables output:
User table:
Users table
Members table:
Members table
Want to left join the tables with keeping all the distinct records from users table and most matching records from members table with the most recent cd.value.
WITH users AS(
SELECT
fullVisitorId AS Clientid
FROM `records`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 10 DAY))
AND
FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND
totals.visits = 1
), members As(
SELECT
MAX(date) AS date,
fullVisitorId AS Clientid,
cd.value AS CD_value,
cd.index AS CD_index
FROM `records`,
UNNEST(customDimensions) AS cd
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 10 DAY))
AND
FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND
totals.visits = 1
AND
cd.index = 6
group by
Clientid,
CD_value,
CD_index
)
SELECT
users.ClientId AS clientId,
members.CD_value
from users
LEFT JOIN members ON users.ClientId = members.Clientid
group by
members.CD_value,
clientId
order by
clientId ASC
try by using row_number()
WITH users AS(
SELECT
fullVisitorId AS Clientid
FROM `records`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 10 DAY))
AND
FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND
totals.visits = 1
), members As(
SELECT
date AS date,
fullVisitorId AS Clientid,
cd.value AS CD_value,
cd.index AS CD_index,
row_number() over(partition by Clientid,
CD_value,
CD_index order by date desc) rn
FROM `records`,
UNNEST(customDimensions) AS cd
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 10 DAY))
AND
FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND
totals.visits = 1
AND
cd.index = 6
), m2 as ( select * from members where rn=1)
SELECT distinct
users.ClientId AS clientId,
m2.CD_value
from users
LEFT JOIN m2 ON users.ClientId = m2.Clientid
order by
clientId ASC

Issue creating a Google Analytics "Returning Users" metric in BigQuery

Taking what has been described on https://webmasters.stackexchange.com/a/87523
As well as my own understanding, I've come up with what I think would be considered "Returning Users"
1.First a query to show users who had their first "latest visit" within a two year time period:
SELECT
parsedDate,
CASE
# return fullVisitorId when the first latest visit is between 2 years and today
WHEN parsedDate BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR) AND CURRENT_DATE() THEN fullVisitorId
END fullVisitorId
FROM (
SELECT
# convert the date field from string to date and get the latest date
PARSE_DATE('%Y%m%d',
MAX(date)) parsedDate,
fullVisitorId
FROM
`project.dataset.ga_sessions_*`
WHERE
# only show fullVisitorId if first visit
totals.newVisits = 1
GROUP BY
fullVisitorId)
2.Then a separate query to select some fields within a specific date range:
SELECT
PARSE_DATE('%Y%m%d',
date) parsedDate,
fullVisitorId,
visitId,
totals.newVisits,
totals.visits,
totals.bounces,
device.deviceCategory
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX = "20180118"
3.Joining these two queries together to find "Returning Users"
SELECT
q1.parsedDate date,
COUNT(DISTINCT q1.fullVisitorId) users,
# Default way to determine New Users
SUM(q1.newVisits) newVisits,
# Number of "New Users" based on my queries (matches with default way above)
COUNT(DISTINCT IF(q2.parsedDate < q1.parsedDate, NULL, q2.fullVisitorId)) newUsers,
# Number of "Returning Users" based on my queries
COUNT(DISTINCT IF(q2.parsedDate < q1.parsedDate, q2.fullVisitorId, NULL)) returningUsers
FROM (
(SELECT
PARSE_DATE('%Y%m%d',
date) parsedDate,
fullVisitorId,
visitId,
totals.newVisits,
totals.visits,
totals.bounces,
device.deviceCategory
FROM
`project.dataset.ga_sessions_*`
WHERE
_TABLE_SUFFIX = "20180118") q1
LEFT JOIN (
SELECT
parsedDate,
CASE
# return fullVisitorId when the first latest visit is between 2 years and today
WHEN parsedDate BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR) AND CURRENT_DATE() THEN fullVisitorId
END fullVisitorId
FROM (
SELECT
# convert the date field from string to date and get the latest date
PARSE_DATE('%Y%m%d',
MAX(date)) parsedDate,
fullVisitorId
FROM
`project.dataset.ga_sessions_*`
WHERE
# only show fullVisitorId if first visit
totals.newVisits = 1
GROUP BY
fullVisitorId)) q2
ON q1.fullVisitorId = q2.fullVisitorId)
GROUP BY
date
Results in BQ
Un-sampled new/returning visitors split by Users report for the same period in GA
Questions/Issues:
Given that newVisits (default field) and newUsers (my calculation) is giving the same results which is inline with the GA report New Visitor Users. Why is there mismatch of GAs Returning Visitor Users and my calculation of returningUsers in BQ? can these two even be compared, what am I missing?
Is my approach the most efficient and less verbose way of going about this?
Is there a better way to get the figures, something I'm missing?
SOLUTION
Based on Martin's answer below, I managed to create the "Returning Users" metric/field within the context of the query I was running:
SELECT
date,
deviceCategory,
# newUsers - SUM result if it's a new user
SUM(IF(userType="New Visitor", 1, 0)) newUsers,
# returningUsers - COUNT DISTINCT fullvisitorId if it's a returning user
COUNT(DISTINCT IF(userType="Returning Visitor", fullvisitorid, NULL)) returningUsers,
COUNT(DISTINCT fullvisitorid) users,
SUM(visits) sessions
FROM (
SELECT
date,
fullVisitorId,
visitId,
totals.visits,
device.deviceCategory,
IF(totals.newVisits IS NOT NULL, "New Visitor", "Returning Visitor") userType
FROM
`project.dataset.ga_sessions_20180118` )
GROUP BY
deviceCategory,
date
Google Analytics uses approximations for users (fullvisitorid) - even if it says "based on 100%". You get better user numbers when using an unsampled report.
Another thing to mention: fullvisitorids are taken into consideration even if totals.visits != 1, while sessions are only counted where totals.visits = 1
Also users are double-counted if they where new and then returned. Meaning, this should give you correct numbers:
SELECT
totals.newVisits IS NOT NULL AS isNew,
COUNT(DISTINCT fullvisitorid) AS visitors,
SUM(totals.visits) AS sessions
FROM
`project.dataset.ga_sessions_20180214`
GROUP BY
1
If you want to avoid double counting you can use this, where a user is counted as new even if she returned:
WITH
visitors AS (
SELECT
fullvisitorid,
-- check if any visit of this visitor was new - will be used for grouping later
MAX(totals.newVisits ) isNew,
SUM(totals.visits) as sessions
FROM
`project.dataset.ga_sessions_20180214`
GROUP BY 1
)
SELECT
isNew IS NOT NULL AS isNew,
COUNT(1) AS visitors,
sum(sessions) as sessions
FROM
visitors
GROUP BY 1
Of course these numbers match with GA only in totals.

How to return a correct aggregate total from unnested data (Google Analytics data in BigQuery)

I am running some queries on GA data in BigQuery and I am encountering a recurring problem when I want to return a sum of data from an unnested table, where in my totals are much higher than expected - I suspect that unnested rows are being counted, resulting in an inaccurate count. here is an example:
SELECT DATE, SUM(totals.transactions)
FROM `PROJECTNAME.43786551.ga_sessions_20*` AS GBP
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY DATE
Returns:
1 20171122 12967
Which is as expected. Next, I want to use a field from hits., which requires me to unnest hits, making my query:
SELECT DATE, SUM(totals.transactions), MIN(hits.page.hostname) AS site
FROM `PROJECTNAME.43786551.ga_sessions_20*` AS GBP
CROSS JOIN UNNEST (hits) as hits
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY DATE
However, the results for this now show:
20171122 2320004 www.hostname.com
The count of transactions is much higher, I assume it's counting all unnested rows, how can I get around this issue, where I want to count and unnested tables, but use a field from unnested too?
you should do probably something like this:
SELECT DATE, SUM(totals.transactions),
(SELECT MIN(hit.page.hostname) FROM UNNEST (GBP.hits) AS hit) AS site
FROM `PROJECTNAME.43786551.ga_sessions_20*` AS GBP
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 1 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY DATE, site

Google BigQuery Visit data per date

I'd like to retrieve sum of visits who have an custom dimension hit within their visit split date.
I get this data with the help of this query as sum for all selected dates, but how do I get it split by date?
Many thanks in advance!
select sum(sessions) as total_sessions, from (
select
fullvisitorid,
count(distinct visitid) as sessions,
from (TABLE_DATE_RANGE([XXX.ga_sessions_], TIMESTAMP('2016-09-01'), TIMESTAMP('2016-09-03')))
where totals.visits = 1
AND hits.customDimensions.index = 3
AND hits.customDimensions.value = 'play'
group each by fullvisitorid
)
ga_sessions tables have date field (see Analytics to BigQuery Export schema)
So, if you want to stay with BigQuery Legacy SQL for your above query - you can use this date field, as in below example
SELECT date, SUM(sessions) AS total_sessions FROM (
SELECT
date,
fullvisitorid,
COUNT(DISTINCT visitid) AS sessions
FROM (TABLE_DATE_RANGE([XXX.ga_sessions_], TIMESTAMP('2016-09-01'), TIMESTAMP('2016-09-03')))
WHERE totals.visits = 1
AND hits.customDimensions.index = 3
AND hits.customDimensions.value = 'play'
GROUP BY date, fullvisitorid
)
GROUP BY date
If you can/want Migrate from BigQuery Legacy SQL to BigQuery Standard SQL you can use below example:
SELECT
_TABLE_SUFFIX AS date,
COUNTIF(EXISTS (SELECT 1 FROM UNNEST(hits), UNNEST(customDimensions)
WHERE TRUE OR (index = 3 AND value = 'play'))) AS sessions
FROM `XXX.ga_sessions_*`
WHERE totals.visits = 1
AND _TABLE_SUFFIX BETWEEN '2016-09-01' AND '2016-09-03'
GROUP BY date
See more details about using Wildcard Tables
Can you try this with your table using standard SQL (uncheck "Use Legacy SQL" under "Show Options")? I may have misunderstood the question, but it computes the total number of visits for each day matching the condition on customDimensions, which I believe is what you want.
SELECT
_PARTITIONTIME,
COUNTIF(EXISTS (SELECT 1 FROM UNNEST(hits), UNNEST(customDimensions)
WHERE index = 3 AND value = 'play')) as sessions
FROM `XXX.ga_sessions_*`
WHERE totals.visits = 1
GROUP BY _PARTITIONTIME;