google bigQuery realtime does not match ga report - google-bigquery

Hello I would like to see real-time data status using google bigquery real time table.
However, simple query statements do not match GA reports. I created a query that shows the number of sessions per hour, but I had an error rate of 10 to 30%.
Is the accuracy of google bigquery realtime not so good? Or am I making a mistake?
WITH noDuplicateTable as (
SELECT
ARRAY_AGG (t ORDER BY exportTimeUsec DESC LIMIT 1) [OFFSET (0)]. *
FROM
`tablename_20 *` AS t
WHERE
_TABLE_SUFFIX = FORMAT_DATE ("% y% m% d", CURRENT_DATE ('Asia / Seoul'))
GROUP BY
T.VisitKey
),
session as (
SELECT
ROW_NUMBER () OVER () sessionRow,
FORMAT_TIMESTAMP ('% H', TIMESTAMP_SECONDS (time), 'Asia / Seoul') AS startTime,
sum (session) as session,
(sum (session) -sum (isvisit)) as uniqueSession,
(sum (isvisit) / sum (session) * 100) as bounce,
sum (totalPageView) as totalPageView
FROM (
SELECT
count (visitId) as session,
visitStartTime as time,
sum (Ifnull (totals.Bounces, 0)) as isVisit,
sum (totals.pageviews) as totalPageView
FROM
noDuplicateTable
GROUP BY
visitStartTime
)
GROUP BY startTime
)
select * from session

Related

SQL aggregated subquery - Athena

Using AWS Athena I want to get total recovered per day by getting total recovered amount / total advances
here is code:
SELECT a.advance_date
,sum(a.advance_amount) as "advance_amount"
,sum(a.advance_fee) as "advance_fee"
,(SELECT
sum(credit_recovered+fee_recovered) / (a.advance_amount+a.advance_fee)
FROM ncmxmy.ageing_recovery_raw_parquet
WHERE advance_date = a.advance_date
AND date(recovery_date) <= DATE_ADD('day', 0, a.advance_date)
) as "day_0"
FROM ageing_summary_advance_parquet a
GROUP BY a.advance_date
ORDER BY a.advance_date
I am getting an error
"("sum"((credit_recovered + fee_recovered)) / (a.advance_amount + a.advance_fee))' must be an aggregate expression or appear in GROUP BY clause"
Your division gives the error because the denominator tries to use individual columns from the ageing_summary_advance_parquet table. In my perception of the query, you need to divide by the grouped sum of advance_amount and advance_fee columns. In that case, we can merge two grouped sets of data by advance_date into the division. Please let me know if this query helps:
WITH cte1 (sum_adv_date, advance_date) as
(SELECT
sum(credit_recovered+fee_recovered) as sum_adv_date, advance_date
FROM ncmxmy.ageing_recovery_raw_parquet
WHERE date(recovery_date) <= DATE_ADD('day', 0, advance_date)
GROUP BY advance_date
),
cte2 (advance_date, advance_amount, advance_fee) as
(SELECT
a.advance_date
,sum(a.advance_amount) as "advance_amount"
,sum(a.advance_fee) as "advance_fee"
FROM ageing_summary_advance_parquet a
GROUP BY a.advance_date
)
SELECT cte2.advance_amount, cte2.advance_fee,
(cte1.sum_adv_date/(cte2.advance_amount+cte2.advance_fee)) as "day_0"
FROM cte1 inner join cte2 on cte1.advance_date = cte2.advance_date
ORDER BY cte1.advance_date

Using OFFSET instead of UNNEST for nested fields in Google Bigquery

A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?
Thanks!
Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h
OFFSET - returns 1 result:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

how to query daily cost of specific product in bigQuery?

I exported billing to bigquery and I want to get the translations total cost in specific date from bigQuery monthly or specific date. like April 1, 2019.
google docs sample query get monthly.
SELECT
invoice.month,
SUM(cost)
+ SUM(IFNULL((SELECT SUM(c.amount)
FROM UNNEST(credits) c), 0))
AS total,
(SUM(CAST(cost * 1000000 AS int64))
+ SUM(IFNULL((SELECT SUM(CAST(c.amount * 1000000 as int64))
FROM UNNEST(credits) c), 0))) / 1000000
AS total_exact
FROM `project.dataset.gcp_billing_export_v1_XXXXXX_XXXXXX_XXXXXX`
GROUP BY 1
ORDER BY 1 ASC
;
but I created my query this way:
$myVariable=
"SELECT
COUNT(*) total_times,
SUM(cost) total_cost
FROM
`project.dataset.gcp_billing_export_v1_XXXXXX_XXXXXX_XXXXXX`
WHERE
service.description = 'Translate' AND (usage_end_time >= timestamp('2019-04-04 00:00:00') AND usage_end_time <= timestamp('2019-04-04 23:59:59'))";
I want to get the total cost of the current day and the total cost from the first day of the month to the current day.
sample:
1. 2019/04/04: 4223.05 - (882 Times)
2. 2019/04/Total: 16505.43 - (3882 Times)
You can further add details to your working query:
SELECT
service.description,
timestamp_trunc(usage_start_time,DAY) as time_fragment,
ROUND(SUM(cost)
+ SUM(IFNULL((SELECT SUM(c.amount)
FROM UNNEST(credits) c), 0)),3)
AS total,
round((SUM(CAST(cost * 1000000 AS int64))
+ SUM(IFNULL((SELECT SUM(CAST(c.amount * 1000000 as int64))
FROM UNNEST(credits) c), 0))) / 1000000,3)
AS total_exact
FROM `project.dataset.gcp_billing_export_v1_XXXXXX_XXXXXX_XXXXXX`
WHERE service.description='Translate'
GROUP BY 1,2
ORDER BY 2 desc;
which displays:
you can further to into HOURly granularity if you edit line 3.

How to calculate the average time of the checkout process?

I am trying to calculate the average time our customers spend during checkout process using Google Big Query.
As I am very new to SQL and BQ, I am running the following code trying to get the timestamp for checkout start, timestamp for checkout completion and then calculate the average.
SELECT
month, chekout_start as Checkout Started,
time_to_transaction as Checkout
FROM ((SELECT
MONTH(TIMESTAMP(date)) AS month,
TIME(AVG(TimeToCheckout)) AS time_to_transaction
FROM (
SELECT
date,
fullVisitorId,
timestamp(integer(visitStartTime*1000000)) as start_time,
timestamp(integer(visitStartTime*1000000 + hits.time*1000)) as hit_time,
(TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000 + hits.time*1000))) - TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000)) ) ) AS TimeToCheckout
FROM (TABLE_DATE_RANGE([data.ga_sessions_],
TIMESTAMP('2018-01-01'), TIMESTAMP('2018-12-31')))
WHERE totals.transactions>=1
)
GROUP BY month) transaction
INNER JOIN
(
SELECT
MONTH(TIMESTAMP(date)) AS month,
TIME(AVG(TimeToCheckout)) AS checkout_start
FROM (
SELECT
date,
fullVisitorId,
timestamp(integer(visitStartTime*1000000)) as start_time,
timestamp(integer(visitStartTime*1000000 + hits.time*1000)) as hit_time,
(TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000 + hits.time*1000))) - TIMESTAMP_TO_SEC(timestamp(integer(visitStartTime*1000000)) ) ) AS TimeToCheckout
FROM (TABLE_DATE_RANGE([data.ga_sessions_],
TIMESTAMP('2018-01-01'), TIMESTAMP('2018-12-31')))
WHERE (hits.page.pagePath = 'checkout/buy')
)
GROUP BY month
) checkout_start
ON transaction.month = checkout_start.month)
ORDER BY month ASC
The desirable outcome looks like:
However, I am getting the error 'Encountered " "transaction "" at line 18, column 17. Was expecting: ")" ...'. Can you please have a look at my code and explain what am I doing wrong? Thanks!

Cohort/ Retention query in BigQuery using Google Analytics exported data

I need help formulating a cohort/retention query
I am trying to build a query to look at visitors who performed ActionX on their first visit (in the time frame) and then how many days later they returned to perform Action X again
The output I (eventually) need looks like this...
The table I am dealing with is an export of Google Analytics to BigQuery
If anyone could help me with this or anyone who has written a query similar that I can manipulate?
Thanks
Just to give you simple idea / direction
Below is for BigQuery Standard SQL
#standardSQL
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
You can test it with below dummy data from your question
#standardSQL
WITH `OutputFromQuery` AS (
SELECT '01.07.17' AS Date_of_action_first_taken, 1000 AS Visits, 800 AS later_1_day, 400 AS later_2_days, 300 AS later_3_days UNION ALL
SELECT '02.07.17', 1000, 860, 780, 860 UNION ALL
SELECT '29.07.17', 1000, 780, 120, 0 UNION ALL
SELECT '30.07.17', 1000, 710, 0, 0
)
SELECT
Date_of_action_first_taken,
ROUND(100 * later_1_day / Visits) AS later_1_day,
ROUND(100 * later_2_days / Visits) AS later_2_days,
ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`
The OutputFromQuery data is as below:
Date_of_action_first_taken Visits later_1_day later_2_days later_3_days
01.07.17 1000 800 400 300
02.07.17 1000 860 780 860
29.07.17 1000 780 120 0
30.07.17 1000 710 0 0
and the final output is:
Date_of_action_first_taken later_1_day later_2_days later_3_days
01.07.17 80.0 40.0 30.0
02.07.17 90.0 78.0 86.0
29.07.17 80.0 12.0 0.0
30.07.17 70.0 0.0 0.0
I found this query on Turn Your App Data into Answers with Firebase and BigQuery (Google I/O'19)
It should work :)
#standardSQL
###################################################
# Part 1: Cohort of New Users Starting on DEC 24
###################################################
WITH
new_user_cohort AS (
SELECT DISTINCT
user_pseudo_id as new_user_id
FROM
`[your_project].[your_firebase_table].events_*`
WHERE
event_name = `[chosen_event] ` AND
#set the date from when starting cohort analysis
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) = '20191224' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
),
num_new_users AS (
SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),
#############################################
# Part 2: Engaged users from Dec 24 cohort
#############################################
engaged_users_by_day AS (
SELECT
FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) as event_day,
COUNT(DISTINCT user_pseudo_id) as num_engaged_users
FROM
`[your_project].[your_firebase_table].events_*`
INNER JOIN
new_user_cohort ON new_user_id = user_pseudo_id
WHERE
event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
GROUP BY
event_day
)
####################################################################
# Part 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT
event_day,
num_engaged_users,
num_users_in_cohort,
ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM
engaged_users_by_day
CROSS JOIN
num_new_users
ORDER BY
event_day
So I think I may have cracked it... from this output I then would need to manipulate it (pivot table it) to make it look like the desired output.
Can anyone review this for me and let me know what you think?
`WITH
cohort_items AS (
SELECT
MIN( TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 +
h.time*1000)), DAY) ) AS cohort_day, fullVisitorID
FROM
TABLE123 AS U,
UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN "20170701" AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 2
),
user_activites AS (
SELECT
A.fullVisitorID,
DATE_DIFF(DATE(TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 + h.time*1000)), DAY)), DATE(C.cohort_day), DAY) AS day_number
FROM `Table123` A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID,
UNNEST(hits) AS h
WHERE
A._TABLE_SUFFIX BETWEEN "20170701 AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 1,2),
cohort_size AS (
SELECT
cohort_day,
count(1) as number_of_users
FROM
cohort_items
GROUP BY 1
ORDER BY 1
),
retention_table AS (
SELECT
C.cohort_day,
A.day_number,
COUNT(1) AS number_of_users
FROM
user_activites A
LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID
GROUP BY 1,2
)
SELECT
B.cohort_day,
S.number_of_users as total_users,
B.day_number,
B.number_of_users / S.number_of_users as percentage
FROM retention_table B
LEFT JOIN cohort_size S ON B.cohort_day = S.cohort_day
WHERE B.cohort_day IS NOT NULL
ORDER BY 1, 3
`
Thank you in advance!
If you use some techniques available in BigQuery, you can potentially solve this type of problem with very cost and performance effective solutions. As an example:
SELECT
init_date,
ARRAY((SELECT AS STRUCT days, freq, ROUND(freq * 100 / MAX(freq) OVER(), 2) FROM UNNEST(data) ORDER BY days)) data
FROM(
SELECT
init_date,
ARRAY_AGG(STRUCT(days, freq)) data
FROM(
SELECT
init_date,
data AS days,
COUNT(data) freq
FROM(
SELECT
init_date,
ARRAY(SELECT DATE_DIFF(PARSE_DATE("%Y%m%d", dts), PARSE_DATE("%Y%m%d", init_date), DAY) AS dt FROM UNNEST(dts) dts) data
FROM(
SELECT
MIN(date) init_date,
ARRAY_AGG(DISTINCT date) dts
FROM `Table123`
WHERE TRUE
AND EXISTS(SELECT 1 FROM UNNEST(hits) where eventinfo.eventCategory = 'recommendation') -- This is your 'ACTION TAKEN' filter
AND _TABLE_SUFFIX BETWEEN "20170724" AND "20170731"
GROUP BY fullvisitorid
)
),
UNNEST(data) data
GROUP BY init_date, days
)
GROUP BY init_date
)
I tested this query against our G.A data and selected customers who interacted with our recommendation system (as you can see in the filter selection WHERE EXISTS...). Example of result (omitted absolute values of freq for privacy reasons):
As you can see, at day 28th for instance, 8% of customers came back 1 day later and interacted with the system again.
I recommend you to play around with this query and see if it works well for you. It's simpler, cheaper, faster and hopefully easier to maintain.