Which part of my query is wrong? UNNEST function - sql

I couldn't figure out which part of my code is wrong.
I used a UNNEST function but the error msg is still
'Cannot access field productSKU on a value with type ARRAY>' in Google Bigquery.
My query is below:
SELECT
hits.product.productSKU AS product_SKU,
hits.product.v2ProductName AS Product_Name,
SUM(totals.transactionRevenue) AS Total_Revenue,
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits.product) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170731' AND totals.transactions >= 1
Group by
hits.product.productSKU
Order by
v2ProductName DESC

Assuming the overall logic of your query reflect what you want to achieve - below is correct version that fixes unnest'ing part as well as adds missing field in group by - hope you see what gets corrected
#standardSQL
SELECT
product.productSKU AS product_SKU,
product.v2ProductName AS Product_Name,
SUM(totals.transactionRevenue) AS Total_Revenue,
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hit,
UNNEST(hit.product) AS product
WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170731' AND totals.transactions >= 1
GROUP BY product_SKU, Product_Name
ORDER BY v2ProductName DESC

Related

Using OFFSET instead of UNNEST for nested fields in Google Bigquery

A quick question to GBQ gurus.
Here are two queries that are identical in their purpose
first
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits[
OFFSET(0)].time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits[OFFSET(0)].eventInfo.eventAction,
hits[OFFSET(0)].TRANSACTION.transactionId,
hits[OFFSET(0)].TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`
WHERE
ARRAY_LENGTH(hits) > 0
AND _table_suffix BETWEEN '20200201'
AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
second
SELECT
fullVisitorId AS userid,
CONCAT(fullVisitorId, visitStartTime) AS session,
visitStartTime + (hits.time / 1000) AS eventtime,
date,
trafficSource.campaign,
trafficSource.source,
trafficSource.medium,
trafficSource.adContent,
trafficSource.adwordsClickInfo.campaignId,
geoNetwork.region,
geoNetwork.city,
trafficSource.keyword,
totals.visits AS visits,
device.deviceCategory AS deviceType,
hits.eventInfo.eventAction,
hits.TRANSACTION.transactionId,
hits.TRANSACTION.transactionRevenue,
SUBSTR(channelGrouping,0,3) AS newchannelGrouping
FROM
`some_site.ga_sessions_*`, UNNEST(hits) hits
WHERE
_table_suffix BETWEEN '20200201' AND '20200201'
AND fullVisitorId IN (
SELECT
DISTINCT(fullVisitorId)
FROM
`some_site.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_table_suffix BETWEEN '20200201'
AND '20200201'
AND (hits.TRANSACTION.transactionId != 'None')
)
The 1st one uses OFFSET to extract data from nested fields. According to execution details report, the query requires about 1.5 MB of shuffling.
The 2nd query uses UNNEST to reach nested data. And the amount of shuffled bytes is around (!) 75 MB
The amount of processed data is the same in both cases.
Now, the question is:
Does that mean that according to this article which concerns optimizing communication between slots I should uses OFFSET instead of UNNEST to get the data stored in nested fields?
Thanks!
Let's consider following examples with using BigQuery public dataset.
UNNEST - returns 6 results:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT h FROM t, UNNEST(hits) h
OFFSET - returns 1 result:
WITH t AS (SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` WHERE visitId = 1501571504 )
SELECT hits[OFFSET(0)] FROM t
Both queries are referencing to the same record inside a GA public table. They show that using a join with UNNEST will bring one row per element inside the array and using OFFSET(0) will bring only one row with the first element of the array.
The reason for difference in high data shuffling is because the UNNEST performs a JOIN operation, which requires the data to be organized in a specific way. The OFFSET approach takes only the first element of the array.

Converting Legacy SQL to Standard SQL - Enhannced Ecommerce

I am in no way a coder so I have tried but falling over on this.
I want to use this query from Googles Google Analytics Big Query Cookbook
Products purchased by customers who purchased product A (Enhanced Ecommerce)
I have pasted the code below
Into Standard SQL.
I have made a few attemps but am falling over and not
Thank you in advance
John
SELECT hits.product.productSKU AS other_purchased_products,
COUNT(hits.product.productSKU) AS quantity
FROM (
SELECT fullVisitorId, hits.product.productSKU, hits.eCommerceAction.action_type
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM TABLE_DATE_RANGE([bigquery-public-data:google_analytics_sample.ga_sessions_],
TIMESTAMP('2017-04-01'), TIMESTAMP('2017-04-20'))
WHERE hits.product.productSKU CONTAINS 'GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND hits.product.productSKU IS NOT NULL
AND hits.product.productSKU !='GGOEYOCR077799'
AND hits.eCommerceAction.action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC;
Below is pure equivalent in BigQuery Standard SQL (no any optimizations, improvements, etc. - just pure translation from legacy to standard)
SELECT productSKU AS other_purchased_products, COUNT(productSKU) AS quantity
FROM (
SELECT fullVisitorId, prod.productSKU, hit.eCommerceAction.action_type
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
)
WHERE fullVisitorId IN (
SELECT fullVisitorId
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) hit, UNNEST(hit.product) prod
WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170420'
AND prod.productSKU LIKE '%GGOEYOCR077799%'
AND hit.eCommerceAction.action_type = '6'
GROUP BY fullVisitorId
)
AND productSKU IS NOT NULL
AND productSKU !='GGOEYOCR077799'
AND action_type = '6'
GROUP BY other_purchased_products
ORDER BY quantity DESC
obviously produces exactly same result as legacy version

Google Big Query: Get New Visitor Count using Custom Dimension

select PARSE_DATE('%Y%m%d', t.date) as Date
,count(distinct(fullvisitorid)) as User
,SUM( totals.newVisits ) AS New_Visitors
,(if(customDimensions.index=1, customDimensions.value,null)) as Orig
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
group by Date, orig
Is there a way to get new visitor count and use the customDimension at the same time? The sum(total.newVisits) doesn't work.
Thanks
Below is for BigQuery Standard SQL
SELECT DATE
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( newVisits ) AS New_Visitors
,Orig
FROM (
SELECT PARSE_DATE('%Y%m%d', t.date) AS DATE
,fullvisitorid
,totals.newVisits AS newVisits
,(IF(customDimensions.index=1, customDimensions.value,NULL)) AS Orig
FROM `table` AS t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
GROUP BY DATE, orig, fullvisitorid, newVisits
)
GROUP BY DATE, Orig
The best way in your case is to remove the cross-joins and use sub-selects instead:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM UNNEST(customDimensions) WHERE index=1) Orig
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( totals.newVisits ) AS New_Visitors
FROM
`table` t
GROUP BY Orig, Date
In case you have a dimension on hit scope and really need to flatten the table, you need to build a session id you can count distinct. That is because you are repeating all session scoped fields on hit-scope by applying the cross-join:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,h.page.pagePathLevel1
,COUNT(DISTINCT(fullvisitorid)) AS User
-- create session id and count distinct
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)) ) AS all_sessions
-- only count distinct session id of sessions where totals.newVisits = 1
,COUNT(DISTINCT
IF(totals.newVisits=1,
CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)),
NULL )
) AS New_Visitors
FROM
-- flatten table to hit scope (comma means cross-join in stnd sql)
`table` t, t.hits h
GROUP BY 1,2,3
So in case for new visitors I only provide a session id if totals.newVisits=1 - else the if-statement provides NULL which is not countable.
If you have something similar on product-scope, you'd need to create an ID that takes into account session and hit.
E.g. counting pages for productSku:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,p.productSku
,COUNT(DISTINCT fullvisitorid) AS users
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING))) AS sessions
,COUNT(DISTINCT
IF(h.type='PAGE',
CONCAT(fullvisitorid, cast(visitstarttime AS STRING),CAST(hitNumber AS STRING)),
NULL)
) as pageviews
,COUNT(1) AS products
FROM
`table` t, t.hits h LEFT JOIN h.product p
GROUP BY 1,2,3
Note, that I'm left joining the product array. Since it sometimes is empty a cross-join would destroy all hits information: cross-join with empty table results in empty table.
Hope that helps!

How to use multiple custom dimensions in Google Big Query

Is there a way to use multiple custom dimensions in GBQ without using the Max function? My problem of using Max function is that it only saves the max pax_num, but I would like to have the count of visitors for all of the combinations of ( Date,product.v2ProductCategory,eCommerceAction.action_type
,product.v2ProductName). Note the pax_num is number of pax on that ticket. I need every combination of the dest+pax_num, not the dest+max(pax_num)
SELECT
Date
,count(distinct( concat(FULLVISITORID,cast(visitID as string)))) as visitor
, product.v2ProductCategory as product_category
,max(if(customDimensions.index=2, customDimensions.value,null)) as dest
,max((if(customDimensions.index=21, customDimensions.value,null)) ) as pax_num
,eCommerceAction.action_type as Action_type
,product.v2ProductName as product_name
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions) AS customDimensions
CROSS JOIN UNNEST(hit.product) AS product
GROUP BY
Date
,product.v2ProductCategory
,eCommerceAction.action_type
,product.v2ProductName
Not sure if this is what you are looking for, but if you include the field pax_num in the group by you might already find what you need, like so:
select
date,
count(distinct( concat(FULLVISITORID,cast(visitID as string)))) as sessions,
product.v2ProductCategory category,
max(if(customDimensions.index=2, customDimensions.value, null)) as dest,
if(customDimensions.index=21, customDimensions.value,null) as pax_num,
eCommerceAction.action_type as act_type,
product.v2ProductName as product_name
from `table` as t,
unnest(hits) as hit,
unnest(hit.customDimensions) customDimensions,
unnest(hit.product) as product
group by
date,
category,
act_type,
pax_num,
product_name
having pax_num is not null
You gave as an example the pax_num values "paxnum_5" and "paxnum_6". If you insert the value pax_num in the group by operation, the count aggregation should happen on the level of pax_num which would preserve the values (and not mix everything into the max value as before).
Also, notice that if you count the distinct combination of fullvisitorids and visitids you are actually computing the total amount of sessions and not visitors (their definition is not the same).
Add the fullvisitorID solve the problem
SELECT
Date
,concat(fullVisitorID,cast(visitID as string)) as visitorID
,count(distinct( concat(FULLVISITORID,cast(visitID as string)))) as visitor
, product.v2ProductCategory as product_category
,max(if(customDimensions.index=2, customDimensions.value,null)) as dest
,max((if(customDimensions.index=21, customDimensions.value,null)) ) as pax_num
,eCommerceAction.action_type as Action_type
,product.v2ProductName as product_name
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions) AS customDimensions
CROSS JOIN UNNEST(hit.product) AS product
GROUP BY
Date
,product.v2ProductCategory
,eCommerceAction.action_type
,product.v2ProductName
,visitorID

Showing all results even using GROUP BY CLAUSE

Query :
How to sort by months ?
select format(datee,'mmm-yyyy') as [Months],sum(amount) as Amount
from ledger_broker
where ref_from like 'Purchase'
group by format(datee,'mmm-yyyy')
order by format(datee,'mmm-yyyy') desc
Output :
Try grouping by the same exact column which you select:
SELECT t.[Months], t.Amount
FROM
(
SELECT MONTH(datee) AS theMonth, YEAR(datee) AS theYear,
FORMAT(datee,'mmm-yyyy') AS [Months], SUM(amount) AS Amount
FROM ledger_transporter
WHERE ref_from LIKE 'Purchase'
GROUP BY MONTH(datee), YEAR(datee), FORMAT(datee, 'mmm-yyyy')
) t
ORDER BY t.theYear DESC, t.theMonth DESC
One way to order by date is to select the numeric month and year in your query.
change group by datee to group by format(datee,'mmm-yyyy').
select distinct format(datee,'mmm-yyyy') as [Months], sum(amount) as Amount
from ledger_transporter
where ref_from like 'Purchase'
group by format(datee,'mmm-yyyy')
order by Month(datee)
The reason is that your date, which I assume is say 01-FEB-2016 and 02-FEB-2016, is different and if you group by it, you will get 2 different records for it.
However, for format(datee,'mmm-yyyy'), ie FEB-2016, both of these dates are same. Hence the mismatch