Use row_number() in BigQuery in CTE - google-bigquery

I'm trying to include row numbers per orderId with this query but I get a error message saying "Unrecognized name: BQ" when selecting the columns in the subquery. I havn't used it too much so not sure where I'm doing it wrong. Can anyone see it?
WITH BQ AS(
SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions)
AS customDimensions WHERE customDimensions.index = 6) as
orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxxxx-xxxx.xxxxxx.ga_sessions_20210801` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
--AND hits.eventinfo.eventaction IN ('complete purchase')
--AND hits.transaction.transactionId IS NULL
--and hits.page.pagePath != '/backend-transaction'
--and hits.eventinfo.eventaction != 'backend transaction' )
)
SELECT
BQ.event_action,
BQ.trx_id,
BQ.page,
BQ.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY BQ.orderId_bq
ORDER BY BQ.hitnumber) as RN,
BQ.orderId_bq
from BQ
)
I did also try this but then it doesn't regognize 'flat.orderId' instead:
WITH BQ AS
(SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 6) as orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxxx-xxxxxxxx.ga_sessions_20210801` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
),
flat AS (
SELECT
*
from bq
)
SELECT
flat.orderId_bq,
flat.event_action,
flat.trx_id,
flat.page,
flat.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY flat.orderId_bq
ORDER BY flat.hitnumber) as RN,
flat.orderId_bq,
flat.event_action,
flat.trx_id,
flat.page,
flat.hitnumber
FROM flat
)

Query that worked:
WITH raw AS
(SELECT
(SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 6) as orderId_bq,
hits.eventinfo.eventaction as event_action,
hits.transaction.transactionId as trx_id,
hits.page.pagePath as page,
hitnumber AS hitnumber
FROM `xxx-xxx.xxxxxx.ga_sessions_*` t,
UNNEST(HITS) as hits
WHERE (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 8) = 'se'
AND (SELECT customDimensions.value FROM UNNEST(t.customDimensions) AS customDimensions WHERE customDimensions.index = 4) = 'soffadirekt'
AND _TABLE_SUFFIX BETWEEN '20211001' AND '20211002'
)
SELECT
raw.event_action,
raw.orderId_bq,
raw.trx_id,
raw.page,
raw.hitnumber,
FROM
(SELECT ROW_NUMBER()
OVER(ORDER BY raw.hitnumber DESC) as RN,
raw.event_action,
raw.orderId_bq,
raw.trx_id,
raw.page,
raw.hitnumber
FROM raw) AS raw

Error Unrecognized name: BQ happens because your are trying to query BQ.* that does not exist since you did not add the BQ alias on your subquery. Adding AS BQ should work. See query:
SELECT
BQ.event_action,
BQ.trx_id,
BQ.page,
BQ.hitnumber,
FROM (SELECT Row_number()
OVER( PARTITION BY BQ.orderId_bq
ORDER BY BQ.hitnumber) as RN,
BQ.orderId_bq
FROM BQ) AS BQ
Just a suggestion, it might be better to use a different alias so your query will be much readable.
I tested this using a table of mine where I add AS subq1 at my subquery. See a simple test:
WITH subq1 AS (SELECT amount_paid,customer from `my-project.test_dataset.myTable`)
SELECT
subq1.RN,
subq1.customer,
subq1.amount_paid
FROM
(SELECT ROW_NUMBER()
OVER(ORDER BY subq1.amount_paid) as RN,
subq1.customer,
subq1.amount_paid
FROM subq1) AS subq1
LIMIT 3
Results:

Related

Unrecognized name: clientId BigQuery

I'm new to SQL and i'm try to make a query:
SELECT
clientId,
pagePath,
SUM(CASE
WHEN isExit IS NOT NULL THEN last_interaction
ELSE
nextTime
END
) AS time_on_page
FROM (
SELECT
hits.page.pagePath,
hits.isExit,
hits.time/1000 AS hits_time,
LEAD(hits.time/1000, 1) OVER (PARTITION BY fullVisitorId, visitid ORDER BY hits.time ASC) AS nextTime,
MAX(
IF
(hits.isInteraction = TRUE,
hits.time / 1000,
0)) OVER (PARTITION BY fullVisitorId, visitid) AS last_interaction
FROM
`merck-bigquery.1===.ga_sessions_20201231`,
UNNEST(hits) AS hits
WHERE
hits.type = "PAGE"
AND hits.page.hostname = 'www.msdmed.ru' )
GROUP BY
1
ORDER BY
2 ASC
The BigQuery returns an error Unrecognized name: clientId
I dont understand what's wrong in this query, because clientId its default field in BQ schema.
The outer query can see only fields listed in the inner query. Try removing clientId from outer one or adding clientId explicitly into the inner query.

How can I find the previous page with Bigquery

I want to find out the previous page where the current page is a product page.
For example I have this page 'https://www.emag.ro/telefon-mobil-apple-iphone-x-64gb-4g-space-grey-mqac2rm-a/pd/DN094NBBM'and my previous page is this page 'https://www.emag.ro/search/telefoane-mobile/IPHONE/c?ref=srcql'
How in terms of hitnumber I can return how many users had this behavior.
I tried with this 2 query and I want to do a JOIN but I don't know how is better.
Also, I tried with LAG function but I don't know for sure if I catch all the users.
Thank you in advance.
with
view_product as (
SELECT
ga.fullVisitorId AS GA_USER_ID,
date as date,
h.hitnumber as hitnumber,
CONCAT(ga.fullVisitorId, cast(ga.visitId AS string)) AS SessionID,
(SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) AS PAGETYPE,
(SELECT VALUE FROM h.customDimensions WHERE index =8) as ref_parameter,
visitid as visitid,
h.page.pagePath as page_path
FROM
`emagbigquery.0` ga,
UNNEST(hits) AS h
WHERE h.type='PAGE'
AND _TABLE_SUFFIX = '20190115'
AND (SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) = 'viewproduct'
)
,
SEARCH_page_WITH_REF_SRCQL as (
select
date as date,
ga.fullVisitorId AS GA_USER_ID,
h.hitnumber as hitnumber,
CONCAT(ga.fullVisitorId, cast(ga.visitId AS string)) AS SessionID,
(SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) AS PAGETYPE,
(SELECT VALUE FROM h.customDimensions WHERE index =8) as ref_parameter,
visitid as visitid,
h.page.pagePath as page_path
FROM
`emagbigquery.0` ga,
UNNEST(hits) AS h
WHERE h.type='PAGE'
AND _TABLE_SUFFIX = '20190115'
AND (SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) = 'search'
AND (SELECT VALUE FROM h.customDimensions WHERE index =8) LIKE 'srcql'
)
select
COUNT(DISTINCT GA_USER_ID) AS USERS,
COUNT(DISTINCT SessionID) AS SESSIONS,
previous_page_from_srcql
from (
select
t1.ga_user_id,
t1.sessionid,
t2.hitnumber > t1.hitnumber as previous_page_from_srcql
from SEARCH_page_WITH_REF_SRCQL as t1
inner join view_product as t2
on t1.ga_user_id = t2.ga_user_id
group by
previous_page_from_srcql
Try UNNEST WITH OFFSET. It can give you an easy way to later determine that one row came after the other:
WITH path_and_prev AS (
SELECT ARRAY(
SELECT AS STRUCT session.page.pagePath
, LAG(session.page.pagePath) OVER(ORDER BY i) prevPagePath
FROM UNNEST(hits) session WITH OFFSET i
) x
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`
)
SELECT COUNT(*) c, pagePath, prevPagePath
FROM path_and_prev, UNNEST(x)
WHERE pagePath='/vests/yellow.html'
AND prevPagePath='/vests/'
GROUP BY 2,3

NTILE() in BigQuery for non-uniform buckets

I'm trying to perform RFM segmentation on the Google Merchandise Store sample dataset on BigQuery. In my SQL query, NTILE(5) divides the rows into 5 buckets based on row ordering and returns the bucket number that is assigned to each row. In this case, each bucket are of equal size. Would like to find out how to create buckets of different sizes instead. For example, bucket 1 contains the bottom 10%, bucket 2 contains the next 20% of records etc. Thank you!
#standard SQL
SELECT
fullVisitorId,
NTILE(5) OVER (ORDER BY last_order_date) AS rfm_recency,
NTILE(5) OVER (ORDER BY count_order) AS rfm_frequency,
NTILE(5) OVER (ORDER BY avg_amount) AS rfm_monetary
FROM (
SELECT
fullVisitorId,
MAX(date) AS last_order_date,
COUNT(*) AS count_order,
AVG(totals.totalTransactionRevenue)/1000000 AS avg_amount
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
WHERE
_table_suffix BETWEEN "101"
AND "801"
AND totals.totalTransactionRevenue IS NOT NULL
GROUP BY
fullVisitorId )
You can use row_number() and count(*) to define your own buckets:
SELECT fullVisitorId,
(CASE WHEN seqnum_r <= 0.1 * cnt THEN 1
WHEN seqnum_r <= 0.3 * cnt THEN 2
ELSE 3
END) as bin_r,
. . .
FROM (SELECT fullVisitorId,
MAX(date) AS last_order_date,
COUNT(*) AS count_order,
(AVG(totals.totalTransactionRevenue) / 1000000) AS avg_amount,
COUNT(*) OVER () as cnt,
ROW_NUMBER() OVER (ORDER BY MAX(date)) as seqnum_r,
ROW_NUMBER() OVER (ORDER BY COUNT(*)) as seqnum_f,
ROW_NUMBER() OVER (ORDER BY AVG(totals.totalTransactionRevenue)) as seqnum_m
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
WHERE _table_suffix BETWEEN "101" AND "801" AND
totals.totalTransactionRevenue IS NOT NULL
GROUP BY fullVisitorId
) rfm
Below is for BigQuery Standard SQL and assumes your initial query works for for you, SQL UDF NON_UNIFORM_BUCKET() does the trick for you
#standard SQL
CREATE TEMP FUNCTION NON_UNIFORM_BUCKET(i INT64) AS (
CASE
WHEN i = 1 THEN 1
WHEN i IN (2, 3) THEN 2
WHEN i IN (4, 5, 6) THEN 3
WHEN i = 7 THEN 4
ELSE 5
END
);
SELECT
fullVisitorId,
NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY last_order_date)) AS rfm_recency,
NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY count_order)) AS rfm_frequency,
NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY avg_amount)) AS rfm_monetary
FROM (
SELECT
fullVisitorId,
MAX(date) AS last_order_date,
COUNT(*) AS count_order,
AVG(totals.totalTransactionRevenue)/1000000 AS avg_amount
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
WHERE
_table_suffix BETWEEN "101"
AND "801"
AND totals.totalTransactionRevenue IS NOT NULL
GROUP BY
fullVisitorId )

Google Big Query: Get New Visitor Count using Custom Dimension

select PARSE_DATE('%Y%m%d', t.date) as Date
,count(distinct(fullvisitorid)) as User
,SUM( totals.newVisits ) AS New_Visitors
,(if(customDimensions.index=1, customDimensions.value,null)) as Orig
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
group by Date, orig
Is there a way to get new visitor count and use the customDimension at the same time? The sum(total.newVisits) doesn't work.
Thanks
Below is for BigQuery Standard SQL
SELECT DATE
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( newVisits ) AS New_Visitors
,Orig
FROM (
SELECT PARSE_DATE('%Y%m%d', t.date) AS DATE
,fullvisitorid
,totals.newVisits AS newVisits
,(IF(customDimensions.index=1, customDimensions.value,NULL)) AS Orig
FROM `table` AS t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
GROUP BY DATE, orig, fullvisitorid, newVisits
)
GROUP BY DATE, Orig
The best way in your case is to remove the cross-joins and use sub-selects instead:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM UNNEST(customDimensions) WHERE index=1) Orig
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( totals.newVisits ) AS New_Visitors
FROM
`table` t
GROUP BY Orig, Date
In case you have a dimension on hit scope and really need to flatten the table, you need to build a session id you can count distinct. That is because you are repeating all session scoped fields on hit-scope by applying the cross-join:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,h.page.pagePathLevel1
,COUNT(DISTINCT(fullvisitorid)) AS User
-- create session id and count distinct
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)) ) AS all_sessions
-- only count distinct session id of sessions where totals.newVisits = 1
,COUNT(DISTINCT
IF(totals.newVisits=1,
CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)),
NULL )
) AS New_Visitors
FROM
-- flatten table to hit scope (comma means cross-join in stnd sql)
`table` t, t.hits h
GROUP BY 1,2,3
So in case for new visitors I only provide a session id if totals.newVisits=1 - else the if-statement provides NULL which is not countable.
If you have something similar on product-scope, you'd need to create an ID that takes into account session and hit.
E.g. counting pages for productSku:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,p.productSku
,COUNT(DISTINCT fullvisitorid) AS users
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING))) AS sessions
,COUNT(DISTINCT
IF(h.type='PAGE',
CONCAT(fullvisitorid, cast(visitstarttime AS STRING),CAST(hitNumber AS STRING)),
NULL)
) as pageviews
,COUNT(1) AS products
FROM
`table` t, t.hits h LEFT JOIN h.product p
GROUP BY 1,2,3
Note, that I'm left joining the product array. Since it sometimes is empty a cross-join would destroy all hits information: cross-join with empty table results in empty table.
Hope that helps!

FULL OUTER JOIN throwing error "wrap select in parentheses"

I am having trouble with a part of my SQL call, I receive this error
Error: Syntax error: Each subquery argument for table-valued function calls must be enclosed in parentheses. To fix this, replace SELECT... with (SELECT...) at [32:5]
This is at the SELECT after the FULL OUTER JOIN EACH, I'd argue that I have done that, I do not know what is wrong here so any suggestions would be much appriciated.
I am trying to create a funnel that more acurately sorts previous customers from new. There are in total 3 levels in the funnel, for "simplicity" I'll only show two.
SELECT
COUNT(s0.firstHit) AS pageId1,
SUM(s0.exit) AS pageId2,
COUNT(s1.firstHit) AS pageId3,
SUM(s1.exit) AS pageId4
FROM(
SELECT
s0.fullVisitorId,
s0.visitId,
s0.firstHit,
s0.exit,
s1.firstHit,
s1.exit
FROM (
SELECT
fullvisitorid,
visitid,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND 1 = 1
AND EXISTS(SELECT 1 FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId'))
AND EXISTS(SELECT 1 FROM UNNEST(hits) hits WHERE (SELECT COUNT(value) FROM UNNEST(hits.customDimensions) custd WHERE index=20) > 0)
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) AS s0
FULL OUTER JOIN EACH(
SELECT
fullVisitorId,
visitId,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) AS s1
ON
s0.fullVisitorId = s1.fullVisitorId
AND s0.visitId = s1.visitId ) s01
You can find ways to write this query and not having the JOIN operations going on.
For instance:
SELECT
fullvisitorid,
visitid,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstFunnelHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstExitFunnelFlag,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE (REGEXP_CONTAINS(page.pagePath, r'pageId')) OR ((select count(1) from unnest(hits) h, unnest(h.customDimensions) custd where custd.index = 20) > 0)) secondFunnelHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId') OR ((select count(1) from unnest(hits) h, unnest(h.customDimensions) custd where custd.index = 20) > 0)) AS secondFunnelExitFlag
FROM `dataset.ga_sessions_2017*`
WHERE 1 = 1
AND _TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
Notice that in just one SELECT you can bring information regarding all visitors who have been in page "pageId" and also visitors who have been in this page and fired the customDimension on index=20.
For each step in your funnel analyzes you can bring new columns as results, such as the firstFunnelHit and secondFunnelHit.
By avoiding expensive JOINs you can query up to teras of data and still have results in seconds.
In the last subquery:
SELECT
fullVisitorId,
visitId
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId
it looks to me like you need a comma after visitId in the third line.
Best of luck.
There is no EACH keyword when using standard SQL; this is specific to legacy SQL. Remove that word and your query will probably work.