How can I find the previous page with Bigquery - google-bigquery

I want to find out the previous page where the current page is a product page.
For example I have this page 'https://www.emag.ro/telefon-mobil-apple-iphone-x-64gb-4g-space-grey-mqac2rm-a/pd/DN094NBBM'and my previous page is this page 'https://www.emag.ro/search/telefoane-mobile/IPHONE/c?ref=srcql'
How in terms of hitnumber I can return how many users had this behavior.
I tried with this 2 query and I want to do a JOIN but I don't know how is better.
Also, I tried with LAG function but I don't know for sure if I catch all the users.
Thank you in advance.
with
view_product as (
SELECT
ga.fullVisitorId AS GA_USER_ID,
date as date,
h.hitnumber as hitnumber,
CONCAT(ga.fullVisitorId, cast(ga.visitId AS string)) AS SessionID,
(SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) AS PAGETYPE,
(SELECT VALUE FROM h.customDimensions WHERE index =8) as ref_parameter,
visitid as visitid,
h.page.pagePath as page_path
FROM
`emagbigquery.0` ga,
UNNEST(hits) AS h
WHERE h.type='PAGE'
AND _TABLE_SUFFIX = '20190115'
AND (SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) = 'viewproduct'
)
,
SEARCH_page_WITH_REF_SRCQL as (
select
date as date,
ga.fullVisitorId AS GA_USER_ID,
h.hitnumber as hitnumber,
CONCAT(ga.fullVisitorId, cast(ga.visitId AS string)) AS SessionID,
(SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) AS PAGETYPE,
(SELECT VALUE FROM h.customDimensions WHERE index =8) as ref_parameter,
visitid as visitid,
h.page.pagePath as page_path
FROM
`emagbigquery.0` ga,
UNNEST(hits) AS h
WHERE h.type='PAGE'
AND _TABLE_SUFFIX = '20190115'
AND (SELECT VALUE FROM h.customDimensions WHERE INDEX = 10) = 'search'
AND (SELECT VALUE FROM h.customDimensions WHERE index =8) LIKE 'srcql'
)
select
COUNT(DISTINCT GA_USER_ID) AS USERS,
COUNT(DISTINCT SessionID) AS SESSIONS,
previous_page_from_srcql
from (
select
t1.ga_user_id,
t1.sessionid,
t2.hitnumber > t1.hitnumber as previous_page_from_srcql
from SEARCH_page_WITH_REF_SRCQL as t1
inner join view_product as t2
on t1.ga_user_id = t2.ga_user_id
group by
previous_page_from_srcql

Try UNNEST WITH OFFSET. It can give you an easy way to later determine that one row came after the other:
WITH path_and_prev AS (
SELECT ARRAY(
SELECT AS STRUCT session.page.pagePath
, LAG(session.page.pagePath) OVER(ORDER BY i) prevPagePath
FROM UNNEST(hits) session WITH OFFSET i
) x
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910`
)
SELECT COUNT(*) c, pagePath, prevPagePath
FROM path_and_prev, UNNEST(x)
WHERE pagePath='/vests/yellow.html'
AND prevPagePath='/vests/'
GROUP BY 2,3

Related

Unrecognized name: clientId BigQuery

I'm new to SQL and i'm try to make a query:
SELECT
clientId,
pagePath,
SUM(CASE
WHEN isExit IS NOT NULL THEN last_interaction
ELSE
nextTime
END
) AS time_on_page
FROM (
SELECT
hits.page.pagePath,
hits.isExit,
hits.time/1000 AS hits_time,
LEAD(hits.time/1000, 1) OVER (PARTITION BY fullVisitorId, visitid ORDER BY hits.time ASC) AS nextTime,
MAX(
IF
(hits.isInteraction = TRUE,
hits.time / 1000,
0)) OVER (PARTITION BY fullVisitorId, visitid) AS last_interaction
FROM
`merck-bigquery.1===.ga_sessions_20201231`,
UNNEST(hits) AS hits
WHERE
hits.type = "PAGE"
AND hits.page.hostname = 'www.msdmed.ru' )
GROUP BY
1
ORDER BY
2 ASC
The BigQuery returns an error Unrecognized name: clientId
I dont understand what's wrong in this query, because clientId its default field in BQ schema.
The outer query can see only fields listed in the inner query. Try removing clientId from outer one or adding clientId explicitly into the inner query.

BigQuery (Google Analytics data):query two different 'hits.customDimensions.index' in the same 'hits.hitNumber'

my goal:
Count 1 for the session if the following two hits.customDimensions.index and associated hits.customDimensions.value appear in the same hits.hitNumber (every row is 1 session if main query is still nested):
['hits.customDimensions.index' = 43 with associated 'hits.customDimensions.value' IN ('login', 'payment', 'order', 'thankyou')] AND ['hits.customDimensions.index' = 10 with associated 'hits.customDimensions.value' = 'checkout' [in the same hits.hitNumber]
my problem:
I don't know how i can query two different hits.customDimensions.value in the same hits.hitNumber in one Subquery without different WITH-tables. If it's possible, which I'm sure, the query would be very easy and short. Since i don't know how to query this usecase in a subquery, I use an workaround which totals to 5 WITH-tables. I would appreciate an easy way to query this usecase
Explanation workaround query:
Table1: Queries all except the 'problem-metric'
Table2-3: Each table queries one hits.customDimensions.index with associated hits.customDimensions.value filtered for the correct value, sessionId and hitNumber
table4: left join table 2 with table 3 based on date, sessionID and hitNumber. Basically if hitNumber combined with sessionId from table2 and table3 match I count 1
table5: left join table1 with table4 to combine the data
#Table1 - complete data except session_atleast_loginCheckout
WITH
prepared_data AS (
SELECT
date,
SUM((SELECT 1 FROM UNNEST(hits) WHERE CAST(eCommerceAction.action_type AS INT64) BETWEEN 4 AND 6 LIMIT 1)) AS sessions_atleast_basket,
#insert in this row query for sessions_atleast_loginCheckout
SUM((SELECT 1 FROM UNNEST(hits) as h, UNNEST(h.customDimensions) as hcd WHERE index = 43 AND value IN ('payment', 'order', 'thankyou') LIMIT 1)) AS sessions_atleast_payment,
FROM
`big-query-221916.172008714.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND totals.visits = 1
GROUP BY
date
#Table2 - data for hits.customDimensions.index = 10 AND associated hits.customDimensions.value = 'checkout' with hits.hitNumber and sessionId (join later based on hitNumber and sessionId)
loginCheckout_index10_pagetype_data AS (
SELECT
date AS date,
CONCAT(fullVisitorId, '/', CAST( visitStartTime AS STRING)) AS sessionId,
h.hitNumber AS hitNumber,
IF(hcd.value IS NOT NULL, 1, NULL) AS pagetype_checkout
FROM
`big-query-221916.172008714.ga_sessions_*` AS o, UNNEST(hits) as h, UNNEST(h.customDimensions) as hcd
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND hcd.index = 10 AND VALUE = 'checkout' AND h.type = 'PAGE' AND totals.visits = 1),
#Table3 - data for hits.customDimensions.index = 43 AND associated hits.customDimensions.value IN ('login', 'register', 'payment', 'order','thankyou') with hits.hitNumber and sessionId (join later based on hitNumber and sessionId)
loginCheckout_index43_pagelevel1_data AS (
SELECT
date AS date,
CONCAT(fullVisitorId, '/', CAST( visitStartTime AS STRING)) AS sessionId,
h.hitNumber AS hitNumber,
IF(hcd.value IS NOT NULL, 1, NULL) AS pagelevel1_login_to_thankyou
FROM
`big-query-221916.172008714.ga_sessions_*` AS o, UNNEST(hits) as h, UNNEST(h.customDimensions) as hcd
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND hcd.index = 43 AND VALUE IN ('login', 'register', 'payment', 'order', 'thankyou') AND h.type = 'PAGE'
),
#table4 - left join table2 and table 3 on sessionId and hitNumber to get sessions_atleast_loginCheckout
loginChackout_output_data AS(
SELECT
a.date AS date,
COUNT(DISTINCT a.sessionId) AS sessions_atleast_loginCheckout
FROM
loginCheckout_index10_pagetype_data AS a
LEFT JOIN
loginCheckout_index43_pagelevel1_data AS b
ON
a.date = b.date AND
a.sessionId = b.sessionId AND
a.hitNumber = b.hitNumber
WHERE
pagelevel1_login_to_thankyou IS NOT NULL
GROUP BY
date
#table5 - leftjoin table1 with table4 to get all data together
SELECT
prep.date,
prep.sessions_atleast_basket,
log.sessions_atleast_loginCheckout,
prep.sessions_atleast_payment
FROM
prepared_data AS prep
LEFT JOIN
loginChackout_output_data as log
ON
prep.date = log.date AND
It's a bit like Inception, but maybe it helps to keep in mind that the input of unnest() is an array and the output are table rows ...
SELECT
SUM(totals.visits) as sessions
FROM
`big-query-221916.172008714.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND -- the following two hits.customDimensions.index and associated hits.customDimensions.value appear in the same hits.hitNumber
(SELECT COUNT(1)>0 as hitsCountMoreThanZero FROM UNNEST(hits) AS h
WHERE
-- index 43, value IN ('login', 'payment', 'order', 'thankyou')
(select count(1)>0 from unnest(h.customdimensions) where index=43 and value IN ('login', 'payment', 'order', 'thankyou'))
AND
-- index 10, value = 'checkout'
(select count(1)>0 from unnest(h.customdimensions) where index=10 and value='checkout')
)
GROUP BY
date

How to calculate running sums with append-only rows

I have a table where rows are never mutated but only inserted; they are immutable records. It has the following fields:
id: int
user_id: int
created: datetime
is_cool: boolean
likes_fruits: boolean
An object is tied to a user, and the "current" object for a given user is the one that has the latest created date. E.g. if I want to update is_cool for a user, I'd append a record with a new created timestamp and is_cool=true.
I want to calculate how many users are is_cool at the end of each day. I.e. I'd like the output table to have the columns:
day: some kind of date_trunc('day', created)
cool_users_count: number of users that have is_cool at the end of this day.
What SQL query can i write that does this? FWIW I'm using Presto (or Redshift if need to).
Note that there are other columns, e.g. likes_fruits, which means a record where is_cool is false does not mean is_cool was just changed to false - it could have been false for a while.
This is what procedural pseudo-code would look like to represent what I'd want to do in SQL:
// rows = ...
min_date = min([row.created for row in rows])
max_date = max([row.created for row in rows])
counts_by_day = {}
for date in range(min_date, max_date):
rows_up_until_date = [row for row in rows if row.created <= date]
latest_row_by_user = rows_up_until_date.reduce(
{},
(acc, row) => acc[row.user_id] = row,
)
counts_by_day[date] = latest_row_by_user.filter(row => row.is_cool).length
You can do this using jus a query .. try using a sum on boolend and group by
select date(created), sum(is_cool)
from my_table
group by date(created)
or if you need the number of users
select t.date_created, count(*) num_user
from (
select distinct date(created) date_created, user_id
from my_table
where is_cool = TRUE
) t
group by t.date_created
or if need the last value for is_cool
select date(max_date), sum(is_cool)
from (
select t.user_id, t.max_date, m.is_cool, m.user_id
from my_table m
inner join (
select max(date_created) max_date, user_id
from my_table
group by user_id, date(date_created)
) t on t.max_date = m.date_created
and t.user_id = m.user_id
where m.is_cool = TRUE
) t2
group by date(max_date)
A correlated subquery might be the simplest solution. The following gets the value of is_cool for each user on each date:
select u.user_id, d.date,
(select t.is_cool
from t
where t.user_id = u.user_id and
t.created < dateadd(day, 1, d.date)
order by t.created desc
limit 1
) as is_cool
from (select distinct date(created) as date
from t
) d cross join
(select distinct user_id
from t
) u ;
Then aggregate:
select date, sum(is_cool)
from (select u.user_id, d.date,
(select t.is_cool
from t
where t.user_id = u.user_id and
t.created < dateadd(day, 1, d.date)
order by t.created desc
limit 1
) as is_cool
from (select distinct date(created) as date
from t
) d cross join
(select distinct user_id
from t
) u
) ud
group by date;

Google Big Query: Get New Visitor Count using Custom Dimension

select PARSE_DATE('%Y%m%d', t.date) as Date
,count(distinct(fullvisitorid)) as User
,SUM( totals.newVisits ) AS New_Visitors
,(if(customDimensions.index=1, customDimensions.value,null)) as Orig
FROM `table` as t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
group by Date, orig
Is there a way to get new visitor count and use the customDimension at the same time? The sum(total.newVisits) doesn't work.
Thanks
Below is for BigQuery Standard SQL
SELECT DATE
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( newVisits ) AS New_Visitors
,Orig
FROM (
SELECT PARSE_DATE('%Y%m%d', t.date) AS DATE
,fullvisitorid
,totals.newVisits AS newVisits
,(IF(customDimensions.index=1, customDimensions.value,NULL)) AS Orig
FROM `table` AS t
CROSS JOIN UNNEST(hits) AS hit
CROSS JOIN UNNEST(hit.customDimensions ) AS customDimensions
GROUP BY DATE, orig, fullvisitorid, newVisits
)
GROUP BY DATE, Orig
The best way in your case is to remove the cross-joins and use sub-selects instead:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM UNNEST(customDimensions) WHERE index=1) Orig
,COUNT(DISTINCT(fullvisitorid)) AS User
,SUM( totals.newVisits ) AS New_Visitors
FROM
`table` t
GROUP BY Orig, Date
In case you have a dimension on hit scope and really need to flatten the table, you need to build a session id you can count distinct. That is because you are repeating all session scoped fields on hit-scope by applying the cross-join:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,h.page.pagePathLevel1
,COUNT(DISTINCT(fullvisitorid)) AS User
-- create session id and count distinct
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)) ) AS all_sessions
-- only count distinct session id of sessions where totals.newVisits = 1
,COUNT(DISTINCT
IF(totals.newVisits=1,
CONCAT(fullvisitorid, CAST(visitstarttime AS STRING)),
NULL )
) AS New_Visitors
FROM
-- flatten table to hit scope (comma means cross-join in stnd sql)
`table` t, t.hits h
GROUP BY 1,2,3
So in case for new visitors I only provide a session id if totals.newVisits=1 - else the if-statement provides NULL which is not countable.
If you have something similar on product-scope, you'd need to create an ID that takes into account session and hit.
E.g. counting pages for productSku:
SELECT
PARSE_DATE('%Y%m%d', t.date) AS Date
,(SELECT value FROM h.customDimensions WHERE index=2) justAHitCd
,p.productSku
,COUNT(DISTINCT fullvisitorid) AS users
,COUNT(DISTINCT CONCAT(fullvisitorid, CAST(visitstarttime AS STRING))) AS sessions
,COUNT(DISTINCT
IF(h.type='PAGE',
CONCAT(fullvisitorid, cast(visitstarttime AS STRING),CAST(hitNumber AS STRING)),
NULL)
) as pageviews
,COUNT(1) AS products
FROM
`table` t, t.hits h LEFT JOIN h.product p
GROUP BY 1,2,3
Note, that I'm left joining the product array. Since it sometimes is empty a cross-join would destroy all hits information: cross-join with empty table results in empty table.
Hope that helps!

FULL OUTER JOIN throwing error "wrap select in parentheses"

I am having trouble with a part of my SQL call, I receive this error
Error: Syntax error: Each subquery argument for table-valued function calls must be enclosed in parentheses. To fix this, replace SELECT... with (SELECT...) at [32:5]
This is at the SELECT after the FULL OUTER JOIN EACH, I'd argue that I have done that, I do not know what is wrong here so any suggestions would be much appriciated.
I am trying to create a funnel that more acurately sorts previous customers from new. There are in total 3 levels in the funnel, for "simplicity" I'll only show two.
SELECT
COUNT(s0.firstHit) AS pageId1,
SUM(s0.exit) AS pageId2,
COUNT(s1.firstHit) AS pageId3,
SUM(s1.exit) AS pageId4
FROM(
SELECT
s0.fullVisitorId,
s0.visitId,
s0.firstHit,
s0.exit,
s1.firstHit,
s1.exit
FROM (
SELECT
fullvisitorid,
visitid,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND 1 = 1
AND EXISTS(SELECT 1 FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId'))
AND EXISTS(SELECT 1 FROM UNNEST(hits) hits WHERE (SELECT COUNT(value) FROM UNNEST(hits.customDimensions) custd WHERE index=20) > 0)
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) AS s0
FULL OUTER JOIN EACH(
SELECT
fullVisitorId,
visitId,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId) AS s1
ON
s0.fullVisitorId = s1.fullVisitorId
AND s0.visitId = s1.visitId ) s01
You can find ways to write this query and not having the JOIN operations going on.
For instance:
SELECT
fullvisitorid,
visitid,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstFunnelHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId')) AS firstExitFunnelFlag,
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE (REGEXP_CONTAINS(page.pagePath, r'pageId')) OR ((select count(1) from unnest(hits) h, unnest(h.customDimensions) custd where custd.index = 20) > 0)) secondFunnelHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, r'pageId') OR ((select count(1) from unnest(hits) h, unnest(h.customDimensions) custd where custd.index = 20) > 0)) AS secondFunnelExitFlag
FROM `dataset.ga_sessions_2017*`
WHERE 1 = 1
AND _TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
Notice that in just one SELECT you can bring information regarding all visitors who have been in page "pageId" and also visitors who have been in this page and fired the customDimension on index=20.
For each step in your funnel analyzes you can bring new columns as results, such as the firstFunnelHit and secondFunnelHit.
By avoiding expensive JOINs you can query up to teras of data and still have results in seconds.
In the last subquery:
SELECT
fullVisitorId,
visitId
(SELECT MIN(hitNumber) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS firstHit,
(SELECT MAX(IF(isExit, 1, 0)) FROM UNNEST(hits) WHERE REGEXP_CONTAINS(page.pagePath, 'pageId')) AS exitFlag
FROM
`<ID>.ga_sessions_2017*`
WHERE
_TABLE_SUFFIX BETWEEN '0601' AND '0602'
AND totals.visits = 1
GROUP BY
fullVisitorId,
visitId
it looks to me like you need a comma after visitId in the third line.
Best of luck.
There is no EACH keyword when using standard SQL; this is specific to legacy SQL. Remove that word and your query will probably work.