Please help me with the BigQuery query. I need to build a closed funnel of user steps events in a mobile app for a week.
The table looks like this:
It is necessary to collect all unique users who have passed from step 1 to step 2 and so on to step 6 during this period. Between these steps, they could do something else, be distracted by other events. But what is important is the passage of each unique user through these steps in a given period of time.
Please tell me how to create such a funnel?
There can be multiple ways of achieving this. Here is an approach using identical sample data, which is not the most optimal but is very self-explanatory and definite:
with data as (
select 'a' as user_id, cast('2020-01-01 04:45:00' as timestamp) as event_timestamp, '1' as step_name
union all
select 'b' as user_id, cast('2020-01-01 04:50:00' as timestamp) as event_timestamp, '1' as step_name
union all
select 'a' as user_id, cast('2020-01-01 05:00:00' as timestamp) as event_timestamp, '2' as step_name
union all
select 'a' as user_id, cast('2020-01-01 05:15:00' as timestamp) as event_timestamp, '3' as step_name
union all
select 'b' as user_id, cast('2020-01-01 04:55:00' as timestamp) as event_timestamp, '2' as step_name
union all
select 'c' as user_id, cast('2020-01-01 04:58:00' as timestamp) as event_timestamp, '1' as step_name
union all
select 'a' as user_id, cast('2020-01-01 05:16:00' as timestamp) as event_timestamp, '4' as step_name
union all
select 'b' as user_id, cast('2020-01-01 05:16:00' as timestamp) as event_timestamp, '3' as step_name
),
data2 as (
select a.user_id, a.step_name step_1, b.step_name step_2, c.step_name step_3, d.step_name step_4 from ( select user_id, event_timestamp, step_name from data where step_name = '1') a
left join data b on (a.user_id = b.user_id and a.event_timestamp < b.event_timestamp and b.step_name = '2')
left join data c on (b.user_id = c.user_id and b.event_timestamp < c.event_timestamp and c.step_name = '3')
left join data d on (c.user_id = d.user_id and c.event_timestamp < d.event_timestamp and d.step_name = '4')
)
select * from (
select 'step_1' as event_name, count(distinct user_id) as n_users from data2 where step_1 is not null
group by 1
union all
select 'step_2' as event_name, count(distinct user_id) as n_users from data2 where (step_1 is not null and step_2 is not null)
group by 1
union all
select 'step_3' as event_name, count(distinct user_id) as n_users from data2 where (step_1 is not null and step_2 is not null and step_3 is not null)
group by 1
union all
select 'step_4' as event_name, count(distinct user_id) as n_users from data2 where (step_1 is not null and step_2 is not null and step_3 is not null and step_4 is not null)
group by 1
)
order by 1
You can further optimize this based on your specific filters, conditions, etc.
Related
I want to create a column that shows whether it is the max order_status as TRUE or FALSE based on created_at.
Is there a way to achieve this without a subquery in Snowflake?
Here is my example data:
WITH t1 AS (
SELECT 'A' AS id, 'created' AS status, '2021-05-18 18:30:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'created' AS status, '2021-05-19 11:30:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'pending' AS status, '2021-05-19 12:00:00'::timestamp AS created_at UNION ALL
SELECT 'A' AS id, 'successful' AS status, '2021-05-20 18:30:00'::timestamp AS created_at
)
Using windowed MAX:
WITH t1(id, status, created_at) AS (
SELECT 'A', 'created', '2021-05-18 18:30:00'::timestamp UNION ALL
SELECT 'A', 'created', '2021-05-19 11:30:00'::timestamp UNION ALL
SELECT 'A', 'pending', '2021-05-19 12:00:00'::timestamp UNION ALL
SELECT 'A', 'successful', '2021-05-20 18:30:00'::timestamp AS created_at
)
SELECT *, created_at = MAX(created_at) OVER(PARTITION BY ID) AS is_final_order_status
FROM t1;
Output:
A cased row_number could work
SELECT id, status, created_at
, CASE
WHEN 1 = ROW_NUMBER() OVER (PARTITION BY id ORDER BY created_at DESC)
THEN 'TRUE'
ELSE 'FALSE'
END is_final_order_status
FROM t1
I want to create a column that flags an id it has a straight order process. i.e. id’s which don’t have order_status pending or info_required.
e.g. id a has pending, so is_straight will be false. b has no pending or info_required, so it should be true.
Here is the example data:
WITH t1 AS (
SELECT 'a' AS id, 'created' AS status, '2021-11-02 15:04:07'::timestamp AS created_at UNION ALL
SELECT 'a' AS id, 'created' AS status, '2021-11-03 13:23:34'::timestamp AS created_at UNION ALL
SELECT 'a' AS id, 'pending' AS status, '2021-11-07 04:04:46'::timestamp AS created_at UNION ALL
SELECT 'a' AS id, 'successful' AS status, '2021-11-07 13:25:05'::timestamp AS created_at UNION ALL
SELECT 'b' AS id, 'created' AS status, '2021-11-11 16:19:07'::timestamp AS created_at UNION ALL
SELECT 'b' AS id, 'successful' AS status, '2021-11-13 17:57:55'::timestamp AS created_at UNION ALL
SELECT 'c' AS id, 'created' AS status, '2021-11-15 01:09:23'::timestamp AS created_at UNION ALL
SELECT 'c' AS id, 'info_required' AS status, '2021-11-17 11:06:00'::timestamp AS created_at UNION ALL
SELECT 'c' AS id, 'successful' AS status, '2021-11-21 23:35:46'::timestamp AS created_at
)
Using windowed COUNT_IF:
SELECT *,
COUNT_IF(order_status IN ('pending', 'info_required')) OVER(PARTITION BY id) = 0
AS is_straight
FROM t1;
Output:
I am using
with t1 as
(
SELECT
DATE_TRUNC(PARSE_DATE("%Y%m%d", date), MONTH) as month,
fullVisitorId,
product.productSKU,
product.v2ProductName,
case when hits.ecommerceaction.action_type = '2' then 1 else 0 end as pdp_visitor,
count(case when hits.ecommerceaction.action_type = '2' then fullvisitorid else null end) AS views_pdp,
count(case when hits.ecommerceaction.action_type = '3' then fullvisitorid else null end) AS add_cart,
count(case when hits.ecommerceaction.action_type = '6' then hits.transaction.transactionid else null end) AS conversions,
count(distinct(hits.transaction.transactionId)) as transaction_id_cnt,
FROM `table` AS nr,
UNNEST(hits) hits,
UNNEST(product) product
GROUP BY 1,2,3,4,5
)
select
month,
sum(views_pdp) as pdp
,sum(add_cart) as add_cart
,sum(conversions) as conversions
,sum(transaction_id_cnt)
from t1
group by 1
order by 1 desc;
Which returns
month pdp add_cart conversions f0_
2021-02-01 500 100 20 10
2021-01-01 600 200 30 20
I know that f0_ ( count(distinct(hits.transaction.transactionId)) ) is bad here because of product.productSKU and product.v2ProductName grouping.
In general, when user makes an order with 3 items in his basket, I want to count this as one order, whereas now it is counted as 3.
This count(distinct(hits.transaction.transactionId)) as transaction_id_cnt results in the correct output if I comment out product.productSKU and product.v2ProductName.
Running this query:
with t1 as
(
SELECT
DATE_TRUNC(PARSE_DATE("%Y%m%d", date), MONTH) as month,
fullVisitorId,
-- product.productSKU, # commented out
-- product.v2ProductName, # commented out
case when hits.ecommerceaction.action_type = '2' then 1 else 0 end as pdp_visitor,
count(case when hits.ecommerceaction.action_type = '2' then fullvisitorid else null end) AS views_pdp,
count(case when hits.ecommerceaction.action_type = '3' then fullvisitorid else null end) AS add_cart,
count(case when hits.ecommerceaction.action_type = '6' then hits.transaction.transactionid else null end) AS conversions,
count(distinct(hits.transaction.transactionId)) as transaction_id_cnt,
FROM `table` AS nr,
UNNEST(hits) hits,
UNNEST(product) product
GROUP BY 1,2,3,4,5
)
select
month,
sum(views_pdp) as pdp
,sum(add_cart) as add_cart
,sum(conversions) as conversions
,sum(transaction_id_cnt)
from t1
group by 1
order by 1 desc;
Returns what is expected, but now I don't have productSKU and v2ProductName which I need. I suspect that the problem is that each order is a new line in google big query and when I ask to to select it by product name and SKU, I count the uniques and then sum it.
How can I achieve the correct summation of count(distinct(hits.transaction.transactionId)) without losing the grouping by product.productSKU and product.v2ProductName which explodes this metric?
On the group by Query you could cherry pick them as array(so you don't group by them):
ARRAY_AGG(DISTINCT product.productSKU IGNORE NULLS) AS productSKU_list,
ARRAY_AGG(DISTINCT product.v2ProductName IGNORE NULLS) AS productName_list,
Update per your below comment: If you want to use them in further group by just save them as string instead of array.
STRING_AGG(DISTINCT product.productSKU, ',') AS productSKU_list,
STRING_AGG(DISTINCT product.v2ProductName, ',') AS productName_list,
In bigquery using legacy sql I have created a monstrous query that returns the following display of visits per day for a site that I released 2018-02-26:
Row date name release_date visits_count
1 20180226 a_name 20180226 2179
2 20180227 a_name 20180226 9522
3 20180228 a_name 20180226 1593
4 20180301 a_name 20180226 300
...
What I really want is
Row name release count_release count_release+1 count_release_rest
1 a_name 20180226 2179 9522 1893
Thus, I want the actual visit count for release date, the day after the release date and all counts after that should just be summed.
I'm new to bigquery (and kind of new to sql...). Is there a way to define my first display as a "subtable" or something like that so that I can do this or what approach would you recommend?
There are lot of ways you can achieve this functionality. The simplest way to do it is compare date with case statement.
select name, sum(case when date = relese_date then 1 else 0) as release_count,
sum(case when date = DATE_ADD(relese_date,1,"DAY") then 1 else 0) as release_count1
sum(case when date > DATE_ADD(relese_date,1,"DAY") then 1 else 0) as release_count_other
Below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT '20180226' date, 'a_name' name, '20180226' release_date, 2179 visits_count UNION ALL
SELECT '20180227', 'a_name', '20180226', 9522 UNION ALL
SELECT '20180228', 'a_name', '20180226', 1593 UNION ALL
SELECT '20180301', 'a_name', '20180226', 300
)
SELECT name, release_date,
SUM(CASE WHEN date = release_date THEN visits_count END) count_release,
SUM(CASE WHEN PARSE_DATE('%Y%m%d', date) = DATE_ADD(PARSE_DATE('%Y%m%d', release_date), INTERVAL 1 DAY) THEN visits_count END) count_release_next_day,
SUM(CASE WHEN PARSE_DATE('%Y%m%d', date) > DATE_ADD(PARSE_DATE('%Y%m%d', release_date), INTERVAL 1 DAY) THEN visits_count END) count_release_rest
FROM `project.dataset.table`
GROUP BY name, release_date
or above can be "refactored" to avoid repeating PARSE_DATE, so query looks more compact and easier to manage
#standardSQL
WITH `project.dataset.table` AS (
SELECT '20180226' date, 'a_name' name, '20180226' release_date, 2179 visits_count UNION ALL
SELECT '20180227', 'a_name', '20180226', 9522 UNION ALL
SELECT '20180228', 'a_name', '20180226', 1593 UNION ALL
SELECT '20180301', 'a_name', '20180226', 300
)
SELECT name, release_date,
SUM(CASE WHEN date = release_date THEN visits_count END) count_release,
SUM(CASE WHEN visit = release_next_day THEN visits_count END) count_release_next_day,
SUM(CASE WHEN visit > release_next_day THEN visits_count END) count_release_rest
FROM `project.dataset.table`,
UNNEST([STRUCT<visit DATE, release_next_day DATE>(
PARSE_DATE('%Y%m%d', date),
DATE_ADD(PARSE_DATE('%Y%m%d', release_date), INTERVAL 1 DAY))]) x
GROUP BY name, release_date
in both cases result is
Row name release_date count_release count_release_next_day count_release_rest
1 a_name 20180226 2179 9522 1893
An example would be.. Say a ticket is in New status. I want to get the MAX Date of New Status and the Max date of Completed Status and calculate the difference between the MAX Completed Status from the MAX New Status
ex.
SELECT t.ID,
MAX(update_date) WHERE t.status = 'New' start_time,
MAX(update_date) WHERE t.status = 'Completed' stop_time,
DATEDIFF(second, MAX(update_date), MAX(update_date)) elapsed_sec
FROM xxx.dbo t
GROUP BY t.ID;
Thank you so much,
P
SELECT
t.id
,DATEDIFF(second, start_time, stop_time) elapsed_sec
FROM (
SELECT
ID,
(SELECT MAX(update_date) from xxx.dbo WHERE status = 'New' AND ID=t2.ID) start_time,
(SELECT MAX(update_date) from xxx.dbo WHERE status = 'Completed' AND ID=t2.ID) stop_time
FROM xxx.dbo t2
) t
I would suggest doing this using condition aggregation and not with correlated subqueries:
SELECT t.ID,
MAX(CASE WHEN t.status = 'New' THEN update_date END) as start_time,
MAX(CASE WHEN t.status = 'Completed' THEN update_date END) as stop_time,
MAX(update_date) WHERE t.status = 'Completed' stop_time,
DATEDIFF(second,
MAX(CASE WHEN t.status = 'New' THEN update_date END),
MAX(CASE WHEN t.status = 'Completed' THEN update_date END)
) as elapsed_sec
FROM xxx.dbo t
GROUP BY t.ID;