clickhouse window function difficulties - hoew to work with date windows - sql

I have web sessions with utm tags (different channels of traffic: cpc, smm, push). Some of them with tags but some sessions from organic without utm tags. I want to overwrite organic sessions to previous tags
Rules, which I want to use:
push channel remains only for the session in which it is registered
all other non-empty channels are forwarded to all empty sessions for the current and next day.
Channels are not overwritten - that is, if at first there was a cpc channel, and then on the same day there was an smm channel, then cpc sessions go first, and then smm for the current and next day.
clickhouse version 22.8.10.29

Main Idea use arrays with union all for push channel
select install_id, session_id, date_uz , started_at, utm_medium, utm_medium_final
from (
SELECT *, arrayFirst(x -> x!='', arrayReverse(utm_medium_array)) as utm_medium_new,
maxIf(date_uz, utm_medium_new = utm_medium) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_date,
if(date_uz - last_date < 2, utm_medium_new, '') utm_medium_final
--any(utm_medium_new) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING ) as h,
from (
select install_id, session_id, utm_medium, date_uz , started_at,
groupArray(utm_medium) OVER (PARTITION BY install_id ORDER BY started_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS utm_medium_array
from marketing.sessions_with_attribution swa
where date_uz >=today()-50
and utm_medium!='push'
and install_id in ('1cc69a1f-eb17-4be6-8bfc-a5dee2dd9c50','57927c21-e862-4729-b38e-f663aa9d227d')
)
union ALL
select install_id, session_id, utm_medium, date_uz , started_at,
[] utm_medium_array, utm_medium , null, utm_medium
from marketing.sessions_with_attribution swa
where date_uz >=today()-50
and utm_medium = 'push'
and install_id in ('1cc69a1f-eb17-4be6-8bfc-a5dee2dd9c50','57927c21-e862-4729-b38e-f663aa9d227d')
)
order by install_id, started_at

Related

ETL query need some changes go get it right

Hello guys I have a query which is working but when I remove 2 filters (2 where clauses at the end doesn't work as expected but still have to be removed from the query)
I have accounts 1000001,1000002,1000003,1000004 and 1000005
I only get 1000005 accounts, Pretty sure that it`s is about the window MAX function, but still.
I want to get the all values for the accounts.
SELECT a12.month_id,
a12.populate_id AS account_id,
LAST_VALUE(current_bal IGNORE NULLS) OVER
(PARTITION BY Populate_id ORDER BY date_id ASC ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_dly_bal
FROM (SELECT TO_CHAR(date_id, 'YYYYMM') AS month_id,
date_id,
account_id AS "account_id",
MAX(account_id) OVER (PARTITION by TO_CHAR(date_id, 'YYYYMM')) as populate_id,
current_bal
FROM (SELECT t.date_id, ad.account_id, ad.current_bal
FROM timedate t
FULL OUTER JOIN (SELECT src_extract_dt, account_id, current_bal
FROM account_dly
WHERE account_id = 1000001) ad
on t.date_id = ad.src_extract_dt
WHERE TO_CHAR(date_id, 'YYYYMM') = '201908'
order by t.date_id)) a12;
https://i.stack.imgur.com/xphVh.png

Double counting problem in Rolling weekly / monthly active endpoints

Here is my current code to calculate DAE,WAE,MAE:
select event_timestamp as day, Section, users as DAE, SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as WAE,
SUM(users)
OVER (PARTITION BY Section
ORDER BY event_timestamp
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) as MAE
from (
select count(distinct user_pseudo_id) as users, Section, event_timestamp
from
(select distinct *
from
(
select *,
CASE
WHEN param_value = "Names" or param_value = "SingleName" THEN 'Names'
ELSE param_value
END AS Section
from(
select user_pseudo_id, DATE_TRUNC(EXTRACT(DATE from TIMESTAMP_MICROS(event_timestamp)), DAY) as event_timestamp, event_name, params.value.string_value as param_value
from `rayn-deen-app.analytics_317927526.events_*`, unnest(event_params) as params
where (event_name = 'screen_view' and params.key = 'firebase_screen' and (
# Promises
params.value.string_value = "Promises"
# Favourites
or params.value.string_value = "Favourites"
))
group by user_pseudo_id, event_timestamp, event_name, param_value
order by event_timestamp, user_pseudo_id) raw
) base
order by event_timestamp) as events_table
group by Section, event_timestamp
)
The problem is that for WAE,MAE there is repeat counts of the same users happening. So for example user A was a "daily active user" for 4 days that week. Then in the WAE count, it will consider that as 4 users instead of one. So there is a problem of repeat counts which I need to remove somehow.

SQL Query get session duration by firebase events

Im need to know the duration of the sessions one by one of my users, to do that i use bigquery, in the next query i try to get the time, but to get you in all the context:
the param ga_session_id propagate for all the event in a session, then I want to rest the timestamp of the session_start (the start of the session) and the last event with this ga_session_id, that for each ga_session_id.
WITH grps AS (
SELECT event_timestamp, event_name,
(SELECT value.int_value FROM UNNEST(event_params)
WHERE key = "ga_session_id") AS sessionid,
COUNTIF(event_name = 'session_start') OVER (ORDER BY event_timestamp) as grp
FROM `nodal-descent-XXXXX.analytics_XXXXXX.events_intraday_*`
)
SELECT min(event_timestamp), max(event_timestamp),
timestamp_diff(timestamp_micros(max(event_timestamp)),
timestamp_micros(min(event_timestamp)), second) as se
FROM grps
An example of the data i have:
Anyone can help me to complete the query and do that but by each ga_session_id?
If I understand your question, you are looking to add the session id to your current query. If so try the following:
select
ep.value.int_value as ga_session_id
, min(event_timestamp) min_ses
, max(event_timestamp) max_ses
, timestamp_diff(timestamp_micros(max(event_timestamp)), timestamp_micros(min(event_timestamp)), second) as se
from bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131,
UNNEST(event_params) ep
where ep.key='ga_session_id'
group by ga_session_id
order by ga_session_id

Finding the journey made by users in Google BigQuery

I'm looking to find the journey made by users on a particular website. The schema of my dataset is the same as Google Merchandise Store, which can be found here: https://support.google.com/analytics/answer/3437719?hl=en
From the Google BigQuery cookbook, I've implemented and modified the SQL code provided to get the sequence of hits made by every customer.
SELECT
fullVisitorId AS id,
visitId AS visitid,
visitNumber AS visitnumber,
h.hitNumber AS hitNumber,
CASE
WHEN h.eventInfo.eventAction = "Lead" THEN "Lead"
WHEN h.eventInfo.eventAction = "Homepage" THEN "Homepage"
WHEN h.eventInfo.eventAction = "Search" THEN "Search"
WHEN h.eventInfo.eventAction = "High Intent Use" THEN "High Intent Use"
WHEN h.eventInfo.eventAction = "Listing Page" THEN "Listing Page"
END AS journey
FROM
`dataset`,
UNNEST(hits) AS h
WHERE
h.type="PAGE"
OR h.type="EVENT"
ORDER BY
fullVisitorId,
visitId,
visitNumber,
hitNumber
A snippet of the result I got is as follows:
fullVisitorId visitId visitNumber hitnumber journey
001 1001 1 1 Homepage
001 1001 1 2 Search
001 1001 1 3 null
001 1001 1 4 Search
001 1001 1 5 Listing Page
001 1001 1 6 Lead
001 1001 1 2 Search
001 1001 1 7 Lead
002 1002 1 1 Search
...
What I need is to get another column which shows the journey taken by each visitor before the first "Lead", while ignoring the duplicates (for eg if the visitor searches for 5 pages back-to-back, the journey should only show "Search" once)
ie. for visitor 001 on visit 1001, the column will show:
Homepage -> Search -> Listing Page -> Lead
I hope the question is clear. Appreciate any help given! :)
Below is for BigQuery Standard SQL and applies extra logic to your existing/current query
#standardSQL
SELECT
fullVisitorId, visitId,
STRING_AGG(journey, ' -> ' ORDER BY visitNumber, hitnumber) journey_path
FROM (
SELECT
fullVisitorId, visitId,
MIN(visitNumber) visitNumber, MIN(hitnumber) hitnumber, journey
FROM (
SELECT *, COUNTIF(journey = 'Lead') OVER(win) grp
FROM `your_current_query`
WINDOW win AS (
PARTITION BY fullVisitorId, visitId
ORDER BY visitNumber, hitnumber
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
)
WHERE grp = 0
GROUP BY fullVisitorId, visitId, journey
)
GROUP BY fullVisitorId, visitId
so you can just use your existing query as below
#standardSQL
WITH `your_current_query` AS (
SELECT
fullVisitorId AS id,
visitId AS visitid,
visitNumber AS visitnumber,
h.hitNumber AS hitNumber,
CASE
WHEN h.eventInfo.eventAction = "Lead" THEN "Lead"
WHEN h.eventInfo.eventAction = "Homepage" THEN "Homepage"
WHEN h.eventInfo.eventAction = "Search" THEN "Search"
WHEN h.eventInfo.eventAction = "High Intent Use" THEN "High Intent Use"
WHEN h.eventInfo.eventAction = "Listing Page" THEN "Listing Page"
END AS journey
FROM
`dataset`,
UNNEST(hits) AS h
WHERE
h.type="PAGE"
OR h.type="EVENT"
)
SELECT
fullVisitorId, visitId,
STRING_AGG(journey, ' -> ' ORDER BY visitNumber, hitnumber) journey_path
FROM (
SELECT
fullVisitorId, visitId,
MIN(visitNumber) visitNumber, MIN(hitnumber) hitnumber, journey
FROM (
SELECT *, COUNTIF(journey = 'Lead') OVER(win) grp
FROM `your_current_query`
WINDOW win AS (
PARTITION BY fullVisitorId, visitId
ORDER BY visitNumber, hitnumber
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
)
WHERE grp = 0
GROUP BY fullVisitorId, visitId, journey
)
GROUP BY fullVisitorId, visitId
--- ORDER BY fullVisitorId, visitId
and if to follow your result example - above should produce below result
Row fullVisitorId visitId journey_path
1 001 1001 Homepage -> Search -> Listing Page -> Lead
2 002 1002 Search
I'd suggest using STRING_AGG to make a string of the journey steps, adding DISTINCT into your selection will only show individual journey steps once per user.
Something like:
STRING_AGG(DISTINCT(journey), '->') as propensity_banding_subset
You could then use some regex to clip off after the first 'lead', unless somebody can suggest a better method to do this in the original string aggregation?
I took Mikhails great approach and brought it to a more scalable version for those who have really large amounts of data. The idea is the same, but applied to a subquery on the hits array.
SELECT
fullVisitorId AS id,
visitId AS visitid,
visitNumber AS visitnumber,
ARRAY(
(SELECT AS STRUCT *
FROM
(SELECT AS STRUCT
hitNumber,
page.pagePath, -- pagePath instead of CASE-WHEN with events
count(page.pagePath) over (win) elNumber
FROM t.hits
WHERE type IN ('PAGE', 'EVENT')
WINDOW win AS (
PARTITION BY page.pagePath
ORDER BY hitnumber
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
ORDER BY hitNumber)
WHERE elNumber=0
-- instead of 'Lead' I used '/signin.html'
AND hitNumber < (SELECT MIN(hitNumber) FROM t.hits WHERE page.pagePath='/signin.html')
)
) AS journey
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
limit 1000
I used the actual sample data and couldn't find the events from the example there so I simply used page paths. But it should be easily adoptable.
Also this one returns nested data, not a flat table, which again saves space when saving the result as a table and is faster when performing queries on it.
There is also no grouping involved - everything happens within the subquery only on the array which allows very fast processing due to parallization.

SQL rollup on sessions

I have an impression event table that has a bunch of timestamps and marked start/end boundaries. I am trying to roll it up to have a metric that says "this session contains at least 1 impression with feature x". I'm not sure how exactly to do this. Any help would be appreciated. Thanks.
I want to roll this up into something that looks like:
account, session_start, session_end, interacted_with_feature
3004514, 2018-02-23 13:43:35.475, 2018-02-23 13:43:47.377, FALSE
where it is simple for me to say if this session had any interactions with the feature or not.
Perhaps aggregation does what you want:
select account, min(timestamp), max(timestamp), max(interacted_with_feature)
from t
group by account;
I was able to solve this with conditional cumulative sums to generate a session group ID for each row.
with cte as (
select *
, sum(case when session_boundary = 'start' then 1 else 0 end)
over (partition by account order by timestamp rows unbounded preceding)
as session_num
from raw_sessions
)
select account
, session_num
, min(timestamp) as session_start
, max(timestamp) as session_end
, bool_or(interacted_with_feature) as interacted_with_feature
from cte
group by account, session_num