User Life Cycle SQL Query Logic in Snowflake - sql

I am working on building a query to track the life cycle of an user through the platform via events. The table EVENTS has 3 columns USER_ID, DATE_TIME and EVENT_NAME. Below is a snapshot of the table,
My query should return the below result (the first timestamp for the registered event followed by the immediate/next timestamp of the following log_in event and finally followed by the immediate/next timestamp of the final landing_page event),
Below is my query ,
WITH FIRST_STEP AS
(SELECT
USER_ID,
MIN(CASE WHEN EVENT_NAME = 'registered' THEN DATE_TIME ELSE NULL END) AS REGISTERED_TIMESTAMP
FROM EVENTS
GROUP BY 1
),
SECOND_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'log_in'
ORDER BY DATE_TIME
),
THIRD_STEP AS
(SELECT * FROM EVENTS
WHERE EVENT_NAME = 'landing_page'
ORDER BY DATE_TIME
)
SELECT
a.USER_ID,
a.REGISTERED_TIMESTAMP,
(SELECT
CASE WHEN b.DATE_TIME >= a.REGISTRATIONS_TIMESTAMP THEN b.DATE_TIME END AS LOG_IN_TIMESTAMP
FROM SECOND_STEP
LIMIT 1
),
(SELECT
CASE WHEN c.DATE_TIME >= LOG_IN_TIMESTAMP THEN c.DATE_TIME END AS LANDING_PAGE_TIMESTAMP
FROM THIRD_STEP
LIMIT 1
)
FROM FIRST_STEP AS a
LEFT JOIN SECOND_STEP AS b ON a.USER_ID = b.USER_ID
LEFT JOIN THIRD_STEP AS c ON b.USER_ID = c.USER_ID;
Unfortunately I am getting the "SQL compilation error: Unsupported subquery type cannot be evaluated" error when I try to run the query

This is a perfect use case for MATCH_RECOGNIZE.
The pattern you are looking for is register anything* login anything* landing and the measures are the min(iff(event_name='x', date_time, null)) for each.
Check:
https://towardsdatascience.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1
https://docs.snowflake.com/en/user-guide/match-recognize-introduction.html
Set the output to one row per match.
Untested sample query:
select *
from data
match_recognize(
partition by user_id
order by date_time
measures min(iff(event_name='register', date_time, null)) as t1
, min(iff(event_name='log_in', date_time, null)) as t2
, min(iff(event_name='landing_page', date_time, null)) as t3
one row per match
pattern(register anything* login anything* landing)
define
register as event_name = 'register'
, login as event_name = 'log_in'
, landing as event_name = 'landing_page'
);

Related

Oracle SQL - Timestamp splits query result into 2 rows, Need all in one with

I need a time-based query (Random or Current) with all results in one row. My current query is as follows:
WITH started AS
(
SELECT f.*, CURRENT_DATE + ROWNUM / 24
FROM
(
SELECT
d.route_name,
d.op_name,
d.route_step_name,
nvl(MAX(DECODE(d.complete_reason, NULL, d.op_STARTS)), 0) started_units,
round(nvl(MAX(DECODE(d.complete_reason, 'PASS', d.op_complete)), 0) / d.op_starts * 100, 2) yield
FROM
(
SELECT route_name,
op_name,
route_step_name,
complete_reason,
complete_quantity,
sum(start_quantity) OVER(PARTITION BY route_name, op_name, COMPLETE_REASON) op_starts,
sum(complete_quantity) OVER(PARTITION BY route_name, op_name, COMPLETE_REASON ) op_complete
FROM FTPC_LT_PRDACT.tracked_object_history
WHERE route_name = 'HEADER FINAL ASSEMBLY'
AND OP_NAME NOT LIKE '%DISPOSITION%'
and (tobj_type = 'Lot')
AND xfr_insert_pid IN
(
SELECT xfr_start_id
FROM FTPC_LT_PRDACT.xfr_interval_id
WHERE last_modified_time <= SYSDATE
AND OP_NAME NOT LIKE '%DISPOSITION%'
and complete_reason = 'PASS' OR complete_reason IS NULL
)
) d
GROUP BY d.route_name, d.op_name, d.route_step_name, complete_reason, d.op_starts
ORDER BY d.route_step_name
) f
),
queued AS
(
SELECT
ts.route_name,
ts.queue_name,
o.op_name,
sum (th.complete_quantity) queued_units
FROM
FTPC_LT_PRDACT.tracked_object_HISTORY th,
FTPC_LT_PRDACT.tracked_object_status ts,
FTPC_LT_PRDACT.route_arc a,
FTPC_LT_PRDACT.route_step r,
FTPC_LT_PRDACT.operation o,
FTPC_LT_PRDACT.lot l
WHERE r.op_key = o.op_key
and l.lot_key = th.tobj_key
AND a.to_node_key = r.route_step_key
AND a.from_node_key = ts.queue_key
and th.tobj_history_key = ts.tobj_history_key
AND a.main_path = 1
AND (ts.tobj_type = 'Lot')
AND O.OP_NAME NOT LIKE '%DISPOSITION%'
and th.route_name = 'HEADER FINAL ASSEMBLY'
GROUP BY ts.route_name, ts.queue_name, o.op_name
)
SELECT
started.route_name,
started.op_name,
started.route_step_name,
max(started.yield) started_yield,
max(started.started_units) started_units,
case when queued.queue_name is NULL then 'N/A' else queued.queue_name end QUEUE_NAME,
case when queued.queued_units is NULL then 0 else queued.queued_units end QUEUED_UNITS
FROM started
left JOIN queued ON started.op_name = queued.op_name
group by started.route_name, started.op_name, started.route_step_name, queued.queue_name, QUEUED_UNITS
order by started.route_step_name asc
;
Current Query (as expected) but missing timestamp:
I need to have a timestamp for each individual row for a different application to display the results. Any help would be greatly appreciated! When I try to add a timestamp my query is altered:
Query once timestamp is added:
Edit: I need to display the query in a visualization tool. That tool is time based and will skew the table results unless there is a datetime associated with each field. The date time value can be random, but cannot be the same for each result.
The query is to be displayed on a live dashboard, every time the application is refreshed, the query is expected to be updated.

Column into multiple columns with distinct count

I have a table which looks like:
event_date
event_name
user_id
20220407
n1
a
20220407
n2
b
20220407
n3
a
20220408
n1
a
20220408
n1
a
20220408
n2
c
Each row is presenting single event with params (it’s actually a bigquery table with data from firebase)
I want to select only needed events and place their sum for distinct users grouped by day into another table, like this:
date
n1 distinct users count
n2 distinct users count
20220407
1
1
20220408
2
0
I've tried smth like:
SELECT COUNT (DISTINCT user_pseudo_id) as users
,event_date
event_name,
case app_info.id when 'com.kaspersky.standalone-vpn' then 'KSeC-iOS'
when 'com.kaspersky.secure.connection' then 'KSeC-Android'
when 'com.kaspersky.securityadvisor' then 'KSC-iOS'
when 'com.kaspersky.security.cloud' then 'KSC-Android'
else app_info.id end as product
, SUBSTRING(device.language, 1, 2) as language
, geo.country
, app_info.version as app_version
FROM `ksec-android.analytics_156657667.events_*`
WHERE (event_name = 'first_open' OR event_name = 'user_engagement' OR 'event_name' = 'app_remove')
and _table_suffix >= FORMAT_DATE("%Y%m%d",(date_sub(CURRENT_DATE(), interval 1 day)))
group by event_date
,product
,language
,country
,app_version
,event_name
) src
pivot
(
count(users)
for event_name in ([first_open], [user_engagement], [app_remove])
) piv
group by event_date
,product
,language
,country
,app_version
I really don’t get it, would be so thankful for help
consider below approach
select * from your_table
pivot (count(distinct user_id) as count for event_name in ('n1', 'n2'))
if applied to sample data in your question - output is

Combine multiple rows with different dates with overlapping variables (to capture first and last change dates)

I have the following data represented in a table like this:
User
Type
Date
A
Mobile
2019-01-10
A
Mobile
2019-01-20
A
Desktop
2019-03-01
A
Desktop
2019-03-20
A
Email
2021-01-01
A
Email
2020-01-02
A
Desktop
2021-01-03
A
Desktop
2021-01-04
A
Desktop
2021-01-05
Using PostgreSQL - I want to achieve the following:
User
First_Type
First Type Initial Date
Last_Type
Last_Type_Initial_Date
A
Mobile
2019-01-10
Desktop
2021-01-03
So for each user, I want to capture the initial date and type but then also, on the same row (but diff columns), have their last type they "switched" to but with the first date the switch occurred and not the last record of activity on that type.
Consider using a LAG window function and conditional aggregation join via multiple CTEs and self-joins:
WITH sub AS (
SELECT "user"
, "type"
, "date"
, CASE
WHEN LAG("type") OVER(PARTITION BY "user" ORDER BY "date") = "type"
THEN 0
ELSE 1
END "shift"
FROM myTable
), agg AS (
SELECT "user"
, MIN(CASE WHEN shift = 1 THEN "date" END) AS min_shift_dt
, MAX(CASE WHEN shift = 1 THEN "date" END) AS max_shift_dt
FROM sub
GROUP BY "user"
)
SELECT agg."user"
, s1."type" AS first_type
, s1."date" AS first_type_initial_date
, s2."type" AS last_type
, s2."date" AS last_type_initial_date
FROM agg
INNER JOIN sub AS s1
ON agg."user" = s1."user"
AND agg.min_shift_dt = s1."date"
INNER JOIN sub AS s2
ON agg."user" = s2."user"
AND agg.max_shift_dt = s2."date"
Online Demo
user
first_type
first_type_initial_date
last_type
last_type_initial_date
A
Mobile
2019-01-10 00:00:00
Desktop
2021-01-03 00:00:00
Here is my solution with only windows functions and no joins:
with
prep as (
select *,
lag("Type") over(partition by "User" order by "Date") as "Lasttype"
from your_table_name
)
select distinct "User",
first_value("Type") over(partition by "User") as "First_Type",
first_value("Date") over(partition by "User") as "First_Type_Initial_Date",
last_value("Type") over(partition by "User") as "Last_Type",
last_value("Date") over(partition by "User") as "Last_Type_Initial_Date"
from prep
where "Type" <> "Lasttype" or "Lasttype" is null
;
I think this will work, but it sure feels ugly. There might be a better way to do this.
SELECT a.User, a.Type AS First_Type, a.Date AS FirstTypeInitialDate, b.Type AS Last_Type, b.LastTypeInitialDate
FROM table a
INNER JOIN table b ON a.User = b.User
WHERE a.Date = (SELECT MIN(c.Date) FROM table c WHERE c.User = a.User)
AND b.Date = (SELECT MIN(d.Date) FROM table d WHERE d.User = b.User
AND d.Type = (SELECT e.Type FROM table e WHERE e.User = d.User
AND e.Date = (SELECT MAX(f.Date) FROM table f WHERE f.User = e.User)))

SQL count DISTINCT ONCE user_id multiple attributes

Hello there I cant manage to get a good result for the following case:
I have a table which is like this:
UserID | Label
-------- ------
1 | Private
1 | Public
2 | Private
3 | Hidden
4 | Public
5 | Hidden
I want to have the following happening if a User has following assigned he is:
Private and Hidden are treaten the same: lets say Business
Public: BtoC
Public and Private and/or Hidden: both
So in the end I have a count(DISTINCT UserID) of
Business 3
BtoC 1
both 1
I have tried to use CASE WHEN but it doesn't work my current total query looks like this:
SELECT gen_month,
count(DISTINCT cu.id) as leads,
a.label
FROM generate_series(DATE_TRUNC('month', CURRENT_DATE::date - 96*INTERVAL '1 month'), CURRENT_DATE::date, '1 month') m(gen_month)
LEFT OUTER JOIN company_user AS cu
ON (date_trunc('month', cu.creation_date) = date_trunc('month', gen_month))
LEFT JOIN user u
ON u.user_id = cu.id
LEFT join user_account_status as uas
on cu.id = uas.user_id
LEFT JOIN account as a
on uas.account_id = a.id
where gen_month >= DATE_TRUNC('month',NOW() - INTERVAL '5 months')
group by m.gen_month, a.label
order by gen_month
So my main problem now is that the count appears in every attribute once.
How can I make a userid only count once under condition CASE WHEN user_id appears Public and (Private or Hidden) THEN count(DISTINCT user_id) as Both?
Addition: its mySQL mariaDB and postgreSQL. But first I would happy with Postgres
This is not implemented in your total query, but for counting users for each category, you can:
with the_table(UserID , Label) as(
select 1 ,'Private' union all
select 1 ,'Public' union all
select 2 ,'Private' union all
select 3 ,'Hidden' union all
select 4 ,'Public' union all
select 5 ,'Hidden'
)
select result, count(*) from (
select UserID, case when min(Label) = 'Public' then 'BtoC' when max(Label) in('Private','Hidden') then 'Business' else 'both' end as result
from the_table
group by UserID
) t
group by result
with
my_table(user_id, label) as (values
(1,'Private'),
(1,'Public'),
(2,'Private'),
(3,'Hidden'),
(4,'Public'),
(5,'Hidden')),
t as (
select
user_id,
string_agg('{'||label||'}', '') as labels
from my_table
group by user_id),
tt as (
select
user_id,
labels,
case
when
position('{Public}' in labels) > 0 and (position('{Private}' in labels) > 0 or position('{Hidden}' in labels) > 0) then 'Both'
when
position('{Private}' in labels) > 0 or position('{Hidden}' in labels) > 0 then 'Business'
when
position('{Public}' in labels) > 0 then 'BtoC'
end as kind
from t)
select kind, count(*) from tt group by kind;
For MariaDB use GROUP_CONCAT() instead of PostgreSQL string_agg().
Note that the case statement check conditions in order of appearance and returns the value for the first satisfied condition.
PS: Using PostgreSQL's arrays the conditions would be more elegant.

LAG within CASE giving false negative offset

TL;DR: scroll down to TASK 2.
I am dealing with the following data set:
email,createdby,createdon
a#b.c,jsmith,2016-10-10
a#b.c,nsmythe,2016-09-09
a#b.c,vstark,2016-11-11
b#x.y,ajohnson,2015-02-03
b#x.y,elear,2015-01-01
...
and so on. Each email is guaranteed to have at least one duplicate in the data set.
Now, there are two tasks to resolve; I resolved one of them but am struggling with the other one. I will now present both tasks for completeness.
TASK 1 (resolved):
For each row, for each email, return an additional column with the name of the user that created the first record with this email.
Expected result for the above sample data set:
email,createdby,createdon,original_createdby
a#b.c,jsmith,2016-10-10,nsmythe
a#b.c,nsmythe,2016-09-09,nsmythe
a#b.c,vstark,2016-11-11,nsmythe
b#x.y,ajohnson,2015-02-03,elear
b#x.y,elear,2015-01-01,elear
Code to get the above:
;WITH q0 -- this is just a security measure in case there are unique emails in the data set
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, LAG(q1.createdby, q1.rn - 1) OVER ( ORDER BY q1.email, q1.createdon ) original_createdby
FROM q1
ORDER BY q1.email
, q1.rn
Brief explanation: I partition data set by email, then I number rows in each partition ordered by creation date, finally I return [createdby] value from (rn-1)th record. Works exactly as expected.
Now, similar to the above, there is TASK 2:
TASK 2:
For each row, for each email, return name of the user that created the first duplicate. I.e. name of a user where rn=2.
Expected result:
email,createdby,createdon,first_dupl_createdby
a#b.c,jsmith,2016-10-10,jsmith
a#b.c,nsmythe,2016-09-09,jsmith
a#b.c,vstark,2016-11-11,jsmith
b#x.y,ajohnson,2015-02-03,ajohnson
b#x.y,elear,2015-01-01,ajohnson
I want to keep things performant so trying to employ LEAD-LAG functions:
WITH q0
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, CASE q1.rn
WHEN 1 THEN LEAD(q1.createdby, 1) OVER ( ORDER BY q1.email, q1.createdon )
ELSE LAG(q1.createdby, q1.rn - 2) OVER ( ORDER BY q1.email, q1.createdon )
END AS first_dupl_createdby
FROM q1
ORDER BY q1.email
, q1.rn
Explanation: for the first record in each partition, return [createdby] from the following record (i.e. from the record containing the first duplicate). For all other records in the same partition return [createdby] from (rn-2) records ago (i.e. for rn = 2 we're staying on the same record, for rn = 3 we're going 1 record back, for rn = 4 - 2 records back and so on).
An issue comes up on the
ELSE LAG(q1.createdby, q1.rn - 2)
operation. Apparently, against any logic, despite the existence of the preceding line (WHEN 1 THEN...), the ELSE block is also evaluated for rn = 1, resulting in a negative offset value passed to the LAG function:
Msg 8730, Level 16, State 2, Line 37
Offset parameter for Lag and Lead functions cannot be a negative value.
When I comment out that ELSE line, the whole thing works fine but obviously I am not getting any results in the first_dupl_createdby column for rn > 1.
QUESTION:
Is there any way of re-writing the above CASE statement (in TASK #2) so that it always returns the value from a record where rn = 2 within each partition but - and this is important bit - without doing a self-JOIN operation (I know I could prepare rows where rn = 2 in a separate sub-query but this would mean extra scans on the whole table and also running an unnecessary self-JOIN).
I think you can simply use the max window function as you are trying to get the value from rownumber = 2 for each partition.
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by
FROM q1
ORDER BY q1.email, q1.rn
You can use a similar query to get the results for rownumber=1 for the 1st scenario as well.
You can get the information for each email using row_number() and conditional aggregation:
select email,
max(case when seqnum = 1 then createdby end) as createdby_first,
max(case when seqnum = 2 then createdby end) as createdby_second
from (select t.*,
row_number() over (partition by email order by createdon) as seqnum
from t
) t
group by email;
You can join this information back to the original data to get the information you want. I don't see how lag() naturally would be used to solve this problem.
/shrug
; WITH duplicate_email_addresses AS (
SELECT email
FROM t
GROUP
BY email
HAVING Count(*) > 1
)
, records_with_duplicate_email_addresses AS (
SELECT email
, createdon
, createdby
, Row_Number() OVER (PARTITION BY email ORDER BY createdon) AS sequencer
FROM t
WHERE EXISTS (
SELECT *
FROM duplicate_email_addresses
WHERE email = t.email
)
)
, second_duplicate_record AS ( -- Why do you need any more than this?
SELECT email
, createdon
, createdby
FROM records_with_duplicate_email_addresses
WHERE sequencer = 2
)
SELECT records_with_duplicate_email_addresses.email
, records_with_duplicate_email_addresses.createdon
, records_with_duplicate_email_addresses.createdby
, second_duplicate_record.createdby AS first_duplicate_createdby
FROM records_with_duplicate_email_addresses
INNER
JOIN second_duplicate_record
ON second_duplicate_record.email = records_with_duplicate_email_addresses.email
;