SQL - timeline based queries - sql

I have a table of events which has:
user_id
event_name
event_time
There are event names of types: meeting_started, meeting_ended, email_sent
I want to create a query that counts the number of times an email has been send during a meeting.
UPDATE: I'm using Google BigQuery.
Example query:
SELECT
event_name,
count(distinct user_id) users,
FROM
events_table WHERE
and event_name IN ('meeting_started', 'meeting_ended')
group by 1
How can I achieve that?
Thanks!

You can do this in BigQuery using last_value():
Presumably, an email is send during a meeting if the most recent "meeting" event is 'meeting_started'. So, you can solve this by getting the most recent meeting event for each event and then filtering:
select et.*
from (select et.*,
last_value(case when event_name in ('meeting_started', 'meeting_ended') then event_name end) ignore nulls) over
(partition by user_id order by event_time) as last_meeting_event
from events_table et
) et
where event_name = 'email_sent' and last_meeting_event = 'meeting_started'

This reads likes some kind of gaps-and-islands problem, where an island is a meeting, and you want emails that belong to islands.
How do we define an island? Assuming that meeting starts and ends properly interleave, we can just compare the count of starts and ends on a per-user basis. If there are more starts than ends, then a meeting is in progress. Using this logic, you can get all emails that were sent during a meeting like so:
select *
from (
select e.*,
countif(event_name = 'meeting_started') over(partition by user_id order by event_time) as cnt_started,
countif(event_name = 'meeting_ended' ) over(partition by user_id order by event_time) as cnt_ended
from events_table e
) e
where event_name = 'email_sent' and cnt_started > cnt_ended
It is unclear where you want to go from here. If you want the count of such emails, just use select count(*) instead of select * in the outer query.

Related

Count of unique identifier based on 2 columns sql

I have a simple question! I need to calculate how many user_id's attended multiple events (confirmed or registered). There are 24 distinct event id's in the real dataset. Here is my query that did not provide me the results I was looking for: Where did I go wrong? This is done in BigQuery if that info was needed.
SELECT COUNT(user_id) as volunteer_count,
status,
COUNT(DISTINCT event_id) as event_count
FROM `nice-incline.events_participations.events_participations`
WHERE status = 'CONFIRMED' OR status = 'REGISTERED'
GROUP BY status
HAVING event_count > 1;
Below is a sample table.
event_id
user_id
status
378398
1783965
confirmed
418729
4518485
registered
378398
4518485
registered
418729
4432831
canceled
The expected results would be just be a count of user_id's who have attended multiple (>1) event_id's. So in this case, we would only have 1 since user '4518485' attended 2 events and they are registered
Use below
select count(*) from (
select user_id
from your_table
where status in ('confirmed', 'registered')
group by user_id
having count(*) > 1
)
or (with same output)
select count(*) from (
select distinct user_id
from your_table
where status in ('confirmed', 'registered')
qualify count(*) over(partition by user_id) > 1
)

Efficiently join latest entry of first table to second table depending on entity characteristics of first table

Looking for efficient solution to join two tables but with the caveat that characteristics of second table should determine what is joined to first table (in Google BigQuery).
Lets say I have two tables. One Table with events (id, session, event_date) and a second with policies applying to events (event_id, policy, create_date) and I want to determine which policy applied to an event based on the policy create date and the event date.
CREATE TEMP TABLE events AS (
SELECT *
FROM UNNEST([
STRUCT(1 AS id, "A" AS session, "2021-11-05" AS event_date),
(1, "B", "2021-12-17"),
(2, "A", "2021-08-13")
])
);
CREATE TEMP TABLE policies AS (
SELECT *
FROM UNNEST([
STRUCT(1 AS event_id, "foo" AS policy, "2021-01-01" AS create_date),
(1, "bar", "2021-12-01"),
(2, "foo", "2021-02-01")
])
)
In my example, the result should look like this if I get the latest policy_create_date that was in existence by the time of the event (enevt_date).
id
session
policy_create_date
1
A
2021-01-01
1
B
2021-12-01
2
A
2021-02-01
The following solution would provide the result I want, but it create a N:N JOIN and can become quite big and calculation intense, if both tables get large (especially if I have many of the same events and many policy changes). Hence, I'm looking for a solution that is more efficient than the solution below and avoids the N:N JOIN.
SELECT
e.id,
e.session,
MAX(p.create_date) AS policy_create_date -- get latest policy amongst all policies for an event_id that existed before the session took place
FROM events e
INNER JOIN policies p
ON e.id = p.event_id -- match event and policy based on event_id
AND p.create_date < e.event_date -- match only policies that existed before the session of the event took place
GROUP BY 1, 2
TY!!!
Edit: I adjusted the known but inefficient solution to better reflect my goal. Of course, I want the policy in the end, but that is not in focus here.
You can try the window function
WITH cte AS (
SELECT e.id, e.session, p.policy
, row_number() over(partition by e.id, e.session order by p.create_date desc) rn
FROM events e
INNER JOIN policies p
ON e.id = p.event_id AND p.create_date < e.event_date
)
SELECT c.id, c.session, c.policy
FROM cte c
where rn=1
I have tried the following code on Postgres, but there shouldn't be anything in there that is postgres specific.
Your query can be reorganised using a subquery to:
SELECT
e.id,
e.session,
(SELECT MAX(create_date) FROM policies AS p WHERE e.id = p.event_id AND p.create_date < e.event_date) AS policy_create_date
FROM events e
WHERE policy_create_date IS NOT NULL
While this query should show similar performance it makes it easier to spot the problem with the overall query: While finding the MAX the database has already found and read the row from policies with the highest date, but you are not getting the the value of the policy column out. So, you need to do a second join.
Using a lateral join you can get the complete relevant row from policies in one go.
SELECT
e.id,
e.session,
p2.policy,
p2.create_date
FROM events AS e
INNER JOIN LATERAL
(SELECT
*
FROM policies AS p
WHERE e.id = p.event_id AND p.create_date < e.event_date
ORDER BY p.create_date DESC
LIMIT 1) AS p2
ON TRUE;
This should use an index on policies. So, time should increase linear with size of events and logarithmic with size of policies.
Nevertheles, you can't expect great performance when you do this for large resultsets, because there will be lots of cache-misses while accessing the policies table.
Another option is to interleave the two tables, then use LAST_VALUE() to look back to find the policy data...
WITH
interleave AS
(
SELECT
id AS event_id,
event_date AS event_date,
session AS event_session,
NULL AS policy_label,
NULL AS policy_date
FROM
events
UNION ALL
SELECT
event_id,
create_date,
NULL,
policy,
create_date
FROM
policies
),
lookback AS
(
SELECT
event_id,
event_session,
event_date,
LAST_VALUE(policy_label IGNORE NULLS) OVER event_order AS policy_label,
LAST_VALUE(policy_date IGNORE NULLS) OVER event_order AS policy_date
FROM
interleave
WINDOW
event_order AS (
PARTITION BY event_id
ORDER BY event_date,
event_session NULLS FIRST
ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING
)
)
SELECT
event_id,
event_session,
event_date,
policy_label,
policy_date
FROM
lookback
WHERE
event_session IS NOT NULL
This presumes that the events table is vastly larger than the policies table.
I'd also recommend ensuring the tables are partitioned by the event_id and clustered by their respective date column.
Another option is to use LEAD() to find a policy's "expiry" date, then use that in the join...
WITH
policy_range AS
(
SELECT
event_id,
policy,
create_date,
LEAD(create_date, 1, DATE '9999-12-31') OVER event_order AS expiry_date
FROM
policies
WINDOW
event_order AS (
PARTITION BY event_id
ORDER BY create_date
)
)
SELECT
e.id,
e.session,
e.event_date,
p.policy,
p.create_date
FROM
policy_range AS p
INNER JOIN
events AS e
ON e.id = p.event_id
AND e.event_date > p.create_date
AND e.event_date <= p.expiry_date

Write a SQL Query to find different users linked to same User ID

So I'm currently on a project where I'm working with multiple sources, and one of them is SAP data.
I need to return "duplicates" in essence and find all the different users, that are linked to the same SAP User ID. There are entries that are valid however, as the data describes access roles to the different SAP systems. So it is normal if the same user occurs more than once. But I need to find where there is a different name assigned to the same User ID.
This is what I currently have:
select *
from (
select *,
row_number() over (partition by FULL_NAME order by USER_ID) as row_number
from SAP_TABLE
) as rows order by USER_ID desc
Any help would be appreciated. Thanks!
You would partition by the user_id
select *
from (
select *,
count(distinct (full_name)) over (partition by user_id) as rnk
from SAP_TABLE
) as rows
where rnk>1
order by USER_ID desc
are you looking for this?
select count(distinct FULL_NAME),
USER_ID
from SAP_TABLE
group by USER_ID
having count(distinct FULL_NAME) > 1
You can use this piece of code to find all the user id's that occur more than once.
SELECT USER_ID, COUNT(*)
FROM SAP_TABLE
GROUP BY 1
HAVING COUNT(*) > 1;

Generic SQL Question (Big Query) - removing rows after a date that is different for each customer

I want to remove all customer hits that I see on my site after they have registered. However, not all customers will register on the same day so I cannot simply filter on a specific date. I have a registration indicator of 1 or 0 and then a hit timestamp, along with unique indicators for the specific customers. I have tried this:
rank() over (partition by customer_id, registration_ind order by hit_timestamp asc) rnk
However, this still partitions by customer and isn't working for what I want.
Any help please?
THanks
Is this what you want?
select t.*
from (select t.*,
min(case when registration_ind = 1 then hit_timestamp end) over (partition by customer_id) as registration_timestamp
from t
) t
where registration_timestamp is null or
hit_timestamp < registration_timestamp;
It returns all rows before the first registration timestamp.

How to find total sessions played in BigQuery?

How to find out the total number of sessions played by all users in a month time frame. The event user_engagement has a parameter session count which increments on each session. The issue being the user who plays 10 sessions would have session count 1 to 10. So how am I supposed to add only the max session count i.e 10 in this instance and similarly for all users.
SELECT
SUM(session_count) AS total_sessions,
COUNT(DISTINCT user_pseudo_id) AS users
FROM
`xyz.analytics_111.events_*`
WHERE
event_name = "user_engagement" AND (_TABLE_SUFFIX BETWEEN "20200201" AND "20200229")
AND platform = "ANDROID"
Try below (BigQuery Standard SQL)
#standardSQL
SELECT
SUM(session_count) AS total_sessions,
COUNT(user_pseudo_id) AS users
FROM (
SELECT user_pseudo_id, MAX(session_count) session_count
FROM `xyz.analytics_111.events_*`
WHERE event_name = "user_engagement"
AND _TABLE_SUFFIX BETWEEN "20200201" AND "20200229"
AND platform = "ANDROID"
GROUP BY user_pseudo_id
)
I am unclear on what your data looks like. If there is one row per session, then you can simply use:
SELECT COUNT(*) AS total_sessions,
COUNT(DISTINCT user_pseudo_id) AS users
. . .
If you can have multiple events per session, you can use a hacky approach:
SELECT COUNT(DISTINCT CONCAT(user_pseudo_id, ':', CAST(session_count as string)))
I offer this, because sometimes in a complex query, it is simpler to just tweak a single row. Otherwise, Mikhail's solution is reasonable.
However, I would suggest window functions instead:
SELECT SUM(CASE WHEN seqnum = 1 THEN session_count END) AS total_sessions,
COUNT(DISTINCT user_pseudo_id) AS users
FROM (SELECT e.*,
ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY session_count DESC) as seqnum
FROM `xyz.analytics_111.events_*`
WHERE e.event_name = 'user_engagement' AND
_TABLE_SUFFIX BETWEEN '20200201' AND '20200229' AND
platform = 'ANDROID'
) e;
The reason I recommend this is because you can keep the rest of the calculation without changing them. That is handy in a complex query.