Optimizing sum() over(order by...) clause throwing 'resources exceeded' error - sql

I'm computing a sessions table from event data from out website in BigQuery. The events table has around 12 million events (pretty small). After I add in the logic to create sessions, I want to sum all sessions and assign a global_session_id. I'm doing that using a sum()over(order by...) clause which is throwing a resources exceeded error. I know that the order by clause is causing all the data to be processed on a single node and that is causing the compute resources to be exceeded, but I'm not sure what changes I can make to my code to achieve the same result. Any work arounds, advice, or explanations are greatly appreciated.
with sessions_1 as ( /* Tie a visitor's last event and last campaign to current event. */
select visitor_id as session_user_id,
sent_at,
context_campaign_name,
event,
id,
LAG(sent_at,1) OVER (PARTITION BY visitor_id ORDER BY sent_at) as last_event,
LAG(context_campaign_name,1) OVER (PARTITION BY visitor_id ORDER BY sent_at) as last_event_campaign_name
from tracks_2
),
sessions_2 as ( /* Flag events that begin a new session. */
select *,
case
when context_campaign_name != last_event_campaign_name
or context_campaign_name is null and last_event_campaign_name is not null
or context_campaign_name is not null and last_event_campaign_name is null
then 1
when unix_seconds(sent_at)
- unix_seconds(last_event) >= (60 * 30)
or last_event is null
then 1
else 0
end as is_new_session
from sessions_1
),
sessions_3 as ( /* Assign events sessions numbers for total sessions and total user sessions. */
select id as event_id,
sum(is_new_session) over (order by session_user_id, sent_at) as global_session_id
#sum(is_new_session) over (partition by session_user_id order by sent_at) as user_session_id
from materialized_result_of_sessions_2_query
)
select * from sessions_3

If might help if you defined a CTE with just the sessions, rather than at the event level. If this works:
select session_user_id, sent_at,
row_number() over (order by session_user_id, sent_at) as global_session_id
from materialized_result_of_sessions_2_query
where is_new_session
group by session_user_id, sent_at;
If that doesn't work, you can construct the global id:
You can join this back to the original event-level data and then use a max() window function to assign it to all events. Something like:
select e.*,
max(s.global_session_id) over (partition by e.session_user_id order by e.event_at) as global_session_id
from events e left join
(<above query>) s
on s.session_user_id = e.session_user_id and s.sent_at = e.event_at;
If not, you can do:
select us.*, us.user_session_id + s.offset as global_session_id
from (select session_user_id, sent_at,
row_number() over (partition by session_user_id order by sent_at) as user_session_id
from materialized_result_of_sessions_2_query
where is_new_session
) us join
(select session_user_id, count(*) as cnt,
sum(count(*)) over (order by session_user_id) - count(*) as offset
from materialized_result_of_sessions_2_query
where is_new_session
group by session_user_id
) s
on us.session_user_id = s.session_user_id;
This might still fail if almost all users are unique and the sessions are short.

Related

SQL Server LEAD function

-- FIRST LOGIN DATE
WITH CTE_FIRST_LOGIN AS
(
SELECT
PLAYER_ID, EVENT_DATE,
ROW_NUMBER() OVER (PARTITION BY PLAYER_ID ORDER BY EVENT_DATE ASC) AS RN
FROM
ACTIVITY
),
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS AS
(
SELECT
PLAYER_ID,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM
ACTIVITY A
JOIN
CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
WHERE
NEXT_DATE = DATEADD(DAY, 1, A.EVENT_DATE) AND C.RN = 1
GROUP BY
A.PLAYER_ID
)
-- FRACTION
SELECT
NULLIF(ROUND(1.00 * COUNT(CTE_CONSEC.PLAYER_ID) / COUNT(DISTINCT PLAYER_ID), 2), 0) AS FRACTION
FROM
ACTIVITY
JOIN
CTE_CONSEC_PLAYERS CTE_CONSEC ON CTE_CONSEC.PLAYER_ID = ACTIVITY.PLAYER_ID
I am getting the following error when I run this query.
[42S22] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column name 'NEXT_DATE'. (207) (SQLExecDirectW)
This is a leetcode medium question 550. Game Play Analysis IV. I wanted to know why it can't identify the column NEXT_DATE here and what am I missing? Thanks!
The problem is in this CTE:
-- CONSECUTIVE LOGINS prep
CTE_CONSEC_PLAYERS AS (
SELECT
PLAYER_ID,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM ACTIVITY A
JOIN CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
WHERE NEXT_DATE = DATEADD(DAY, 1, A.EVENT_DATE) AND C.RN = 1
GROUP BY A.PLAYER_ID
)
Note that you are creating NEXT_DATE as a column alias in this CTE but also referring to it in the WHERE clause. This is invalid because by SQL clause-ordering rules the NEXT_DATE column alias does not exist until you get to the ORDER BY clause which is the last evaluated clause in a SQL query or subquery. You don't have an ORDER BY clause in this subquery, so technically the NEXT_DATE column alias only exists to [sub]queries that both come after and reference your CTE_CONSEC_PLAYERS CTE.
To fix this you'd probably want two CTEs like this (untested):
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS_pre AS (
SELECT
PLAYER_ID,
RN,
EVENT_DATE,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM ACTIVITY A
JOIN CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
)
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS AS (
SELECT
PLAYER_ID,
MAX(NEXT_DATE) AS NEXT_DATE,
FROM CTE_CONSEC_PLAYERS_pre
WHERE NEXT_DATE = DATEADD(DAY, 1, EVENT_DATE) AND RN = 1
GROUP BY PLAYER_ID
)
You gave every table an alias (for example JOIN CTE_FIRST_LOGIN C has the alias C), and every column access is via the alias. You need to add the correct alias from the correct table to NEXT_DATE.
Your primary issue is that NEXT_DATE is a window function, and therefore cannot be referred to in the WHERE because of SQL's order of operations.
But it seems this query is over-complicated.
The problem to be solved appears to be: how many players logged in the day after they first logged in, as a percentage of all players.
This can be done in a single pass (no joins), by using multiple window functions together:
WITH CTE_FIRST_LOGIN AS (
SELECT
PLAYER_ID,
EVENT_DATE,
ROW_NUMBER() OVER (PARTITION BY PLAYER_ID ORDER BY EVENT_DATE) AS RN,
-- if EVENT_DATE is a datetime and can have multiple per day then group by CAST(EVENT_DATE AS date) first
LEAD(EVENT_DATE, 1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) AS NextDate
FROM ACTIVITY
),
BY_PLAYERS AS (
SELECT
c.PLAYER_ID,
SUM(CASE WHEN c.RN = 1 AND c.NextDate = DATEADD(DAY, 1, c.EVENT_DATE)
THEN 1 END) AS IsConsecutive
FROM CTE_FIRST_LOGIN AS c
GROUP BY c.PLAYER_ID
)
SELECT ROUND(
1.00 *
COUNT(c.IsConsecutive) /
NULLIF(COUNT(*), 0)
,2) AS FRACTION
FROM BY_PLAYERS AS c;
You could theoretically merge BY_PLAYERS into the outer query and use COUNT(DISTINCT but splitting them feels cleaner

What is the difference in syntax between the following queries?

I have an huge table of policies and I need to find all policies with invalid movements. For example, if inforce - premium_paid to terminated - premium_paid is invalid, then I would need to find all policies with this movement.
My query was initially as follows:
SELECT *,
LEAD(STAT) OVER (PARTITION BY ID, ORDER BY PROCDT, PROCTIME) AS NEXT_STAT,
LEAD(EVENT) OVER (PARTITION BY ID, ORDER BY PROCDT, PROCTIME) AS NEXT_EVENT
FROM TABLE
WHERE STAT = 'inforce',
EVENT = 'premium_paid',
NEXT_STAT = 'terminated',
NEXT_EVENT = 'premium_paid'
ORDER BY STAT, EVENT, NEXT_STAT, NEXT_EVENT
However, when I ran it, the compiler said that my column names 'NEXT_POLSTAT' and 'NEXT_EVENT' were invalid. Then, when I tweaked it to the following, it worked:
SELECT *
FROM (
SELECT *,
LEAD(STAT) OVER (PARTITION BY ID, ORDER BY PROCDT, PROCTIME) AS NEXT_STAT,
LEAD(EVENT) OVER (PARTITION BY ID, ORDER BY PROCDT, PROCTIME) AS NEXT_EVENT
FROM TABLE) AS a
WHERE a.STAT = 'inforce',
a.EVENT = 'premium_paid',
a.NEXT_STAT = 'terminated',
a.NEXT_EVENT = 'premium_paid'
ORDER BY a.STAT, a.EVENT, a.NEXT_STAT, a.NEXT_EVENT
Thus, I am just curious to know why my initial query did not work.

How to find the max time difference in PostgreSQL that a value stayed on the same state?

I am working for a university project and this came up to me:
I have a table like this:
And I want to get the max duration that an actuator was on a state. For example, cool0 Was for 18 minutes.
The result table should look like this:
NAME COOL0
State False
Duration 18
This is a gaps and islands problem. Your data is a bit hard to follow, but I think:
select actuator, state, min(actuator_time), max(actuator_time)
from (select t.*,
row_number() over (partition by actuator order by actuator_time) as seqnum,
row_number() over (partition by actuator, state order by actuator_time) as seqnum_s
from t
) t
group by actuator, (seqnum- seqnum_s)
For the maximum per actuator, use distinct on:
select distinct on (actuator) actuator, state, min(actuator_time), max(actuator_time)
from (select t.*,
row_number() over (partition by actuator order by actuator_time) as seqnum,
row_number() over (partition by actuator, state order by actuator_time) as seqnum_s
from t
) t
group by actuator, (seqnum- seqnum_s)
order by actuator, max(actuator_time) - min(actuator_time) desc;

How to generate session_id by sql?

My tracking system do not generate sessions IDS.
I have user_id & event_date_time.
I need a new session_id for each user's session that starts 30 minutes or more after last event_date_time of each user.
My final goal is to calculate median session time.
I tried to generate session_id=1 and session_id=2 once event_date_time-next_event_time>30 and guid=guid, but i'm stuck from here
select a.*,
case when (a.next_event_date-a.event_date)*24*60<30 and userID=next_userID
then 1
when (a.next_event_date-a.event_date)*24*60>=30 and userID=next_userID then
2
end session_id
from
(select f.userID,
lead(f.userID) over (partition by f.guid order by f.event_date)
next_guid,
f.event_date,
lead(f.event_date) over (partition by f.guid order by f.event_date)
next_event_date
from event_table f
)a
where next_event_date is not null
If I understood correctly you could generate ID's this way:
select id, guid, event_date,
sum(chg) over (partition by guid order by event_date) session_id
from (
select id, guid, event_date,
case when lag(guid) over (partition by guid order by event_date) = guid
and 24 * 60 * (event_date -lag(event_date)
over (partition by guid order by event_date) ) < 30
then 0 else 1
end chg
from event_table ) a
dbfiddle demo
Compare neighbouring rows, if there are different guids or time difference is greater than 30 minutes then assign 1. Then sum these values analytically.
I think you're on the right track using lead or lag. My recommendation would be to break this into steps and create a temp table to work against:
With the first query, assign every record its own unique ID, either a sequence number or GUID. You could also capture some of the lagged data in this step.
With a second query, find the overlaps (< 30 minutes) and make the overlapping records all the same -- either the same as the earliest or latest in that grouping, doesn't matter as long as it's consistent.
Something like this:
create table events_temp as (
select f.*,
row_number() over (partition by f.userID order by f.event_date) as user_row,
lag(f.userID) over (partition by f.userID order by f.event_date) as prev_userID,
lag(f.event_date) over (partition by f.userID order by f.event_date) as prev_event_date
from event_table f
order by f.userId, f.event_date
)
select a.*,
case when prev_userID = userID
and 24 * 60 * (event_date - prev_event_date) < 30
then lag(user_row) over (partition by userID order by user_row)
else user_row
end as session_id
from events_temp

How to return all the rows in the yellow census blocks?

Hey the schema is like this: for the whole dataset, we should order by machine_id first, then order by ss2k. after that, for each machine, we should find all the rows with at least consecutively 5 flag = 'census'. In this dataset, the result should be all the yellow rows..
I cannot return the last 4 rows of the yellow blocks by using this:
drop table if exists qz_panel_census_228_rank;
create table qz_panel_census_228_rank as
select t.*
from (select t.*,
count(*) filter (where flag = 'census') over (partition by machine_id, date order by ss2k rows between current row and 4 following) as census_cnt5,
count(*) filter (where flag = 'census') over (partition by machine_id, date) as count_census,
row_number() over (partition by machine_id, date order by ss2k) as seqnum,
count(*) over (partition by machine_id, date) as cnt
from qz_panel_census_228 t
) t
where census_cnt5 = 5
group by 1,2,3,4,5,6,7,8,9,10,11
DISTRIBUTED BY (machine_id);
You were close, but you need to search in both directions:
select t.*
from (select t.*,
case when count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between 4 preceding and current row) = 5
or count(*) filter (where flag = 'census')
over (partition by machine_id, date
order by ss2k
rows between current row and 4 following) = 5
then 1
else 0
end as flag
from qz_panel_census_228 t
) t
where flag = 1
Edit:
This approach will not work unless you add an extra count for each possible 5 row window, e.g. 3 preceding and 1 following, 2 preceding and 2 following, etc. This results in ugly code and is not very flexible.
The common way to solve this gaps & islands problem is to assign consecutive rows to a common group first:
select *
from
(
select t2.*,
count(*) over (partition by machine_id, date, grp) as cnt
from
(
select t1.*
from (select t.*,
-- keep the same number for 'census' rows
sum(case when flag = 'census' then 0 else 1 end)
over (partition by machine_id, date
order by ss2k
rows unbounded preceding) as grp
from qz_panel_census_228 t
) t1
where flag = 'census' -- only census rows
) as t2
) t3
where cnt >= 5 -- only groups of at least 5 census rows
Wow, there has to be a better way of doing this, but the only way I could figure out was to create blocks of consecutive 'census' values. This looks awful but might be a catalyst to a better idea.
with q1 as (
select
machine_id, recorded, ss2k, flag, date,
case
when flag = 'census' and
lag (flag) over (order by machine_id, ss2k) != 'census'
then 1
else 0
end as block
from foo
),
q2 as (
select
machine_id, recorded, ss2k, flag, date,
sum (block) over (order by machine_id, ss2k) as group_id,
case when flag = 'census' then 1 else 0 end as census
from q1
),
q3 as (
select
machine_id, recorded, ss2k, flag, date, group_id,
sum (census) over (partition by group_id order by ss2k) as max_count
from q2
),
groups as (
select group_id
from q3
group by group_id
having max (max_count) >= 5
)
select
q2.machine_id, q2.recorded, q2.ss2k, q2.flag, q2.date
from
q2
join groups g on q2.group_id = g.group_id
where
q2.flag = 'census'
If you run each query within the with clauses in isolation, I think you will see how this evolves.