In my problem, I want to be able to track whether a state has shifted from 04a. Lapsing - Lowering Engagement to 03d. Engaged - Very High after trigger_send_date has occurred.
I believe a window function is required that checks whether a state is 04a. Lapsing - Lowering Engagement before trigger_send_date, and then measures whether that changes after trigger_send_date is needed , but I can't figure out how to write this. I made a start below, but have difficulty continuing!
Ideally I'd like a new column that is a True/False as to whether that switching has occurred post trigger_send_date within 31 days of the date occuring.
SELECT
cust_id,
state_date,
trigger_send_date,
total_state,
IF (
total_state IN ("04a. Lapsing - Lowering Engagement"),
True,
False
) as lapse,
-- Trying to write this column
sum(IF ((trigger_send_date >= state_date) & (total_state IN ("04a. Lapsing - Lowering Engagement") , 1, null)) OVER (
PARTITION BY cust_id,
state_date
ORDER BY
state_date
) as lapsed_and_returned_wirthin_31_days
FROM
base
ORDER BY
state_date,
trigger_send_date
Does anyone have any tips to help me write this?
This is what my table looks like with expected result as right-most column if it helps!
Let me preface my answer by saying that I don't have access to spark SQL, so the below is written in MySQL (it would probably work in SQL Server as well). I've had a look at the docs and the window frame should still work, you obviously might need to make some tweaks.
The window frame tells the partition function which rows to look at, by included UNBOUNDED PRECEDING you're telling the function to include every row before the current row, and using UNBOUNDED FOLLOWING you're telling the function to look at every row after the current row.
I tried to include another test, for a customer that was engaged before the trigger date and it seems to work. Obviously if you provided some sample data we could test further.
DROP TABLE IF EXISTS Base;
CREATE TABLE Base
(
cust_id BIGINT,
state_date DATE,
trigger_send_date DATE,
total_state VARCHAR(256)
);
INSERT INTO Base (cust_id,state_date, trigger_send_date, total_state) VALUES
(9177819375032029782,'2022-03-07','2022-03-14','03d. Engaged - Very High'),
(9177819375032029782,'2022-03-13','2022-03-14','04a. Lapsing - Lowering Engagement'),
(9177819375032029782,'2022-03-19','2022-03-14','03d. Engaged - Very High'),
(9177819375032029782,'2022-05-07','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-07','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-10','2022-03-14','04a. Lapsing - Lowering Engagement'),
(819375032029782,'2022-03-11','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-03-19','2022-03-14','03d. Engaged - Very High'),
(819375032029782,'2022-05-07','2022-03-14','03d. Engaged - Very High');
With LapsedCTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY cust_id ORDER BY state_date DESC) AS `RNum`
FROM Base
WHERE state_date <= trigger_send_date
AND LEFT(total_state, 3) IN ('03d','04a')
)
SELECT b.cust_id, b.state_date, b.trigger_send_date, b.total_state,
IF (
b.total_state IN ("04a. Lapsing - Lowering Engagement"),
True,
False
) as lapse,
-- Here we find the MIN engaged date (you can other states if needed) AFTER the trigger date.
-- Then we compare that to the trigger_send_date from the list of customers that were lapsed prior to the trigger_send_date (this will be empty for non-lapsed customers
-- so will default to 0 in our results column
-- Then we do a DATEDIFF between the trigger date and the engaged date, if the value is less than or equal to 31 days, Robert is your Mother's Brother..
IF(DATEDIFF(
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), l.trigger_send_date) <= 31, 1, 0) AS `lapsed_and_returned_wirthin_31_days`
-- Here's some other stuff just to show you the inner working of the above
/*
DATEDIFF(
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), b.trigger_send_date) AS `engaged_time_lag_days`,
MIN(IF(b.state_date > b.trigger_send_date AND LEFT(b.total_state, 3) IN ('03d'), b.state_date, NULL))
OVER (PARTITION BY b.cust_id ORDER BY b.state_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `first_engaged_date_after_trigger`
*/
FROM Base b
LEFT JOIN LapsedCTE l ON l.cust_id = b.cust_id AND l.RNum = 1 AND LEFT(l.total_state, 3) IN ('04a');
It would be possible to remove the CTE if you need, it just makes things a bit cleaner.
Here's a runnable DBFiddle just incase you don't have access to a MySQL database.
Related
I have a table with equipment failure and resolved date. Until the failure is resolved, entries for each day will show as failed. Once the issue is resolved data will start from the next failure date. Below is an example
I want an output which will give me the first failure time for each resolved timestamp like
I tried to do a left join between Resolved timestamp and failure dates AND take the min but that doesn't work.
Consider below approach
select type,
max(timestamp) resolved_timestamp,
min(timestamp) first_failure_timestamp
from (
select *, countif(status='resolved') over win as grp
from your_table
window win as (partition by type order by timestamp rows between unbounded preceding and 1 preceding)
)
group by type, grp
if applied to sample data in y our question - output is
I am trying to form a query for returning, reactivated and WAU as defined below:
Returning WAU - active last week
WAU - not active last week, but active within last 30 days
Reactivated WAU – not seen in 30+ days
I have table for past 60 days containing cust_id, login date but cant lag function to work (Teradata ODBC connection). I keep getting this error:
[3706] Syntax error: Data Type "logindate" does not match a Defined
Type name. My format is: select .... lag(logindate, 1) over (partition
by cust_id order by 1 asc) as lag_ind from ( ....
Please help for the 3 cases above.
You can aggregate to get the expected answer:
select cust_id,
case
when max(logindate) > current_date - 7 -- active last week
then 'Returning WAU'
when max(logindate) > current_date - 30 -- not active last week, but active within last 30 days
then 'WAU'
else 'Reactivated WAU' –- not seen in 30+ days
end
from tab
group by 1
Regarding the issue with LAG, this has been introduced in 16.10, before you have to rewrite:
lag(logindate, 1)
over (partition by cust_id
order by col asc) as lag_ind
max(logindate)
over (partition by cust_id
order by col asc
rows between 1 preceding and 1 preceding) as lag_ind
Hint: never use ORDER BY 1 in an OLAP function, here it's the literal value one and not the first column.
I have been working with window functions a fair amount but I don't think I understand enough about how they work to answer why they behave the way they do.
For the query that I was working on (below), why am I required to take my aggregated field and add it to the group by? (In the second half of my query below I am unable to produce a result if I don't include "Events" in my second group by)
With Data as (
Select
CohortDate as month
,datediff(week,CohortDate,EventDate) as EventAge
,count(distinct case when EventDate is not null then GUID end) as Events
From MyTable
where month >= [getdate():month] - interval '12 months'
group by 1, 2
order by 1, 2
)
Select
month
,EventAge
,sum(Events) over (partition by month order by SubAge asc rows between unbounded preceding and current row) as TotEvents
from data
group by 1, 2, Events
order by 1, 2
I have run into this enough that I have just taken it for granted, but would really love some more color as to why this is needed. Is there a way I should be formatting these differently in order to avoid this (somewhat non-intuitive) requirement?
Thanks a ton!
What you are looking for is presumably a cumulative sum. That would be:
select month, EventAge,
sum(sum(Events)) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from data
group by 1, 2
order by 1, 2 ;
Why? That might be a little hard to explain. Perhaps if you see the equivalent version with a subquery it will be clearer:
select me.*
sum(sum_events) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from (select month, EventAge, sum(events) as sum_events
from data
group by 1, 2
) me
order by 1, 2 ;
This is pretty much an exactly shorthand for the query. The window function is evaluated after aggregation. You want to sum the SUM of the events after the aggregation. Hence, you need sum(sum(events)). After the aggregation, events is no longer available.
The nesting of aggregation functions is awkward at first -- at least it was for me. When I first started using window functions, I think I first spent a few days writing aggregation queries using subqueries and then rewriting without the subqueries. Quickly, I got used to writing them without subqueries.
My problem seems simple on paper:
For a given date, give me active users for that given date, active users in given_Date()-7, active users in a given_Date()-30
i.e. sample data.
"timestamp" "user_public_id"
"23-Sep-15" "805a47023fa611e58ebb22000b680490"
"28-Sep-15" "d842b5bc5b1711e5a84322000b680490"
"01-Oct-15" "ac6b5f70b95911e0ac5312313d06dad5"
"21-Oct-15" "8c3e91e2749f11e296bb12313d086540"
"29-Nov-15" "b144298810ee11e4a3091231390eb251"
for 01-10 the count for today would be 1, last_7_days would be 3, last_30_days would be 3+n (where n would be the count of the user_ids that fall in dates that precede Oct 1st in a 30 day window)
I am on redshift amazon. Can somebody provide a sample sql to help me get started?
the outputshould look like this:
"timestamp" "users_today", "users_last_7_days", "users_30_days"
"01-Oct-15" 1 3 (3+n)
I know asking for help/incomplete solutions are frowned upon, but this is not getting any other attention so I thought I would do my bit.
I have been pulling my hair out trying to nut this one out, alas, I am a beginner and something is not clicking for me. Perhaps yourself or others will be able to drastically improve my answer, but I think I am on the right track.
SELECT replace(convert(varchar, [timestamp], 111), '/','-') AS [timestamp], -- to get date in same format as you require
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE ([TIMESTAMP]) = ([timestamp])) AS users_today,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-7,[TIMESTAMP]) AND [TIMESTAMP]) AS users_last_7_days ,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-30,[TIMESTAMP]) AND [timestamp]) AS users_last_30_days
FROM #SIMPLE
GROUP BY [timestamp]
Starting with this:
CREATE TABLE #SIMPLE (
[timestamp] datetime, user_public_id varchar(32)
)
INSERT INTO #SIMPLE
VALUES('23-Sep-15','805a47023fa611e58ebb22000b680490'),
('28-Sep-15','d842b5bc5b1711e5a84322000b680490'),
('01-Oct-15','ac6b5f70b95911e0ac5312313d06dad5'),
('21-Oct-15','8c3e91e2749f11e296bb12313d086540'),
('29-Nov-15','b144298810ee11e4a3091231390eb251')
The problem I am having is that each row contains the same counts, despite my grouping by [timestamp].
Step 1-- Create a table which has daily counts.
create temp table daily_mobile_Sessions as
select "timestamp" ,
count(user_public_id) over (partition by "timestamp" ) as "today"
from mobile_sessions
group by 1, mobile_sessions.user_public_id
order by 1 DESC
Step 2 -- From the table above. We create yet another table which can use the "today" field, and we apply the window function to Sum the counts.
select "timestamp", today,
sum(today) over (order by "timestamp" rows between 6 PRECEDING and CURRENT ROW) as "last_7days",
sum(today) over (order by "timestamp" rows between 29 PRECEDING and CURRENT ROW) as "last_30days"
from daily_mobile_Sessions group by "timestamp" , 2 order by 1 desc
Ok, initially this was just a joke we had with a friend of mine, but it turned into interesting technical question :)
I have the following stuff table:
CREATE TABLE stuff
(
id serial PRIMARY KEY,
volume integer NOT NULL DEFAULT 0,
priority smallint NOT NULL DEFAULT 0,
);
The table contains the records for all of my stuff, with respective volume and priority (how much I need it).
I have a bag with specified volume, say 1000. I want to select from the table all stuff I can put into a bag, packing the most important stuff first.
This seems like the case for using window functions, so here is the query I came up with:
select s.*, sum(volume) OVER previous_rows as total
from stuff s
where total < 1000
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
order by priority desc
The problem with it, however, is that Postgres complains:
ERROR: column "total" does not exist
LINE 3: where total < 1000
If I remove this filter, total column gets properly calculated, results properly sorted but all stuff gets selected, which is not what I want.
So, how do I do this? How do I select only items that can fit into the bag?
I don't know if this qualifies as "more elegant" but it is written in a different manner than Cybernate's solution (although it is essentially the same)
WITH window_table AS
(
SELECT s.*,
sum(volume) OVER previous_rows as total
FROM stuff s
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
)
SELECT *
FROM window_table
WHERE total < 1000
ORDER BY priority DESC
If by "more elegant" you mean something that avoids the sub-select, then the answer is "no"
I haven't worked with PostgreSQL. However, my best guess would be using an inline view.
SELECT a.*
FROM (
SELECT s.*, sum(volume) OVER previous_rows AS total
FROM stuff AS s
WINDOW previous_rows AS (
ORDER BY priority desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY priority DESC
) AS a
WHERE a.total < 1000;