Tabulating Profit And Loss For Backtesting using Bigquery - google-bigquery

I have this Bigquery dataframe where 1 in long_entry or short_entry represents entering the trade at that time with a long/short position corresponding. While a 1 in long_exit or short_exit means exiting a trade. I would like to have 2 new columns, one called long_pnl which tabulate the PnL generated from individual long trades and another called short_pnl which tabulate the PnL generated from individual short trades.
Only a maximum of 1 trade/position at any point of time for this backtesting.
Below is my dataframe. As we can see, a long trade is entered on 26/2/2019 and closed at 1/3/2019 and the Pnl will be $64.45 while a short trade is entered on 4/3/2019 and closed on 5/3/2019 with a pnl of -$119.11 (loss).
date price long_entry long_exit short_entry short_exit
0 24/2/2019 4124.25 0 0 0 0
1 25/2/2019 4130.67 0 0 0 0
2 26/2/2019 4145.67 1 0 0 0
3 27/2/2019 4180.10 0 0 0 0
4 28/2/2019 4200.05 0 0 0 0
5 1/3/2019 4210.12 0 1 0 0
6 2/3/2019 4198.10 0 0 0 0
7 3/3/2019 4210.34 0 0 0 0
8 4/3/2019 4100.12 0 0 1 0
9 5/3/2019 4219.23 0 0 0 1
I hope to have an output like this, with another column for short_pnl:
date price long_entry long_exit short_entry short_exit long_pnl
0 24/2/2019 4124.25 0 0 0 0 NaN
1 25/2/2019 4130.67 0 0 0 0 NaN
2 26/2/2019 4145.67 1 0 0 0 64.45
3 27/2/2019 4180.10 0 0 0 0 NaN
4 28/2/2019 4200.05 0 0 0 0 NaN
5 1/3/2019 4210.12 0 1 0 0 NaN
6 2/3/2019 4198.10 0 0 0 0 NaN
7 3/3/2019 4210.34 0 0 0 0 NaN
8 4/3/2019 4100.12 0 0 1 0 NaN
9 5/3/2019 4219.23 0 0 0 1 NaN

Below is for BigQuery Standard SQL
#standardSQL
WITH temp1 AS (
SELECT PARSE_DATE('%d/%m/%Y', dt) dt, CAST(price AS numeric) price, long_entry, long_exit, short_entry, short_exit
FROM `project.dataset.table`
), temp2 AS (
SELECT dt, price, long_entry, long_exit, short_entry, short_exit,
SUM(long_entry) OVER(ORDER BY dt) + SUM(long_exit) OVER(ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) long_grp,
SUM(short_entry) OVER(ORDER BY dt) + SUM(short_exit) OVER(ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) short_grp
FROM temp1
)
SELECT dt, price, long_entry, long_exit, short_entry, short_exit,
IF(long_entry = 0, NULL,
FIRST_VALUE(price) OVER(PARTITION BY long_grp ORDER BY dt DESC) -
LAST_VALUE(price) OVER(PARTITION BY long_grp ORDER BY dt DESC)
) long_pnl,
IF(short_entry = 0, NULL,
LAST_VALUE(price) OVER(PARTITION BY short_grp ORDER BY dt DESC) -
FIRST_VALUE(price) OVER(PARTITION BY short_grp ORDER BY dt DESC)
) short_pnl
FROM temp2
If to apply above to sample data in your question
#standardSQL
WITH `project.dataset.table` AS (
SELECT '24/2/2019' dt, 4124.25 price, 0 long_entry, 0 long_exit, 0 short_entry, 0 short_exit UNION ALL
SELECT '25/2/2019', 4130.67, 0, 0, 0, 0 UNION ALL
SELECT '26/2/2019', 4145.67, 1, 0, 0, 0 UNION ALL
SELECT '27/2/2019', 4180.10, 0, 0, 0, 0 UNION ALL
SELECT '28/2/2019', 4200.05, 0, 0, 0, 0 UNION ALL
SELECT '1/3/2019', 4210.12, 0, 1, 0, 0 UNION ALL
SELECT '2/3/2019', 4198.10, 0, 0, 0, 0 UNION ALL
SELECT '3/3/2019', 4210.34, 0, 0, 0, 0 UNION ALL
SELECT '4/3/2019', 4100.12, 0, 0, 1, 0 UNION ALL
SELECT '5/3/2019', 4219.23, 0, 0, 0, 1
), temp1 AS (
SELECT PARSE_DATE('%d/%m/%Y', dt) dt, CAST(price AS numeric) price, long_entry, long_exit, short_entry, short_exit
FROM `project.dataset.table`
), temp2 AS (
SELECT dt, price, long_entry, long_exit, short_entry, short_exit,
SUM(long_entry) OVER(ORDER BY dt) + SUM(long_exit) OVER(ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) long_grp,
SUM(short_entry) OVER(ORDER BY dt) + SUM(short_exit) OVER(ORDER BY dt ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) short_grp
FROM temp1
)
SELECT dt, price, long_entry, long_exit, short_entry, short_exit,
IF(long_entry = 0, NULL,
FIRST_VALUE(price) OVER(PARTITION BY long_grp ORDER BY dt DESC) -
LAST_VALUE(price) OVER(PARTITION BY long_grp ORDER BY dt DESC)
) long_pnl,
IF(short_entry = 0, NULL,
LAST_VALUE(price) OVER(PARTITION BY short_grp ORDER BY dt DESC) -
FIRST_VALUE(price) OVER(PARTITION BY short_grp ORDER BY dt DESC)
) short_pnl
FROM temp2
-- ORDER BY dt
result will be
Row dt price long_entry long_exit short_entry short_exit long_pnl short_pnl
1 2019-02-24 4124.25 0 0 0 0 null null
2 2019-02-25 4130.67 0 0 0 0 null null
3 2019-02-26 4145.67 1 0 0 0 64.45 null
4 2019-02-27 4180.1 0 0 0 0 null null
5 2019-02-28 4200.05 0 0 0 0 null null
6 2019-03-01 4210.12 0 1 0 0 null null
7 2019-03-02 4198.1 0 0 0 0 null null
8 2019-03-03 4210.34 0 0 0 0 null null
9 2019-03-04 4100.12 0 0 1 0 null -119.11
10 2019-03-05 4219.23 0 0 0 1 null null
I feel there should be a "shorter" solution - but above is still good enough I think to use

Related

SQL decreasing sum by a percentage

I have a table like
timestamp
type
value
08.01.2023
1
5
07.01.2023
0
20
06.01.2023
1
1
05.01.2023
0
50
04.01.2023
0
50
03.01.2023
1
1
02.01.2023
1
1
01.01.2023
1
1
Type 1 means a deposit, type 0 means a withdrawal.
The thing is when a type is 1 then the amount is the exact amount the user deposited so we can just sum that but type 0 means a withdrawal in percentage.
What I'm looking for is to create another column with current deposited amount. For the example above it would look like that.
timestamp
type
value
deposited
08.01.2023
1
5
5.4
07.01.2023
0
20
1.4
06.01.2023
1
1
1.75
05.01.2023
0
50
0.75
04.01.2023
0
50
1.5
03.01.2023
1
1
3
02.01.2023
1
1
2
01.01.2023
1
1
1
I can't figure out how to make a sum like this which would subtract percentage of previous total
You are trying to carry state over time, so ether need to use a UDTF to doing the carry work for you. Or use a recursive CTE
with data(transaction_date, type, value) as (
select to_date(column1, 'dd.mm.yyyy'), column2, column3
from values
('08.01.2023', 1, 5),
('07.01.2023', 0, 20),
('06.01.2023', 1, 1),
('05.01.2023', 0, 50),
('04.01.2023', 0, 50),
('03.01.2023', 1, 1),
('02.01.2023', 1, 1),
('01.01.2023', 1, 1)
), pre_process_data as (
select *
,iff(type = 0, 0, value)::number as add
,iff(type = 0, value, 0)::number as per
,row_number()over(order by transaction_date asc) as rn
from data
), rec_cte_block as (
with recursive rec_sub_cte as (
select
p.*,
p.add::number(20,4) as deposited
from pre_process_data as p
where p.rn = 1
union all
select
p.*,
round(div0((r.deposited + p.add)*(100-p.per), 100), 2) as deposited
from rec_sub_cte as r
left join pre_process_data as p
where p.rn = r.rn+1
)
select *
from rec_sub_cte
)
select * exclude(add, per, rn)
from rec_cte_block
order by 1;
I wrote the recursive CTE this way, as there currently is an incident if IFF or CASE is used inside the CTE.
TRANSACTION_DATE
TYPE
VALUE
DEPOSITED
2023-01-01
1
1
1
2023-01-02
1
1
2
2023-01-03
1
1
3
2023-01-04
0
50
1.5
2023-01-05
0
50
0.75
2023-01-06
1
1
1.75
2023-01-07
0
20
1.4
2023-01-08
1
5
6.4
Solution without recursion and UDTF
create table depo (timestamp date,type int, value float);
insert into depo values
(cast('01.01.2023' as date),1, 1.0)
,(cast('02.01.2023' as date),1, 1.0)
,(cast('03.01.2023' as date),1, 1.0)
,(cast('04.01.2023' as date),0, 50.0)
,(cast('05.01.2023' as date),0, 50.0)
,(cast('06.01.2023' as date),1, 1.0)
,(cast('07.01.2023' as date),0, 20.0)
,(cast('08.01.2023' as date),1, 5.0)
;
with t0 as(
select *
,sum(case when type=0 and value>=100 then 1 else 0 end)over(order by timestamp) gr
from depo
)
,t1 as (select timestamp as dt,type,gr
,case when type=1 then value else 0 end depo
,case when type=0 then ((100.0-value)/100.0) else 0.0 end pct
,sum(case when type=0 and value<100 then log((100.0-value)/100.0,2.0)
when type=0 and value>=100 then null
else 0.0
end)
over(partition by gr order by timestamp ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) totLog
from t0
)
,t2 as(
select *
,case when type=1 then
isnull(sum(depo*power(cast(2.0 as float),totLog))
over(partition by gr order by dt rows between unbounded preceding and 1 preceding)
,0)/power(cast(2.0 as float),totLog)
+depo
else
isnull(sum(depo*power(cast(2.0 as float),totLog))
over(partition by gr order by dt rows between unbounded preceding and 1 preceding)
,0)/power(cast(2.0 as float),totLog)*pct
end rest
from t1
)
select dt,type,depo,pct*100 pct
,rest-lag(rest,1,0)over(order by dt) movement
,rest
from t2
order by dt
dt
type
depo
pct
movement
rest
2023-01-01
1
1
0
1
1
2023-02-01
1
1
0
1
2
2023-03-01
1
1
0
1
3
2023-04-01
0
0
50
-1.5
1.5
2023-05-01
0
0
50
-0.75
0.75
2023-06-01
1
1
0
1
1.75
2023-07-01
0
0
80
-0.35
1.4
2023-08-01
1
5
0
5
6.4
I think, it is better to perform this kind of calculations on client side or middle level.
Sequential calculations are difficult to implement in Sql. In some special cases, you can use logarithmic expressions. But it is clearer and easier to implement through recursion, as #Simeon showed.
To expand on #ValNik's answer
The fist simple step is to change "deduct 20%, then deduct 50%, then deduct 30%" in to a multiplication...
X - 20% - 50% - 30%
=>
x * 0.8 * 0.5 * 0.7
=>
x * 0.28
The second trick is to understand how to calculate cumulative PRODUCT() when you only have cumulative sum; SUM() OVER (), using the properties of logarithms...
a * b == exp( log(a) + log(b) )
0.8 * 0.5 * 0.7
=>
exp( log(0.8) + log(0.5) + log(0.7) )
=>
exp( -0.2231 + -0.6931 + -0.3567 )
=>
exp( -1.2730 )
=>
0.28
The next trick is easier to explain with integers rather than percentages. That is to be able to break down the original problem in to one that can be solved using "cumulative sum" and "cumulative product"...
Current working:
row_id
type
value
equation
result
1
+
10
0 + 10
10
2
+
20
(0 + 10 + 20)
30
3
*
2
(0 + 10 + 20) * 2
60
4
+
30
(0 + 10 + 20) * 2 + 30
90
5
*
3
((0 + 10 + 20) * 2 + 30) * 3
270
Rearranged working:
row_id
type
value
CUMPROD
new equation
result
1
+
10
2*3=6
(10*6 ) / 6
10
2
+
20
2*3=6
(10*6 + 20*6 ) / 6
30
3
*
2
3=3
(10*6 + 20*6 ) / 3
60
4
+
30
3=3
(10*6 + 20*6 + 30*3) / 3
90
5
*
3
=1
(10*6 + 20*6 + 30*3) / 1
270
CUMPROD is the "cumulative product" of all future "multiplication values".
The equation is then the "cumulative sum" of value * CUMPROD divided by the current CUMPROD.
So...
row 1 : SUM(10*6 ) / 6 => SUM(10 )
row 2 : SUM(10*6, 20*6 ) / 6 => SUM(10, 20)
row 3 : SUM(10*6, 20*6 ) / 3 => SUM(10, 20) * 2
row 4 : SUM(10*6, 20*6, 30*3) / 3 => SUM(10, 20) * 2 + SUM(30)
row 5 : SUM(10*6, 20*6, 30*3) / 1 => SUM(10, 20) * 2*3 + SUM(30) * 3
The only things to be cautious of are:
LOG(0) = Infinity (which would happen when deducting 100%)
Deducting more than 100% makes no sense
So, I copied #ValNik's code that creates a new partition every time 100% or more is deducted (forcing everything in the next partition to start at zero again).
This gives the following SQL (a re-arranged version of #ValNik's code):
WITH
partition_when_deduct_everything AS
(
SELECT
*,
SUM(
CASE WHEN type = 0 AND value >= 100 THEN 1 ELSE 0 END
)
OVER (
ORDER BY timestamp
)
AS deduct_everything_id,
CASE WHEN type = 1 THEN value
ELSE 0
END
AS deposit,
CASE WHEN type = 1 THEN 1.0 -- Deposits == Deduct 0%
WHEN value >= 100 THEN 1.0 -- Treat "deduct everything" as a special case
ELSE (100.0-value)/100.0 -- Change "deduct 20%" to "multiply by 0.8"
END
AS multiplier
FROM
your_table
)
,
cumulative_product_of_multipliers as
(
SELECT
*,
EXP(
ISNULL(
SUM(
LOG(multiplier)
)
OVER (
PARTITION BY deduct_everything_id
ORDER BY timestamp
ROWS BETWEEN 1 FOLLOWING
AND UNBOUNDED FOLLOWING
)
, 0
)
)
AS future_multiplier
FROM
partition_when_deduct_everything
)
SELECT
*,
ISNULL(
SUM(
deposit * future_multiplier
)
OVER (
PARTITION BY deduct_everything_id
ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
),
0
)
/
future_multiplier
AS rest
FROM
cumulative_product_of_multipliers
Demo : https://dbfiddle.uk/mrioIMiB
So how this should be solved, is a UDTF, because it requires to "sorting the data once" and "traversing the data only once", and if you have different PARTITIONS aka user_id etc etc you can work in parallel):
create or replace function carry_value_state(_TYPE float, _VALUE float)
returns table (DEPOSITED float)
language javascript
as
$$
{
initialize: function(argumentInfo, context) {
this.carried_value = 0.0;
},
processRow: function (row, rowWriter, context){
if(row._TYPE === 1) {
this.carried_value += row._VALUE;
} else {
let limited = Math.max(Math.min(row._VALUE, 100.0), 0.0);
this.carried_value -= (this.carried_value * limited) / 100;
}
rowWriter.writeRow({DEPOSITED: this.carried_value});
}
}
$$;
which then gets used like:
select d.*,
c.*
from data as d
,table(carry_value_state(d.type::float, d.value::float) over (order by transaction_date)) as c
order by 1;
so for the data we have been using in the example, that gives:
TRANSACTION_DATE
TYPE
VALUE
DEPOSITED
2023-01-01
1
1
1
2023-01-02
1
1
2
2023-01-03
1
1
3
2023-01-04
0
50
1.5
2023-01-05
0
50
0.75
2023-01-06
1
1
1.75
2023-01-07
0
20
1.4
2023-01-08
1
5
6.4
yes, the results are now in floating point, so you should double round to avoid FP representation problems, like:
round(round(c.deposited, 6) , 2) as deposited
An alternative approach using Match_Recognize(), POW(), SUM().
I would not recommend using Match_Recognize() unless you have too, it's fiddly and can waste time, however does look elegant.
with data(transaction_date, type, value) as (
select
to_date(column1, 'dd.mm.yyyy'),
column2,
column3
from
values
('08.01.2023', 1, 5),
('07.01.2023', 0, 20),
('06.01.2023', 1, 1),
('05.01.2023', 0, 50),
('04.01.2023', 0, 50),
('03.01.2023', 1, 1),
('02.01.2023', 1, 1),
('01.01.2023', 1, 1)
)
select
*
from
data match_recognize(
order by
transaction_date measures
sum(iff(CLASSIFIER() = 'ROW_WITH_DEPOSIT', value, 0)) DEPOSITS,
pow(iff(CLASSIFIER() = 'ROW_WITH_WITHDRAWL', value / 100, 1) ,count(row_with_withdrawl.*)) DISCOUNT_FROM_WITHDRAWL,
CLASSIFIER() TRANS_TYPE,
first(transaction_date) as start_date,
last(transaction_date) as end_date,
count(*) as rows_in_sequence,
count(row_with_deposit.*) as num_deposits,
count(row_with_withdrawl.*) as num_withdrawls
after
match skip PAST LAST ROW pattern((row_with_deposit + | row_with_withdrawl +)) define row_with_deposit as type = 1,
row_with_withdrawl as type = 0
);

Last page visited by a user on the website and never returned

I am trying to find which was the last page a user visited and did not return to the website in a particular duration:
For Example:
Date visitstarttime ldapid page_1 page_2 Page_3 Page_4
2018-08-01 1590805941 1000123 1 0 0 0
2018-07-30 1590200345 1000123 0 1 0 0
2018-07-20 1580100098 1000100 0 1 0 0
2018-07-18 1570000987 1000100 0 0 1 0
2018-07-12 1550200678 1000007 0 1 0 0
2018-07-09 1530287323 1000007 0 0 0 1
As I am trying to find only the user who visited a page and never returned in a particular duration of time. I am expecting the output as below:
Date visitstarttime ldapid page_1 page_2 Page_3 Page_4
2018-08-01 1590805941 1000123 1 0 0 0
2018-07-20 1580100098 1000100 0 1 0 0
2018-07-12 1550200678 1000007 0 1 0 0
Since we cannot use Aggregate and group by functions together in GBQ is there a way around it?
Query:
SELECT
MAX(date) as Max_date,
Max(visitStartTime) as Max_Time,
ldapid
FROM
(
SELECT
CAST(CONCAT(SUBSTR(date,0,4), '-', SUBSTR(date,5,2),'-',SUBSTR(date,7,2)) AS date ) AS date,
visitStartTime,
-- fullVisitorId,
-- hit.page.pagePath AS pagepath,
(
SELECT
x.value
FROM
UNNEST(hit.customDimensions) x
WHERE
x.index = 9)
AS ldapid,
(
SELECT
MAX(
IF
(page.pagePath LIKE '%/applicant-center/#interview/recommendations-and-references%',
1,
0))
FROM
UNNEST(hits))
AS Recommendation_References,
(
SELECT
MAX(
IF
(page.pagePath LIKE '%/applicant-center/#interview/interview-sign-up%',
1,
0))
FROM
UNNEST(hits))
AS Interview_Sign_Up,
(
SELECT
MAX(
IF
(page.pagePath LIKE '%/applicant-center/#interview/transcript-setup%',
1,
0))
FROM
UNNEST(hits))
AS Transcript_Setup,
(
SELECT
MAX(
IF
(page.pagePath LIKE '%/applicant-center/#interview/transcript-upload%',
1,
0))
FROM
UNNEST(hits))
AS Transcript_Upload,
(
SELECT
MAX(
IF
(page.pagePath LIKE '%/applicant-center/#interview/pre-interview-questions%',
1,
0))
FROM
UNNEST(hits))
AS Pre_Interview_Questions,
(
SELECT
MAX(
IF
(page.pagePath LIKE '%/applicant-center/#interview/interview-prep%',
1,
0))
FROM
UNNEST(hits))
AS Interview_Prep,
(
SELECT
MAX(
IF
(page.pagePath LIKE '%/applicant-center/#interview/interview-day-details%',
1,
0))
FROM
UNNEST(hits))
AS Interview_day_details
FROM
`tfa-big-query.74006564.ga_sessions_*`,
UNNEST(hits) AS hit
WHERE
_TABLE_SUFFIX BETWEEN '20190801' AND '20200529'
GROUP BY
date,
visitStartTime,
--fullVisitorId,
-- pagepath,
ldapid,
Recommendation_References,
Interview_Sign_Up,
Transcript_Setup,
Transcript_Upload,
Pre_Interview_Questions,
Interview_Prep,
Interview_day_details
ORDER BY visitStartTime DESC, date DESC
)
WHERE
( Recommendation_References >= 1
OR
Interview_Sign_Up >= 1
OR
Transcript_Setup >= 1
OR
Transcript_Upload>= 1
OR
Pre_Interview_Questions >= 1
OR
Interview_Prep >= 1
OR
Interview_day_details >= 1) and ldapid IS NOT NULL
GROUP BY
Max_date,
Max_Time,
ldapid
Since we cannot use Aggregate and group by functions together in GBQ is there a way around it?
I see no relationship between your question and the query in the question. But based on your question and the sample data, use lead():
select t.*
from (select t.*,
lead(visitstarttime) over (partition by ldapid order by visitstarttime) as next_visitstarttime
from t
) t
where next_visitstarttime is null or next_visitstarttime > visitstarttime + <whatever>

SQL query which converts sets or range of records on the basis of record before and after rows of that range

Suppose this table
Day Present Absent Holiday
1/1/2019 1 0 0
1/2/2019 0 1 0
1/3/2019 0 0 1
1/4/2019 0 0 1
1/5/2019 0 0 1
1/6/2019 0 1 0
1/7/2019 1 0 0
1/8/2019 0 1 0
1/9/2019 0 0 1
1/10/2019 0 1 0
I want to mark all holidays zero which are between absents, if an employee is absent before and after the holidays, then holidays will become absent days for him. I don't want to use a loop, I want set base query approach.
As a select, you can use lead() and lag():
select t.*,
(case when prev_absent = 0 and next_absent = 0 and holiday = 1
then 0 else holiday
end) as new_holiday
from (select t.*,
lag(absent) over (order by day) as prev_absent,
lead(absent) over (order by day) as next_absent
from t
) t;
If this does what you want, then you can incorporate this into an update:
with toupdate as (
select t.*,
(case when prev_absent = 0 and next_absent = 0 and holiday = 1
then 0 else holiday
end) as new_holiday
from (select t.*,
lag(absent) over (order by day) as prev_absent,
lead(absent) over (order by day) as next_absent
from t
) t
) t
update toupdate
set holiday = new_holiday
where holiday <> new_holiday;
EDIT:
You can also do this with joins:
select t.*,
(case when tprev.absent = 0 and tnext.absent = 0 and t.holiday = 1
then 0 else holiday
end) as new_holiday
from t left join
t tprev
on tnext.day = dateadd(day, -1, t.day) left join
t tnext
on tprev.day = dateadd(day, 1, tprev.day)

How to get running total from consecutive columns in Oracle SQL

I have troubles to display consecutive holidays from an existing date dataset in Oracle SQL. For example, in December 2017 between 20th and 30th, there are the following days off (because Christmas and weekend days):
23.12.2017 Saturday
24.12.2017 Sunday
25.12.2017 Christmas
30.12.2017 Saturday
Now I want my result dataset to look like this (RUNTOT is needed):
DAT ISOFF RUNTOT
20.12.2017 0 0
21.12.2017 0 0
22.12.2017 0 0
23.12.2017 1 1
24.12.2017 1 2
25.12.2017 1 3
26.12.2017 0 0
27.12.2017 0 0
28.12.2017 0 0
29.12.2017 0 0
30.12.2017 1 1
That means when "ISOFF" changes I want to count (or sum) the consecutive rows where "ISOFF" is 1.
I tried to approach a solution with an analytic function, where I summarize the "ISOFF" to the current row.
SELECT DAT,
ISOFF,
SUM (ISOFF)
OVER (ORDER BY DAT ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS RUNTOT
FROM (TIME_DATASET)
WHERE DAT BETWEEN DATE '2017-12-20' AND DATE '2017-12-27'
ORDER BY 1
What I get now is following dataset:
DAT ISOFF RUNTOT
20.12.2017 0 0
21.12.2017 0 0
22.12.2017 0 0
23.12.2017 1 1
24.12.2017 1 2
25.12.2017 1 3
26.12.2017 0 3
27.12.2017 0 3
28.12.2017 0 3
29.12.2017 0 3
30.12.2017 1 4
How can I reset the running total if ISOFF changes to 0? Or is this the wrong approach to solve this problem?
Thank you for your help!
This is a gaps-and-islands problem. Here is one method that assigns the groups by the number of 0s up to that row:
select t.*,
(case when is_off = 1
then row_number() over (partition by grp order by dat)
end) as runtot
from (select t.*,
sum(case when is_off = 0 then 1 else 0 end) over (order by dat) as grp
from TIME_DATASET t
) t;
You may use the recursive recursive subquery factoring - the precondition is, that your dates are consecutive without gaps (or you have some oder row number sequence to follow in steps of one).
WITH t1(dat, isoff, runtot) AS (
SELECT dat, isoff, 0 runtot
FROM tab
WHERE DAT = DATE'2017-12-20'
UNION ALL
SELECT t2.dat, t2.isoff,
case when t2.isoff = 0 then 0 else runtot + t2.isoff end as runtot
FROM tab t2, t1
WHERE t2.dat = t1.dat + 1
)
SELECT dat, isoff, runtot
FROM t1;
DAT ISOFF RUNTOT
------------------- ---------- ----------
20.12.2017 00:00:00 0 0
21.12.2017 00:00:00 0 0
22.12.2017 00:00:00 0 0
23.12.2017 00:00:00 1 1
24.12.2017 00:00:00 1 2
25.12.2017 00:00:00 1 3
26.12.2017 00:00:00 0 0
27.12.2017 00:00:00 0 0
28.12.2017 00:00:00 0 0
29.12.2017 00:00:00 0 0
30.12.2017 00:00:00 1 1
Another variation, which doesn't need a subquery or CTE but does need all days to be present and have the same time, is - for the holiday dates only (where isoff = 1) - to see how many days it's been since the last non-holiday date:
select dat,
isoff,
case
when isoff = 1 then
coalesce(dat - max(case when isoff = 0 then dat end)
over (order by dat range between unbounded preceding and 1 preceding), 1)
else 0
end as runtot
from time_dataset
order by dat;
DAT ISOFF RUNTOT
---------- ---------- ----------
2017-12-20 0 0
2017-12-21 0 0
2017-12-22 0 0
2017-12-23 1 1
2017-12-24 1 2
2017-12-25 1 3
2017-12-26 0 0
2017-12-27 0 0
2017-12-28 0 0
2017-12-29 0 0
2017-12-30 1 1
The coalesce() is there in case the first date in the range is a holiday - as there is no previous non-holiday date to compare against, that subtraction would get null.
db<>fiddle with a slightly larger data set.

Create a lapsed concept based on logic across every row per ID

I am trying to get to a lapsed_date which is when there are >12 weeks (ie. 84 days) for a given ID between:
1) onboarded_at and current_date (if no applied_at exists) - this means lapsed_now if >84 days
2) onboarded_at and min(applied_at) (if one exists)
3) each consecutive applied_at
4) max(applied_at) and current_date - this means lapsed_now if >84 days
If there are multiple instances where he lapsed, then we only show the latest lapsed date.
The attempt I have works for most but not all cases. Can you assists make it work universally?
Sample set:
CREATE TABLE #t
(
id VARCHAR(10),
rank INTEGER,
onboarded_at DATE,
applied_at DATE
);
INSERT INTO #t VALUES
('A',1,'20180101','20180402'),
('A',2,'20180101','20180403'),
('A',3,'20180101','20180504'),
('B',1,'20180201','20180801'),
('C',1,'20180301','20180401'),
('C',2,'20180301','20180501'),
('C',3,'20180301','20180901'),
('D',1,'20180401',null)
Best attempt:
SELECT onb.id,
onb.rank,
onb.onboarded_at,
onb.applied_at,
onb.lapsed_now,
CASE WHEN lapsed_now = 1 OR lapsed_previous = 1
THEN 1
ELSE 0
END lapsed_ever,
CASE WHEN lapsed_now = 1
THEN DATEADD(DAY, 84, lapsed_now_date)
ELSE min_applied_at_add_84
END lapsed_date
FROM
(SELECT *,
CASE
WHEN DATEDIFF(DAY, onboarded_at, MIN(ISNULL(applied_at, onboarded_at)) over (PARTITION BY id)) >= 84
THEN 1
WHEN DATEDIFF(DAY, MAX(applied_at) OVER (PARTITION BY id), GETDATE()) >= 84
THEN 1
ELSE 0
END lapsed_now,
CASE
WHEN MAX(DATEDIFF(DAY, onboarded_at, ISNULL(applied_at, GETDATE()))) OVER (PARTITION BY id) >= 84
THEN 1
ELSE 0
END lapsed_previous,
MAX(applied_at) OVER (PARTITION BY id) lapsed_now_date,
DATEADD(DAY, 84, MIN(CASE WHEN applied_at IS NULL THEN onboarded_at ELSE applied_at END) OVER (PARTITION BY id)) min_applied_at_add_84
FROM #t
) onb
Current solution:
id rank onboarded_at applied_at lapsed_now lapsed_ever lapsed_date
A 1 2018-01-01 2018-04-02 1 1 2018-07-27
A 2 2018-01-01 2018-04-03 1 1 2018-07-27
A 3 2018-01-01 2018-05-04 1 1 2018-07-27
B 2 2018-02-01 2018-08-01 1 1 2018-10-24
C 1 2018-03-01 2018-04-01 0 1 2018-06-24
C 2 2018-03-01 2018-05-01 0 1 2018-06-24
C 3 2018-03-01 2018-09-01 0 1 2018-06-24
D 1 2018-04-01 null 1 1 2018-06-24
Expected solution:
id rank onboarded_at applied_at lapsed_now lapsed_ever lapsed_date
A 1 2018-01-01 2018-04-02 1 1 2018-07-27 (not max lapsed date)
A 2 2018-01-01 2018-04-03 1 1 2018-07-27
A 3 2018-01-01 2018-05-04 1 1 2018-07-27 (May 4 + 84)
B 1 2018-02-01 2018-08-01 0 1 2018-04-26 (Feb 1 + 84)
C 1 2018-03-01 2018-04-01 0 1 2018-07-24
C 2 2018-03-01 2018-05-01 0 1 2018-07-24 (May 1 + 84)
C 3 2018-03-01 2018-09-01 0 1 2018-07-24
D 1 2018-04-01 null 1 1 2018-06-24
Bit of guesswork here, but hopefully this does the trick:
SELECT res.id,
res.rank,
res.onboarded_at,
res.applied_at,
res.lapsed_now,
CASE WHEN lapsed_now = 1 OR lapsed_previous = 1
THEN 1
ELSE 0
END lapsed_ever,
CASE
WHEN lapsed_now = 1
THEN DATEADD(DAY, 84, lapsed_now_date)
WHEN applied_difference_gt84 IS NOT NULL
THEN DATEADD(DAY, 84, applied_difference_gt84)
WHEN DATEDIFF(DAY, min_applied_at_add_84, GETDATE()) < 84
THEN DATEADD(DAY, 84, onboarded_at)
ELSE min_applied_at_add_84
END lapsed_date
FROM (
SELECT *, MAX(applied_difference) OVER (PARTITION BY id ORDER BY rank ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) applied_difference_gt84
FROM
(
SELECT *,
CASE
WHEN DATEDIFF(DAY, onboarded_at, MIN(ISNULL(applied_at, onboarded_at)) over (PARTITION BY id)) >= 84
AND DATEDIFF(DAY, MAX(applied_at) OVER (PARTITION BY id), GETDATE()) >= 84
THEN 1
WHEN DATEDIFF(DAY, ISNULL(MAX(applied_at) OVER (PARTITION BY id), onboarded_at), GETDATE()) >= 84
THEN 1
ELSE 0
END lapsed_now,
CASE
WHEN MAX(DATEDIFF(DAY, onboarded_at, ISNULL(applied_at, GETDATE()))) OVER (PARTITION BY id) >= 84
THEN 1
ELSE 0
END lapsed_previous,
CASE
WHEN DATEDIFF(MONTH, applied_at, LEAD(applied_at, 1) OVER (PARTITION BY id ORDER BY rank)) >= 2
THEN applied_at
ELSE NULL
END applied_difference,
ISNULL(MAX(applied_at) OVER (PARTITION BY id), onboarded_at) lapsed_now_date,
DATEADD(DAY, 84, MIN(CASE WHEN applied_at IS NULL THEN onboarded_at ELSE applied_at END) OVER (PARTITION BY id)) min_applied_at_add_84
FROM #t
) onb
) res
Results:
id rank onboarded_at applied_at lapsed_now lapsed_ever lapsed_date
A 1 2018-01-01 2018-04-02 1 1 2018-07-27
A 2 2018-01-01 2018-04-03 1 1 2018-07-27
A 3 2018-01-01 2018-05-04 1 1 2018-07-27
B 1 2018-02-01 2018-08-01 0 1 2018-04-26
C 1 2018-03-01 2018-04-01 0 1 2018-07-24
C 2 2018-03-01 2018-05-01 0 1 2018-07-24
C 3 2018-03-01 2018-09-01 0 1 2018-07-24
D 1 2018-04-01 (null) 1 1 2018-06-24
It's a bit messy because of the need to calculate the difference between the applied_at dates.
#Jim, inspired by your answer, I created the following solution.
I think it is easily understandable and intuitive, knowing the lapsed criteria:
SELECT id, onboarded_at, applied_at,
max(case when (zero_applicants is not null and current_date - onboarded_at > 84) or (last_applicant is not null and current_date - last_applicant > 84) then 1 else 0 end) over (partition by id) lapsed_now,
max(case when (zero_applicants is not null and current_date - onboarded_at > 84) or (one_applicant is not null and applied_at - onboarded_at > 84)
or (one_applicant is not null and current_date - applied_at > 84) or (next_applicant is not null and next_applicant- applied_at > 84)
or (last_applicant is not null and current_date - last_applicant > 84) then 1 else 0 end) over(partition by id) lapsed_ever,
max(case when zero_applicants is not null and current_date - onboarded_at > 84 then onboarded_at + 84
when one_applicant is not null and applied_at - onboarded_at > 84 then onboarded_at + 84
when one_applicant is not null and current_date - applied_at > 84 then applied_at + 84
when next_applicant is not null and next_applicant - applied_at > 84 then applied_at + 84
when last_applicant is not null and current_date - last_applicant > 84 then last_applicant + 84
end) over (partition by id) lapsed_date
from (
select *,
case when MAX(applied_at) OVER (PARTITION BY id) is null then onboarded_at end as zero_applicants,
case when count(applied_at) over(partition by id)=1 then onboarded_at end as one_applicant,
case when count(applied_at) over(partition by id)>1 then LEAD(applied_at, 1) OVER (PARTITION BY id ORDER BY applied_at) end as next_applicant,
case when LEAD(applied_at, 1) OVER (PARTITION BY id ORDER BY applied_at) is null then MAX(applied_at) over(partition by id) end as last_applicant
from #t
) res
order by id, applied_at