Conditionally increment previous value and propagate forward using Amazon Redshift (SQL) - sql

Using Amazon Redshift (SQL), I've got a table of timestamps that I'd like to split into separate phases when the time between entries is above some threshold.
For example, using a threshold of 60 units for this input:
id ts
1 a 1
2 a 4
3 a 12
4 a 90
5 a 94
6 a 101
7 a 404
8 a 412
9 a 413
I'd like to return this:
id ts dt phase
1 a 1 NA 1
2 a 4 3 1
3 a 12 8 1
4 a 90 78 2
5 a 94 4 2
6 a 101 7 2
7 a 404 303 3
8 a 412 8 3
9 a 413 1 3
This is straightforward in R (which I'm most familiar with) using a simple for loop and ifelse which increments the previous phase value by 1 if dt > 60:
# sample data
df <- data.frame(id = rep("a", 9),
ts = c(1, 4, 12, 90, 94, 101, 404, 412, 413)) %>%
mutate(dt = c(NA, diff(ts)))
# add default minimum phase value, 1
df$phase<- 1
# for loop
for(i in 2:nrow(df)) {
df$phase[i] <- ifelse(df$dt[i] > 60, df$phase[i-1] + 1, df$phase[i-1])
}
However, my attempts using the lag function and case / when in SQL have not been successful.
-- sample data
CREATE TABLE sampledata (
conversationid varchar(10), ts integer
);
INSERT INTO sampledata (conversationid, ts)
VALUES
('a', 1),
('a', 4),
('a', 12),
('a', 90),
('a', 94),
('a', 101),
('a', 404),
('a', 412),
('a', 413);
-- query
SELECT *,
CASE
WHEN dt > 60 THEN LAG(period) OVER (PARTITION BY conversationid ORDER BY ts) + 1
ELSE LAG(period) OVER (PARTITION BY conversationid ORDER BY ts)
END AS period
FROM (
SELECT *,
ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts) AS dt,
1 AS period
FROM sampledata
)
ORDER BY ts
;
-- output
id ts dt period period
a 1 1
a 4 3 1 1
a 12 8 1 1
a 90 78 1 2
a 94 4 1 1
a 101 7 1 1
a 404 303 1 2
a 412 8 1 1
a 413 1 1 1
I'm able to increment the phase value on rows where dt > 60, but not propagate the incremented phase value across subsequent rows.
I guess this may be something to do with the lag function operating across all rows at once rather than row-by-row and/or being unable to update the original phase value on the fly (a second column phase is instead created).

You are close. You want a cumulative sum based on the lag difference:
SELECT sd.*,
SUM(CASE WHEN diff > 60 THEN 1 ELSE 0 END) OVER (PARTITION BY conversationid ORDER BY ts) as period
FROM (SELECT sd.*,
(ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ) AS diff
FROM sampledata sd
) sd
ORDER BY ts;
As a side note, I would expect you to use ORDER BY conversationid, ts, rather than just the time.
And finally, the above will start the periods at NULL (it should correctly identify them, just naming them awkwardly). The following tweak does the enumeration as you specifically request:
SELECT sd.*,
(1 + SUM(CASE WHEN diff < 60 THEN 0 ELSE 1 END) OVER (PARTITION BY conversationid ORDER BY ts ROW BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as period

Related

SQL decreasing sum by a percentage

I have a table like
timestamp
type
value
08.01.2023
1
5
07.01.2023
0
20
06.01.2023
1
1
05.01.2023
0
50
04.01.2023
0
50
03.01.2023
1
1
02.01.2023
1
1
01.01.2023
1
1
Type 1 means a deposit, type 0 means a withdrawal.
The thing is when a type is 1 then the amount is the exact amount the user deposited so we can just sum that but type 0 means a withdrawal in percentage.
What I'm looking for is to create another column with current deposited amount. For the example above it would look like that.
timestamp
type
value
deposited
08.01.2023
1
5
5.4
07.01.2023
0
20
1.4
06.01.2023
1
1
1.75
05.01.2023
0
50
0.75
04.01.2023
0
50
1.5
03.01.2023
1
1
3
02.01.2023
1
1
2
01.01.2023
1
1
1
I can't figure out how to make a sum like this which would subtract percentage of previous total
You are trying to carry state over time, so ether need to use a UDTF to doing the carry work for you. Or use a recursive CTE
with data(transaction_date, type, value) as (
select to_date(column1, 'dd.mm.yyyy'), column2, column3
from values
('08.01.2023', 1, 5),
('07.01.2023', 0, 20),
('06.01.2023', 1, 1),
('05.01.2023', 0, 50),
('04.01.2023', 0, 50),
('03.01.2023', 1, 1),
('02.01.2023', 1, 1),
('01.01.2023', 1, 1)
), pre_process_data as (
select *
,iff(type = 0, 0, value)::number as add
,iff(type = 0, value, 0)::number as per
,row_number()over(order by transaction_date asc) as rn
from data
), rec_cte_block as (
with recursive rec_sub_cte as (
select
p.*,
p.add::number(20,4) as deposited
from pre_process_data as p
where p.rn = 1
union all
select
p.*,
round(div0((r.deposited + p.add)*(100-p.per), 100), 2) as deposited
from rec_sub_cte as r
left join pre_process_data as p
where p.rn = r.rn+1
)
select *
from rec_sub_cte
)
select * exclude(add, per, rn)
from rec_cte_block
order by 1;
I wrote the recursive CTE this way, as there currently is an incident if IFF or CASE is used inside the CTE.
TRANSACTION_DATE
TYPE
VALUE
DEPOSITED
2023-01-01
1
1
1
2023-01-02
1
1
2
2023-01-03
1
1
3
2023-01-04
0
50
1.5
2023-01-05
0
50
0.75
2023-01-06
1
1
1.75
2023-01-07
0
20
1.4
2023-01-08
1
5
6.4
Solution without recursion and UDTF
create table depo (timestamp date,type int, value float);
insert into depo values
(cast('01.01.2023' as date),1, 1.0)
,(cast('02.01.2023' as date),1, 1.0)
,(cast('03.01.2023' as date),1, 1.0)
,(cast('04.01.2023' as date),0, 50.0)
,(cast('05.01.2023' as date),0, 50.0)
,(cast('06.01.2023' as date),1, 1.0)
,(cast('07.01.2023' as date),0, 20.0)
,(cast('08.01.2023' as date),1, 5.0)
;
with t0 as(
select *
,sum(case when type=0 and value>=100 then 1 else 0 end)over(order by timestamp) gr
from depo
)
,t1 as (select timestamp as dt,type,gr
,case when type=1 then value else 0 end depo
,case when type=0 then ((100.0-value)/100.0) else 0.0 end pct
,sum(case when type=0 and value<100 then log((100.0-value)/100.0,2.0)
when type=0 and value>=100 then null
else 0.0
end)
over(partition by gr order by timestamp ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) totLog
from t0
)
,t2 as(
select *
,case when type=1 then
isnull(sum(depo*power(cast(2.0 as float),totLog))
over(partition by gr order by dt rows between unbounded preceding and 1 preceding)
,0)/power(cast(2.0 as float),totLog)
+depo
else
isnull(sum(depo*power(cast(2.0 as float),totLog))
over(partition by gr order by dt rows between unbounded preceding and 1 preceding)
,0)/power(cast(2.0 as float),totLog)*pct
end rest
from t1
)
select dt,type,depo,pct*100 pct
,rest-lag(rest,1,0)over(order by dt) movement
,rest
from t2
order by dt
dt
type
depo
pct
movement
rest
2023-01-01
1
1
0
1
1
2023-02-01
1
1
0
1
2
2023-03-01
1
1
0
1
3
2023-04-01
0
0
50
-1.5
1.5
2023-05-01
0
0
50
-0.75
0.75
2023-06-01
1
1
0
1
1.75
2023-07-01
0
0
80
-0.35
1.4
2023-08-01
1
5
0
5
6.4
I think, it is better to perform this kind of calculations on client side or middle level.
Sequential calculations are difficult to implement in Sql. In some special cases, you can use logarithmic expressions. But it is clearer and easier to implement through recursion, as #Simeon showed.
To expand on #ValNik's answer
The fist simple step is to change "deduct 20%, then deduct 50%, then deduct 30%" in to a multiplication...
X - 20% - 50% - 30%
=>
x * 0.8 * 0.5 * 0.7
=>
x * 0.28
The second trick is to understand how to calculate cumulative PRODUCT() when you only have cumulative sum; SUM() OVER (), using the properties of logarithms...
a * b == exp( log(a) + log(b) )
0.8 * 0.5 * 0.7
=>
exp( log(0.8) + log(0.5) + log(0.7) )
=>
exp( -0.2231 + -0.6931 + -0.3567 )
=>
exp( -1.2730 )
=>
0.28
The next trick is easier to explain with integers rather than percentages. That is to be able to break down the original problem in to one that can be solved using "cumulative sum" and "cumulative product"...
Current working:
row_id
type
value
equation
result
1
+
10
0 + 10
10
2
+
20
(0 + 10 + 20)
30
3
*
2
(0 + 10 + 20) * 2
60
4
+
30
(0 + 10 + 20) * 2 + 30
90
5
*
3
((0 + 10 + 20) * 2 + 30) * 3
270
Rearranged working:
row_id
type
value
CUMPROD
new equation
result
1
+
10
2*3=6
(10*6 ) / 6
10
2
+
20
2*3=6
(10*6 + 20*6 ) / 6
30
3
*
2
3=3
(10*6 + 20*6 ) / 3
60
4
+
30
3=3
(10*6 + 20*6 + 30*3) / 3
90
5
*
3
=1
(10*6 + 20*6 + 30*3) / 1
270
CUMPROD is the "cumulative product" of all future "multiplication values".
The equation is then the "cumulative sum" of value * CUMPROD divided by the current CUMPROD.
So...
row 1 : SUM(10*6 ) / 6 => SUM(10 )
row 2 : SUM(10*6, 20*6 ) / 6 => SUM(10, 20)
row 3 : SUM(10*6, 20*6 ) / 3 => SUM(10, 20) * 2
row 4 : SUM(10*6, 20*6, 30*3) / 3 => SUM(10, 20) * 2 + SUM(30)
row 5 : SUM(10*6, 20*6, 30*3) / 1 => SUM(10, 20) * 2*3 + SUM(30) * 3
The only things to be cautious of are:
LOG(0) = Infinity (which would happen when deducting 100%)
Deducting more than 100% makes no sense
So, I copied #ValNik's code that creates a new partition every time 100% or more is deducted (forcing everything in the next partition to start at zero again).
This gives the following SQL (a re-arranged version of #ValNik's code):
WITH
partition_when_deduct_everything AS
(
SELECT
*,
SUM(
CASE WHEN type = 0 AND value >= 100 THEN 1 ELSE 0 END
)
OVER (
ORDER BY timestamp
)
AS deduct_everything_id,
CASE WHEN type = 1 THEN value
ELSE 0
END
AS deposit,
CASE WHEN type = 1 THEN 1.0 -- Deposits == Deduct 0%
WHEN value >= 100 THEN 1.0 -- Treat "deduct everything" as a special case
ELSE (100.0-value)/100.0 -- Change "deduct 20%" to "multiply by 0.8"
END
AS multiplier
FROM
your_table
)
,
cumulative_product_of_multipliers as
(
SELECT
*,
EXP(
ISNULL(
SUM(
LOG(multiplier)
)
OVER (
PARTITION BY deduct_everything_id
ORDER BY timestamp
ROWS BETWEEN 1 FOLLOWING
AND UNBOUNDED FOLLOWING
)
, 0
)
)
AS future_multiplier
FROM
partition_when_deduct_everything
)
SELECT
*,
ISNULL(
SUM(
deposit * future_multiplier
)
OVER (
PARTITION BY deduct_everything_id
ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
),
0
)
/
future_multiplier
AS rest
FROM
cumulative_product_of_multipliers
Demo : https://dbfiddle.uk/mrioIMiB
So how this should be solved, is a UDTF, because it requires to "sorting the data once" and "traversing the data only once", and if you have different PARTITIONS aka user_id etc etc you can work in parallel):
create or replace function carry_value_state(_TYPE float, _VALUE float)
returns table (DEPOSITED float)
language javascript
as
$$
{
initialize: function(argumentInfo, context) {
this.carried_value = 0.0;
},
processRow: function (row, rowWriter, context){
if(row._TYPE === 1) {
this.carried_value += row._VALUE;
} else {
let limited = Math.max(Math.min(row._VALUE, 100.0), 0.0);
this.carried_value -= (this.carried_value * limited) / 100;
}
rowWriter.writeRow({DEPOSITED: this.carried_value});
}
}
$$;
which then gets used like:
select d.*,
c.*
from data as d
,table(carry_value_state(d.type::float, d.value::float) over (order by transaction_date)) as c
order by 1;
so for the data we have been using in the example, that gives:
TRANSACTION_DATE
TYPE
VALUE
DEPOSITED
2023-01-01
1
1
1
2023-01-02
1
1
2
2023-01-03
1
1
3
2023-01-04
0
50
1.5
2023-01-05
0
50
0.75
2023-01-06
1
1
1.75
2023-01-07
0
20
1.4
2023-01-08
1
5
6.4
yes, the results are now in floating point, so you should double round to avoid FP representation problems, like:
round(round(c.deposited, 6) , 2) as deposited
An alternative approach using Match_Recognize(), POW(), SUM().
I would not recommend using Match_Recognize() unless you have too, it's fiddly and can waste time, however does look elegant.
with data(transaction_date, type, value) as (
select
to_date(column1, 'dd.mm.yyyy'),
column2,
column3
from
values
('08.01.2023', 1, 5),
('07.01.2023', 0, 20),
('06.01.2023', 1, 1),
('05.01.2023', 0, 50),
('04.01.2023', 0, 50),
('03.01.2023', 1, 1),
('02.01.2023', 1, 1),
('01.01.2023', 1, 1)
)
select
*
from
data match_recognize(
order by
transaction_date measures
sum(iff(CLASSIFIER() = 'ROW_WITH_DEPOSIT', value, 0)) DEPOSITS,
pow(iff(CLASSIFIER() = 'ROW_WITH_WITHDRAWL', value / 100, 1) ,count(row_with_withdrawl.*)) DISCOUNT_FROM_WITHDRAWL,
CLASSIFIER() TRANS_TYPE,
first(transaction_date) as start_date,
last(transaction_date) as end_date,
count(*) as rows_in_sequence,
count(row_with_deposit.*) as num_deposits,
count(row_with_withdrawl.*) as num_withdrawls
after
match skip PAST LAST ROW pattern((row_with_deposit + | row_with_withdrawl +)) define row_with_deposit as type = 1,
row_with_withdrawl as type = 0
);

TSQL - dates overlapping - number of days

I have the following table on SQL Server:
ID
FROM
TO
OFFER NUMBER
1
2022.01.02
9999.12.31
1
1
2022.01.02
2022.02.10
2
2
2022.01.05
2022.02.15
1
3
2022.01.02
9999.12.31
1
3
2022.01.15
2022.02.20
2
3
2022.02.03
2022.02.25
3
4
2022.01.16
2022.02.05
1
5
2022.01.17
2022.02.13
1
5
2022.02.05
2022.02.13
2
The range includes the start date but excludes the end date.
The date 9999.12.31 is given (comes from another system), but we could use the last day of the current quarter instead.
I need to find a way to determine the number of days when the customer sees exactly one, two, or three offers. The following picture shows the method upon id 3:
The expected results should be like (without using the last day of the quarter):
ID
# of days when the customer sees only 1 offer
# of days when the customer sees 2 offers
# of days when the customer sees 3 offers
1
2913863
39
0
2
41
0
0
3
2913861
24
17
4
20
0
0
5
19
8
0
I've found this article but it did not enlighten me.
Also I have limited privileges that is I am not able to declare a variable for example so I need to use "basic" TSQL.
Please provide a detailed explanation besides the code.
Thanks in advance!
The following will (for each ID) extract all distinct dates, construct non-overlapping date ranges to test, and will count up the number of offers per range. The final step is to sum and format.
The fact that the start dates are inclusive and the end dates are exclusive while sometimes non-intuitive for the human, actually works well in algorithms like this.
DECLARE #Data TABLE (Id INT, FromDate DATETIME, ToDate DATETIME, OfferNumber INT)
INSERT #Data
VALUES
(1, '2022-01-02', '9999-12-31', 1),
(1, '2022-01-02', '2022-02-10', 2),
(2, '2022-01-05', '2022-02-15', 1),
(3, '2022-01-02', '9999-12-31', 1),
(3, '2022-01-15', '2022-02-20', 2),
(3, '2022-02-03', '2022-02-25', 3),
(4, '2022-01-16', '2022-02-05', 1),
(5, '2022-01-17', '2022-02-13', 1),
(5, '2022-02-05', '2022-02-13', 2)
;
WITH Dates AS ( -- Gather distinct dates
SELECT Id, Date = FromDate FROM #Data
UNION --(distinct)
SELECT Id, Date = ToDate FROM #Data
),
Ranges AS ( --Construct non-overlapping ranges (The ToDate = NULL case will be ignored later)
SELECT ID, FromDate = Date, ToDate = LEAD(Date) OVER(PARTITION BY Id ORDER BY Date)
FROM Dates
),
Counts AS ( -- Calculate days and count offers per date range
SELECT R.Id, R.FromDate, R.ToDate,
Days = DATEDIFF(DAY, R.FromDate, R.ToDate),
Offers = COUNT(*)
FROM Ranges R
JOIN #Data D ON D.Id = R.Id
AND D.FromDate <= R.FromDate
AND D.ToDate >= R.ToDate
GROUP BY R.Id, R.FromDate, R.ToDate
)
SELECT Id
,[Days with 1 Offer] = SUM(CASE WHEN Offers = 1 THEN Days ELSE 0 END)
,[Days with 2 Offers] = SUM(CASE WHEN Offers = 2 THEN Days ELSE 0 END)
,[Days with 3 Offers] = SUM(CASE WHEN Offers = 3 THEN Days ELSE 0 END)
FROM Counts
GROUP BY Id
The WITH clause introduces Common Table Expressions (CTEs) which progressively build up intermediate results until a final select can be made.
Results:
Id
Days with 1 Offer
Days with 2 Offers
Days with 3 Offers
1
2913863
39
0
2
41
0
0
3
2913861
24
17
4
20
0
0
5
19
8
0
Alternately, the final select could use a pivot. Something like:
SELECT Id,
[Days with 1 Offer] = ISNULL([1], 0),
[Days with 2 Offers] = ISNULL([2], 0),
[Days with 3 Offers] = ISNULL([3], 0)
FROM (SELECT Id, Offers, Days FROM Counts) C
PIVOT (SUM(Days) FOR Offers IN ([1], [2], [3])) PVT
ORDER BY Id
See This db<>fiddle for a working example.
Find all date points for each ID. For each date point, find the number of overlapping.
Refer to comments within query
with
dates as
(
-- get all date points
select ID, theDate = FromDate from offers
union -- union to exclude any duplicate
select ID, theDate = ToDate from offers
),
cte as
(
select ID = d.ID,
Date_Start = d.theDate,
Date_End = LEAD(d.theDate) OVER (PARTITION BY ID ORDER BY theDate),
TheCount = c.cnt
from dates d
cross apply
(
-- Count no of overlapping
select cnt = count(*)
from offers x
where x.ID = d.ID
and x.FromDate <= d.theDate
and x.ToDate > d.theDate
) c
)
select ID, TheCount, days = sum(datediff(day, Date_Start, Date_End))
from cte
where Date_End is not null
group by ID, TheCount
order by ID, TheCount
Result :
ID
TheCount
days
1
1
2913863
1
2
39
2
1
41
3
1
2913861
3
2
29
3
3
12
4
1
20
5
1
19
5
2
8
To get to the required format, use PIVOT
dbfiddle demo

Estimation of Cumulative value every 3 months in SQL

I have a table like this:
ID Date Prod
1 1/1/2009 5
1 2/1/2009 5
1 3/1/2009 5
1 4/1/2009 5
1 5/1/2009 5
1 6/1/2009 5
1 7/1/2009 5
1 8/1/2009 5
1 9/1/2009 5
And I need to get the following result:
ID Date Prod CumProd
1 2009/03/01 5 15 ---Each 3 months
1 2009/06/01 5 30 ---Each 3 months
1 2009/09/01 5 45 ---Each 3 months
What could be the best approach to take in SQL?
You can try the below - using window function
DEMO Here
select * from
(
select *,sum(prod) over(order by DATEPART(qq,dateval)) as cum_sum,
row_number() over(partition by DATEPART(qq,dateval) order by dateval) as rn
from t
)A where rn=1
How about just filtering on the month number?
select t.*
from (select id, date, prod, sum(prod) over (partition by id order by date) as running_prod
from t
) t
where month(date) in (3, 6, 9, 12);

Tag consecutive non zero rows into distinct partitions?

Suppose we have this simple schema and data:
DROP TABLE #builds
CREATE TABLE #builds (
Id INT IDENTITY(1,1) NOT NULL,
StartTime INT,
IsPassed BIT
)
INSERT INTO #builds (StartTime, IsPassed) VALUES
(1, 1),
(7, 1),
(10, 0),
(15, 1),
(21, 1),
(26, 0),
(34, 0),
(44, 0),
(51, 1),
(60, 1)
SELECT StartTime, IsPassed, NextStartTime,
CASE IsPassed WHEN 1 THEN 0 ELSE NextStartTime - StartTime END Duration
FROM (
SELECT
LEAD(StartTime) OVER (ORDER BY StartTime) NextStartTime,
StartTime, IsPassed
FROM #builds
) x
ORDER BY StartTime
It produces the following result set:
StartTime IsPassed NextStartTime Duration
1 1 7 0
7 1 10 0
10 0 15 5
15 1 21 0
21 1 26 0
26 0 34 8
34 0 44 10
44 0 51 7
51 1 60 0
60 1 NULL 0
I need to summarize the non zero consecutive Duration values and to show them at the StartTime of the first row in the batch. I.e. I need to get to this:
StartTime Duration
10 5
26 25
I just can't figure out how to do it.
PS: The real table contains many more rows, of course.
This is a gaps and islands problem, requiring partitioning each section where IsPassed is constant into a different group. That can be done by computing the difference between ROW_NUMBER() over the entire table against partitioned by IsPassed. You can then SUM the Duration Values for each group where IsPassed = False and take the MIN(StartTime) to give the StartTime of the first row of the group:
WITH CTE AS (
SELECT StartTime, IsPassed,
LEAD(StartTime) OVER (ORDER BY StartTime) AS NextStartTime
FROM #builds
),
CTE2 AS (
SELECT StartTime, IsPassed, NextStartTime,
CASE IsPassed WHEN 1 THEN 0 ELSE NextStartTime - StartTime END Duration,
ROW_NUMBER() OVER (ORDER BY StartTime) -
ROW_NUMBER() OVER (PARTITION BY IsPassed ORDER BY StartTime) AS grp
FROM CTE
)
SELECT MIN(StartTime) AS StartTime, SUM(Duration) AS Duration
FROM CTE2
WHERE IsPassed = 0
GROUP BY grp
ORDER BY MIN(StartTime)
Output:
StartTime Duration
10 5
26 25
Demo on dbfiddle
Your approach is unnecessarily complicated. You simply need to assign the 0s to groups that include exactly the following 1.
You can do this by counting the number of "1"s on or after each row. Of course, this also assigns a grouping to the rows with no "0"s. These can be filtered out by ensuring that there is at least on 0 in each group:
select min(StartTime), max(startTime) - min(startTime)
from (select b.*,
sum(case when IsPassed = 1 then 1 else 0 end) over (order by StartTime desc) as grp
from builds b
) b
group by grp
having min(convert(int, IsPassed)) = 0
order by min(StartTime);
Here is a db<>fiddle.
Or an alternative method uses no aggregation at all. It simply gets the next "1" starttime for each row and then filters down to the first "0" row:
select StartTime, next_1_starttime - StartTime
from (select b.*,
lag(IsPassed) over (order by StartTime) as prev_IsPassed,
min(case when IsPassed = 1 then StartTime end) over (order by StartTime desc) as next_1_starttime
from builds b
) b
where IsPassed = 0 and (prev_IsPassed = 1 or prev_IsPassed is null)
order by StartTime;
This probably has the best performance of the alternatives.

How to calculate moving sum with reset based on condition in teradata SQL?

I have this data and I want to sum the field USAGE_FLAG but reset when it drops to 0 or moves to a new ID keeping the dataset ordered by SU_ID and WEEK:
SU_ID WEEK USAGE_FLAG
100 1 0
100 2 7
100 3 7
100 4 0
101 1 0
101 2 7
101 3 0
101 4 7
102 1 7
102 2 7
102 3 7
102 4 0
So I want to create this table:
SU_ID WEEK USAGE_FLAG SUM
100 1 0 0
100 2 7 7
100 3 7 14
100 4 0 0
101 1 0 0
101 2 7 7
101 3 0 0
101 4 7 7
102 1 7 7
102 2 7 14
102 3 7 21
102 4 0 0
I have tried the MSUM() function using GROUP BY but it won't keep the order I want above. It groups the 7's and the week numbers together which I don't want.
Anyone know if this is possible to do? I'm using teradata
In standard SQL a running sum can be done using a windowing function:
select su_id,
week,
usage_flag,
sum(usage_flag) over (partition by su_id order by week) as running_sum
from the_table;
I know Teradata supports windowing functions, I just don't know whether it also supports an order by in the window definition.
Resetting the sum is a bit more complicated. You first need to create "group IDs" that change each time the usage_flag goes to 0. The following works in PostgreSQL, I don't know if this works in Teradata as well:
select su_id,
week,
usage_flag,
sum(usage_flag) over (partition by su_id, group_nr order by week) as running_sum
from (
select t1.*,
sum(group_flag) over (partition by su_id order by week) as group_nr
from (
select *,
case
when usage_flag = 0 then 1
else 0
end as group_flag
from the_table
) t1
) t2
order by su_id, week;
Try below code, with use of RESET function it is working fine.
select su_id,
week,
usage_flag,
SUM(usage_flag) OVER (
PARTITION BY su_id
ORDER BY week
RESET WHEN usage_flag < /* preceding row */ SUM(usage_flag) OVER (
PARTITION BY su_id ORDER BY week
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
ROWS UNBOUNDED PRECEDING
)
from emp_su;
Please try below SQL:
select su_id,
week,
usage_flag,
SUM(usage_flag) OVER (PARTITION BY su_id ORDER BY week
RESET WHEN usage_flag = 0
ROWS UNBOUNDED PRECEDING
)
from emp_su;
Here RESET WHEN usage_flag = 0 will reset sum whenever sum usage_flag drops to 0