I have a table like
timestamp
type
value
08.01.2023
1
5
07.01.2023
0
20
06.01.2023
1
1
05.01.2023
0
50
04.01.2023
0
50
03.01.2023
1
1
02.01.2023
1
1
01.01.2023
1
1
Type 1 means a deposit, type 0 means a withdrawal.
The thing is when a type is 1 then the amount is the exact amount the user deposited so we can just sum that but type 0 means a withdrawal in percentage.
What I'm looking for is to create another column with current deposited amount. For the example above it would look like that.
timestamp
type
value
deposited
08.01.2023
1
5
5.4
07.01.2023
0
20
1.4
06.01.2023
1
1
1.75
05.01.2023
0
50
0.75
04.01.2023
0
50
1.5
03.01.2023
1
1
3
02.01.2023
1
1
2
01.01.2023
1
1
1
I can't figure out how to make a sum like this which would subtract percentage of previous total
You are trying to carry state over time, so ether need to use a UDTF to doing the carry work for you. Or use a recursive CTE
with data(transaction_date, type, value) as (
select to_date(column1, 'dd.mm.yyyy'), column2, column3
from values
('08.01.2023', 1, 5),
('07.01.2023', 0, 20),
('06.01.2023', 1, 1),
('05.01.2023', 0, 50),
('04.01.2023', 0, 50),
('03.01.2023', 1, 1),
('02.01.2023', 1, 1),
('01.01.2023', 1, 1)
), pre_process_data as (
select *
,iff(type = 0, 0, value)::number as add
,iff(type = 0, value, 0)::number as per
,row_number()over(order by transaction_date asc) as rn
from data
), rec_cte_block as (
with recursive rec_sub_cte as (
select
p.*,
p.add::number(20,4) as deposited
from pre_process_data as p
where p.rn = 1
union all
select
p.*,
round(div0((r.deposited + p.add)*(100-p.per), 100), 2) as deposited
from rec_sub_cte as r
left join pre_process_data as p
where p.rn = r.rn+1
)
select *
from rec_sub_cte
)
select * exclude(add, per, rn)
from rec_cte_block
order by 1;
I wrote the recursive CTE this way, as there currently is an incident if IFF or CASE is used inside the CTE.
TRANSACTION_DATE
TYPE
VALUE
DEPOSITED
2023-01-01
1
1
1
2023-01-02
1
1
2
2023-01-03
1
1
3
2023-01-04
0
50
1.5
2023-01-05
0
50
0.75
2023-01-06
1
1
1.75
2023-01-07
0
20
1.4
2023-01-08
1
5
6.4
Solution without recursion and UDTF
create table depo (timestamp date,type int, value float);
insert into depo values
(cast('01.01.2023' as date),1, 1.0)
,(cast('02.01.2023' as date),1, 1.0)
,(cast('03.01.2023' as date),1, 1.0)
,(cast('04.01.2023' as date),0, 50.0)
,(cast('05.01.2023' as date),0, 50.0)
,(cast('06.01.2023' as date),1, 1.0)
,(cast('07.01.2023' as date),0, 20.0)
,(cast('08.01.2023' as date),1, 5.0)
;
with t0 as(
select *
,sum(case when type=0 and value>=100 then 1 else 0 end)over(order by timestamp) gr
from depo
)
,t1 as (select timestamp as dt,type,gr
,case when type=1 then value else 0 end depo
,case when type=0 then ((100.0-value)/100.0) else 0.0 end pct
,sum(case when type=0 and value<100 then log((100.0-value)/100.0,2.0)
when type=0 and value>=100 then null
else 0.0
end)
over(partition by gr order by timestamp ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) totLog
from t0
)
,t2 as(
select *
,case when type=1 then
isnull(sum(depo*power(cast(2.0 as float),totLog))
over(partition by gr order by dt rows between unbounded preceding and 1 preceding)
,0)/power(cast(2.0 as float),totLog)
+depo
else
isnull(sum(depo*power(cast(2.0 as float),totLog))
over(partition by gr order by dt rows between unbounded preceding and 1 preceding)
,0)/power(cast(2.0 as float),totLog)*pct
end rest
from t1
)
select dt,type,depo,pct*100 pct
,rest-lag(rest,1,0)over(order by dt) movement
,rest
from t2
order by dt
dt
type
depo
pct
movement
rest
2023-01-01
1
1
0
1
1
2023-02-01
1
1
0
1
2
2023-03-01
1
1
0
1
3
2023-04-01
0
0
50
-1.5
1.5
2023-05-01
0
0
50
-0.75
0.75
2023-06-01
1
1
0
1
1.75
2023-07-01
0
0
80
-0.35
1.4
2023-08-01
1
5
0
5
6.4
I think, it is better to perform this kind of calculations on client side or middle level.
Sequential calculations are difficult to implement in Sql. In some special cases, you can use logarithmic expressions. But it is clearer and easier to implement through recursion, as #Simeon showed.
To expand on #ValNik's answer
The fist simple step is to change "deduct 20%, then deduct 50%, then deduct 30%" in to a multiplication...
X - 20% - 50% - 30%
=>
x * 0.8 * 0.5 * 0.7
=>
x * 0.28
The second trick is to understand how to calculate cumulative PRODUCT() when you only have cumulative sum; SUM() OVER (), using the properties of logarithms...
a * b == exp( log(a) + log(b) )
0.8 * 0.5 * 0.7
=>
exp( log(0.8) + log(0.5) + log(0.7) )
=>
exp( -0.2231 + -0.6931 + -0.3567 )
=>
exp( -1.2730 )
=>
0.28
The next trick is easier to explain with integers rather than percentages. That is to be able to break down the original problem in to one that can be solved using "cumulative sum" and "cumulative product"...
Current working:
row_id
type
value
equation
result
1
+
10
0 + 10
10
2
+
20
(0 + 10 + 20)
30
3
*
2
(0 + 10 + 20) * 2
60
4
+
30
(0 + 10 + 20) * 2 + 30
90
5
*
3
((0 + 10 + 20) * 2 + 30) * 3
270
Rearranged working:
row_id
type
value
CUMPROD
new equation
result
1
+
10
2*3=6
(10*6 ) / 6
10
2
+
20
2*3=6
(10*6 + 20*6 ) / 6
30
3
*
2
3=3
(10*6 + 20*6 ) / 3
60
4
+
30
3=3
(10*6 + 20*6 + 30*3) / 3
90
5
*
3
=1
(10*6 + 20*6 + 30*3) / 1
270
CUMPROD is the "cumulative product" of all future "multiplication values".
The equation is then the "cumulative sum" of value * CUMPROD divided by the current CUMPROD.
So...
row 1 : SUM(10*6 ) / 6 => SUM(10 )
row 2 : SUM(10*6, 20*6 ) / 6 => SUM(10, 20)
row 3 : SUM(10*6, 20*6 ) / 3 => SUM(10, 20) * 2
row 4 : SUM(10*6, 20*6, 30*3) / 3 => SUM(10, 20) * 2 + SUM(30)
row 5 : SUM(10*6, 20*6, 30*3) / 1 => SUM(10, 20) * 2*3 + SUM(30) * 3
The only things to be cautious of are:
LOG(0) = Infinity (which would happen when deducting 100%)
Deducting more than 100% makes no sense
So, I copied #ValNik's code that creates a new partition every time 100% or more is deducted (forcing everything in the next partition to start at zero again).
This gives the following SQL (a re-arranged version of #ValNik's code):
WITH
partition_when_deduct_everything AS
(
SELECT
*,
SUM(
CASE WHEN type = 0 AND value >= 100 THEN 1 ELSE 0 END
)
OVER (
ORDER BY timestamp
)
AS deduct_everything_id,
CASE WHEN type = 1 THEN value
ELSE 0
END
AS deposit,
CASE WHEN type = 1 THEN 1.0 -- Deposits == Deduct 0%
WHEN value >= 100 THEN 1.0 -- Treat "deduct everything" as a special case
ELSE (100.0-value)/100.0 -- Change "deduct 20%" to "multiply by 0.8"
END
AS multiplier
FROM
your_table
)
,
cumulative_product_of_multipliers as
(
SELECT
*,
EXP(
ISNULL(
SUM(
LOG(multiplier)
)
OVER (
PARTITION BY deduct_everything_id
ORDER BY timestamp
ROWS BETWEEN 1 FOLLOWING
AND UNBOUNDED FOLLOWING
)
, 0
)
)
AS future_multiplier
FROM
partition_when_deduct_everything
)
SELECT
*,
ISNULL(
SUM(
deposit * future_multiplier
)
OVER (
PARTITION BY deduct_everything_id
ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
),
0
)
/
future_multiplier
AS rest
FROM
cumulative_product_of_multipliers
Demo : https://dbfiddle.uk/mrioIMiB
So how this should be solved, is a UDTF, because it requires to "sorting the data once" and "traversing the data only once", and if you have different PARTITIONS aka user_id etc etc you can work in parallel):
create or replace function carry_value_state(_TYPE float, _VALUE float)
returns table (DEPOSITED float)
language javascript
as
$$
{
initialize: function(argumentInfo, context) {
this.carried_value = 0.0;
},
processRow: function (row, rowWriter, context){
if(row._TYPE === 1) {
this.carried_value += row._VALUE;
} else {
let limited = Math.max(Math.min(row._VALUE, 100.0), 0.0);
this.carried_value -= (this.carried_value * limited) / 100;
}
rowWriter.writeRow({DEPOSITED: this.carried_value});
}
}
$$;
which then gets used like:
select d.*,
c.*
from data as d
,table(carry_value_state(d.type::float, d.value::float) over (order by transaction_date)) as c
order by 1;
so for the data we have been using in the example, that gives:
TRANSACTION_DATE
TYPE
VALUE
DEPOSITED
2023-01-01
1
1
1
2023-01-02
1
1
2
2023-01-03
1
1
3
2023-01-04
0
50
1.5
2023-01-05
0
50
0.75
2023-01-06
1
1
1.75
2023-01-07
0
20
1.4
2023-01-08
1
5
6.4
yes, the results are now in floating point, so you should double round to avoid FP representation problems, like:
round(round(c.deposited, 6) , 2) as deposited
An alternative approach using Match_Recognize(), POW(), SUM().
I would not recommend using Match_Recognize() unless you have too, it's fiddly and can waste time, however does look elegant.
with data(transaction_date, type, value) as (
select
to_date(column1, 'dd.mm.yyyy'),
column2,
column3
from
values
('08.01.2023', 1, 5),
('07.01.2023', 0, 20),
('06.01.2023', 1, 1),
('05.01.2023', 0, 50),
('04.01.2023', 0, 50),
('03.01.2023', 1, 1),
('02.01.2023', 1, 1),
('01.01.2023', 1, 1)
)
select
*
from
data match_recognize(
order by
transaction_date measures
sum(iff(CLASSIFIER() = 'ROW_WITH_DEPOSIT', value, 0)) DEPOSITS,
pow(iff(CLASSIFIER() = 'ROW_WITH_WITHDRAWL', value / 100, 1) ,count(row_with_withdrawl.*)) DISCOUNT_FROM_WITHDRAWL,
CLASSIFIER() TRANS_TYPE,
first(transaction_date) as start_date,
last(transaction_date) as end_date,
count(*) as rows_in_sequence,
count(row_with_deposit.*) as num_deposits,
count(row_with_withdrawl.*) as num_withdrawls
after
match skip PAST LAST ROW pattern((row_with_deposit + | row_with_withdrawl +)) define row_with_deposit as type = 1,
row_with_withdrawl as type = 0
);
Related
I'm trying to calculate the total on an interest-bearing account accounting for deposits/withdraws with BigQuery.
Example scenario:
Daily interest rate = 10%
Value added/removed on every day: [100, 0, 29, 0, -100] (negative means amount removed)
The totals for each day are:
Day 1: 0*1.1 + 100 = 100
Day 2: 100*1.1 + 0 = 110
Day 3: 110*1.1 + 29 = 150
Day 4: 150*1.1 + 0 = 165
Day 5: 165*1.1 - 100 = 81.5
This would be trivial to implement in a language like Python
daily_changes = [100, 0, 29, 0, -100]
interest_rate = 0.1
result = []
for day, change in enumerate(daily_changes):
if day == 0:
result.append(change)
else:
result.append(result[day-1]*(1+interest_rate) + change)
print(result)
# Result: [100, 110.00000000000001, 150.00000000000003, 165.00000000000006, 81.50000000000009]
My difficulty lies in calculating values for row N when they depend on row N-1 (the usual SUM(...) OVER (ORDER BY...) solution does not suffice here).
Here's a CTE to test with the mock data in this example.
with raw_data as (
select 1 as day, numeric '100' as change union all
select 2 as day, numeric '0' as change union all
select 3 as day, numeric '29' as change union all
select 4 as day, numeric '0' as change union all
select 5 as day, numeric '-100' as change
)
select * from raw_data
You may try below:
SELECT day,
ROUND((SELECT SUM(c * POW(1.1, day - o - 1))
FROM t.changes c WITH OFFSET o), 2) AS totals
FROM (
SELECT *, ARRAY_AGG(change) OVER (ORDER BY day) changes
FROM raw_data
) t;
+-----+--------+
| day | totals |
+-----+--------+
| 1 | 100.0 |
| 2 | 110.0 |
| 3 | 150.0 |
| 4 | 165.0 |
| 5 | 81.5 |
+-----+--------+
Another option with use of recursive CTE
with recursive raw_data as (
select 1 as day, numeric '100' as change union all
select 2 as day, numeric '0' as change union all
select 3 as day, numeric '29' as change union all
select 4 as day, numeric '0' as change union all
select 5 as day, numeric '-100' as change
), iterations as (
select *, change as total
from raw_data where day = 1
union all
select r.day, r.change, 1.1 * i.total + r.change
from iterations i join raw_data r
on r.day = i.day + 1
)
select *
from iterations
with output
I want to create a flag in Bigquery which will return 1 when true and 0 when false. The statement works fine when it has to return the "else" value which is 0. However, when it satisfies the condition, it returns two rows with both 1 and 0 in them. Why is this happening?
Below is the code used:
table AS(
SELECT
id,
month,
ROUND((text/(month_days/7)), 2) AS value
FROM (SELECT id, extract(month FROM date) AS month,
(32 - EXTRACT(DAY FROM DATE_ADD(DATE_TRUNC(DATE(date), MONTH), INTERVAL 31 DAY))) AS month_days,
sum(text_sent) AS text
FROM table1
WHERE date BETWEEN '2020-01-01 00:00:00 UTC' AND '2020-06-30 00:00:00 UTC'
GROUP BY 1,2,3)),
table_flag AS(
SELECT
id,
CASE
WHEN month = 1 AND value > 100 THEN 1
WHEN month = 2 AND value > 150 THEN 1
WHEN month = 3 AND value > 130 THEN 1
WHEN month = 4 AND value > 200 THEN 1
WHEN month = 5 AND value > 235 THEN 1
WHEN month = 6 AND value > 125 THEN 1
WHEN month = 7 AND value > 324 THEN 1
WHEN month = 8 AND value > 160 THEN 1
WHEN month = 9 AND value > 350 THEN 1
WHEN month = 10 AND value > 80 THEN 1
WHEN month = 11 AND value > 245 THEN 1
ELSE 0
END AS value_flag
FROM
table)
SELECT
t.id,
t.value,
t.month,
tf.value_flag
FROM
table t
LEFT JOIN
table_flag tf
ON
t.id = tf.id
WHERE t.id IS NOT NULL
GROUP BY 1,2,3,4
ORDER BY 1
I have also tried nested IF, but that doesn't work either:
SELECT DISTINCT(id),
(IF((month = 1 AND value > 100), 1,
(IF((month = 2 AND value > 150), 1,
(IF((month = 3 AND value > 130), 1,
(IF((month = 4 AND value > 200), 1,
(IF((month = 5 AND value > 235), 1,
(IF((month = 6 AND value > 125), 1,
(IF((month = 7 AND value > 324), 1,
(IF((month = 8 AND value > 160), 1,
(IF((month = 9 AND value > 350), 1,
(IF((month = 10 AND value > 80), 1,
(IF((month = 11 AND value > 245), 1,0))))))))))))))))))))))
AS value_flag
FROM table)
This is how the output looks right now (This is NOT what I want):
enter image description here
The output is completely wrong. Please suggest alternate method (if any) to do it.
P.S.: This is my first question here, please let me know if any other information is needed. Thanks in advance for the help!
Both table and table_flag has several rows with identical id. BigQuery for each row in table finds several rows in table_flag. To remove duplicates we could add month to table_flag and to the ON clause. But we actually do not need the last LEFT JOIN. Try this:
WITH table AS(
SELECT
id,
month,
ROUND((text/(month_days/7)), 2) AS value
FROM (
SELECT
id,
extract(month FROM date) AS month,
(32 - EXTRACT(DAY FROM DATE_ADD(DATE_TRUNC(DATE(date), MONTH), INTERVAL 31 DAY))) AS month_days,
sum(text_sent) AS text
FROM table1
WHERE
date BETWEEN '2020-01-01 00:00:00 UTC' AND '2020-06-30 00:00:00 UTC'
AND id IS NOT NULL
GROUP BY 1,2,3
)
)
SELECT
id,
value,
month,
CASE
WHEN month = 1 AND value > 100 THEN 1
WHEN month = 2 AND value > 150 THEN 1
WHEN month = 3 AND value > 130 THEN 1
WHEN month = 4 AND value > 200 THEN 1
WHEN month = 5 AND value > 235 THEN 1
WHEN month = 6 AND value > 125 THEN 1
WHEN month = 7 AND value > 324 THEN 1
WHEN month = 8 AND value > 160 THEN 1
WHEN month = 9 AND value > 350 THEN 1
WHEN month = 10 AND value > 80 THEN 1
WHEN month = 11 AND value > 245 THEN 1
ELSE 0
END AS value_flag
FROM table
ORDER BY 1
or this:
SELECT
id,
month,
ROUND((text/(month_days/7)), 2) AS value,
CASE
WHEN month = 1 AND ROUND((text/(month_days/7)), 2) > 100 THEN 1
WHEN month = 2 AND ROUND((text/(month_days/7)), 2) > 150 THEN 1
WHEN month = 3 AND ROUND((text/(month_days/7)), 2) > 130 THEN 1
WHEN month = 4 AND ROUND((text/(month_days/7)), 2) > 200 THEN 1
WHEN month = 5 AND ROUND((text/(month_days/7)), 2) > 235 THEN 1
WHEN month = 6 AND ROUND((text/(month_days/7)), 2) > 125 THEN 1
WHEN month = 7 AND ROUND((text/(month_days/7)), 2) > 324 THEN 1
WHEN month = 8 AND ROUND((text/(month_days/7)), 2) > 160 THEN 1
WHEN month = 9 AND ROUND((text/(month_days/7)), 2) > 350 THEN 1
WHEN month = 10 AND ROUND((text/(month_days/7)), 2) > 80 THEN 1
WHEN month = 11 AND ROUND((text/(month_days/7)), 2) > 245 THEN 1
ELSE 0
END AS value_flag
FROM (
SELECT
id,
extract(month FROM date) AS month,
(32 - EXTRACT(DAY FROM DATE_ADD(DATE_TRUNC(DATE(date), MONTH), INTERVAL 31 DAY))) AS month_days,
sum(text_sent) AS text
FROM table1
WHERE
date BETWEEN '2020-01-01 00:00:00 UTC' AND '2020-06-30 00:00:00 UTC'
AND id IS NOT NULL
GROUP BY 1,2,3
)
ORDER BY 1
I have a table like this:
Year, DividendYield
1950, .1
1951, .2
1952, .3
I now want to calculate the total running shares. In other words, if the dividend is re-invested in new shares, it will look now like this:
Original Number of Shares purchased Jan 1, 1950 is 1
1950, .1, 1.1 -- yield of .1 reinvested in new shares results in .1 new shares, totaling 1.1
1951, .2, 1.32 -- (1.1 (Prior Year Total shares) * .2 (dividend yield) + 1.1 = 1.32)
1953, .3, 1.716 -- (1.32 * .3 + 1.32 = 1.716)
The closest I have been able to come up with is this:
declare #startingShares int = 1
; with cte_data as (
Select *,
#startingShares * DividendYield as NewShares,
(#startingShares * DividendYield) + #startingShares as TotalShares from DividendTest
)
select *, Sum(TotalShares) over (order by id) as RunningTotal from cte_data
But only the first row is correct.
Id Year DividendYield NewShares TotalShares RunningTotal
1 1950 0.10 0.10 1.10 1.10
2 1951 0.20 0.20 1.20 2.30
3 1953 0.30 0.30 1.30 3.60
How do I do this with SQL? I was trying not to resort to a loop to process this.
You want a cumulative multiplication. I think a correlated CTE is actually the simplest solution:
with tt as (
select t.*, row_number() over (order by year) as seqnum
from t
),
cte as (
select tt.year, convert(float, tt.yield) as yield, tt.seqnum
from tt
where seqnum = 1
union all
select tt.year, (tt.yield + 1) * (cte.yield + 1) - 1, tt.seqnum
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select cte.*
from cte;
Here is a db<>fiddle.
You can also phrase this using logs and exponents:
select t.*,
exp(sum(log(1 + yield)) over (order by year)) - 1
from t;
This should be fine for most purposes, but I find that for longer series this introduces numerical errors more quickly than the recursive CTE.
I'm looking for a beauty-easy to read-smart SQL query (SQLite engine) to agregate data in column. It is more easy to explain with an example :
Data table :
id elapsedtime httpcode
1 0.0 200
2 0.1 200
3 0.3 301
4 0.6 404
5 1.0 200
6 1.1 404
7 1.2 500
Expected result set : a column by httpcode, with number of code by time. In this example, the time agregation is 0.2s (but it can be agregated at a second, or 10s). I'm interested only in some expected http_code :
time code_200 code_404 code_500 code_other
0.0 2 0 0 0
0.2 0 0 0 1
0.4 0 1 0 0
0.6 0 0 0 0
0.8 0 0 0 0
1.0 1 1 1 0
It is not mandatory for "time" to be continuous. In the previous example, line with time 0.6 and 0.6 can be removed.
For the moment, I can do this by doing 4 different requests (one by http code) and I agregate the result in my developped application:
select
0.2 * cast (elapsedtime/ 0.2 as int) as time, count(id) as code_200
from test
where (httpcode=200)
group by time
But i'm pretty sure i can achieve this with a single query. Unfortunally i'm not mastering UNION keyword...
Is there a way to get such data in a single SELECT ?
See SQLFiddle : http://sqlfiddle.com/#!5/2081f/3/1
Found a nicer solution than my original post, which I'll leave in, in case you're curious. Here's the nicer solution:
with t1 as (
select
0.2 * cast (elapsedtime/ 0.2 as int) as time,
case httpcode when 200 then 1 else 0 end code_200,
case httpcode when 404 then 1 else 0 end code_404,
case httpcode when 500 then 1 else 0 end code_500,
case when httpcode not in (200, 404, 500) then 1 else 0 end code_other
from test
)
select time,
sum(code_200) as count_200,
sum(code_404) as count_404,
sum(code_500) as count_500,
sum(code_other) as count_other
from t1
group by time;
Old solution:
This might not be too easy on the eye, but it more or less works (only difference between your desired output and what I get with this is that time groupings that have no values (0.6 and 0.8 in your example) are omitted:
with
t_all as (select
0.2 * cast (elapsedtime/ 0.2 as int) as time, count(id) as total
from test
group by time
),
t_200 as (select
0.2 * cast (elapsedtime/ 0.2 as int) as time, count(id) as code_200
from test
where (httpcode=200)
group by time),
t_404 as (select
0.2 * cast (elapsedtime/ 0.2 as int) as time, count(id) as code_404
from test
where (httpcode=404)
group by time),
t_500 as (select
0.2 * cast (elapsedtime/ 0.2 as int) as time, count(id) as code_500
from test
where (httpcode=500)
group by time),
t_other as (select
0.2 * cast (elapsedtime/ 0.2 as int) as time, count(id) as code_other
from test
where (httpcode not in (200, 404, 500))
group by time)
select
t_all.time,
total,
ifnull(code_200,0) as count_200,
ifnull(code_404,0) as count_404,
ifnull(code_500,0) as count_500,
ifnull(code_other,0) as count_other
from t_all
left join t_200 on t_all.time = t_200.time
left join t_404 on t_all.time = t_404.time
left join t_500 on t_all.time = t_500.time
left join t_other on t_all.time = t_other.time;
It may help you
select
0.2 * cast (elapsedtime/ 0.2 as int) as time, count(id) as code_200,
0 as code_404,
0 as code_500,
0 as code_other
from
test
where (httpcode=200)
group by time
union
select
0.2 * cast (elapsedtime/ 0.2 as int) as time,0 as code_200,
count(id) as code_404,
0 as code_500,
0 as code_other
from
test
where (httpcode=404)
group by time
union
select
0.2 * cast (elapsedtime/ 0.2 as int) as time,0 as code_200,
0 as code_400,
count(id) as code_500,
0 as code_other
from
test
where (httpcode=500)
group by time
union
select
0.2 * cast (elapsedtime/ 0.2 as int) as time,0 as code_200,
0 as code_400,
0 as code_500,
count(id) as code_other
from
test
where (httpcode<>200 and httpcode <> 404 and httpcode <> 500)
group by time
Using Amazon Redshift (SQL), I've got a table of timestamps that I'd like to split into separate phases when the time between entries is above some threshold.
For example, using a threshold of 60 units for this input:
id ts
1 a 1
2 a 4
3 a 12
4 a 90
5 a 94
6 a 101
7 a 404
8 a 412
9 a 413
I'd like to return this:
id ts dt phase
1 a 1 NA 1
2 a 4 3 1
3 a 12 8 1
4 a 90 78 2
5 a 94 4 2
6 a 101 7 2
7 a 404 303 3
8 a 412 8 3
9 a 413 1 3
This is straightforward in R (which I'm most familiar with) using a simple for loop and ifelse which increments the previous phase value by 1 if dt > 60:
# sample data
df <- data.frame(id = rep("a", 9),
ts = c(1, 4, 12, 90, 94, 101, 404, 412, 413)) %>%
mutate(dt = c(NA, diff(ts)))
# add default minimum phase value, 1
df$phase<- 1
# for loop
for(i in 2:nrow(df)) {
df$phase[i] <- ifelse(df$dt[i] > 60, df$phase[i-1] + 1, df$phase[i-1])
}
However, my attempts using the lag function and case / when in SQL have not been successful.
-- sample data
CREATE TABLE sampledata (
conversationid varchar(10), ts integer
);
INSERT INTO sampledata (conversationid, ts)
VALUES
('a', 1),
('a', 4),
('a', 12),
('a', 90),
('a', 94),
('a', 101),
('a', 404),
('a', 412),
('a', 413);
-- query
SELECT *,
CASE
WHEN dt > 60 THEN LAG(period) OVER (PARTITION BY conversationid ORDER BY ts) + 1
ELSE LAG(period) OVER (PARTITION BY conversationid ORDER BY ts)
END AS period
FROM (
SELECT *,
ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts) AS dt,
1 AS period
FROM sampledata
)
ORDER BY ts
;
-- output
id ts dt period period
a 1 1
a 4 3 1 1
a 12 8 1 1
a 90 78 1 2
a 94 4 1 1
a 101 7 1 1
a 404 303 1 2
a 412 8 1 1
a 413 1 1 1
I'm able to increment the phase value on rows where dt > 60, but not propagate the incremented phase value across subsequent rows.
I guess this may be something to do with the lag function operating across all rows at once rather than row-by-row and/or being unable to update the original phase value on the fly (a second column phase is instead created).
You are close. You want a cumulative sum based on the lag difference:
SELECT sd.*,
SUM(CASE WHEN diff > 60 THEN 1 ELSE 0 END) OVER (PARTITION BY conversationid ORDER BY ts) as period
FROM (SELECT sd.*,
(ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ) AS diff
FROM sampledata sd
) sd
ORDER BY ts;
As a side note, I would expect you to use ORDER BY conversationid, ts, rather than just the time.
And finally, the above will start the periods at NULL (it should correctly identify them, just naming them awkwardly). The following tweak does the enumeration as you specifically request:
SELECT sd.*,
(1 + SUM(CASE WHEN diff < 60 THEN 0 ELSE 1 END) OVER (PARTITION BY conversationid ORDER BY ts ROW BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as period