Recursive loop in BigQuery for capped cumulative sum? - sql

I'd like to be able to implement a "capped" cumulative sum in BigQuery using SQL.
Here's what I mean: I have a table whose rows have the amount by which a value is increased/decreased each day, but the value cannot go below 0 or above 100. I want to compute the cumulative sum of the changes to keep track of this value.
As an example, consider the following table:
day | change
--------------
1 | 70
2 | 50
3 | 20
4 | -30
5 | 10
6 | -90
7 | 20
I want to make a column that has the capped cumulative sum so that it looks like this:
day | change | capped cumsum
----------------------------
1 | 70 | 70
2 | 50 | 100
3 | 20 | 100
4 | -30 | 70
5 | 10 | 80
6 | -90 | 0
7 | 20 | 20
Simply doing SUM (change) OVER (ORDER BY day) and capping the values at 100 and 0 won't work. I need some sort of recursive loop and I don't know how to implement this in BigQuery.
Eventually I'd also like to do this over partitions, so that if I have something like
day | class | change
--------------
1 | A | 70
1 | B | 12
2 | A | 50
2 | B | 83
3 | A | -30
3 | B | 17
4 | A | 10
5 | A | -90
6 | A | 20
I can do the capped cumulative sum partitioned over each class.

I need some sort of recursive loop and I don't know how to implement this in BigQuery
Super naïve / cursor based approach
declare cumulative_change int64 default 0;
create temp table temp_table as (
select * , 0 as capped_cumsum from your_table where false
);
for rec in (select * from your_table order by day)
do
set cumulative_change = cumulative_change + rec.change;
set cumulative_change = case when cumulative_change < 0 then 0 when cumulative_change > 100 then 100 else cumulative_change end;
insert into temp_table (select rec.*, cumulative_change);
end for;
select * from temp_table order by day;
if applied to sample data in your question - output is
Slightly modified option with use of array instead of temp table
declare cumulative_change int64 default 0;
declare result array<struct<day int64, change int64, capped_cumsum int64>>;
for rec in (select * from your_table order by day)
do
set cumulative_change = cumulative_change + rec.change;
set cumulative_change = case when cumulative_change < 0 then 0 when cumulative_change > 100 then 100 else cumulative_change end;
set result = array(select as struct * from unnest(result) union all select as struct rec.*, cumulative_change);
end for;
select * from unnest(result) order by day;
P.S. I like none of above options so far :o)
Meantime, that approach might work for relatively small tables, set of data

Using RECURSIVE CTE can be another option:
DECLARE sample ARRAY<STRUCT<day INT64, change INT64>> DEFAULT [
(1, 70), (2, 50), (3, 20), (4, -30), (5, 10), (6, -90), (7, 20)
];
WITH RECURSIVE ccsum AS (
SELECT 0 AS n, vals[OFFSET(0)] AS change,
CASE
WHEN vals[OFFSET(0)] > 100 THEN 100
WHEN vals[OFFSET(0)] < 0 THEN 0
ELSE vals[OFFSET(0)]
END AS cap_csum
FROM sample
UNION ALL
SELECT n + 1 AS n, vals[OFFSET(n + 1)] AS change,
CASE
WHEN cap_csum + vals[OFFSET(n + 1)] > 100 THEN 100
WHEN cap_csum + vals[OFFSET(n + 1)] < 0 THEN 0
ELSE cap_csum + vals[OFFSET(n + 1)]
END AS cap_csum
FROM ccsum, sample
WHERE n < ARRAY_LENGTH(vals) - 1
),
sample AS (
SELECT ARRAY_AGG(change ORDER BY day) vals FROM UNNEST(sample)
)
SELECT * EXCEPT(n) FROM ccsum ORDER BY n;
output:

Eventually I'd also like to do this over partitions ...
Consider below solution
create temp function cap_value(value int64, lower_boundary int64, upper_boundary int64) as (
least(greatest(value, lower_boundary), upper_boundary)
);
with recursive temp_table as (
select *, row_number() over(partition by class order by day) as n from your_table
), iterations as (
select 1 as n, day, class, change, cap_value(change, 0, 100) as capped_cumsum
from temp_table
where n = 1
union all
select t.n, t.day, t.class, t.change, cap_value(i.capped_cumsum + t.change, 0, 100) as capped_cumsum
from temp_table t
join iterations i
on t.n = i.n + 1
and t.class = i.class
)
select * except(n) from iterations
order by class, day
if applied to sample data in your question - output is

Related

Running SUM in T-SQL

Sorry for bad topic but I wasn't sure what to call it..
I have a table looking like this:
+-----++-----+
| Id ||Count|
+-----++-----+
| 1 || 1 |
+-----++-----+
| 2 || 5 |
+-----++-----+
| 3 || 8 |
+-----++-----+
| 4 || 3 |
+-----++-----+
| 5 || 6 |
+-----++-----+
| 6 || 8 |
+-----++-----+
| 7 || 3 |
+-----++-----+
| 8 || 1 |
+-----++-----+
I'm trying to make a select from this table where every time the SUM of row1 + row2 + row3 (etc) reaches 10, then it's a "HIT", and the count starts over again.
Requested output:
+-----++-----++-----+
| Id ||Count|| HIT |
+-----++-----++-----+
| 1 || 1 || N | Count = 1
+-----++-----++-----+
| 2 || 5 || N | Count = 6
+-----++-----++-----+
| 3 || 8 || Y | Count = 14 (over 10)
+-----++-----++-----+
| 4 || 3 || N | Count = 3
+-----++-----++-----+
| 5 || 6 || N | Count = 9
+-----++-----++-----+
| 6 || 8 || Y | Count = 17 (over 10..)
+-----++-----++-----+
| 7 || 3 || N | Count = 3
+-----++-----++-----+
| 8 || 1 || N | Count = 4
+-----++-----++-----+
How do I do this, and with best performance? I have no idea..
You can't do this using window/analytic functions, because the breakpoints are not known in advance. Sometimes, it is possible to calculate the breakpoints. However, in this case, the breakpoints depend on a non-linear function of the previous values (I can't think of a better word than "non-linear" right now). That is, sometimes adding "1" to an earlier value has zero effect on the calculation for the current row. Sometimes it has a big effect. The implication is that the calculation has to start at the beginning and iterate through the data.
A minor modification to the problem would be solvable using such functions. If the problem were, instead, to carry over the excess amount for each group (instead of restarting the sum), the problem would be solvable using cumulative sums (and some other trickery).
Recursive queries (which others have provided) or a sequential operation is the best way to approach this problem. Unfortunately, it doesn't have a set-based method for solving it.
You could use Recursive Queries
Please note the following query assuming the id value are all in sequence, otherwise, please use ROW_NUMBER() to create a new id
WITH cte AS (
SELECT id, [Count], [Count] AS total_count
FROM Table1
WHERE id = 1
UNION ALL
SELECT t2.id,t2.[Count], CASE WHEN t1.total_count >= 10 THEN t2.[Count] ELSE t1.total_count + t2.[Count] END
FROM Table1 t2
INNER JOIN cte t1
ON t2.id = t1.id + 1
)
SELECT *
FROM cte
ORDER BY id
SQL Fiddle
I'm really hoping someone can show us if it's possible to do this using straight-forward window functions. That's the real challenge.
In the meantime, here is how I would do it using recursion. This handles gaps in the sequence, and handles the edge case of the first row already being >= 10.
I also added the maxrecursion hint to remove the default recursion limit. But I honestly don't know how well it will run with larger amounts of data.
with NumberedRows as (
select Id, Cnt,
row_number() over(order by id) as rn
from CountTable
), RecursiveCTE as (
select Id, Cnt, rn,
case when Cnt >= 10 then 0 else Cnt end as CumulativeSum,
case when Cnt >= 10 then 'Y' else 'N' end as hit
from NumberedRows
where rn = 1
union all
select n.Id, n.Cnt, n.rn,
case when (n.Cnt + r.CumulativeSum) >= 10 then 0 else n.Cnt + r.CumulativeSum end as CumulativeSum,
case when (n.Cnt + r.CumulativeSum) >= 10 then 'Y' else 'N' end as hit
from RecursiveCTE r
join NumberedRows n
on n.rn = r.rn + 1
)
select Id, Cnt, hit
from RecursiveCTE
order by Id
option (maxrecursion 0)
SQLFiddle Demo
How about this using Running Totals:
DECLARE #Data TABLE(
Id INT
,SubTotal INT
)
INSERT INTO #Data
VALUES(1, 5)
INSERT INTO #Data
VALUES(2, 3)
INSERT INTO #Data
VALUES(3, 4)
INSERT INTO #Data
VALUES(4, 4)
INSERT INTO #Data
VALUES(5, 7)
DECLARE #RunningTotal INT = 0
DECLARE #HitCount INT = 0
SELECT
#RunningTotal = CASE WHEN #RunningTotal < 10 THEN #RunningTotal + SubTotal ELSE SubTotal END
,#HitCount = #HitCount + CASE WHEN #RunningTotal >= 10 THEN 1 ELSE 0 END
FROM #Data ORDER BY Id
SELECT #HitCount -- Outputs 2
Having re-read the question I see this does not meet the required output - I'll leave the answer as it may be of some use to someone looking for an example of a running total solution to this type of problem that doesn't need each row tagged with a Y or an N.

How to find the average of multiple columns with different specific weights?

I need to calculate the average of Value_A, Value_B, and Value_C in mssql.
My problem is that every information I need is in one row.
Every value has its own weight:
(sum of values * weight) / (sum weight):
Every column can be null. If there is a value but not a weight, weight is 100,
if there is a weight and no value then the specific value is not considered of course
e.g.
1st column:
(2*100+1*80)/(100+80)= 2.55 ≈ 2.6
2nd column:
(1*100+2*80)/(100+80)
+------+---------+---------+---------+----------+----------+----------+-----+
| ID | VALUE_A | VALUE_B | VALUE_C | Weight_A | Weight_B | Weight_C | AVG |
+------+---------+---------+---------+----------+----------+----------+-----+
| 1111 | 2 | 1 | null | 100 | 80 | 60 | 2.6 |
+------+---------+---------+---------+----------+----------+----------+-----+
| 2222 | 1 | 2 | null | 100 | 80 | 60 | |
+------+---------+---------+---------+----------+----------+----------+-----+
I got this far to get the AVG values without weights
select ID, VALUE_A, VALUE_B, VALUE_C, Weight_A, Weight_B, Weight_C,
(SELECT AVG(Cast(c as decimal(18,1)))
FROM (VALUES(VALUE_A),
(VALUE_B),
(VALUE_C)) T (c)) AS [Average]
FROM table
Second try was selecting sum of values multiply them by their weights and then divide them by sum of the weights. Sum of weights is missing. Can't figure out how to add it
select *,
(SELECT SUM(Cast(c as decimal(18,1)))
FROM (VALUES(VALUE_A* ISNULL(Weight_A,100)),
(VALUE_B* ISNULL(Weight_B,100)),
(VALUE_C* ISNULL(Weight_C,100))
) T (c)) AS [Average]
FROM table
Is this what you are looking for?
SELECT SUM(val * COALESCE(w, 100)) / SUM(w) as weighted_average,
SUM(val * COALESCE(w, 100)) as weighted_sum
FROM table t CROSS APPLY
(VALUES (t.VALUE_A, t.Weight_A),
(t.VALUE_B, t.Weight_B),
(t.VALUE_C, t.Weight_C)
) a(val, w)
WHERE a.val IS NOT NULL;
This is how Average could be calculated
SELECT *
,CASE
WHEN (W.weight_A + W.Weight_B+ W.Weight_C) = 0
THEN 0
ELSE (ISNULL(VALUE_A, 0 * W.Weight_A)
+ (ISNULL(VALUE_B, 0) * W.Weight_B)
+ (ISNULL(VALUE_C, 0) * W.Weight_C))
/ (W.weight_A + w.Weight_B+ W.Weight_C)
END Average
FROM TABLE t
CROSS APPLY (Select CASE WHEN VALUE_A is null then 0 ELSE ISNULL(Weight_A,100) END [Weight_A]
,CASE WHEN VALUE_B is null then 0 ELSE ISNULL(Weight_B,100) END [Weight_B]
,CASE WHEN VALUE_C is null then 0 ELSE ISNULL(Weight_C,100) END [Weight_C]) W

SQL: Arrange numbers on a scale 1 to 10

I have a table in a SQL Server 2008 database with a number column that I want to arrange on a scale 1 to 10.
Here is an example where the column (Scale) is what I want to accomplish with SQL
Name Count (Scale)
----------------------
A 19 2
B 1 1
C 25 3
D 100 10
E 29 3
F 60 7
In my example above the min and max count is 1 and 100 (this could be different from day to day).
I want to get a number to which each record belongs to.
1 = 0-9
2 = 10-19
3 = 20-29 and so on...
It has to be dynamic because this data changes everyday so I can not use a WHERE clause with static numbers like this: WHEN Count Between 0 and 10...
Try this, though note technically the value 100 doesn't fall in the range 90-99 and therefore should probably be classed as 11, hence why the value 60 comes out with a scale of 6 rather than your 7:
SQL Fiddle
MS SQL Server 2008 Schema Setup:
Query 1:
create table #scale
(
Name Varchar(10),
[Count] INT
)
INSERT INTO #scale
VALUES
('A', 19),
('B', 1),
('C', 25),
('D', 100),
('E', 29),
('F', 60)
SELECT name, [COUNT],
CEILING([COUNT] * 10.0 / (SELECT MAX([Count]) - MIN([Count]) + 1 FROM #Scale)) AS [Scale]
FROM #scale
Results:
| NAME | COUNT | SCALE |
|------|-------|-------|
| A | 19 | 2 |
| B | 1 | 1 |
| C | 25 | 3 |
| D | 100 | 10 |
| E | 29 | 3 |
| F | 60 | 6 |
This gets you your answer where 60 becomes 7, hence 100 is 11:
SELECT name, [COUNT],
CEILING([COUNT] * 10.0 / (SELECT MAX([Count]) - MIN([Count]) FROM #Scale)) AS [Scale]
FROM #scale
WITH MinMax(Min, Max) AS (SELECT MIN(Count), MAX(Count) FROM Table1)
SELECT Name, Count, 1 + 9 * (Count - Min) / (Max - Min) AS Scale
FROM Table1, MinMax
You can make Scale column a PERSISTED COMPUTED column as:
alter table test drop column Scale
ALTER TABLE test ADD
Scale AS (case when Count between 0 and 9 then 1
when Count between 10 and 19 then 2
when Count between 20 and 29 then 3
when Count between 30 and 39 then 4
when Count between 40 and 49 then 5
when Count between 50 and 59 then 6
when Count between 60 and 69 then 7
when Count between 70 and 79 then 8
when Count between 80 and 89 then 9
when Count between 90 and 100 then 10
end
)PERSISTED
GO
DEMO
select ntile(10) over (order by [count])

Oracle sql split up rows according to max value in column

I have a select result like this:
ID | Amount
---------------
xx1 | 105
xx2 | 70
I would like to split up the row into multiple rows if the amount is greater than 50 resulting in:
ID | Amount
---------------
xx1 | 50
xx1 | 50
xx1 | 5
xx2 | 50
xx2 | 20
A recursive solution:
WITH t(id, amount) AS
(SELECT id, amount
FROM mytable
UNION ALL
SELECT id, amount - 50
FROM t
WHERE amount - 50 > 0)
SELECT id
,least(amount, 50) amount
FROM t
ORDER BY id
,least(amount, 50) DESC
Following Frank Schmitt's comment. A MODEL solution which should work in Oracle 10g:
SELECT id
, least(amt, 50) amount
FROM
(SELECT id
, amt
FROM mytable t
MODEL
PARTITION BY (id)
DIMENSION BY (0 d)
MEASURES (amount amt)
RULES ITERATE (1024) UNTIL (amt[ITERATION_NUMBER] < 50)
( amt[ITERATION_NUMBER+1] = amt[ITERATION_NUMBER] - 50 ))
WHERE amt > 0
ORDER BY id
, amt DESC
You should ensure that 1024*50 >= max(amount), or change the maximum number of iterations to perform appropriately.

Statistical Mode with postgres

I have a table that has this schema:
create table mytable (creation_date timestamp,
value int,
category int);
I want the maximum ocurrence of a value every each hour per category, Only on week days. I had made some progress, I have a query like this now:
select category,foo.h as h,value, count(value) from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value;
as result I got something like this:
category | h | value | count
---------+----------+---------+-------
1 | 00:00:00 | 2 | 1
1 | 01:00:00 | 2 | 1
1 | 02:00:00 | 2 | 6
1 | 03:00:00 | 2 | 31
1 | 03:00:00 | 3 | 11
1 | 04:00:00 | 2 | 21
1 | 04:00:00 | 3 | 9
1 | 13:00:00 | 1 | 14
1 | 14:00:00 | 1 | 10
1 | 14:00:00 | 2 | 7
1 | 15:00:00 | 1 | 52
for example at 04:00 I have to values 2 and 3, with counts of 21 and 9 respectively, I only need the value with highest count which would be the statiscal mode.
BTW I have more than 2M records
This can be simpler:
SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
category
, extract(hour FROM creation_date)::int AS h
, count(*)::int AS max_ct
, value
FROM mytable
WHERE extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP BY 1,2,4
ORDER BY 1,2,3 DESC;
Basically these are the steps:
Exclude weekends (WHERE ...). Use ISODOW to simplify the expression.
Extract hour from timestamp as h.
Group by category, h and value.
Count the rows per combination of the three; cast to integer - we don't need bigint.
Order by category, h and the highest count (DESC).
Only pick the first row (highest count) per (category, h) with the according category.
I am able to do this in one query level, because DISTINCT is applied after the aggregate function.
The result will hold no rows for any (category, h) without no entries at all. If you need to fill in the blanks, LEFT JOIN to this:
SELECT c.category, h.h
FROM cat_tbl c
CROSS JOIN (SELECT generate_series(0, 23) AS h) h
Given the size of your table, I'd be tempted to use your query to build a temporary table, then run a query on that to finalise the results.
Assuming you called the temporary table "summary_table", the following query should do it.
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s2.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));
If you don't want to create a table, you could use a WITH clause to attach your query to this one.
with summary_table as (
select category,foo.h as h,value, count(value) as count from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value)
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s1.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));