Group rows using the cumulative sum of a third column - sql

I have a table with two columns:
sort_column = A column I use for sorting
value_column = My metric of interest (a positive integer)
Using SQL, I need to create contiguous groups of rows, ordered by sort_column, such that the sum of value_column within each group is the largest possible but staying below 100 (100 not included).
Find below an example of my desired result.
Thanks
sort_column
value_column
desired_result
1
53
1
2
25
1
3
33
2
4
25
2
5
10
2
6
46
3
7
9
3
8
49
4
9
48
4
10
53
5
11
33
5
12
52
6
13
29
6
14
16
6
15
66
7
16
1
7
17
62
8
18
57
9
19
47
10
20
12
10

Ok, so after a few lengthy attempts, I came to the conclusion the task is impossible with pure SQL, because a given value of the desired column depends on previous values of that same column, in a way that cannot be obtained from the first two columns alone, so the problem is impossible to tackle without using a recursive CTE, which BigQuery does not support.
I solved the issue by writing a javascript UDF for the task. It seems to be working fine and produces the expected results.
Many thanks everyone!

Related

How to Create a CDF out of a PDF in SQL

So I have a datatable that looks something like that following. ID represents an object, bin represents how I am segmenting the data, and percent is how much of a data falls into that bin.
id bin percent
2 8 0.20030698388
2 16 0.14504988488
2 24 0.12356101304
2 32 0.09976976208
2 40 0.09056024558
2 48 0.07137375287
2 56 0.04067536454
2 64 0.03914044512
2 72 0.02916346891
2 80 0.16039907904
3 8 0.36316695352
3 16 0.03958691910
3 24 0.11876075731
3 32 0.13253012048
3 40 0.03098106712
3 48 0.07228915662
3 56 0.07745266781
3 64 0.02581755593
3 72 0.02065404475
3 80 0.11876075731
I am looking for a function to turn this dataset into a cdf partitioning id. I have tried cume_dist and percent_rank, but they do not appear to work.
I am facing a similar problem and found this great tutorial for doing exactly that:
https://dwaincsql.com/2015/05/14/excel-in-t-sql-part-2-the-normal-distribution-norm-dist-density-functions/
It tries to rebuild the Excel function NORM.DIST function which gives you either the PDF if you set the cummulative flag as FALSE and the CDF if you set it as TRUE. I assumed that CUME_DIST would do the exact same thing in SQL. However, it turns out that the latter distributes by counting the elements whereas Excel uses the relative differences in the values.

Auctions System Logical Condition

I am trying to make an auctions system but can not figure out the logical conditions for doing so..
Lets say that I have 10 credit
$credit
I have already bet 5 credits on another auction... so I owe 5 from 10 $owe
I thus have 5 available... $available = $credit - $owe (=5)
I bet 3 from available (on a different item)...
I wish to bet again 4 (cancel 3, update to 4), but credit available is now $available - 3 (=2)
Can't find a logical solution.... written in code.
What is the condition for setting a bet???
Made up a matrix with the dependence between variables:
bet available owe lastbet
1 10 10 0
2 9 11 1
3 7 13 2
4 4 16 3
5 0 20 4
6 -5 25 5
7 -11 31 6
8 -18 38 7
9 -26 46 8
10 -35 55 9
11 -45 65 10
Need to translate it into a condition statement.... (the next row would not meet the conditions)
The condition should fail on the 11th row....
Based on the Matrix... I found out that the condition is:
if ($bet <= (($owe + $available) / 2)) {}
Not very intuitive......

Multiple cumulative sums [duplicate]

This question already has answers here:
How to get the cumulative sum by group in R?
(2 answers)
Closed 7 years ago.
Hopefully the title is explicit enough.
I have a table looking like that :
classes id value
a 1 10
a 2 15
a 3 12
b 1 5
b 2 9
b 3 7
c 1 6
c 2 14
c 3 6
and here is what I would like :
classes id value cumsum
a 1 10 10
a 2 15 25
a 3 12 37
b 1 5 5
b 2 9 14
b 3 7 21
c 1 6 6
c 2 14 20
c 3 6 26
I've seen this solution, and I've already applied it successfully to cases where I don't have multiple classes :
id value cumsum
1 10 10
2 15 25
3 12 37
It was reasonably fast, even with datasets of size equivalent to the one I'm currently working on.
However, when I try to apply the exact same code to the dataset I'm working on now (which looks like the first table of this question, IE multiple classes), without subsetting it by a,b,c, it seems to me that it's taking ages (it's been running for 4 hours now. The dataset is 40.000 rows).
Any idea if there is an issue with the code from the linked answer, when used in this context ? I have trouble wrapping my head around the triangular join thingy, but I have the feeling there might be an issue with the size the join takes when the number of rows increases, thus slowing the whole thing a lot, which maybe is even worsened by the fact that there are multiple "classes" on which to do the cumulative sums.
Is there any way this could be done faster ? I'm using SQL in R through the SQLDF package. A solution in either R code (with or without an external common package) or SQL code will do.
Thanks
In SQL, you can do a cumulative sum using the ANSI standard sum() over () functionality:
select classes, id, value,
sum(value) over (partition by classes order by id) as cumesum
from t;
Or you can use by from the base package:
df$cumsum <- unlist(by(df$value, df$classes, cumsum))
# classes id value cumsum
#1 a 1 10 10
#2 a 2 15 25
#3 a 3 12 37
#4 b 1 5 5
#5 b 2 9 14
#6 b 3 7 21
#7 c 1 6 6
#8 c 2 14 20
#9 c 3 6 26

Excluding rows dynamically

Let's assume we have the following:
A
1 10
2 20
3 30
4 20
5 10
6 30
7 20
8
9
10 =(AVERAGE(A1:A7)
11 4
12 6
I would like to be able to find a way to calculate the Average of A1-A7 into cell A10 while excluding row range defined in A11 and A12. That is, according to the above setup the result should be 20:
((10 + 20 + 30 + 20) / 4) = 20
because if rows 4,5 and 6 are excluded what's left is rows 1,2,3,7 to be averaged.
Two other options:
=AVERAGE(FILTER(A1:A7,ISNA(MATCH(ROW(A1:A7),A11:A12,0))))
=ArrayFormula(AVERAGEIF(MATCH(ROW(A1:A7),A11:A12,0),NA(),A1:A7))
Seems to meet your requirement, though not flexible:
=(sum(A1:A7)-indirect("A"&A11)-indirect("A"&A12))/(count(A1:A7)-2)
Adjust re misunderstanding of requirements:
=(SUM(A1:A7)-SUM(INDIRECT("A"&A11&":A"&A12)))/(COUNT(A1:A7)-A12+A11-1)

Need query to find rows that may have 4, 5 or 6 consecutive numbers

Trying to come up with a query that will find rows that contain 4, 5 or 6 consecutive numbers.
For example: Table MyNumbers contains 6 columns of number combinations from 1 to 52.
Cloumn names are: nbr1 nbr2 nbr3 nbr4 nbr5 nbr6
Row one contains: 1 5 43 50 51 52
Row two contains: 41 42 43 44 45 52 <----- five consecutive numbers
Row three contains: 8 14 38 39 42 50
Row four contains: 1 2 3 4 15 29 <----- four consecutive numbers
Row five contains: 8 14 24 36 48 51
Row six contains: 1 2 3 4 5 6 <----- six consecutive numbers
Need to come up with a query that would find rows 2, 4 and 6 based on containing a result set where there were 4 or more consecutive numbers in that row.
I created a database that contains all possible combinations for a 6 numbers out of 52 (1 to 52). What I would like to do is eliminate rows that have four or more numbers that are consecutive. So I am not sure above would do the trick. For those that asked, I am using sql server 2008 R2.
Assuming the numbers are always increasing, and not repeating
select *
from mynumbers
where nbr4 - nbr1 = 3
or nbr5 - nbr2 = 3
or nbr6 - nbr3 = 3
I took the liberty of simplifying it to the fact that for a series of 6 consecutive numbers, there must already be a series of 4 consecutive numbers.