Multiple cumulative sums [duplicate] - sql

This question already has answers here:
How to get the cumulative sum by group in R?
(2 answers)
Closed 7 years ago.
Hopefully the title is explicit enough.
I have a table looking like that :
classes id value
a 1 10
a 2 15
a 3 12
b 1 5
b 2 9
b 3 7
c 1 6
c 2 14
c 3 6
and here is what I would like :
classes id value cumsum
a 1 10 10
a 2 15 25
a 3 12 37
b 1 5 5
b 2 9 14
b 3 7 21
c 1 6 6
c 2 14 20
c 3 6 26
I've seen this solution, and I've already applied it successfully to cases where I don't have multiple classes :
id value cumsum
1 10 10
2 15 25
3 12 37
It was reasonably fast, even with datasets of size equivalent to the one I'm currently working on.
However, when I try to apply the exact same code to the dataset I'm working on now (which looks like the first table of this question, IE multiple classes), without subsetting it by a,b,c, it seems to me that it's taking ages (it's been running for 4 hours now. The dataset is 40.000 rows).
Any idea if there is an issue with the code from the linked answer, when used in this context ? I have trouble wrapping my head around the triangular join thingy, but I have the feeling there might be an issue with the size the join takes when the number of rows increases, thus slowing the whole thing a lot, which maybe is even worsened by the fact that there are multiple "classes" on which to do the cumulative sums.
Is there any way this could be done faster ? I'm using SQL in R through the SQLDF package. A solution in either R code (with or without an external common package) or SQL code will do.
Thanks

In SQL, you can do a cumulative sum using the ANSI standard sum() over () functionality:
select classes, id, value,
sum(value) over (partition by classes order by id) as cumesum
from t;

Or you can use by from the base package:
df$cumsum <- unlist(by(df$value, df$classes, cumsum))
# classes id value cumsum
#1 a 1 10 10
#2 a 2 15 25
#3 a 3 12 37
#4 b 1 5 5
#5 b 2 9 14
#6 b 3 7 21
#7 c 1 6 6
#8 c 2 14 20
#9 c 3 6 26

Related

SQLite: How to create a combination of unrelated elements of two queries?

I have one table that I need to get some metrics from.
For example I have the following table:
meas_count
skippings
links
extra
10
8
4.2
some
10
9
5.8
some
10
9
5.8
some_2
11
8
4.2
some
11
8
5.8
some
11
9
5.9
some
I need to get a view of an existing table in the following form for further work:
meas_count
skippings
links_min
links_max
10
8
0
4
10
8
4
5
10
8
5
6
10
9
0
4
10
9
4
5
10
9
5
6
11
8
0
4
11
8
4
5
11
8
5
6
11
9
0
4
11
9
4
5
11
9
5
6
At the moment I have 2 queries, the results of which I need to combine to get the result I need.
First request:
SELECT meas_count,skippings FROM current_stats GROUP BY meas_count,skippings
Creates the following:
meas_count
skippings
10
8
10
9
11
8
11
9
Second request:
SELECT
LAG(rounded) OVER (ORDER BY rounded) as links_min,
rounded as links_max FROM
(SELECT * FROM
(SELECT ROUND(links, 1) as rounded FROM current_stats)
GROUP BY rounded ORDER BY rounded)
Creates the following:
links_min
links_max
NULL
4
4
5
5
6
I need something like result of sets multiplication...
What query should be executed to get the table of the view I need as a result?
I also have an additional question: is the execution of the second query slowed down due to several SELECTs inside?
You can do that by doing an INNER JOIN on the two tables without specifying a join condition. That will give you every combination of the two sets of rows.
SELECT * FROM
(
SELECT meas_count,skippings
FROM current_stats
GROUP BY meas_count,skippings)
AS one
INNER JOIN
(
SELECT LAG(rounded) OVER (ORDER BY rounded) as links_min,
rounded as links_max FROM
(SELECT * FROM
(SELECT ROUND(links, 1) as rounded FROM current_stats)
GROUP BY rounded
ORDER BY rounded
)
) AS two;
As for performance, that's really only an issue if there is a better way to do it. Of course nested SELECTs take time, but the query optimizers in today's SQL engine are pretty good at determining what you MEANT from what you SAID.

Group rows using the cumulative sum of a third column

I have a table with two columns:
sort_column = A column I use for sorting
value_column = My metric of interest (a positive integer)
Using SQL, I need to create contiguous groups of rows, ordered by sort_column, such that the sum of value_column within each group is the largest possible but staying below 100 (100 not included).
Find below an example of my desired result.
Thanks
sort_column
value_column
desired_result
1
53
1
2
25
1
3
33
2
4
25
2
5
10
2
6
46
3
7
9
3
8
49
4
9
48
4
10
53
5
11
33
5
12
52
6
13
29
6
14
16
6
15
66
7
16
1
7
17
62
8
18
57
9
19
47
10
20
12
10
Ok, so after a few lengthy attempts, I came to the conclusion the task is impossible with pure SQL, because a given value of the desired column depends on previous values of that same column, in a way that cannot be obtained from the first two columns alone, so the problem is impossible to tackle without using a recursive CTE, which BigQuery does not support.
I solved the issue by writing a javascript UDF for the task. It seems to be working fine and produces the expected results.
Many thanks everyone!

Create new ID based on cumulative sum in excel vba

I need to create a new transport ID based on the cumulative sum of the volume being transported. Let´s say that originally everything was transported in truck A with a capacity of 25. Now I want to assign these items to shipments with truck B (Capacity 15).
The only real constraint is amt shipped cannot exceed capacity.
I can´t post a picture because of the restrictions...but the overall set up would be like this:
Old Trans # Volume New Trans # Cumulative Volume for Trans
1 1
1 9
1 3
1 7
1 4
2 9
2 10
3 8
3 5
3 9
4 4
4 6
4 8
5 9
5 1
5 5
5 8
6 3
6 4
6 3
6 4
6 4
6 7
7 7
7 10
7 4
8 10
8 6
8 7
9 4
9 9
9 6
10 7
10 4
10 1
10 1
10 5
10 2
11 9
11 3
11 9
12 8
12 5
12 9
13 9
Expected output would be that the first three entries would result in a new shipment ID of 1;the next two entries would result in a new shipment ID of 2;and so on... I´ve tried everthing that I know(excluding VBA): Index/lookup/if functions. My VBA skills are very limited though.Any tips?? thanks!
I think I see what you're trying to do here, and just using an IF formula (and inserting a new column to keep track):
In the Columns C and D, insert these formulas in row 3 and copy down (changing 15 for whatever you want your new volume capacity to be):
Column C: =IF(B3+C2<15,B3+C2,B3)
Column D: =IF(B3+C2<15,D2,D2+1)
And for the cells C2 and D2:
C2: = B2
D2: = A2
Is this what you're looking to do?
A simple formula could be written that 'floats' the range totals for each successive load ID.
In the following, I've typed 25 and 15 in D1:E1 and used a custom number format of I\D 0. In this way, the column is identified and the cell can be referenced as a true number load limit. You can hard-code the limits into the formula if you prefer by overwriting D$1 but you will not have a one-size-fits-all formula that can be copied right for alternate load limits as I have in my example..
      
The formula in D2 is,
=IF(ROW()=2, 1, (SUM(INDEX($B:$B, MATCH(D1, D1:D$1, 0)):$B2)>D$1)+ D1)
Fill right to E2 then down as necessary.

Excel VBA function to solve impossible round-robin tournament roster with venue constraint

I am really having difficulty generating a round-robin tournament roster with the following conditions:
10 Teams (Teams 1 - 10)
5 Fields (Field A - E)
9 Rounds (Round 1 - 9)
Each team must play every other team exactly once.
Only two teams can play on a field at any one time. (i.e. all 5 fields always in use)
No team is allowed to play on any particular field more than twice. <- This is the problem!
I have been trying on and off for many years to solve this problem on paper without success. So once and for all, I would like to generate a function in Excel VBA to test every combination to prove it is impossible.
I started creating a very messy piece of code that generates an array using nested if/while loops, but I can already see it's just not going to work.
Is there anyone out there with a juicy piece of code that can solve?
Edit: Thanks to Brian Camire's method below, I've been able to include further desirable constraints and still get a solution:
No team plays the same field twice in a row
A team should play on all the fields once before repeating
The solution is below. I should have asked years ago! Thanks again Brian - you are a genius!
Round 1 2 3 4 5 6 7 8 9
Field A 5v10 1v9 2v4 6v8 3v7 4v10 3v9 7v8 1v2
Field B 1v7 8v10 3v6 2v9 4v5 6v7 1v8 9v10 3v5
Field C 2v6 3v4 1v10 5v7 8v9 1v3 2v5 4v6 7v10
Field D 4v9 2v7 5v8 3v10 1v6 2v8 4v7 1v5 6v9
Field E 3v8 5v6 7v9 1v4 2v10 5v9 6v10 2v3 4v8
I think I've found at least one solution to the problem:
Round Field Team 1 Team 2
1 A 3 10
1 B 7 8
1 C 1 9
1 D 2 4
1 E 5 6
2 A 8 10
2 B 1 5
2 C 2 6
2 D 3 7
2 E 4 9
3 A 1 4
3 B 2 3
3 C 8 9
3 D 5 7
3 E 6 10
4 A 6 7
4 B 4 10
4 C 2 8
4 D 5 9
4 E 1 3
5 A 2 9
5 B 3 8
5 C 4 7
5 D 1 6
5 E 5 10
6 A 3 9
6 B 4 5
6 C 7 10
6 D 6 8
6 E 1 2
7 A 5 8
7 B 6 9
7 C 1 10
7 D 3 4
7 E 2 7
8 A 4 6
8 B 2 10
8 C 3 5
8 D 1 8
8 E 7 9
9 A 2 5
9 B 1 7
9 C 3 6
9 D 9 10
9 E 4 8
I found it using the OpenSolver add-in for Excel (as the problem was too large for the built-in Solver feature). The steps were something like this:
Set up a table with 2025 rows representing the possible matches -- that is, possible combinations of round, field, and pair of teams (with columns like the table above), plus one extra column that will be a binary (0 or 1) decision variable indicating if the match is to be selected.
Set up formulas to use the decision variables to calculate: a) the number matches at each field in each round, b) the number of matches between each pair of teams, c) the number of matches played by each team in each round, and, d) the number of matches played by each team at each field.
Set up a formula to use the decision variables to calculate the total number of matches.
Use OpenSolver to solve a model whose objective is to maximize the result of the formula from Step 3 by changing the decision variables from Step 1, subject to the constraints that the decision variables must be binary, the results of the formulas from Steps 2.a) through c) must equal 1, and the results of the formulas from Step 2.d) must be less than or equal to 2.
The details are as follows...
For Step 1, I set up my table so that columns A, B, C, and D represented the Round, Field, Team 1, and Team 2, respectively, and column E represented the decision variable. Row 1 contained the column headings, and rows 2 through 2026 each represented one possible match.
For Step 2.a), I set up a vertical list of rounds 1 through 9 in cells I2 through I10, a horizontal list of fields A through E in cells J1 through N1, and a series of formulas to calculate the number of matches in each field in each round in cells J2 through N10 by starting with =SUMIFS($E$2:$E$2026,$A$2:$A$2026,$I2,$B$2:$B$2026,J$1) in cell J2 and then copying and pasting.
For Step 2.b), I set up a vertical list of teams 1 through 9 in cells I13 through I21, a horizontal list of opposing teams 2 through 10 in cells J12 through R12, and a series of formulas to calculate the number of matches between each pair of teams in the "upper right triangular half" of cells J13 through R21 (including the diagonal) by starting with =SUMIFS($E$2:$E$2026,$C$2:$C$2026,$I13,$D$2:$D$2026,J$12) in cell J13 and then copying and pasting.
For Step 2.c), I set up a vertical list of teams 1 through 10 in cells I24 through I33, a horizontal list of rounds 1 through 9 in cells J23 through R23, and a series of formulas to calculate the number of matches played by each team in each round in cells J24 through R33 by starting with =SUMIFS($E$2:$E$2026,$C$2:$C$2026,$I24,$A$2:$A$2026,J$23)+SUMIFS($E$2:$E$2026,$D$2:$D$2026,$I24,$A$2:$A$2026,J$23) in cell J24 and then copying and pasting.
For Step 2.d), I set up a vertical list of teams 1 through 10 in cells I36 through I45, a horizontal list of fields A through B in cells J35 through N45, and series of formulas to calculate the number of matches played by each team at each field in cells J36 through N45 by starting with =SUMIFS($E$2:$E$2026,$C$2:$C$2026,$I36,$B$2:$B$2026,J$35)+SUMIFS($E$2:$E$2026,$D$2:$D$2026,$I36,$B$2:$B$2026,J$35) in cell J36 and then copying and pasting.
For Step 3, I set up a formula to calculate the total number of matches in cell G2 as =SUM($E$2:$E$2026).
For Step 4, in the OpenSolver Model dialog (available from Data, OpenSolver, Model) I set the Objective Cell to $G$2, the Variable Cells to $E$2:$E$2026, and added constraints as described above and detailed below (sorry that the constraints are not listed in the order that I described them):
Note that, for the constraints described in Step 2.b), I needed to add the constraints separately for each row, since OpenSolver raised an error message if the constraints included the blank cells in the "lower left triangular half".
After setting up the model, OpenSolver highlighted the objective, variable, and constraint cells as shown below:
I then solved the problem using OpenSolver (via Data, OpenSolver, Solve). The selected matches are the ones with a 1 in column E. You might get a different solution than I did, as there might be many feasible ones.
come on ... that's an easy one for manual solution ;-)
T1 T2 VE
1 2 A
1 3 A
1 4 B
1 5 B
1 6 C
1 7 C
1 8 D
1 9 D
1 10 E
2 3 A
2 4 B
2 5 B
2 6 C
2 7 C
2 8 D
2 9 D
2 10 E
3 4 C
3 5 C
3 6 D
3 7 D
3 8 E
3 9 E
3 10 B
4 5 C
4 6 D
4 7 D
4 8 E
4 9 E
4 10 A
5 6 E
5 7 E
5 8 A
5 9 A
5 10 D
6 7 E
6 8 A
6 9 A
6 10 B
7 8 B
7 9 B
7 10 A
8 9 B
8 10 C
9 10 C
As far as I have checked no team more then twice on the same venue. Please double check.
To divide it into rounds should be a easy one.
Edit: this time with only 5 venues :-)
Edit 2: now also with allocated rounds :-)
Edit 3: deleted the round allocation again because it was wrong.

How to match already-calculated means to the original data set?

I am now learning R. I feel that there is a very easy succinct answer to my problem, but I am having trouble solving it myself.
I have a large data set. One column contains various 'categories'. I aggregated these categories to get the mean for each one. So, right now, my aggregated table looks like this:
Category __ Average
A ________ a
B ________ b
C ________ c
etc...
I want now to take these average and combine it as another column onto my original data.
So, I want it to look something like this:
Categories _____ Averages
B _____________ b
A______________a
B______________b
C______________c
B______________b
C______________c
In other words, I want to match each category with its corresponding mean. I have tried variations of merge(), match(), and different apply functions. The fact that my aggregated table is so much smaller than my original data is causing some problems.
Is there a specific function I can use for this simple problem? Thanks in advance.
In base R:
data <- data.frame(Category=c(rep("A",3), rep("B",4), rep("C",2)), Value=1:9)
> data
Category Value
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 B 7
8 C 8
9 C 9
> avg <- lapply(split(data$Value, data$Category), mean)
$A
[1] 2
$B
[1] 5.5
$C
[1] 8.5
> data$Averages <- avg[data$Category]
> data
Category Value Averages
1 A 1 2
2 A 2 2
3 A 3 2
4 B 4 5.5
5 B 5 5.5
6 B 6 5.5
7 B 7 5.5
8 C 8 8.5
9 C 9 8.5
You can use plyr, data.table, etc. more efficiently for larger datasets.