SQL Max Consecutive Values in a number set using recursion - sql

The following SQL query is supposed to return the max consecutive numbers in a set.
WITH RECURSIVE Mystery(X,Y) AS (SELECT A AS X, A AS Y FROM R)
UNION (SELECT m1.X, m2.Y
FROM Mystery m1, Mystery m2
WHERE m2.X = m1.Y + 1)
SELECT MAX(Y-X) + 1 FROM Mystery;
This query on the set {7, 9, 10, 14, 15, 16, 18} returns 3, because {14 15 16} is the longest chain of consecutive numbers and there are three numbers in that chain. But when I try to work through this manually I don't see how it arrives at that result.
For example, given the number set above I could create two columns:
m1.x
m2.y
7
7
9
9
10
10
14
14
15
15
16
16
18
18
If we are working on rows and columns, not the actual data, as I understand it WHERE m2.X = m1.Y + 1 takes the value from the next row in Y and puts it in the current row of X, like so
m1.X
m2.Y
9
7
10
9
14
10
15
14
16
15
18
16
18
Null?
The main part on which I am uncertain is where in the SQL recursion actually happens. According to Denis Lukichev recursion is the R part - or in this case the RECURSIVE Mystery(X,Y) - and stops when the table is empty. But if the above is true, how would the table ever empty?
Since I don't know how to proceed with the above, let me try a different direction. If WHERE m2.X = m1.Y + 1 is actually a comparison, the result should be:
m1.X
m2.Y
14
14
15
15
16
16
But at this point, it seems that it should continue recursively on this until only two rows are left (nothing else to compare). If it stops here to get the correct count of 3 rows (2 + 1), what is actually stopping the recursion?
I understand that for the above example the MAX(Y-X) + 1 effectively returns the actual number of recursion steps and adds 1.
But if I have 7 consecutive numbers and the recursion flows down to 2 rows, should this not end up with an incorrect 3 as the result? I understand recursion in C++ and other languages, but this is confusing to me.
Full disclosure, yes it appears this is a common university question, but I am retired, discovered this while researching recursion for my use, and need to understand how it works to use similar recursion in my projects.

Based on this db<>fiddle shared previously, you may find it instructive to alter the CTE to include an iteration number as follows, and then to show the content of the CTE rather than the output of final SELECT. Here's an amended CTE and its content after the recursion is complete:
Amended CTE
WITH RECURSIVE Mystery(X,Y) AS ((SELECT A AS X, A AS Y, 1 as Z FROM R)
UNION (SELECT m1.X, m2.A, Z+1
FROM Mystery m1
JOIN R m2 ON m2.A = m1.Y + 1))
CTE Content
x
y
z
7
7
1
9
9
1
10
10
1
14
14
1
15
15
1
16
16
1
18
18
1
9
10
2
14
15
2
15
16
2
14
16
3
The Z field holds the iteration count. Where Z = 1 we've simply got the rows from the table R. The, values X and Y are both from the field A. In terms of what we are attempting to achieve these represent sequences consecutive numbers, which start at X and continue to (at least) Y.
Where Z = 2, the second iteration, we find all the rows first iteration where there is a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That becomes the new highest number, and we add one to the number of iterations. As only three numbers in our original data set have successors within the set, there are only three rows output in the second iteration.
Where Z = 3, the third iteration, we find all the rows of the second iteration (note we are not considering all the rows of the first iteration again), where there is, again, a value in R which is one higher than our Y value, or one higher than the last member of our sequence of consecutive numbers. That, again, becomes the new highest number, and we add one to the number of iterations.
The process will attempt a fourth iteration, but as there are no rows in R where the value is one more than the Y values from our third iteration, no extra data gets added to the CTE and recursion ends.
Going back to the original db<>fiddle, the process then searches our CTE content to output MAX(Y-X) + 1, which is the maximum difference between the first and last values in any consecutive sequence, plus one. This finds it's value from the record produced in the third iteration, using ((16-14) + 1) which has a value of 3.
For this specific piece of code, the output is always equivalent to the value in the Z field as every addition of a row through the recursion adds one to Z and adds one to Y.

Related

Grouping rows so a column sums to no more than 10 per group

I have a table that looks like:
col1
------
2
2
3
4
5
6
7
with values sorted in ascending order.
I want to assign each row to groups with labels 0,1,...,n so that each group has a total of no more than 10. So in the above example it would look like this:
col1 |label
------------
2 0
2 0
3 0
4 1
5 1
6 2
7 3
I tried using this:
floor(sum(col1) OVER (partition by ORDER BY col1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /10))
But this doesn't work correctly because it is performing the operations
as:
floor(2/10) = 0
floor([2+2]/10) = 0
floor([2+2+3]/10) = 0
floor([2+2+3+4]/10) = 1
floor([2+2+3+4+5]/10 = 1
floor([2+2+3+4+5+6]/10 = 2
floor([2+2+3+4+5+6+7]/10) = 2
It's all coincidentally correct until the last calculation, because even though
[2+2+3+4+5+6+7] / 10 = 2.9
and
floor(2.9) = 2
what it should do is realise 6+7 is > 10 so the 5th row with value 7 needs be in its own group so iterate the group number + 1 and allocate this row into a new group.
What I really want it to do is when it encounters a sum > 10 then set group number = group number + 1, allocate the CURRENT ROW into this new group, and then finally set the new start row to be the CURRENT ROW.
This is too long for a comment.
Solving this problem requires scanning the table, row-by-row. In SQL, this would be through a recursive CTE (or hierarchical query). Hive supports neither of these.
The issue is that each time a group is defined, the difference between 10 and the sum is "forgotten". That is, when you are further down in the list, what happens earlier on is not a simple accumulation of the available data. You need to know how it was split into groups.
A related problem is solvable. The related problem would assign all rows to groups of size 10, splitting rows between two groups. Then you would know what group a later row is in based only on the cumulative sum of the previous rows.

Set outcome of formula to working days

I would like to change the outcome of a SQL statement formula to 1, 2, 3, 4 or 5 (these are working days).
Example 1: when I have day 1, minus 2 days the outcome should be 4.
Example 2: when I have day 4, plus 2 days the outcome should be 1.
Example 3: when I have day 5, minus 20 days, the outcome should be 5
At the moment I'm using a table as shown below (I have the input and days-back and the output is what i want to see):
Input, days-back, output:
1 0 1
Input, days-back, output:
1 1 5
Input, days-back, output:
1 2 4
Input, days-back, output:
2 4 3
P.s. I do not have a date, only day numbers.
I hope you understand what I'm looking for :)
If you want to have "days-back" greater than 5 you need to use the following formula:
((Input + ((5*days-back)-1) - days-back) % 5) + 1
How this works - If you look at the prior formula you can see I'm adding 5 to input to make sure we are always positive before I subtract one and the days back. I then mod by 5 and add the one back in so that we go from 1 to 5 instead of 0 to 4
Since I don't know how large days-back is going to be I need something larger but I also need to have it not effect the mod 5 calculation so I just multiply it by 5. I then subtract one (so I can add it later and offset 0 to 4 to 1 to 5) and we are done.
prior answer below
I note I missed the 5 case -- here is the formula that works for that:
((Input + 4 - days-back) % 5) + 1
original answer
You need to use use modulus math. The formula you want is
(Input + 5 - days-back) % 5
Where % means modulus. In SQL Server you can use % in Oracle it is MOD, etc -- it depends on the platform.
For those that care here is my DB2 test code:
WITH TEST_TABLE(input, days_back) AS
(
VALUES
(1,0),
(1,1),
(1,2),
(2,4)
)
SELECT TEST_TABLE.*
MOD(INPUT+4-DAYS_BACK,5)+1
FROM TEST_TABLE

How do I remove contiguous sequences of almost identical records from database

I have a SQL Server database containing real-time stock quotes.
There is a Quotes table containing what you would expect-- a sequence number, ticker symbol, time, price, bid, bid size, ask, ask size, etc.
The sequence number corresponds to a message that was received containing data for a set of ticker symbols being tracked. A new message (with a new, incrementing sequence number) is received whenever anything changes for any of the symbols being tracked. The message contains data for all symbols (even for those where nothing changed).
When the data was put into the database, a record was inserted for every symbol in each message, even for symbols where nothing changed since the prior message. So a lot of records contain redundant information (only the sequence number changed) and I want to remove these redundant records.
This is not the same as removing all but one record from the entire database for a combination of identical columns (already answered). Rather, I want to compress each contiguous block of identical records (identical except for sequence number) into a single record. When finished, there may be duplicate records but with differing records between them.
My approach was to find contiguous ranges of records (for a ticker symbol) where everything is the same except the sequence number.
In the following sample data I simplify things by showing only Sequence, Symbol, and Price. The compound primary key would be Sequence+Symbol (each symbol appears only once in a message). I want to remove records where Price is the same as the prior record (for a given ticker symbol). For ticker X it means I want to remove the range [1, 6], and for ticker Y I want to remove the ranges [1, 2], [4, 5] and [7, 7]:
Before:
Sequence Symbol Price
0 X $10
0 Y $ 5
1 X $10
1 Y $ 5
2 X $10
2 Y $ 5
3 X $10
3 Y $ 6
4 X $10
4 Y $ 6
5 X $10
5 Y $ 6
6 X $10
6 Y $ 5
7 X $11
7 Y $ 5
After:
Sequence Symbol Price
0 X $10
0 Y $ 5
3 Y $ 6
6 Y $ 5
7 X $11
Note that (Y, $5) appears twice but with (Y, $6) between.
The following generates the ranges I need. The left outer join ensures I select the first group of records (where there is no earlier record that is different), and the BETWEEN is intended to reduce the number of records that need to be searched to find the next-earlier different record (the results are the same without the BETWEEN, but slower). I would need only to add something like "DELETE FROM Quotes WHERE Sequence BETWEEN StartOfRange AND EndOfRange".
SELECT
GroupsOfIdenticalRecords.Symbol,
MIN(GroupsOfIdenticalRecords.Sequence)+1 AS StartOfRange,
MAX(GroupsOfIdenticalRecords.Sequence) AS EndOfRange
FROM
(
SELECT
Q1.Symbol,
Q1.Sequence,
MAX(Q2.Sequence) AS ClosestEarlierDifferentRecord
FROM
Quotes AS Q1
LEFT OUTER JOIN
Quotes AS Q2
ON
Q2.Sequence BETWEEN Q1.Sequence-100 AND Q1.Sequence-1
AND Q2.Symbol=Q1.Symbol
AND Q2.Price<>Q1.Price
GROUP BY
Q1.Sequence,
Q1.Symbol
) AS GroupsOfIdenticalRecords
GROUP BY
GroupsOfIdenticalRecords.Symbol,
GroupsOfIdenticalRecords.ClosestEarlierDifferentRecord
The problem is that this is way too slow and runs out of memory (crashing SSMS- remarkably) for the 2+ million records in the database. Even if I change "-100" to "-2" it is still slow and runs out of memory. I expected the "ON" clause of the LEFT OUTER JOIN to limit the processing and memory usage (2 million iterations, processing about 100 records each, which should be tractable), but it seems like SQL Server may first be generating all combinations of the 2 instances of the table, Q1 and Q2 (about 4e12 combinations) before selecting based on the criteria specified in the ON clause.
If I run the query on a smaller subset of the data (for example, by using "(SELECT TOP 100000 FROM Quotes) AS Q1", and similar for Q2), it completes in a reasonable amount time. I was trying to figure out how to automatically run this 20 or so times using "WHERE Sequence BETWEEN 0 AND 99999", then "...BETWEEN 100000 AND 199999", etc. (actually I would use overlapping ranges such as [0,99999], [99900, 199999], etc. to remove ranges that span boundaries).
The following generates sets of ranges to split the data into 100000 record blocks ([0,99999], [100000, 199999], etc). But how do I apply the above query repeatedly (once for each range)? I keep getting stuck because you can't group these using "BETWEEN" without applying an aggregate function. So instead of selecting blocks of records, I only know how to get MIN(), MAX(), etc. (single values) which does not work with the above query (as Q1 and Q2). Is there a way to do this? Is there totally different (and better) approach to the problem?
SELECT
CONVERT(INTEGER, Sequence / 100000)*100000 AS BlockStart,
MIN(((1+CONVERT(INTEGER, Sequence / 100000))*100000)-1) AS BlockEnd
FROM
Quotes
GROUP BY
CONVERT(INTEGER, Sequence / 100000)*100000
You can do this with a nice little trick. The groups that you want can be defined as the difference between two sequences of numbers. One is assigned for each symbol in order by sequence. The other is assigned for each symbol and price. This is what is looks like for your data:
Sequence Symbol Price seq1 seq2 diff
0 X $10 1 1 0
0 Y $ 5 1 1 0
1 X $10 2 2 0
1 Y $ 5 2 2 0
2 X $10 3 3 0
2 Y $ 5 3 3 0
3 X $10 4 4 0
3 Y $ 6 4 1 3
4 X $10 5 5 0
4 Y $ 6 5 2 3
5 X $10 6 6 0
5 Y $ 6 6 3 3
6 X $10 7 7 0
6 Y $ 5 7 4 3
7 X $11 8 1 7
7 Y $ 5 8 5 3
You can stare at this and figure out that the combination of symbol, diff, and price define each group.
The following puts this into a SQL query to return the data you want:
select min(q.sequence) as sequence, symbol, price
from (select q.*,
(row_number() over (partition by symbol order by sequence) -
row_number() over (partition by symbol, price order by sequence)
) as grp
from quotes q
) q
group by symbol, grp, price;
If you want to replace the data in the original table, I would suggest that you store the results of the query in a temporary table, truncate the original table, and then re-insert the values from the temporary table.
Answering my own question. I want to add some additional comments to complement the excellent answer by Gordon Linoff.
You're right. It is a nice little trick. I had to stare at it for a while to understand how it works. Here's my thoughts for the benefit of others.
The numbering by Sequence/Symbol (seq1) always increases, whereas the numbering by Symbol/Price (seq2) only increases sometimes (within each group, only when a record for Symbol contains the group's Price). Therefore seq1 either remains in lock step with seq2 (i.e., diff remains constant, until either Symbol or Price changes), or seq1 "runs away" from seq2 (while it is busy "counting" other Prices and other Symbols-- which increases the difference between seq1 and seq2 for a given Symbol and Price). Once seq2 falls behind, it can never "catch up" to seq1, so a given value of diff is never seen again once diff moves to the next larger value (for a given Price). By taking the minimum value within each Symbol/Price group, you get the first record in each contiguous block, which is exactly what I needed.
I don't use SQL a lot, so I wasn't familiar with the OVER clause. I just took it on faith that the first clause generates seq1 and the second generates seq2. I can kind of see how it works, but that's not the interesting part.
My data contained more than just Price. It was a simple thing to add the other fields (Bid, Ask, etc.) to the second OVER clause and the final GROUP BY:
row_number() over (partition by Symbol, Price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time order by Sequence)
group by Symbol, grp, price, Bid, BidSize, Ask, AskSize, Change, Volume, DayLow, DayHigh, Time
Also, I was able to use use >MIN(...) and <=MAX(...) to define ranges of records to delete.

Circle Summation (30 Points) InterviewStree Puzzle

The following is the problem from Interviewstreet I am not getting any help from their site, so asking a question here. I am not interested in an algorithm/solution, but I did not understand the solution given by them as an example for their second input. Can anybody please help me to understand the second Input and Output as specified in the problem statement.
Circle Summation (30 Points)
There are N children sitting along a circle, numbered 1,2,...,N clockwise. The ith child has a piece of paper with number ai written on it. They play the following game:
In the first round, the child numbered x adds to his number the sum of the numbers of his neighbors.
In the second round, the child next in clockwise order adds to his number the sum of the numbers of his neighbors, and so on.
The game ends after M rounds have been played.
Input:
The first line contains T, the number of test cases. T cases follow. The first line for a test case contains two space seperated integers N and M. The next line contains N integers, the ith number being ai.
Output:
For each test case, output N lines each having N integers. The jth integer on the ith line contains the number that the jth child ends up with if the game starts with child i playing the first round. Output a blank line after each test case except the last one. Since the numbers can be really huge, output them modulo 1000000007.
Constraints:
1 <= T <= 15
3 <= N <= 50
1 <= M <= 10^9
1 <= ai <= 10^9
Sample Input:
2
5 1
10 20 30 40 50
3 4
1 2 1
Sample Output:
80 20 30 40 50
10 60 30 40 50
10 20 90 40 50
10 20 30 120 50
10 20 30 40 100
23 7 12
11 21 6
7 13 24
Here is an explanation of the second test case. I will use a notation (a, b, c) meaning that child one has number a, child two has number b and child three has number c. In the beginning, the position is always (1,2,1).
If the first child is the first to sum its neighbours, the table goes through the following situations (I will put an asterisk in front of the child that just added its two neighbouring numbers):
(1,2,1)->(*4,2,1)->(4,*7,1)->(4,7,*12)->(*23,7,12)
If the second child is the first to move:
(1,2,1)->(1,*4,1)->(1,4,*6)->(*11,4,6)->(11,*21,6)
And last if the third child is first to move:
(1,2,1)->(1,2,*4)->(*7,2,4)->(7,*13,4)->(7,13,*24)
And as you notice the output to the second case are exactly the three triples computed that way.
Hope that helps.

How to generate sequences with distinct subsums?

I'm looking for a way to generate some (6 for default) equations where all subsums are unique.
For example,
a+b+c=50
d+e+f=50
g+h+i=50
a, d and g have to be distinct.
a+b and d+e have to be distinct.
e+f and h+i have to be distinct.
a+c and d+f have to be distinct.
But, a+b and e+f can be the same. So I only care about the subsums of aligned parameters..
I could only found ways to check whether some sequence is subsum-distinct, but I found nothing on how to generate such a sequence..
You didn't state whether you need it to be a random sequence, so suppose that this is not required.
One simple approach is this:
1 + 2 + 47 = 50
3 + 4 + 43 = 50
5 + 6 + 39 = 50
7 + 8 + 35 = 50
9 + 10 + 31 = 50
11 + 12 + 27 = 50
First two numbers are 2 smallest available numbers, the third number is final sum - those numbers.
a and b are always increasing, c is always decreasing
a + b is always increasing, b + c and a + c are always decreasing
You can generate it this way in a loop.
EDIT after comment that it has to be a random sequence:
Possibly you could create several sets (some sort of hashset/hashmap would be the most appropriate)
set of first summands
set of sums of first and second summands
set of sums of second and third summands
set of sums of first and third summands
set of previously generated triples
You would generate random triples this way:
If total number of demanded triples was not achieved generate a random triple, otherwise finish.
Check if the triple was not previously generated, if not proceed with step 3.
Conduct checks for first four sets. If no sums are contained within those sets, add triple and proceed with step 1.
However, I am not sure if this approach guarantees that you will get results (especially in small final sums).
So, I would add an counter, if too many consecutive attempts are not successful, then I would switch to brute force approach (which should not be problem if final sums are small and on other hand is very unlikely to happen if a final sum is large).
Overall, performance should be good.