Custom Rolling Computation - google-bigquery

Assume I have a model that has A(t) and B(t) governed by the following equations:
A(t) = {
WHEN B(t-1) < 10 : B(t-1)
WHEN B(t-1) >=10 : B(t-1) / 6
}
B(t) = A(t) * 2
The following table is provided as input.
SELECT * FROM model ORDER BY t;
| t | A | B |
|---|------|------|
| 0 | 0 | 9 |
| 1 | null | null |
| 2 | null | null |
| 3 | null | null |
| 4 | null | null |
I.e. we know the values of A(t=0) and B(t=0).
For each row, we want to calculate the value of A & B using the equations above.
The final table should be:
| t | A | B |
|---|---|----|
| 0 | 0 | 9 |
| 1 | 9 | 18 |
| 2 | 3 | 6 |
| 3 | 6 | 12 |
| 4 | 2 | 4 |
We've tried using lag, but because of the models' recursive-like nature, we end up only getting A & B at (t=1)
CREATE TEMPORARY FUNCTION A_fn(b_prev FLOAT64) AS (
CASE
WHEN b_prev < 10 THEN b_prev
ELSE b_prev / 6.0
END
);
SELECT
t,
CASE WHEN t = 0 THEN A ELSE A_fn(LAG(B) OVER (ORDER BY t)) END AS A,
CASE WHEN t = 0 THEN B ELSE A_fn(LAG(B) OVER (ORDER BY t)) * 2 END AS B
FROM model
ORDER BY t;
Produces:
| t | A | B |
|---|------|------|
| 0 | 0 | 9 |
| 1 | 9 | 18 |
| 2 | null | null |
| 3 | null | null |
| 4 | null | null |
Each row is dependent on the row above it. It seems it should be possible to compute a single row at a time, while iterating through the rows? Or does BigQuery not support this type of windowing?
If it is not possible, what do you recommend?

Round #1 - starting point
Below is for BigQuery Standard SQL and works (for me) with up to 3M rows
#standardSQL
CREATE TEMP FUNCTION x(v FLOAT64, t INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>>
LANGUAGE js AS """
var i, result = [];
for (i = 1; i <= t; i++) {
if (v < 10) {v = 2 * v}
else {v = v / 3};
result.push({t:i, v});
};
return result
""";
SELECT 0 AS t, 0 AS A, 9 AS B UNION ALL
SELECT line.t, line.v / 2, line.v FROM UNNEST(x(9, 3000000)) line
Going above 3M rows produces Resources exceeded during query execution: UDF out of memory.
To overcome this - i think you should just implement it on the client - so no JS UDF Limits are applied. I think it is reasonable "workaround" because looks like anyway you have no really data in BQ and just one starting value (9 in this example). But even if you do have other valuable columns in the table - you can then JOIN produced result back to table ON t value - so should be Ok!
Round #2 - It could be billions ... - so let's take care of scale, parallelization
Below is a little trick to avoid JS UDFs Resource and/or Memory error
So, I was able to run it for 2B rows in one shot!
#standardSQL
CREATE TEMP FUNCTION anchor(seed FLOAT64, len INT64, batch INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>> LANGUAGE js AS """
var i, result = [], v = seed;
for (i = 0; i <= len; i++) {
if (v < 10) {v = 2 * v} else {v = v / 3};
if (i % batch == 0) {result.push({t:i + 1, v})};
}; return result
""";
CREATE TEMP FUNCTION x(value FLOAT64, start INT64, len INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>>
LANGUAGE js AS """
var i, result = []; result.push({t:0, v:value});
for (i = 1; i < len; i++) {
if (value < 10) {value = 2 * value} else {value = value / 3};
result.push({t:i, v:value});
}; return result
""";
CREATE OR REPLACE TABLE `project.dataset.result` AS
WITH settings AS (SELECT 9 init, 2000000000 len, 1000 batch),
anchors AS (SELECT line.* FROM settings, UNNEST(anchor(init, len, batch)) line)
SELECT 0 AS t, 0 AS A, init AS B FROM settings UNION ALL
SELECT a.t + line.t, line.v / 2, line.v
FROM settings, anchors a, UNNEST(x(v, t, batch)) line
In above query - you "control" initial values in below line
WITH settings AS (SELECT 9 init, 2000000000 len, 1000 batch),
in above example, 9 is initial value, 2,000,000,000 is number of rows to be calculated and 1000 is a batch to process with (this is important one to keep BQ Engine out of throwing Resource and/or Memory error - you cannot make it too big or too small - i feel I got some sense of what it needs to be - but not enough for trying to formulate it)
Some stats (settings - execution time):
1M: SELECT 9 init, 1000000 len, 1000 batch - 0 min 9 sec
10M: SELECT 9 init, 10000000 len, 1000 batch - 0 min 50 sec
100M: SELECT 9 init, 100000000 len, 600 batch - 3 min 4 sec
100M: SELECT 9 init, 100000000 len, 40 batch - 2 min 56 sec
1B: SELECT 9 init, 1000000000 len, 10000 batch - 29 min 39 sec
1B: SELECT 9 init, 1000000000 len, 1000 batch - 27 min 50 sec
2B: SELECT 9 init, 2000000000 len, 1000 batch - 48 min 27 sec
Round #3 - some thoughts and comments
Obviously, as I mentioned in #1 above - this type of calculation is more suited for being implemented on client of your choice - so it is hard for me to judge practical value of above - but I really had fun playing with it! In reality, I had few more cool ideas in mind and also implemented and played with them - but above (in #2) was the most practical/scalable one
Note: The most interesting part of above solution is anchors table. It is very cheap to generate and allows to set anchors in batch-size interval - so having this you can for example calculate value of row = 2,000,035 or 1,123,456,789 (for example) without actually processing all previous rows - and this will take fraction of sec. Or you can parallelize calculation of all rows by starting several threads/calculations using respective anchors, etc. Quite a number of opportunities.
Finally, it really depends on your specific use-case which way to go further - so I am leaving it up to you

It seems it should be possible to compute a single row at a time, while iterating through the rows
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now.
So, conceptually your process could look like below script:
DECLARE b_prev FLOAT64 DEFAULT NULL;
DECLARE t INT64 DEFAULT 0;
DECLARE arr ARRAY<STRUCT<t INT64, a FLOAT64, b FLOAT64>> DEFAULT [STRUCT(0, 0.0, 9.0)];
SET b_prev = 9.0 / 2;
LOOP
SET (t, b_prev) = (t + 1, 2 * b_prev);
IF t >= 100 THEN LEAVE;
ELSE
SET b_prev = CASE WHEN b_prev < 10 THEN b_prev ELSE b_prev / 6.0 END;
SET arr = (SELECT ARRAY_CONCAT(arr, [(t, b_prev, 2 * b_prev)]));
END IF;
END LOOP;
SELECT * FROM UNNEST(arr);
Even though above script is simpler and more directly represents logic for non-technical personal and easier to manage - it does not fit in scenarios were you need to loop through more than 100 or more iterations. For example above script took close to 2 min while my original solution for same 100 rows took just 2 sec
But still great for simple / smaller cases

Related

Creating bins in presto sql - programmatically

I am new to Presto SQL syntax and and wondering if a function exists that will bin rows into n bins in a certain range.
For example, I have a a table with 1m different integers that range from 1 - 100. What can I do to create 20 bins between 1 and 100 (a bin for 1-5, 6-10, 11-15 ... etc. ) without using 20 separate CASE WHEN statements ? Are there any standard SQL functions that do will perform the binning function?
Any advice would be appreciated!
You can use the standard SQL function width_bucket. For example:
WITH data(value) AS (
SELECT rand(100)+1 FROM UNNEST(sequence(1,10000))
)
SELECT value, width_bucket(value, 1, 101, 20) bucket
FROM data
produces:
value | bucket
-------+--------
100 | 20
98 | 20
38 | 8
42 | 9
67 | 14
74 | 15
6 | 2
...
You can just use integer division:
select (intcol - 1) / 5 as bin
Presto does integer division, so you shouldn't have to worry about the remainder.

How to set manually split long rows size on Octave's Terminal Output?

How to set manually slipt long rows size on Octave's Terminal Output?
I am using Octave through Sublime Text output build panel, and octave cannot recognize correctly how many rows it should use to split/to fill up the screen.
Example, It is currently filling the screen like this:
octave:13> rand (2,10)
ans =
Columns 1 through 6:
0.75883 0.93290 0.40064 0.43818 0.94958 0.16467
0.75697 0.51942 0.40031 0.61784 0.92309 0.40201
Columns 7 through 10:
0.90174 0.11854 0.72313 0.73326
0.44672 0.94303 0.56564 0.82150
But I want to set 10 columns (Columns 1 through 10) instead of Columns 1 through 6.
If I disable the split_long_rows, never splits.
Query or set the internal variable that controls whether rows of a
matrix may be split when displayed to a terminal window.
If the rows are split, Octave will display the matrix in a series of
smaller pieces, each of which can fit within the limits of your
terminal width and each set of rows is labeled so that you can easily
see which columns are currently being displayed.
https://www.gnu.org/software/octave/doc/v4.0.1/Matrices.html#XREFsplit_005flong_005frows
You cannot to split them like that. The Octave output is just a simple and fast way to debug your program. To print things beautifully as you want to, just to create a function for it and it to print them.
This is a similar example, where a table is printed:
...
for i = 2 : 7
...
% https://www.gnu.org/software/octave/doc/v4.0.1/Basic-Usage-of-Cell-Arrays.html
results(end+1).vector = { m, gaussLegendreIntegral________, gaussLegendreIntegralErroExato___ };
end
printf( "%20s | %30s | %30s\n", "m", "Gm", "Erro Exato Gm = |Gm - Ie |" )
printf( "%20s | %30s | %30s\n", "--------------------", "------------------------------", "------------------------------" )
numberToStringPrecision = 15;
for i = 1 : numel( results )
# https://www.gnu.org/software/octave/doc/v4.0.0/Processing-Data-in-Cell-Arrays.html
# https://www.gnu.org/software/octave/doc/v4.0.1/Converting-Numerical-Data-to-Strings.html#XREFnum2str
printf( "%20s | ", num2str( cell2mat( results(i).vector(1) ), numberToStringPrecision ) )
printf( "%30s | ", num2str( cell2mat( results(i).vector(2) ), numberToStringPrecision ) )
printf( "%30s\n" , num2str( cell2mat( results(i).vector(3) ), numberToStringPrecision ) )
end
It would generate a output like this:
m | Gm | Erro Exato Gm = |Gm - Ie |
-------------------- | ------------------------------ | ------------------------------
2 | -0.895879734614027 | 0.104120265385973
3 | -0.947672383858322 | 0.0523276161416784
4 | -0.968535977854582 | 0.0314640221454183
5 | -0.979000992287376 | 0.0209990077126242
6 | -0.984991210262343 | 0.0150087897376568
7 | -0.988738923004894 | 0.0112610769951058

Convert multi-dimensional array to records

Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;
This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs
Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)

Extract a grid from a one-dimentional array

I have an array with 12 values and I need to display these in a grid like this (always 4 columns, 3 rows):
1 | 2 | 3 | 4
----------------------
5 | 6 | 7 | 8
----------------------
9 | 10 | 11 | 12
I am looping through the grid and I have two coordinates: column and row.
How do I know which index belongs to which row and column? I tried several things, but they are not working:
objectAtIndex: (row + 1) * (column + 1) - 1
objectAtIndex: row + column
etc...
Row and column indexes start with 0.
forward conversion: objectAtPosition(x,y) = array[columns*y + x]
provided x<columns && y<rows
backward conversion: positionAtIndex(i) = (row=(i div columns), col=(i mod columns))
note that div and mod correspond to integer operators / and % in C languages.

Check if a number is divisible by 3 [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Write code to determine if a number is divisible by 3. The input to the function is a single bit, 0 or 1, and the output should be 1 if the number received so far is the binary representation of a number divisible by 3, otherwise zero.
Examples:
input "0": (0) output 1
inputs "1,0,0": (4) output 0
inputs "1,1,0,0": (6) output 1
This is based on an interview question. I ask for a drawing of logic gates but since this is stackoverflow I'll accept any coding language. Bonus points for a hardware implementation (verilog etc).
Part a (easy): First input is the MSB.
Part b (a little harder): First input is the LSB.
Part c (difficult): Which one is faster and smaller, (a) or (b)? (Not theoretically in the Big-O sense, but practically faster/smaller.) Now take the slower/bigger one and make it as fast/small as the faster/smaller one.
There's a fairly well-known trick for determining whether a number is a multiple of 11, by alternately adding and subtracting its decimal digits. If the number you get at the end is a multiple of 11, then the number you started out with is also a multiple of 11:
47278 4 - 7 + 2 - 7 + 8 = 0, multiple of 11 (47278 = 11 * 4298)
52214 5 - 2 + 2 - 1 + 4 = 8, not multiple of 11 (52214 = 11 * 4746 + 8)
We can apply the same trick to binary numbers. A binary number is a multiple of 3 if and only if the alternating sum of its bits is also a multiple of 3:
4 = 100 1 - 0 + 0 = 1, not multiple of 3
6 = 110 1 - 1 + 0 = 0, multiple of 3
78 = 1001110 1 - 0 + 0 - 1 + 1 - 1 + 0 = 0, multiple of 3
109 = 1101101 1 - 1 + 0 - 1 + 1 - 0 + 1 = 1, not multiple of 3
It makes no difference whether you start with the MSB or the LSB, so the following Python function works equally well in both cases. It takes an iterator that returns the bits one at a time. multiplier alternates between 1 and 2 instead of 1 and -1 to avoid taking the modulo of a negative number.
def divisibleBy3(iterator):
multiplier = 1
accumulator = 0
for bit in iterator:
accumulator = (accumulator + bit * multiplier) % 3
multiplier = 3 - multiplier
return accumulator == 0
Here... something new... how to check if a binary number of any length (even thousands of digits) is divisible by 3.
-->((0))<---1--->()<---0--->(1) ASCII representation of graph
From the picture.
You start in the double circle.
When you get a one or a zero, if the digit is inside the circle, then you stay in that circle. However if the digit is on a line, then you travel across the line.
Repeat step two until all digits are comsumed.
If you finally end up in the double circle then the binary number is divisible by 3.
You can also use this for generating numbers divisible by 3. And I wouldn't image it would be hard to convert this into a circuit.
1 example using the graph...
11000000000001011111111111101 is divisible by 3 (ends up in the double circle again)
Try it for yourself.
You can also do similar tricks for performing MOD 10, for when converting binary numbers into base 10 numbers. (10 circles, each doubled circled and represent the values 0 to 9 resulting from the modulo)
EDIT: This is for digits running left to right, it's not hard to modify the finite state machine to accept the reverse language though.
NOTE: In the ASCII representation of the graph () denotes a single circle and (()) denotes a double circle. In finite state machines these are called states, and the double circle is the accept state (the state that means its eventually divisible by 3)
Heh
State table for LSB:
S I S' O
0 0 0 1
0 1 1 0
1 0 2 0
1 1 0 1
2 0 1 0
2 1 2 0
Explanation: 0 is divisible by three. 0 << 1 + 0 = 0. Repeat using S = (S << 1 + I) % 3 and O = 1 if S == 0.
State table for MSB:
S I S' O
0 0 0 1
0 1 2 0
1 0 1 0
1 1 0 1
2 0 2 0
2 1 1 0
Explanation: 0 is divisible by three. 0 >> 1 + 0 = 0. Repeat using S = (S >> 1 + I) % 3 and O = 1 if S == 0.
S' is different from above, but O works the same, since S' is 0 for the same cases (00 and 11). Since O is the same in both cases, O_LSB = O_MSB, so to make MSB as short as LSB, or vice-versa, just use the shortest of both.
Here is an simple way to do it by hand.
Since 1 = 22 mod 3, we get 1 = 22n mod 3 for every positive integer.
Furthermore 2 = 22n+1 mod 3. Hence one can determine if an integer is divisible by 3 by counting the 1 bits at odd bit positions, multiply this number by 2, add the number of 1-bits at even bit posistions add them to the result and check if the result is divisible by 3.
Example: 5710=1110012.
There are 2 bits at odd positions, and 2 bits at even positions. 2*2 + 2 = 6 is divisible by 3. Hence 57 is divisible by 3.
Here is also a thought towards solving question c). If one inverts the bit order of a binary integer then all the bits remain at even/odd positions or all bits change. Hence inverting the order of the bits of an integer n results is an integer that is divisible by 3 if and only if n is divisible by 3. Hence any solution for question a) works without changes for question b) and vice versa. Hmm, maybe this could help to figure out which approach is faster...
You need to do all calculations using arithmetic modulo 3. This is the way
MSB:
number=0
while(!eof)
n=input()
number=(number *2 + n) mod 3
if(number == 0)
print divisible
LSB:
number = 0;
multiplier = 1;
while(!eof)
n=input()
number = (number + multiplier * n) mod 3
multiplier = (multiplier * 2) mod 3
if(number == 0)
print divisible
This is general idea...
Now, your part is to understand why this is correct.
And yes, do homework yourself ;)
The idea is that the number can grow arbitrarily long, which means you can't use mod 3 here, since your number will grow beyond the capacity of your integer class.
The idea is to notice what happens to the number. If you're adding bits to the right, what you're actually doing is shifting left one bit and adding the new bit.
Shift-left is the same as multiplying by 2, and adding the new bit is either adding 0 or 1. Assuming we started from 0, we can do this recursively based on the modulo-3 of the last number.
last | input || next | example
------------------------------------
0 | 0 || 0 | 0 * 2 + 0 = 0
0 | 1 || 1 | 0 * 2 + 1 = 1
1 | 0 || 2 | 1 * 2 + 0 = 2
1 | 1 || 0 | 1 * 2 + 1 = 0 (= 3 mod 3)
2 | 0 || 1 | 2 * 2 + 0 = 1 (= 4 mod 3)
2 | 1 || 2 | 2 * 2 + 1 = 2 (= 5 mod 3)
Now let's see what happens when you add a bit to the left. First, notice that:
22n mod 3 = 1
and
22n+1 mod 3 = 2
So now we have to either add 1 or 2 to the mod based on if the current iteration is odd or even.
last | n is even? | input || next | example
-------------------------------------------
d/c | don't care | 0 || last | last + 0*2^n = last
0 | yes | 1 || 0 | 0 + 1*2^n = 1 (= 2^n mod 3)
0 | no | 1 || 0 | 0 + 1*2^n = 2 (= 2^n mod 3)
1 | yes | 1 || 0 | 1 + 1*2^n = 2
1 | no | 1 || 0 | 1 + 1*2^n = 0 (= 3 mod 3)
1 | yes | 1 || 0 | 2 + 1*2^n = 0
1 | no | 1 || 0 | 2 + 1*2^n = 1
input "0": (0) output 1
inputs "1,0,0": (4) output 0
inputs "1,1,0,0": (6) output 1
shouldn't this last input be 12, or am i misunderstanding the question?
Actually the LSB method would actually make this easier. In C:
MSB method:
/*
Returns 1 if divisble by 3, otherwise 0
Note: It is assumed 'input' format is valid
*/
int is_divisible_by_3_msb(char *input) {
unsigned value = 0;
char *p = input;
if (*p == '1') {
value &= 1;
}
p++;
while (*p) {
if (*p != ',') {
value <<= 1;
if (*p == '1') {
ret &= 1;
}
}
p++;
}
return (value % 3 == 0) ? 1 : 0;
}
LSB method:
/*
Returns 1 if divisble by 3, otherwise 0
Note: It is assumed 'input' format is valid
*/
int is_divisible_by_3_lsb(char *input) {
unsigned value = 0;
unsigned mask = 1;
char *p = input;
while (*p) {
if (*p != ',') {
if (*p == '1') {
value &= mask;
}
mask <<= 1;
}
p++;
}
return (value % 3 == 0) ? 1 : 0;
}
Personally I have a hard time believing one of these will be significantly different to the other.
I think Nathan Fellman is on the right track for part a and b (except b needs an extra piece of state: you need to keep track of if your digit position is odd or even).
I think the trick for part C is negating the last value at each step. I.e. 0 goes to 0, 1 goes to 2 and 2 goes to 1.
A number is divisible by 3 if the sum of it's digits is divisible by 3.
So you can add the digits and get the sum:
if the sum is greater or equal to 10 use the same method
if it's 3, 6, 9 then it is divisible
if the sum is different than 3, 6, 9 then it's not divisible