Long time execution of simple SQL script - google-bigquery

When I run below script, the query runs about for 30 seconds but when it finishes then I can get information that elapsed time is equal to 0.1 seconds and compute time is 2ms.
Could you tell me what is the reason why this query is running 30 seconds even though I do not use any table?
declare heads bool;
declare heads_in_a_row int64 default 0; #number of heads in a row
declare nb_of_throws int64 default 0; #number of throws
#How many throws I need to get a 8 heads in a row?
while heads_in_a_row <= 8 DO
set heads = RAND() < 0.5;
set nb_of_throws = nb_of_throws +1;
if heads then
set heads_in_a_row = heads_in_a_row + 1 ;
else
set heads_in_a_row = 0;
end if;
end while;
select nb_of_throws;

This will sometimes be fast and sometimes be slow, it depends on random chance to get to the point of having 8 heads in a row. Try 50 heads in a row, you'll see its much slower, 2 heads in a row will be faster. The 0.1s elapsed time you are seeing is the elapsed time only for the final section of the script which is the select statement which just pulls your stored variable value and is nearly instant.

What is your answer? I have run it a few times and got results in the hundreds and thousands of 'throws'. The odds of getting 8 heads in a row is .39%, so not very likely. BQ (and databases in general) are optimized for set-based and column-based computation, not loop-based operations. Using BQ for this problem is likely not your best option, python would be much faster for example.

Related

How the complexity of the following code is O(nlogn)?

for(i=1;i<=n;i=i*2)
{
for(j=1;j<=i;j++)
{
}
}
How the complexity of the following code is O(nlogn) ?
Time complexity in terms of what? If you want to know how many inner loop operations the algorithm performs, it is not O(n log n). If you want to take into account also the arithmetic operations, then see further below. If you literally are to plug in that code into a programming language, chances are the compiler will notice that your code does nothing and optimise the loop away, resulting in constant O(1) time complexity. But only based on what you've given us, I would interpret it as time complexity in terms of whatever might be inside the inner loop, not counting arithmetic operations of the loops themselves. If so:
Consider an iteration of your inner loop a constant-time operation, then we just need to count how many iterations the inner loop will make.
You will find that it will make
1 + 2 + 4 + 8 + ... + n
iterations, if n is a square number. If it is not square, it will stop a bit sooner, but this will be our upper limit.
We can write this more generally as
the sum of 2i where i ranges from 0 to log2n.
Now, if you do the math, e.g. using the formula for geometric sums, you will find that this sum equals
2n - 1.
So we have a time complexity of O(2n - 1) = O(n), if we don't take the arithmetic operations of the loops into account.
If you wish to verify this experimentally, the best way is to write code that counts how many times the inner loop runs. In javascript, you could write it like this:
function f(n) {
let c = 0;
for(i=1;i<=n;i=i*2) {
for(j=1;j<=i;j++) {
++c;
}
}
console.log(c);
}
f(2);
f(4);
f(32);
f(1024);
f(1 << 20);
If you do want to take the arithmetic operations into account, then it depends a bit on your assumptions but you can indeed get some logarithmic coefficients to account for. It depends on how you formulate the question and how you define an operation.
First, we need to estimate number of high-level operations executed for different n. In this case the inner loop is an operation that you want to count, if I understood the question right.
If it is difficult, you may automate it. I used Matlab for example code since there was no tag for specific language. Testing code will look like this:
% Reasonable amount of input elements placed in array, change it to fit your needs
x = 1:1:100;
% Plot linear function
plot(x,x,'DisplayName','O(n)', 'LineWidth', 2);
hold on;
% Plot n*log(n) function
plot(x, x.*log(x), 'DisplayName','O(nln(n))','LineWidth', 2);
hold on;
% Apply our function to each element of x
measured = arrayfun(#(v) test(v),x);
% Plot number of high level operations performed by our function for each element of x
plot(x,measured, 'DisplayName','Measured','LineWidth', 2);
legend
% Our function
function k = test(n)
% Counter for operations
k = 0;
% Outer loop, same as for(i=1;i<=n;i=i*2)
i = 1;
while i < n
% Inner loop
for j=1:1:i
% Count operations
k=k+1;
end
i = i*2;
end
end
And the result will look like
Our complexity is worse than linear but not worse than O(nlogn), so we choose O(nlogn) as an upper bound.
Furthermore the upper bound should be:
O(n*log2(n))
The worst case is n being in 2^x. x€real numbers
The inner loop is evaluated n times, the outer loop log2 (logarithm basis 2) times.

How to calculate the sum of all the odd numbers less than 1000 using a for loop in Xcode (Objective C)

I'm very new to programming and I have no idea where to start, just looking for some help with this, I know its very simple but I'm clueless, thanks for the help!
So this is the code I have:
NSInteger sum = 0;
for (int a = 1; a < 500; a++) {
sum += (a * 2 - 1); }
NSLog(#"The sum of all the odd numbers within the range = %ld",(long)sum);
but I'm getting a answer of 249,001, but it should be 250,000
Appreciate the help!
Your immediate problem is that you're missing a term in the sum: your output differs from the actual answer by 999. This ought to motivate you to write a <= 500 instead as the stopping condition in the for loop.
But, in reality, you would not use a for loop for this as there is an alternative that's much cheaper computationally speaking.
Note that this is an arithmetic progression and there is therefore a closed-form solution to this. That is, you can get the answer out in O(1) rather than by using a loop which would be O(n); i.e. the compute time grows linearly with the number of terms that you want.
Recognising that there are 500 odd numbers in your range, you can use
n * (2 * a + (n - 1) * d) / 2
to compute this. In your case, n is 500. d (the difference between the terms) is 2. a (the first term) is 1.
See https://en.wikipedia.org/wiki/Arithmetic_progression

Calculating a recursive sequence iteratively - code optimization

I have to calculate first 3000 items of a sequence given as follows:
a_1=1,
a_n+1 = smallest integer > a_n, for which for every (not necessarily different) 1<= i,j,k <= n+1 applies (a_i+a_j not equal 3*a_k)
I have written code (in Magma) that works correctly, but its time complexity is obviously way too large. I am asking if there is a way to reduce the time complexity. I had an idea to somehow move the inner for loop (which is the one causing havoc) out in a way of making an array of all the sums, but I cant get it to work right. Attaching my code below:
S:=[1];
for n:=2 to 3000 do
new:=S[n-1];
repeat
flag:=0;
new+:=1;
for i,j in S do
if (i+j eq 3*new) or (i+new eq 3*j) then
flag:=1;
break i;
end if;
end for;
until flag eq 0;
S[n]:=new;
end for;
print S[2015];
P.S.: If it helps, I also know Python, Pascal and C if you prefer any of those languages.
I copied your program in MAGMA. Run time for n=2978 was 4712.766 seconds. I changed your program as follow and result was amazing. Run time for changed version for n=3000 was 41.250 seconds.
S:=[1];
for n:=2 to 3000 do
new:=S[n-1];
repeat
flag:=0;
new+:=1;
for i in S do
if ((3*new-i) in S) or ((i+new)/3 in S) then
flag:=1;
break i;
end if;
end for;
until flag eq 0;
S[n]:=new;
end for;

C++\CLI DateTime calculation error

I am trying to predict the estimated completion time of a simulation. I take startTime at the start of the simulation. At the end of each cycle, I take timeNow. The time elapsed timeLapsed is calculated by subtracting these two values. The average cycle time (varies per cycle) is calculated by dividing the elapsed time by the cycle number at that time, i.e. number of cycles run until then. Then I calculate the estimated completion time estimEndTime by adding the number of cycles still to go multiplied by the average cycle time to timeNow.
I think something goes wrong in the data conversion, as the estimEndTime calculation is incorrect. Its prediction is way, way too short / soon. The average cycle time avgCycleTime is calculated at around 30-50 seconds which looks correct. Trial nr of cycles is 20.
I get one warning for the conversion of the cycle number (int i) from int64 to long with possible loss of data, but since the avgCycleTime seems ok, this does not seem to be the cause for the error.
Why doesn't this work?
Code essentials:
long avgCycleTime;
DateTime startTime = DateTime::Now;
f1->textBox9->Text = startTime.ToString("dd/MM/yy HH:mm:ss");
f1->textBox9->Update();
i = 0; // cycle counter
while (i < nCycl)
{
// this is where the simulation occurs
i++;
DateTime timeNow = DateTime::Now;
TimeSpan timeLapsed = timeNow.Subtract(startTime);
avgCycleTime = (timeLapsed.Ticks / i);
DateTime estimEndTime = timeNow.AddTicks(avgCycleTime * (nCycl-i));
f1->textBox10->Text = Convert::ToString(avgCycleTime / 10000000); // cycle time in milliseconds
f1->textBox11->Text = estimEndTime.ToString("dd/MM/yy HH:mm:ss");
f1->Refresh();
}
The problem is that you declared avgCycleTime as long - effectively Int32.
Let's assume that one cycle takes 50 seconds. In ticks it would be 50 * 10,000,000 = 500,000,000 - well fit in Int32. But then you calculate avgCycleTime * (nCycl - i) and it overflows (result will be Int32), thus you get invalid estimEndTime. So you have to declare avgCycleTime as long long or Int64.

Efficiently finding the distance between 2 lat/longs in SQL

I'm working with billions of rows of data, and each row has an associated start latitude/longitude, and end latitude/longitude. I need to calculate the distance between each start/end point - but it is taking an extremely long time.
I really need to make what I'm doing more efficient.
Currently I use a function (below) to calculate the hypotenuse between points. Is there some way to make this more efficient?
I should say that I have already tried casting the lat/longs as spatial geographies and using SQL built in STDistance() functions (not indexed), but this was even slower.
Any help would be much appreciated. I'm hoping there is some way to speed up the function, even if it degrades accuracy a little (nearest 100m is probably ok).
Thanks in advance!
DECLARE #l_distance_m float
, #l_long_start FLOAT
, #l_long_end FLOAT
, #l_lat_start FLOAT
, #l_lat_end FLOAT
, #l_x_diff FLOAT
, #l_y_diff FLOAT
SET #l_lat_start = #lat_start
SET #l_long_start = #long_start
SET #l_lat_end = #lat_end
SET #l_long_end = #long_end
-- NOTE 2 x PI() x (radius of earth) / 360 = 111
SET #l_y_diff = 111 * (#l_lat_end - #l_lat_start)
SET #l_x_diff = 111 * (#l_long_end - #l_long_start) * COS(RADIANS((#l_lat_end + #l_lat_start) / 2))
SET #l_distance_m = 1000 * SQRT(#l_x_diff * #l_x_diff + #l_y_diff * #l_y_diff)
RETURN #l_distance_m
I haven't done any SQL programming since around 1994, however I'd make the following observations:The formula that you're using is a formula that works as long as the distances between your coordinates doesn't get too big. It'll have big errors for working out the distance between e.g. New York and Singapore, but for working out the distance between New York and Boston it should be fine to within 100m.I don't think there's any approximation formula that would be faster, however I can see some minor implementation improvements that might speed it up such as (1) why do you bother to assign #l_lat_start from #lat_start, can't you just use #lat_start directly (and same for #long_start, #lat_end, #long_end), (2) Instead of having 111 in the formulas for #l_y_diff and #l_x_diff, you could get rid of it there hence saving a multiplication, and instead of 1000 in the formula for #l_distance_m you could have 111000, (3) using COS(RADIANS(#l_lat_end)) or COS(RADIANS(#l_lat_start)) won't degrade the accuracy as long as the points aren't too far away, or if the points are all within the same city you could just work out the cosine of any point in the cityApart from that, I think you'd need to look at other ideas such as creating a table with the results, and whenever points are added/deleted from the table, updating the results table at that time.