binary search middle value calculation - binary-search

The following is the pseudocode I got from a TopCoder tutorial about binary search
binary_search(A, target):
lo = 1, hi = size(A)
while lo <= hi:
mid = lo + (hi-lo)/2
if A[mid] == target:
return mid
else if A[mid] < target:
lo = mid+1
else:
hi = mid-1
// target was not found
Why do we calculate the middle value as mid = lo + (hi - lo) / 2 ? Whats wrong with (hi + lo) / 2
I have a slight idea that it might be to prevent overflows but I'm not sure, perhaps someone can explain it to me and if there are other reasons behind this.

Although this question is 5 years old, but there is a great article in googleblog which explains the problem and the solution in detail which is worth to share.
It's needed to mention that in current implementation of binary search in Java mid = lo + (hi - lo) / 2 calculation is not used, instead the faster and more clear alternative is used with zero fill right shift operator
int mid = (low + high) >>> 1;

Yes, (hi + lo) / 2 may overflow. This was an actual bug in Java binary search implementation.
No, there are no other reasons for this.

From later on in the same tutorial:
"You may also wonder as to why mid is calculated using mid = lo + (hi-lo)/2 instead of the usual mid = (lo+hi)/2. This is to avoid another potential rounding bug: in the first case, we want the division to always round down, towards the lower bound. But division truncates, so when lo+hi would be negative, it would start rounding towards the higher bound. Coding the calculation this way ensures that the number divided is always positive and hence always rounds as we want it to. Although the bug doesn't surface when the search space consists only of positive integers or real numbers, I've decided to code it this way throughout the article for consistency."

It is indeed possible for (hi+lo) to overflow integer. In the improved version, it may seem that subtracting lo from hi and then adding it again is pointless, but there is a reason: performing this operation will not overflow integer and it will result in a number with the same parity as hi+lo, so that the remainder of (hi+lo)/2 will be the same as (hi-lo)/2. lo can then be safely added after the division to reach the same result.

Let us assume that the array we're searching in, is of length INT_MAX.
Hence initially:
high = INT_MAX
low = 0
In the first iteration, we notice that the target element is greater than the middle element and so we shift the start index to mid as
low = mid + 1
In the next iteration, when mid is calculated, it is calculated as (high + low)/2
which essentially translates to
INT_MAX + low(which is half of INT_MAX + 1) / 2
Now, the first part of this operation i.e. (high + low) would lead to an overflow since we're going over the max Int range i.e. INT_MAX

Because Unsigned right shift is not present in Go programming, To avoid integer overflow while calculating middle value in Go Programming language we can write like this.
mid := int(uint(lo+hi) >> 1)

Why question is answered but it is not easy to understand why solution works.
So let's assume 10 is high and 5 is low. Assume 10 is highest value integer can have ( 10+1 will cause overflow ).
So instead of doing (10+5)/2 ≈ 7 ( because 10 + anything will lead overflow).
We do 5+(10-5)/2=> 5 + 2.5 ≈ 7

Related

Kotlin: Why these two implementations of log base 10 give different results on the specific imputs?

println(log(it.toDouble(), 10.0).toInt()+1) // n1
println(log10(it.toDouble()).toInt() + 1) // n2
I had to count the "length" of the number in n-base for non-related to the question needs and stumbled upon a bug (or rather unexpected behavior) that for it == 1000 these two functions give different results.
n1(1000) = 3,
n2(1000) = 4.
Checking values before conversion to int resulted in:
n1_double(1000) = 3.9999999999999996,
n2_double(1000) = 4.0
I understand that some floating point arithmetics magic is involved, but what is especially weird to me is that for 100, 10000 and other inputs that I checked n1 == n2.
What is special about it == 1000? How I ensure that log gives me the intended result (4, not 3.99..), because right now I can't even figure out what cases I need to double-check, since it is not just powers of 10, it is 1000 (and probably some other numbers) specifically.
I looked into implementation of log() and log10() and log is implemented as
if (base <= 0.0 || base == 1.0) return Double.NaN
return nativeMath.log(x) / nativeMath.log(base) //log() here is a natural logarithm
while log10 is implemented as
return nativeMath.log10(x)
I suspect this division in the first case is the reason of an error, but I can't figure out why it causes an error only in specific cases.
I also found this question:
Python math.log and math.log10 giving different results
But I already know that one is more precise than another. However there is no analogy for log10 for some base n, so I'm curious of reason WHY it is specifically 1000 that goes wrong.
PS: I understand there are methods of calculating length of a number without fp arithmetics and log of n-base, but at this point it is a scientific curiosity.
but I can't figure out why it causes an error only in specific cases.
return nativeMath.log(x) / nativeMath.log(base)
//log() here is a natural logarithm
Consider x = 1000 and nativeMath.log(x). The natural logarithm is not exactly representable. It is near
6.90775527898213_681... (Double answer)
6.90775527898213_705... (closer answer)
Consider base = 10 and nativeMath.log(base). The natural logarithm is not exactly representable. It is near
2.302585092994045_901... (Double)
2.302585092994045_684... (closer answer)
The only exactly correct nativeMath.log(x) for a finite x is when x == 1.0.
The quotient of the division of 6.90775527898213681... / 2.302585092994045901... is not exactly representable. It is near 2.9999999999999995559...
The conversion of the quotient to text is not exact.
So we have 4 computation errors with the system giving us a close (rounded) result instead at each step.
Sometimes these rounding errors cancel out in a way we find acceptable and the value of "3.0" is reported. Sometimes not.
Performed with higher precision math, it is easy to see log(1000) was less than a higher precision answer and that log(10) was more. These 2 round-off errors in the opposite direction for a / contributed to the quotient being extra off (low) - by 1 ULP than hoped.
When log(x, 10) is computed for other x = power-of-10, and the log(x) is slightly more than than a higher precision answer, I'd expect the quotient to less often result in a 1 ULP error. Perhaps it will be 50/50 for all powers-of-10.
log10(x) is designed to compute the logarithm in a different fashion, exploiting that the base is 10.0 and certainly exact for powers-of-10.

Why do we use mid = low + (high – low)/2; but not mid = (low/2) +(high /2)?

In binary search, we use mid = low + (high – low)/2 instead of (low + high)/2 to avoid overflow, however, can't calculate low/2 and high/2 separately and then sum them up rather than low+(( high-low)/2)?
P.S. If low + (high – low)/2 is more efficient, then why is it so?
Let's say both low and high are 3; then middle = 3/2 + 3/2 = 1+1 = 2, which actually is quite bad. :-)
The reason we don't use middle=high/2+low/2 is that it gives us the wrong result
Can't we calculate low/2 and high/2 separately and then sum them up rather than using low+((high-low)/2)?
Sure.
If low+(high-low)/2 is more efficient, then why is it so?
For a lot of hardware dividing is slower than adding and subtracting, so dividing twice might be slower than the method that uses more adding and subtracting.

Precise Multiplication

first post!
I have a problem with a program that i'm writing for a numerical simulation and I have a problem with the multiplication. Basically, I am trying to calculate:
result1 = (a + b)*c
and this loops thousands of times. I need to expand this code to be
result2 = a*c + b*c
However, when I do that I start to get significant errors in my results. I used a high precision library, which did improve things, but the simulation ran horribly slow (the simulation took 50 times longer) and it really isn't a practical solution. From this I realised that it isn't really the precision of the variables a, b, & c that is hurting me, but something in the way the multiplication is done.
My question is: how can I multiply out these brackets in way so that result1 = result2?
Thanks.
SOLVED!!!!!!!!!
It was a problem with the addition. So i reordered the terms and applied Kahan addition by writing the following piece of code:
double Modelsimple::sum(double a, double b, double c, double d) {
//reorder the variables in order from smallest to greatest
double tempone = (a<b?a:b);
double temptwo = (c<d?c:d);
double tempthree = (a>b?a:b);
double tempfour = (c>d?c:d);
double one = (tempone<temptwo?tempone:temptwo);
double four = (tempthree>tempfour?tempthree:tempfour);
double tempfive = (tempone>temptwo?tempone:temptwo);
double tempsix = (tempthree<tempfour?tempthree:tempfour);
double two = (tempfive<tempsix?tempfive:tempsix);
double three = (tempfive>tempsix?tempfive:tempsix);
//kahan addition
double total = one;
double tempsum = one + two;
double error = (tempsum - one) - two;
total = tempsum;
// first iteration complete
double tempadd = three - error;
tempsum = total + tempadd;
error = (tempsum - total) - tempadd;
total = tempsum;
//second iteration complete
tempadd = four - error;
total += tempadd;
return total;
}
This gives me results that are as close to the precise answer as makes no difference. However, in a fictitious simulation of a mine collapse, the code with the Kahan addition takes 2 minutes whereas the high precision library takes over a day to finish!!
Thanks to all the help here. This problem was really a pain in the a$$.
I am presuming your numbers are all floating point values.
You should not expect result1 to equal result2 due to limitations in the scale of the numbers and precision in the calculations. Which one to use will depend upon the numbers you are dealing with. More important than result1 and result2 being the same is that they are close enough to the real answer (eg that you would have calculated by hand) for your application.
Imagine that a and b are both very large, and c much less than 1. (a + b) might overflow so that result1 will be incorrect. result2 would not overflow because it scales everything down before adding.
There are also problems with loss of precision when combining numbers of widely differing size, as the smaller number has significant digits reduced when it is converted to use the same exponent as the larger number it is added to.
If you give some specific examples of a, b and c which are causing you issues it might be possible to suggest further improvements.
I have been using the following program as a test, using values for a and b between 10^5 and 10^10, and c around 10^-5, but so far cannot find any differences.
Thinking about the storage of 10^5 vs 10^10, I think it requires about 13 bits vs 33 bits, so you may lose about 20 bits of precision when you add a and b together in result1.
But multiplying them by the same value c essentially reduces the exponent but leaves the significand the same, so it should also lose about 20 bits of precision in result2.
A double significand usually stores 53 bits, so I suspect your results will still retain 33 bits, or about 10 decimal digits of precision.
#include <stdio.h>
int main()
{
double a = 13584.9484893449;
double b = 43719848748.3911;
double c = 0.00001483394434;
double result1 = (a+b)*c;
double result2 = a*c + b*c;
double diff = result1 - result2;
printf("size of double is %d\n", sizeof(double));
printf("a=%f\nb=%f\nc=%f\nr1=%f\nr2=%f\ndiff=%f\n",a,b,c,result1,result2,diff);
}
However I do find a difference if I change all the doubles to float and use c=0.00001083394434. Are you sure that you are using 64 (or 80) bit doubles when doing your calculations?
Usually "loss of precision" in these kinds of calculations can be traced to "poorly formulated problem". For example, when you have to add a series of numbers of very different sizes, you will get a different answer depending on the order in which you sum them. The problem is even more acute when you subtract numbers.
The best approach in your case above is to look not simply at this one line, but at the way that result1 is used in your subsequent calculations. In principle, an engineering calculation should not require precision in the final result beyond about three significant figures; but in many instances (for example, finite element methods) you end up subtracting two numbers that are very similar in magnitude - in which case you may lose many significant figures and get a meaningless answer. Given that you are talking about "materials properties" and "strain", I am suspecting that is actually at the heart of your problem.
One approach is to look at places where you compute a difference, and see if you can reformulate your problem (for example, if you can differentiate your function, you can replace Y(x+dx)-Y(x) with dx * Y(x)'.
There are many excellent references on the subject of numerical stability. It is a complicated subject. Just "throwing more significant figures at the problem" is almost never the best solution.

How can I calculate pi (π) in VB?

Does anyone know how can I calculate pi (π) in VB?
System.Math.Pi
Assuming you actually want to compute pi instead of just using the built in constants, there are a bunch of ways that you can do it. Here are a few links that could be useful:
http://www.codeproject.com/KB/recipes/CRHpi.aspx
http://en.wikipedia.org/wiki/Pi#Computation_in_the_computer_age
http://en.wikipedia.org/wiki/Machin-like_formula
If you mean VB6, it doesn't have a pi constant. You can use:
Dim pi as Double
pi = 4 * Atn(1)
If the OP is asking about algorithms as a learning experience, good for him/her.
If the OP wanted help finding the built-in value, s/he has it now.
But if the goal is a good value of higher precision than the built-in value with a minimum of effort, here's pi to one million digits:
http://www.eveandersson.com/pi/digits/1000000
That should be enough.
I hope the OP isn't asking how to recalculate the value of Pi each and every time it's used. That would be madness.
Meh, so efficient, accurate and most of all boring approximations... Try this instead! Pseudocode ensues:
initialize inside and total as 0
repeat an insane amount of times:
assign both x and y random values between (and including) 0 and +1.
assign distance as the square root of (x2 + y2)
if distance ≤ 1, add 1 to inside
add 1 to total
assign pi as inside / total * 4
If you don't want to use the built in values in the .net math library...
22 / 7

Determing longest repeating cycle in a decimal expansion

Today I encountered this article about decimal expansion and I was instantaneously inspired to rework my solution on Project Euler Problem 26 to include this new knowledge of math for a more effecient solution (no brute forcing). In short the problem is to find the value of d ranging 1-1000 that would maximize the length of the repeating cycle in the expression "1/d".
Without making any further assumptions about the problem that could further improve the effecienty of solving the problem I decided to stick with
10^s=10^(s+t) (mod n)
which allows me for any value of D to find the longest repeating cycle (t) and the starting point for the cycle (s).
The problem is that eksponential part of the equation, since this will generate extremely large values before they're reduced by using modulus. No integral value can handle this large values, and the floating point data types seemes to be calculating wrong.
I'm using this code currently:
Private Function solveDiscreteLogarithm(ByVal D As Integer) As Integer
Dim NumberToIndex As New Dictionary(Of Long, Long)()
Dim maxCheck As Integer = 1000
For index As Integer = 1 To maxCheck
If (Not NumberToIndex.ContainsKey((10 ^ index) Mod D)) Then
NumberToIndex.Add((10 ^ index) Mod D, index)
Else
Return index - NumberToIndex((10 ^ index) Mod D)
End If
Next
Return -1
End Function
which at some point will compute "(10^47) mod 983" resulting in 783 which is not the correct result. The correct result should have been 732. I'm assuming it's because I'm using integral data types and it's causing overflow. I tried using double instead, but that gave even stranger results.
So what are my options?
Instead of using ^ to do your powers, I would do a for loop using multiplication and then taking the mod of the number as you go along by using a conditional to check if the number calculated is greater than the mod. This helps to keep the numbers smaller and within range of your mod number.
I'll give you a hint from my own solution to this.
With each decimal expansion of the fraction, you end up with a remainder, which if multiplied by the current decimal place, is an integer. Since this remainder is all you need to determine the next decimal expansion, you can use it to make predictions about the subsequent expansion.
See my post for this other question, getting the nth digit of a fraction, you may find some useful leads on what to try. (Methinks the answer is the largest prime less than 1000.) (Correction: the largest prime or Carmichael number less than 1000.)