how can we decide or find upper and lower bound in binary search questions of competitive programming? - binary-search

I recently saw the topcoder binary search tutorial in which i understand where we can apply binary search but while solving questions or in topcoder tutorial i didn't understood that how we can decide our lower and upper bound values and on what basis.

The search space has a lower and upper limit. After each query we either raise the lower bound or decrease the upper bound.
Let's say you are trying to guess somebody's number, in the range 0 to 100.
If you guess 50, in the middle, in binary search, and the person says 50 is too high, then you can decrease your upper bound to 50 and the search space becomes 0 to 50 instead.
There must be an evaluation function to determine whether a guess is too "high" or "low", and if the result is too "high", the upper bound is lowered, while if it is too "low", the lower bound is increased.
It may not be as straightforward as a number range, it could be indexes of a sorted array, or numbers put into evaluation functions to determine the difference. The lower bound is the smallest possible guess which could still be correct while the upper bound is the highest possible guess which could still be correct.
Each time, we remove all numbers we know are wrong for sure from the search space.

Related

Binary Search: Number of comparisons in the worst case

I am trying to figure out the number of comparisons that binary search does on an array of a given size in the worst case.
Let's say there is an array A with 123,456 elements (or any other number). Binary search is applied to find some element E. The comparison is to determine whether A[i] = E. How many times would this comparison be executed in the worst case?
According to this post, the number of worst case comparisons is 2logn+1.
Result: 50
According to this post, the max. number of binary search comparisons is log2(n+1).
Result: 25
According to this post, the number of comparisons is 2logn-1.
Result: 50
I am confused by the different answers. Can anyone tell me which one is correct and how I can determine the maximum number of comparisons in the worst case?
According to this Wiki page:
In the worst case, binary search makes floor(log2(n)+1) iterations of the comparison loop, where the floor notation denotes the floor function that yields the greatest integer less than or equal to the argument, and log2 is the binary logarithm. This is because the worst case is reached when the search reaches the deepest level of the tree, and there are always floor(log2(n)+1) levels in the tree for any binary search.
Also, it's not enough to consider only comparisons A[i] = E. The binary search also includes comparisons E <= A[mid], where the mid is the midpoint of the index interval.

Is there a more efficient search factor than midpoint for binary search?

The naive binary search is a very efficient algorithm: you take the midpoint of your high and low points in a sorted array and adjust your high or low point accordingly. Then you recalculate your endpoint and iterate until you find your target value (or you don't, of course.)
Now, quite clearly, if you don't use the midpoint, you introduce some risk to the system. Let's say you shift your search target away from the midpoint and you create two sides - I'll call them a big side and small side. (It doesn't matter whether the shift is toward high or low, because it would be symmetrical.) The risk is that if you miss, your search space is bigger than it would be: you've got to search the big side which is bigger. But the reward is that if you hit your search space is smaller.
It occurs to me that the number of spaces being risked vs rewarded is the same, and (without patterns, which I'm assuming there are none) the likelihood of an element being higher and lower than the midpoint is equal. So the risk is that it falls between the new target and the midpoint.
Now because the number of spaces affects the search space, and the search space is measured logrithmically, it seems to me if I used, let's say 1/4 and 3/4 for our search spaces, I've cut the log of the small space in half, where the large space has only gone up in by about .6 or .7.
So with all this in mind: is there a more efficient way of performing a binary search than just using the midpoint?
Let's agree that the search key is equally likely to be at position in the array—otherwise, we'd want to design an algorithm based on our special knowledge of the location. So all we can choose is where to split the array each time. If we choose a number 0 < x < 1 and split the array there, the chance that it's on the left is x and the chance that it's on the right is 1-x. In the first case we shorten the array by a factor of x and in the second by a factor of 1-x. If we did this many times we'd have a product of many of these factors, and so the 'right' average to use here is the geometric mean. In that sense, the average decrease per step is x with weight x and 1-x with weight 1-x, for a total of x^x * (1-x)^(1-x).
So when is this minimized? If this were the math stackexchange, we'd take derivatives (with the product rule, chain rule, and exponent rule), set them to zero, and solve. But this is stackoverflow, so instead we graph it:
You can see that the further you get from 1/2, the worse you get. For a better understanding I recommend information theory or calculus which have interesting and complementary perspectives on this.

Spike removal algorithm

I have an array of values ranging from 30 to 300. I want to somehow make an weighted average, where, if I have 5 values and one is a lot bigger than the rest(spike), it won't influence the average that much as it would if I simply make a arithmetic average: eg: (n1+n2+n3+n4+n5)/5.
Does anyone has an idea how to make an simple algorithm that does just that, or where to look?
Sounds like you're looking to discard data that falls outside some parameter range you've specified. You could do it by computing the median/mode and ignoring values outside of this range when computing your mean. You'll have to adjust the divisor accordingly, of course, to account for the number of discarded values. What this "tolerable" range should be is ultimately up to you to decide, and will likely depend on your specific application needs.
Alternatively, you could try something like eliminating items r% out of range of your total average. Something like this (in javascript):
function RangedAverage(arr, r)
{
x = Average(arr);
//now eliminate items r% out of range
for(var i=0; i<arr.length; i++)
if(arr[i] < (x/r) || arr[i]>(x*(1+r)))
arr.splice(i,1);
x = Average(arr); //compute new average
return x;
}
You could try a median filter rather than a mean filter. It's often used in image processing to mitigate spurious pixel values (as opposed to white noise).
As you have noticed the mean is susceptible to skewing by spikes. perhaps median or mode may be a better statistic as they tend to be less skewed?
this should be a comment but js seems to be broken for me atm: its not quite clear whether you are after a single number that is characteristic of your array (i.e. an average) or a new array with the spikes removed (median filter)
in response to that then i'd suggest you first look at if median or mode is more appropriate as a statistic. if not then apply a median filter (very good at removing spikes) then average
A Kalman filter is often used in similar applications. I don't know if it qualifies as "simple," but it's robust and well known.
Lots of ways of doing this: You could implement a low-pass digital filter.
Or, if you're just concerned about removing outliers from a statistical summary, you could just remove the highest and lowest N% of your data values from the dataset before averaging.
"Robust statistics" is the search term that will get you into the literature. An advantage of a Kalman filter is that you have a running estimate of the variability of the data, and this allows you eventually to "discard observations that are more than x% likely to be spurious given the whole set of observations so far".

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.

how to store an approximate number? (number is too small to be measured)

I have a table representing standards of alloys. The standard is partly based on the chemical composition of the alloys. The composition is presented in percentages. The percentage is determined by a chemical composition test. Sample data.
But sometimes, the lab cannot measure below a certain percentage. So they indicate that the element is present, but the percentage is less than they can measure.
I was confused how to accurately store such a number in an SQL database. I thought to store the number with a negative sign. No element can have a negative composition of course, but i can interpret this as less than the specified value. Or option is to add another column for each element!! The latter option i really don't like.
Any other ideas? It's a small issue if you think about it, but i think a crowd is always wiser. Somebody might have a neater solution.
Question updated:
Thanks for all the replies.
The test results come from different labs, so there is no common lower bound.
The when the percentage of Titanium is less than <0.0004 for example, the number is still important, only the formula will differ slightly in this case.
Hence the value cannot be stored as NULL, and i don't know the lower bound for all values.
Tricky one.
Another possibility i thought of is to store it as a string. Any other ideas?
What you're talking about is a sentinel value. It's a common technique. Strings in most languages after all use 0 as a sentinel end-of-string value. You can do that. You just need to find a number that makes sense and isn't used for anything else. Many string functions will return -1 to indicate what you're looking for isn't there.
0 might work because if the element isn't there there shouldn't even be a record. You also face the problem that it might be mistaken for actually meaning 0. -1 is another option. It doesn't have that same problem obviously.
Another column to indicate if the amount is measurable or not is also a viable option. The case for this one becomes stronger if you need to store different categories of trace elements (eg <1%, <0.1%, <0.01%, etc). Storing the negative of those numbers seems a bit hacky to me.
You could just store it as NULL, meaning that the value exists but is undefined.
Any arithmetic operation with a NULL yields a NULL.
Division by NULL is safe.
NULL's are ignored by the aggregation functions, so queries like these:
SELECT SUM(metal_percent), COUNT(metal_percent)
FROM alloys
GROUP BY
metal
will give you the sum and the count of the actual, defined values, not taking the unfilled values into account.
I would use a threshold value which is at least one significant digit smaller than your smallest expected value. This way you can logically say that any value less than say 0.01, can be presented to you application as a "trace" amount. This remains easy to understand and gives you flexibility in determining where your threshold should lie.
Since the constraints of the values are well defined (cannot have negative composition), I would go for the "negative value to indicate less-than" approach. As long as this use of such sentinel values are sufficiently documented, it should be reasonably easy to implement and maintain.
An alternative but similar method would be to add 100 to the values, assuming that you can't get more than 100%. So <0.001 becomes 100.001.
I would have a table modeling the certificate, in a one to many relation with another table, storing the values for elements. Then, I would still have the elements table containing the value in one column and a flag (less than) as a separate column.
Draft:
create table CERTIFICATES
(
PK_ID integer,
NAME varchar(128)
)
create table ELEMENTS
(
ELEMENT_ID varchar(2),
CERTIFICATE_ID integer,
CONCENTRATION number,
MEASURABLE integer
)
Depending on the database engine you're using, the types of the columns may vary.
Why not add another column to store whether or not its a trace amount
This will allow you to to save the amount that the trace is less than too
Since there is no common lowest threshold value and NULL is not acceptable, the cleanest solution now is to have a marker column which indicates whether there is a quantifiable amount or a trace amount present. A value of "Trace" would indicate to anybody reading the raw data that only a trace amount was present. A value of "Quantity" would indicate that you should check an amount column to find the actual quantity present.
I would have to warn against storing numerical values as strings. It will inevitably add additional pain, since you now lose the assertions a strong type definition gives you. When your application consumes the values in that column, it has to read the string to determine whether it's a sentinel value, a numeric value or simply some other string it can't interpret. Trying to handle data conversion errors at this point in your application is something I'm sure you don't want to be doing.
Another field seems like the way to go; call it 'MinMeasurablePercent'.