How can I test that my hash function is good in terms of max-load? - testing

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.

That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.

Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.

After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.

Related

is SHA-512 collision resistant?

According to the books that i have read, it says that S.H.A(Secure Hash Algorithm) is collision resistant.But if the input space is a 1024 bit number and the output space is a 512 bit message digest then shouldn't it be colliding for
(2^1024)/(2^512) times? As the range is lesser than the domain being mapped there should have been collisions. please explain where i am going wrong.
The chance for a collision does not depend on the input size. The chance to a 512-bit hash collision is 1.4×10^77, see Probability table
Maybe your book has also mentioned the definition of collision resistance? It does not mean that no collisions are created (which is clearly not the case), but that given a hash you are not able to create a message easily that produces this hash.
a hash function H is collision resistant if it is hard to find two
inputs that hash to the same output; that is, two inputs a and b such
that H(a) = H(b), and a ≠ b
From Wikipedia
As you describe: Since the input space (arbitrary size) is larger than the output space (e.g. 512bit for sha512), there always exist collisions.
"Collision resistant" means, it is adequately unlikely for a collision to be found.
Your confusion is answered when considering how large the output space "512 bits" really is:
2^512 (the number of possible configurations of a 512 bit array) is of the order 10^154.
For comparison: The number of atoms in the visible universe is somewhere in the range of 10^80.
A million is 10^6.
So a million of our 'visible universes' has 10^86 atoms.
A million times a million universes has 10^92 atoms.
If you could store a single 512 bit value on a single atom, how many universes would you need to have all possible 512 bit has values stored?
Starting with a specific 512bit number (and assuming the has function is not broken), the probability p to obtain a collision is assuming you can produce new hashes with a rate R and have the total time of t to do this is:
p = R*t/(2^(512/2))
(The exponent is halved, see "birthday attach". The expected search space for a success is to find a collision in n bits is n/2.)
Let's plugin in some example numbers:
The has rate of the bitcoin network is currently about R = 200*10^15 / s (200 million terrahashes per second).
Consider the situation that since the beginning of the universe the bitcoin network's current hashing capacity would have been available for the sole purpose of finding a collision for a specific hash value, i.e. for an available time of t=13.787*10^9 years,
then the probability that a collision would have been found by now is about 7 × 10^-41 %
Again, it is hard to appreciate how small this number is.
Edit: A similar question with a good answer is found here: https://crypto.stackexchange.com/questions/89558/are-sha-256-and-sha-512-collision-resistant

Determine whether there is a subset of size n which has a standard deviation <= s

Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.
To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size n which has a standard deviation <= s. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.
So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?
EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying n in advance is okay. But feel free to correct me!
And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in n log n:
First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is n log n)
Second, take the first n numbers and compute their standard deviation. If our array of numbers is 0-based, then the first n numbers are [0, n-1]. To get standard deviation, compute s1 and s2 as follows:
s1 = sum of numbers
s2 = sum of squares of numbers
Then, wikipedia says that the standard deviation is sqrt(n*s2 - s1^2)/n. Record this value as the highest standard deviation seen so far.
Find the standard deviation of [1, n], [2, n+1], [3, n+2] ... until you hit the the last n numbers. To do each computation takes only constant time if you keep track of s1 and s2 running totals: for example, to get std dev of [1, n], just subtract the 0th element from the s1 and s2 totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.
So total time complexity n log n.
Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Having been working recently on a similar problem, both the definition of the clumps and the proposed implementation seem reasonable.
Another reasonable definition would be to find the minimum of all the ranges of n numbers. Thus, given that the list of numbers x is sorted, one would just find the minimum of x[n]-x[1], x[n+1]-x[2], etc. This would be slightly quicker than finding the standard deviation because it would avoid the multiplications and square roots. Indeed, you can avoid the square roots even when looking for the lowest standard deviation by finding the minimum variance (the square of the standard deviation), rather than the sd itself.
A caution would be that the location of the biggest clump might be quite sensitive to the choice of n. If there is an a priori reason to select a particular n, that won't be a problem. If not, however, it might require some experimentation to select the value of n that fairly reliably finds the clumps you are looking for, whether you are selecting by range or by standard deviation. Some ideas on this can be found in Chapter 6 of the online book ABC of EDA.

How many iterations of Rabin-Miller should I use for cryptographic safe primes?

I am generating a 2048-bit safe prime for a Diffie-Hellman-type key, p such that p and (p-1)/2 are both prime.
How few iterations of Rabin-Miller can I use on both p and (p-1)/2 and still be confident of a cryptographically strong key? In the research I've done I've heard everything from 6 to 64 iterations for 1024-bit ordinary primes, so I'm a little confused at this point. And once that's established, does the number change if you are generating a safe prime rather than an ordinary one?
Computation time is at a premium, so this is a practical question - I'm basically wondering how to find out the lowest possible number of tests I can get away with while at the same time maintaining pretty much guaranteed security.
Let's assume that you select a prime p by selecting random values until you hit one for which Miller-Rabin says: that one looks like a prime. You use n rounds at most for the Miller-Rabin test. (For a so-called "safe prime", things are are not changed, except that you run two nested tests.)
The probability that a random 1024-bit integer is prime is about 1/900. Now, you do not want to do anything stupid so you generate only odd values (an even 1024-bit integer is guaranteed non-prime), and, more generally, you run the Miller-Rabin test only if the value is not "obviously" non-prime, i.e. can be divided by a small prime. So you end up with trying about 300 values with Miller-Rabin before hitting a prime (on average). When the value is non-prime, Miller-Rabin will detect it with probability 3/4 at each round, so the number of Miller-Rabin rounds you will run on average for a single non-prime value is 1+(1/4)+(1/16)+... = 4/3. For the 300 values, this means about 400 rounds of Miller-Rabin, regardless of what you choose for n.
So if you select n to be, e.g., 40, then the cost implied by n is less than 10% of the total computational cost. The random prime selection process is dominated by the test on non-primes, which are not impacted by the value of n you choose. I talked here about 1024-bit integers; for bigger numbers the choice of n is even less important since primes become sparser as size increases (for 2048-bit integers, the "10%" above become "5%").
Hence you can choose n=40 and be happy with it (or at least know that reducing n will not buy you much anyway). On the other hand, using a n greater than 40 is meaningless, because this would get you to probabilities lower than the risk of a simple miscomputation. Computers are hardware, they can have random failures. For instance, a primality test function could return "true" for a non-prime value because a cosmic ray (a high-energy particle hurtling through the Universe at high speed) happens to hit just the right transistor at the right time, flipping the return value from 0 ("false") to 1 ("true"). This is very unlikely -- but no less likely than probability 2-80. See this stackoverflow answer for a few more details. The bottom line is that regardless of how you make sure that an integer is prime, you still have an unavoidable probabilistic element, and 40 rounds of Miller-Rabin already give you the best that you can hope for.
To sum up, use 40 rounds.
The paper Average case error estimates for the strong probable prime test by Damgard-Landrock-Pomerance points out that, if you randomly select k-bit odd number n and apply t independent Rabin-Miller tests in succession, the probability that n is a composite has much stronger bounds.
In fact for 3 <= t <= k/9 and k >= 21,
For a k=1024 bit prime, t=6 iterations give you an error rate less than 10^(-40).
Each iteration of Rabin-Miller reduces the odds that the number is composite by a factor of 1/4.
So after 64 iterations, there is only 1 chance in 2^128 that the number is composite.
Assuming you are using these for a public key algorithm (e.g. RSA), and assuming you are combining that with a symmetric algorithm using (say) 128-bit keys, an adversary can guess your key with that probability.
The bottom line is to choose the number of iterations to put that probability within the ballpark of the other sizes you are choosing for your algorithm.
[update, to elaborate]
The answer depends entirely on what algorithms you are going to use the numbers for, and what the best known attacks are against those algorithms.
For example, according to Wikipedia:
As of 2003 RSA Security claims that 1024-bit RSA keys are equivalent in strength to 80-bit symmetric keys, 2048-bit RSA keys to 112-bit symmetric keys and 3072-bit RSA keys to 128-bit symmetric keys.
So, if you are planning to use these primes to generate (say) a 1024-bit RSA key, then there is no reason to run more than 40 iterations or so of Rabin-Miller. Why? Because by the time you hit a failure, an attacker could crack one of your keys anyway.
Of course, there is no reason not to perform more iterations, time permitting. There just isn't much point to doing so.
On the other hand, if you are generating 2048-bit RSA keys, then 56 (or so) iterations of Rabin-Miller is more appropriate.
Cryptography is typically built as a composition of primitives, like prime generation, RSA, SHA-2, and AES. If you want to make one of those primitives 2^900 times stronger than the others, you can, but it is a little like putting a 10-foot-steel vault door on a log cabin.
There is no fixed answer to your question. It depends on the strength of the other pieces going into your cryptographic system.
All that said, 2^-128 is a ludicrously tiny probability, so I would probably just use 64 iterations :-).
From the libgcrypt source:
/* We use 64 Rabin-Miller rounds which is better and thus
sufficient. We do not have a Lucas test implementaion thus we
can't do it in the X9.31 preferred way of running a few
Rabin-Miller followed by one Lucas test. */
cipher/primegen.c line# 1295
I would run two or three iterations of Miller-Rabin (i.e., strong Fermat probable prime) tests, making sure that one of the bases is 2.
Then I would run a strong Lucas probable prime test, choosing D, P, and Q with the method described here:
https://en.wikipedia.org/wiki/Baillie%E2%80%93PSW_primality_test
There are no known composites that pass this combination of Fermat and Lucas tests.
This is much faster than doing 40 Rabin-Miller iterations. In addition, as was pointed out by Pomerance, Selfridge, and Wagstaff in https://math.dartmouth.edu/~carlp/PDF/paper25.pdf, there are diminishing returns with multiple Fermat tests: if N is a pseudoprime to one base, then it is more likely than the average number to be a pseudoprime to other bases. That's why, for example, we see so many psp's base 2 are also psp's base 3.
A smaller probability is usually better, but I would take the actual probability value with a grain of salt. Albrecht et al Prime and Prejudice: Primality Testing Under Adversarial Conditions break a number of prime-testing routines in cryptographic libraries. In one example, the published probability is 1/2^80, but the number they construct is declared prime 1 time out of 16.
In several other examples, their number passes 100% of the time.
Only 2 iterations, assuming 2^-80 as a negligibly probability.
From (Alfred. J. Menezes. et al. 1996) §4.4 p.148:
Does it matter? Why not run for 1000 iterations? When searching for primes, you stop applying the Rabin-Miller test anyway the first time it fails, so for the time it takes to find a prime it doesn't really matter what the upper bound on the number of iterations is. You could even run a deterministic primality checking algorithm after those 1000 iterations to be completely sure.
That said, the probability that a number is prime after n iterations is 4^-n.

How does rand() work? Does it have certain tendencies? Is there something better to use?

I have read that it has something to do with time, also you get from including time.h, so I assumed that much, but how does it work exactly? Also, does it have any tendencies towards odd or even numbers or something like that? And finally is there something with better distribution in the C standard library or the Foundation framework?
Briefly:
You use time.h to get a seed, which is an initial random number. C then does a bunch of operations on this number to get the next random number, then operations on that one to get the next, then... you get the picture.
rand() is able to touch on every possible integer. It will not prefer even or odd numbers regardless of the input seed, happily. Still, it has limits - it repeats itself relatively quickly, and in almost every implementation only gives numbers up to 32767.
C does not have another built-in random number generator. If you need a real tough one, there are many packages available online, but the Mersenne Twister algorithm is probably the most popular pick.
Now, if you are interested on the reasons why the above is true, here are the gory details on how rand() works:
rand() is what's called a "linear congruential generator." This means that it employs an equation of the form:
xn+1 = (*a****xn + ***b*) mod m
where xn is the nth random number, and a and b are some predetermined integers. The arithmetic is performed modulo m, with m usually 232 depending on the machine, so that only the lowest 32 bits are kept in the calculation of xn+1.
In English, then, the idea is this: To get the next random number, multiply the last random number by something, add a number to it, and then take the last few digits.
A few limitations are quickly apparent:
First, you need a starting random number. This is the "seed" of your random number generator, and this is where you've heard of time.h being used. Since we want a really random number, it is common practice to ask the system what time it is (in integer form) and use this as the first "random number." Also, this explains why using the same seed twice will always give exactly the same sequence of random numbers. This sounds bad, but is actually useful, since debugging is a lot easier when you control the inputs to your program
Second, a and b have to be chosen very, very carefully or you'll get some disastrous results. Fortunately, the equation for a linear congruential generator is simple enough that the math has been worked out in some detail. It turns out that choosing an a which satisfies *a***mod8 = 5 together with ***b* = 1 will insure that all m integers are equally likely, independent of choice of seed. You also want a value of a that is really big, so that every time you multiply it by xn you trigger a the modulo and chop off a lot of digits, or else many numbers in a row will just be multiples of each other. As a result, two common values of a (for example) are 1566083941 and 1812433253 according to Knuth. The GNU C library happens to use a=1103515245 and b=12345. A list of values for lots of implementations is available at the wikipedia page for LCGs.
Third, the linear congruential generator will actually repeat itself because of that modulo. This gets to be some pretty heady math, but the result of it all is happily very simple: The sequence will repeat itself after m numbers of have been generated. In most cases, this means that your random number generator will repeat every 232 cycles. That sounds like a lot, but it really isn't for many applications. If you are doing serious numerical work with Monte Carlo simulations, this number is hopelessly inadequate.
A fourth much less obvious problem is that the numbers are actually not really random. They have a funny sort of correlation. If you take three consecutive integers, (x, y, z), from an LCG with some value of a and m, those three points will always fall on the lattice of points generated by all linear combinations of the three points (1, a, a2), (0, m, 0), (0, 0, m). This is known as Marsaglia's Theorem, and if you don't understand it, that's okay. All it means is this: Triplets of random numbers from an LCG will show correlations at some deep, deep level. Usually it's too deep for you or I to notice, but its there. It's possible to even reconstruct the first number in a "random" sequence of three numbers if you are given the second and third! This is not good for cryptography at all.
The good part is that LCGs like rand() are very, very low footprint. It typically requires only 32 bits to retain state, which is really nice. It's also very fast, requiring very few operations. These make it good for noncritical embedded systems, video games, casual applications, stuff like that.
PRNGs are a fascinating topic. Wikipedia is always a good place to go if you are hungry to learn more on the history or the various implementations that are around today.
rand returns numbers generated by a pseudo-random number generator (PRNG). The sequence of numbers it returns is deterministic, based on the value with which the PRNG was initialized (by calling srand).
The numbers should be distributed such that they appear somewhat random, so, for example, odd and even numbers should be returned at roughly the same frequency. The actual implementation of the random number generator is left unspecified, so the actual behavior is specific to the implementation.
The important thing to remember is that rand does not return random numbers; it returns pseudo-random numbers, and the values it returns are determined by the seed value and the number of times rand has been called. This behavior is fine for many use cases, but is not appropriate for others (for example, rand would not be appropriate for use in many cryptographic applications).
How does rand() work?
http://en.wikipedia.org/wiki/Pseudorandom_number_generator
I have read that it has something to
do with time, also you get from
including time.h
rand() has nothing at all to do with the time. However, it's very common to use time() to obtain the "seed" for the PRNG so that you get different "random" numbers each time your program is run.
Also, does it have any tendencies
towards odd or even numbers or
something like that?
Depends on the exact method used. There's one popular implementation of rand() that alternates between odd and even numbers. So avoid writing code like rand() % 2 that depends on the lowest bit being random.

What is Big O notation? Do you use it? [duplicate]

This question already has answers here:
What is a plain English explanation of "Big O" notation?
(43 answers)
Closed 9 years ago.
What is Big O notation? Do you use it?
I missed this university class I guess :D
Does anyone use it and give some real life examples of where they used it?
See also:
Big-O for Eight Year Olds?
Big O, how do you calculate/approximate it?
Did you apply computational complexity theory in real life?
One important thing most people forget when talking about Big-O, thus I feel the need to mention that:
You cannot use Big-O to compare the speed of two algorithms. Big-O only says how much slower an algorithm will get (approximately) if you double the number of items processed, or how much faster it will get if you cut the number in half.
However, if you have two entirely different algorithms and one (A) is O(n^2) and the other one (B) is O(log n), it is not said that A is slower than B. Actually, with 100 items, A might be ten times faster than B. It only says that with 200 items, A will grow slower by the factor n^2 and B will grow slower by the factor log n. So, if you benchmark both and you know how much time A takes to process 100 items, and how much time B needs for the same 100 items, and A is faster than B, you can calculate at what amount of items B will overtake A in speed (as the speed of B decreases much slower than the one of A, it will overtake A sooner or later—this is for sure).
Big O notation denotes the limiting factor of an algorithm. Its a simplified expression of how run time of an algorithm scales with relation to the input.
For example (in Java):
/** Takes an array of strings and concatenates them
* This is a silly way of doing things but it gets the
* point across hopefully
* #param strings the array of strings to concatenate
* #returns a string that is a result of the concatenation of all the strings
* in the array
*/
public static String badConcat(String[] Strings){
String totalString = "";
for(String s : strings) {
for(int i = 0; i < s.length(); i++){
totalString += s.charAt(i);
}
}
return totalString;
}
Now think about what this is actually doing. It is going through every character of input and adding them together. This seems straightforward. The problem is that String is immutable. So every time you add a letter onto the string you have to create a new String. To do this you have to copy the values from the old string into the new string and add the new character.
This means you will be copying the first letter n times where n is the number of characters in the input. You will be copying the character n-1 times, so in total there will be (n-1)(n/2) copies.
This is (n^2-n)/2 and for Big O notation we use only the highest magnitude factor (usually) and drop any constants that are multiplied by it and we end up with O(n^2).
Using something like a StringBuilder will be along the lines of O(nLog(n)). If you calculate the number of characters at the beginning and set the capacity of the StringBuilder you can get it to be O(n).
So if we had 1000 characters of input, the first example would perform roughly a million operations, StringBuilder would perform 10,000, and the StringBuilder with setCapacity would perform 1000 operations to do the same thing. This is rough estimate, but O(n) notation is about orders of magnitudes, not exact runtime.
It's not something I use per say on a regular basis. It is, however, constantly in the back of my mind when trying to figure out the best algorithm for doing something.
A very similar question has already been asked at Big-O for Eight Year Olds?. Hopefully the answers there will answer your question although the question asker there did have a bit of mathematical knowledge about it all which you may not have so clarify if you need a fuller explanation.
Every programmer should be aware of what Big O notation is, how it applies for actions with common data structures and algorithms (and thus pick the correct DS and algorithm for the problem they are solving), and how to calculate it for their own algorithms.
1) It's an order of measurement of the efficiency of an algorithm when working on a data structure.
2) Actions like 'add' / 'sort' / 'remove' can take different amounts of time with different data structures (and algorithms), for example 'add' and 'find' are O(1) for a hashmap, but O(log n) for a binary tree. Sort is O(nlog n) for QuickSort, but O(n^2) for BubbleSort, when dealing with a plain array.
3) Calculations can be done by looking at the loop depth of your algorithm generally. No loops, O(1), loops iterating over all the set (even if they break out at some point) O(n). If the loop halves the search space on each iteration? O(log n). Take the highest O() for a sequence of loops, and multiply the O() when you nest loops.
Yeah, it's more complex than that. If you're really interested get a textbook.
'Big-O' notation is used to compare the growth rates of two functions of a variable (say n) as n gets very large. If function f grows much more quickly than function g we say that g = O(f) to imply that for large enough n, f will always be larger than g up to a scaling factor.
It turns out that this is a very useful idea in computer science and particularly in the analysis of algorithms, because we are often precisely concerned with the growth rates of functions which represent, for example, the time taken by two different algorithms. Very coarsely, we can determine that an algorithm with run-time t1(n) is more efficient than an algorithm with run-time t2(n) if t1 = O(t2) for large enough n which is typically the 'size' of the problem - like the length of the array or number of nodes in the graph or whatever.
This stipulation, that n gets large enough, allows us to pull a lot of useful tricks. Perhaps the most often used one is that you can simplify functions down to their fastest growing terms. For example n^2 + n = O(n^2) because as n gets large enough, the n^2 term gets so much larger than n that the n term is practically insignificant. So we can drop it from consideration.
However, it does mean that big-O notation is less useful for small n, because the slower growing terms that we've forgotten about are still significant enough to affect the run-time.
What we now have is a tool for comparing the costs of two different algorithms, and a shorthand for saying that one is quicker or slower than the other. Big-O notation can be abused which is a shame as it is imprecise enough already! There are equivalent terms for saying that a function grows less quickly than another, and that two functions grow at the same rate.
Oh, and do I use it? Yes, all the time - when I'm figuring out how efficient my code is it gives a great 'back-of-the-envelope- approximation to the cost.
The "Intuitition" behind Big-O
Imagine a "competition" between two functions over x, as x approaches infinity: f(x) and g(x).
Now, if from some point on (some x) one function always has a higher value then the other, then let's call this function "faster" than the other.
So, for example, if for every x > 100 you see that f(x) > g(x), then f(x) is "faster" than g(x).
In this case we would say g(x) = O(f(x)). f(x) poses a sort of "speed limit" of sorts for g(x), since eventually it passes it and leaves it behind for good.
This isn't exactly the definition of big-O notation, which also states that f(x) only has to be larger than C*g(x) for some constant C (which is just another way of saying that you can't help g(x) win the competition by multiplying it by a constant factor - f(x) will always win in the end). The formal definition also uses absolute values. But I hope I managed to make it intuitive.
It may also be worth considering that the complexity of many algorithms is based on more than one variable, particularly in multi-dimensional problems. For example, I recently had to write an algorithm for the following. Given a set of n points, and m polygons, extract all the points that lie in any of the polygons. The complexity is based around two known variables, n and m, and the unknown of how many points are in each polygon. The big O notation here is quite a bit more involved than O(f(n)) or even O(f(n) + g(m)).
Big O is good when you are dealing with large numbers of homogenous items, but don't expect this to always be the case.
It is also worth noting that the actual number of iterations over the data is often dependent on the data. Quicksort is usually quick, but give it presorted data and it slows down. My points and polygons alogorithm ended up quite fast, close to O(n + (m log(m)), based on prior knowledge of how the data was likely to be organised and the relative sizes of n and m. It would fall down badly on randomly organised data of different relative sizes.
A final thing to consider is that there is often a direct trade off between the speed of an algorithm and the amount of space it uses. Pigeon hole sorting is a pretty good example of this. Going back to my points and polygons, lets say that all my polygons were simple and quick to draw, and I could draw them filled on screen, say in blue, in a fixed amount of time each. So if I draw my m polygons on a black screen it would take O(m) time. To check if any of my n points was in a polygon, I simply check whether the pixel at that point is green or black. So the check is O(n), and the total analysis is O(m + n). Downside of course is that I need near infinite storage if I'm dealing with real world coordinates to millimeter accuracy.... ...ho hum.
It may also be worth considering amortized time, rather than just worst case. This means, for example, that if you run the algorithm n times, it will be O(1) on average, but it might be worse sometimes.
A good example is a dynamic table, which is basically an array that expands as you add elements to it. A naïve implementation would increase the array's size by 1 for each element added, meaning that all the elements need to be copied every time a new one is added. This would result in a O(n2) algorithm if you were concatenating a series of arrays using this method. An alternative is to double the capacity of the array every time you need more storage. Even though appending is an O(n) operation sometimes, you will only need to copy O(n) elements for every n elements added, so the operation is O(1) on average. This is how things like StringBuilder or std::vector are implemented.
What is Big O notation?
Big O notation is a method of expressing the relationship between many steps an algorithm will require related to the size of the input data. This is referred to as the algorithmic complexity. For example sorting a list of size N using Bubble Sort takes O(N^2) steps.
Do I use Big O notation?
I do use Big O notation on occasion to convey algorithmic complexity to fellow programmers. I use the underlying theory (e.g. Big O analysis techniques) all of the time when I think about what algorithms to use.
Concrete Examples?
I have used the theory of complexity analysis to create algorithms for efficient stack data structures which require no memory reallocation, and which support average time of O(N) for indexing. I have used Big O notation to explain the algorithm to other people. I have also used complexity analysis to understand when linear time sorting O(N) is possible.
From Wikipedia.....
Big O notation is useful when analyzing algorithms for efficiency. For example, the time (or the number of steps) it takes to complete a problem of size n might be found to be T(n) = 4n² − 2n + 2.
As n grows large, the n² term will come to dominate, so that all other terms can be neglected — for instance when n = 500, the term 4n² is 1000 times as large as the 2n term. Ignoring the latter would have negligible effect on the expression's value for most purposes.
Obviously I have never used it..
You should be able to evaluate an algorithm's complexity. This combined with a knowledge of how many elements it will take can help you to determine if it is ill suited for its task.
It says how many iterations an algorithm has in the worst case.
to search for an item in an list, you can traverse the list until you got the item. In the worst case, the item is in the last place.
Lets say there are n items in the list. In the worst case you take n iterations. In the Big O notiation it is O(n).
It says factualy how efficient an algorithm is.