What is Big O notation? Do you use it? [duplicate] - optimization

This question already has answers here:
What is a plain English explanation of "Big O" notation?
(43 answers)
Closed 9 years ago.
What is Big O notation? Do you use it?
I missed this university class I guess :D
Does anyone use it and give some real life examples of where they used it?
See also:
Big-O for Eight Year Olds?
Big O, how do you calculate/approximate it?
Did you apply computational complexity theory in real life?

One important thing most people forget when talking about Big-O, thus I feel the need to mention that:
You cannot use Big-O to compare the speed of two algorithms. Big-O only says how much slower an algorithm will get (approximately) if you double the number of items processed, or how much faster it will get if you cut the number in half.
However, if you have two entirely different algorithms and one (A) is O(n^2) and the other one (B) is O(log n), it is not said that A is slower than B. Actually, with 100 items, A might be ten times faster than B. It only says that with 200 items, A will grow slower by the factor n^2 and B will grow slower by the factor log n. So, if you benchmark both and you know how much time A takes to process 100 items, and how much time B needs for the same 100 items, and A is faster than B, you can calculate at what amount of items B will overtake A in speed (as the speed of B decreases much slower than the one of A, it will overtake A sooner or later—this is for sure).

Big O notation denotes the limiting factor of an algorithm. Its a simplified expression of how run time of an algorithm scales with relation to the input.
For example (in Java):
/** Takes an array of strings and concatenates them
* This is a silly way of doing things but it gets the
* point across hopefully
* #param strings the array of strings to concatenate
* #returns a string that is a result of the concatenation of all the strings
* in the array
*/
public static String badConcat(String[] Strings){
String totalString = "";
for(String s : strings) {
for(int i = 0; i < s.length(); i++){
totalString += s.charAt(i);
}
}
return totalString;
}
Now think about what this is actually doing. It is going through every character of input and adding them together. This seems straightforward. The problem is that String is immutable. So every time you add a letter onto the string you have to create a new String. To do this you have to copy the values from the old string into the new string and add the new character.
This means you will be copying the first letter n times where n is the number of characters in the input. You will be copying the character n-1 times, so in total there will be (n-1)(n/2) copies.
This is (n^2-n)/2 and for Big O notation we use only the highest magnitude factor (usually) and drop any constants that are multiplied by it and we end up with O(n^2).
Using something like a StringBuilder will be along the lines of O(nLog(n)). If you calculate the number of characters at the beginning and set the capacity of the StringBuilder you can get it to be O(n).
So if we had 1000 characters of input, the first example would perform roughly a million operations, StringBuilder would perform 10,000, and the StringBuilder with setCapacity would perform 1000 operations to do the same thing. This is rough estimate, but O(n) notation is about orders of magnitudes, not exact runtime.
It's not something I use per say on a regular basis. It is, however, constantly in the back of my mind when trying to figure out the best algorithm for doing something.

A very similar question has already been asked at Big-O for Eight Year Olds?. Hopefully the answers there will answer your question although the question asker there did have a bit of mathematical knowledge about it all which you may not have so clarify if you need a fuller explanation.

Every programmer should be aware of what Big O notation is, how it applies for actions with common data structures and algorithms (and thus pick the correct DS and algorithm for the problem they are solving), and how to calculate it for their own algorithms.
1) It's an order of measurement of the efficiency of an algorithm when working on a data structure.
2) Actions like 'add' / 'sort' / 'remove' can take different amounts of time with different data structures (and algorithms), for example 'add' and 'find' are O(1) for a hashmap, but O(log n) for a binary tree. Sort is O(nlog n) for QuickSort, but O(n^2) for BubbleSort, when dealing with a plain array.
3) Calculations can be done by looking at the loop depth of your algorithm generally. No loops, O(1), loops iterating over all the set (even if they break out at some point) O(n). If the loop halves the search space on each iteration? O(log n). Take the highest O() for a sequence of loops, and multiply the O() when you nest loops.
Yeah, it's more complex than that. If you're really interested get a textbook.

'Big-O' notation is used to compare the growth rates of two functions of a variable (say n) as n gets very large. If function f grows much more quickly than function g we say that g = O(f) to imply that for large enough n, f will always be larger than g up to a scaling factor.
It turns out that this is a very useful idea in computer science and particularly in the analysis of algorithms, because we are often precisely concerned with the growth rates of functions which represent, for example, the time taken by two different algorithms. Very coarsely, we can determine that an algorithm with run-time t1(n) is more efficient than an algorithm with run-time t2(n) if t1 = O(t2) for large enough n which is typically the 'size' of the problem - like the length of the array or number of nodes in the graph or whatever.
This stipulation, that n gets large enough, allows us to pull a lot of useful tricks. Perhaps the most often used one is that you can simplify functions down to their fastest growing terms. For example n^2 + n = O(n^2) because as n gets large enough, the n^2 term gets so much larger than n that the n term is practically insignificant. So we can drop it from consideration.
However, it does mean that big-O notation is less useful for small n, because the slower growing terms that we've forgotten about are still significant enough to affect the run-time.
What we now have is a tool for comparing the costs of two different algorithms, and a shorthand for saying that one is quicker or slower than the other. Big-O notation can be abused which is a shame as it is imprecise enough already! There are equivalent terms for saying that a function grows less quickly than another, and that two functions grow at the same rate.
Oh, and do I use it? Yes, all the time - when I'm figuring out how efficient my code is it gives a great 'back-of-the-envelope- approximation to the cost.

The "Intuitition" behind Big-O
Imagine a "competition" between two functions over x, as x approaches infinity: f(x) and g(x).
Now, if from some point on (some x) one function always has a higher value then the other, then let's call this function "faster" than the other.
So, for example, if for every x > 100 you see that f(x) > g(x), then f(x) is "faster" than g(x).
In this case we would say g(x) = O(f(x)). f(x) poses a sort of "speed limit" of sorts for g(x), since eventually it passes it and leaves it behind for good.
This isn't exactly the definition of big-O notation, which also states that f(x) only has to be larger than C*g(x) for some constant C (which is just another way of saying that you can't help g(x) win the competition by multiplying it by a constant factor - f(x) will always win in the end). The formal definition also uses absolute values. But I hope I managed to make it intuitive.

It may also be worth considering that the complexity of many algorithms is based on more than one variable, particularly in multi-dimensional problems. For example, I recently had to write an algorithm for the following. Given a set of n points, and m polygons, extract all the points that lie in any of the polygons. The complexity is based around two known variables, n and m, and the unknown of how many points are in each polygon. The big O notation here is quite a bit more involved than O(f(n)) or even O(f(n) + g(m)).
Big O is good when you are dealing with large numbers of homogenous items, but don't expect this to always be the case.
It is also worth noting that the actual number of iterations over the data is often dependent on the data. Quicksort is usually quick, but give it presorted data and it slows down. My points and polygons alogorithm ended up quite fast, close to O(n + (m log(m)), based on prior knowledge of how the data was likely to be organised and the relative sizes of n and m. It would fall down badly on randomly organised data of different relative sizes.
A final thing to consider is that there is often a direct trade off between the speed of an algorithm and the amount of space it uses. Pigeon hole sorting is a pretty good example of this. Going back to my points and polygons, lets say that all my polygons were simple and quick to draw, and I could draw them filled on screen, say in blue, in a fixed amount of time each. So if I draw my m polygons on a black screen it would take O(m) time. To check if any of my n points was in a polygon, I simply check whether the pixel at that point is green or black. So the check is O(n), and the total analysis is O(m + n). Downside of course is that I need near infinite storage if I'm dealing with real world coordinates to millimeter accuracy.... ...ho hum.

It may also be worth considering amortized time, rather than just worst case. This means, for example, that if you run the algorithm n times, it will be O(1) on average, but it might be worse sometimes.
A good example is a dynamic table, which is basically an array that expands as you add elements to it. A naïve implementation would increase the array's size by 1 for each element added, meaning that all the elements need to be copied every time a new one is added. This would result in a O(n2) algorithm if you were concatenating a series of arrays using this method. An alternative is to double the capacity of the array every time you need more storage. Even though appending is an O(n) operation sometimes, you will only need to copy O(n) elements for every n elements added, so the operation is O(1) on average. This is how things like StringBuilder or std::vector are implemented.

What is Big O notation?
Big O notation is a method of expressing the relationship between many steps an algorithm will require related to the size of the input data. This is referred to as the algorithmic complexity. For example sorting a list of size N using Bubble Sort takes O(N^2) steps.
Do I use Big O notation?
I do use Big O notation on occasion to convey algorithmic complexity to fellow programmers. I use the underlying theory (e.g. Big O analysis techniques) all of the time when I think about what algorithms to use.
Concrete Examples?
I have used the theory of complexity analysis to create algorithms for efficient stack data structures which require no memory reallocation, and which support average time of O(N) for indexing. I have used Big O notation to explain the algorithm to other people. I have also used complexity analysis to understand when linear time sorting O(N) is possible.

From Wikipedia.....
Big O notation is useful when analyzing algorithms for efficiency. For example, the time (or the number of steps) it takes to complete a problem of size n might be found to be T(n) = 4n² − 2n + 2.
As n grows large, the n² term will come to dominate, so that all other terms can be neglected — for instance when n = 500, the term 4n² is 1000 times as large as the 2n term. Ignoring the latter would have negligible effect on the expression's value for most purposes.
Obviously I have never used it..

You should be able to evaluate an algorithm's complexity. This combined with a knowledge of how many elements it will take can help you to determine if it is ill suited for its task.

It says how many iterations an algorithm has in the worst case.
to search for an item in an list, you can traverse the list until you got the item. In the worst case, the item is in the last place.
Lets say there are n items in the list. In the worst case you take n iterations. In the Big O notiation it is O(n).
It says factualy how efficient an algorithm is.

Related

Does it make sense to use big-O to describe the best case for a function?

I have an extremely pedantic question on big-O notation that I would like some opinions on. One of my uni subjects states “Best O(1) if their first element is the same” for a question on checking if two lists have a common element.
My qualm with this is that it does not describe the function on the entire domain of large inputs, rather the restricted domain of large inputs that have two lists with the same first element. Does it make sense to describe a function by only talking about a subset of that function’s domain? Of course, when restricted to that domain, the time complexity is omega(1), O(1) and therefore theta(1), but this isn’t describing the original function. From my understanding it would be more correct to say the entire function is bounded by omega(1). (and O(m*n) where m, n are the sizes of the two input lists).
What do all of you think?
It is perfectly correct to discuss cases (as you correctly point out, a case is a subset of the function's domain) and bounds on the runtime of algorithms in those cases (Omega, Oh or Theta). Whether or not it's useful is a harder question and one that is very situation-dependent. I'd generally think that Omega-bounds on the best case, Oh-bounds on the worst case and Theta bounds (on the "universal" case of all inputs, when such a bound exists) are the most "useful". But calling the subset of inputs where the first elements of each collection are the same, the "best case", seems like reasonable usage. The "best case" for bubble sort is the subset of inputs which are pre-sorted arrays, and is bound by O(n), better than unmodified merge sort's best-case bound.
Fundamentally, big-O notation is a way of talking about how some quantity scales. In CS we so often see it used for talking about algorithm runtimes that we forget that all the following are perfectly legitimate use cases for it:
The area of a circle of radius r is O(r2).
The volume of a sphere of radius r is O(r3).
The expected number of 2’s showing after rolling n dice is O(n).
The minimum number of 2’s showing after rolling n dice is O(1).
The number of strings made of n pairs of balanced parentheses is O(4n).
With that in mind, it’s perfectly reasonable to use big-O notation to talk about how the behavior of an algorithm in a specific family of cases will scale. For example, we could say the following:
The time required to sort a sequence of n already-sorted elements with insertion sort is O(n). (There are other, non-sorted sequences where it takes longer.)
The time required to run a breadth-first search of a tree with n nodes is O(n). (In a general graph, this could be larger if the number of edges were larger.)
The time required to insert n items in sorted order into an initially empty binary heap is On). (This can be Θ(n log n) for a sequence of elements that is reverse-sorted.)
In short, it’s perfectly fine to use asymptotic notation to talk about restricted subcases of an algorithm. More generally, it’s fine to use asymptotic notation to describe how things grow as long as you’re precise about what you’re quantifying. See #Patrick87’s answer for more about whether to use O, Θ, Ω, etc. when doing so.

Time complexity - understanding big-Theta

I'm currently taking algorithms and data structure. After nearly two months of studying, I still find time complexity extremely confusing.
I was told (by my professor) that if I big-omega and big-O of some program aren't equal, big-theta doesn't exist.
I now literally question everything I've learned so far. I'll take BubbleSort as an example, with big-omega(n),big-theta(n^2) and big-O(n^2). Big-theta indeed does exist (and it makes sense when i analyze it).
Can anyone explain to me whether my professor is wrong or perhaps I misunderstood something?
There exists big O, Big ϴ, and Big Ω.
O is the upper bound
Ω is the lower bound
ϴ exists if and only only if O = Ω
In essence, Big-O is the most useful in that it tells us the WORST a function can behave.
Big-Ω indicates the BEST it can behave.
When WORST = BEST, you get ALWAYS. That's Big-ϴ. When a function always behaves the same.
Example (optimized bubble sort, then has boolean flag for when a swap occurs):
bubbleSort() ∈ Ω(n)
bubbleSort() ∈ O(n^2)
This means, the best bubble sort can do is linear time. But, in the worst case, it DEGRADES to quadratic time. Reason, on a pre-sorted list, bubble sort behaves quiet well, as it does one loop through the list (n iterations) and exits. Whereas, on a list in descending order, it would do roughly n(n-1)/2 iterations, which is proportional to n^2. Depending on the input, bubble sort behaves DRASTICALLY (different order or magnitude) differently.
Whereas:
mergeSort() ∈ Ω(n * log n)
mergeSort() ∈ O(n * log n)
This means, in the best case merge sort is in n * log n time, and in the worst case. It is ALWAYS n * log n. That is because, no matter what the input is, merge sort will always recursively divide the list in half-size sub-arrays, and put them back together once each sub-array is of size 1. However, you can only break something in half so many times (log2(n)) times. Then, the merge() routine that is O(n) is called once per mergeSort() which occurs log2(n) times. So, for ALL executions of mergeSort() you get n * log2(n).
Therefore we can make a STRONGER statement and say that:
mergeSort() ∈ ϴ(n * log n)
We can only make such definitive statements (use big theta) if the runtime of a function is bounded above and below by functions of the exact same order of magnitude.
How I remember it:
ϴ is an end all be all, whereas O, and Ω are simply limits.
lets take this problem in 2 ways:
First Way:
we say that O(bubble sort) is O(n^2) and Ω(bubble sort) is Ω(n).
The above statement holds true because, when the elements are laid down in array in sorted or almost sorted manner then bubble sort will have an equation of running time in form of -> an + b, where a and b are constants (a is positive).
but when elements are laid down in array in not so sorted manner or very unsorted manner, then the running time will be in form -> an^2 + bn + c. where a,b,c are constants (a is positive).
thus, bubble sort ∈ O(n^2) and bubble sort ∈ Ω(n).
PS: differentiating what will constitute to almost sorted manner and what will constitute to not so sorted manner is talk for another day, but you can google it you will find it easily.
Now, Second way:
Rather than taking Big-O and Big-Ω of bubble sort we divide bubble sort into its worst case and best case.
now we say what is Big-O of worst case of bubble sort we get. O(n^2), i.e,
worst case of bubble sort ∈ O(n^2).
but now since we have specified that we are dealing with only worst case then
Big-Ω of worst case of bubble sort will also be Ω(n^2).
Thus in this case ϴ(n^2) exists. because O = Ω.
Using similar approach, we can see that the best case of bubble sort will have ϴ(n).
Note: worst case will be in form: an^2 + bn + c.
and best case will be in form: an + b.
-------------------------------------------------------
In second way we bifurcated the worst case and best case then only we were able to reach to ϴ notations for both the cases.
Bubble sort as a whole had different polynomial equations as its running time with different degree that is why it did not have a ϴ notation.

When writing big O notation can unknown variables be used?

I do not know if the language I am using in the title is correct, but here is an example that illustrates what I am asking.
What would the time complexity for this non-optimal algorithm that removes character pairs from a string?
The function will loop through a string. When it finds two identical characters next to each other it will return a string without the found pair. It then recursively calls itself until no pair is found.
Example (each line is the return string from one recurisive function call):
iabccba
iabba
iaa
i
Would it be fair to describe the time complexity as O(|Characters| * |Pairs|)?
What about O(|Characters|^2) Can pairs be used to describe the time complexity even though the number of pairs is not knowable at the initial function call?
It was argued to me that this algorithm was O(n^2)because the number of pairs is not known.
You're right that this is strictly speaking O(|Characters| * |Pairs|)
However, in the worst case, number of pairs can be same as number of charachters (or same order of magnitude), for example in the string 'abcdeedcba'
So it also makes sense to describe it as O(n^2) worst-case.
I think this largely depends on the problem you mean to solve and and it's definition.
For graph algorithms for example, everyone is comfortable with a writing as complexity O(|V| + |E|), although in the worst case of a dense graph |E| = |V|^2. In other problems we just look at the worst possible case and write O(n^2), without breaking it into more specific variables.
I'd say that if there's no special convention, or no special data in the problem regarding number of pairs, O(...) implies worst-case performance and hence O(n^2) would be more appropriate.

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.

Time Complexity confusion

Ive always been a bit confused on this, possibly due to my lack of understanding in compilers. But lets use python as an example. If we had some large list of numbers called numlist and wanted to get rid of any duplicates, we could use a set operator on the list, example set(numlist). In return we would have a set of our numbers. This operation to the best of my knowledge will be done in O(n) time. Though if I were to create my own algorithm to handle this operation, the absolute best I could ever hope for is O(n^2).
What I don't get is, what allows a internal operation like set() to be so much faster then an external to the language algorithm. The checking still needs to be done, don't they?
You can do this in Θ(n) average time using a hash table. Lookup and insertion in a hash table are Θ(1) on average . Thus, you just run through the n items and for each one checking if it is already in the hash table and if not inserting the item.
What I don't get is, what allows a internal operation like set() to be so much faster then an external to the language algorithm. The checking still needs to be done, don't they?
The asymptotic complexity of an algorithm does not change if implemented by the language implementers versus being implemented by a user of the language. As long as both are implemented in a Turing complete language with random access memory models they have the same capabilities and algorithms implemented in each will have the same asymptotic complexity. If an algorithm is theoretically O(f(n)) it does not matter if it is implemented in assembly language, C#, or Python on it will still be O(f(n)).
You can do this in O(n) in any language, basically as:
# Get min and max values O(n).
min = oldList[0]
max = oldList[0]
for i = 1 to oldList.size() - 1:
if oldList[i] < min:
min = oldList[i]
if oldList[i] > max:
max = oldList[i]
# Initialise boolean list O(n)
isInList = new boolean[max - min + 1]
for i = min to max:
isInList[i] = false
# Change booleans for values in old list O(n)
for i = 0 to oldList.size() - 1:
isInList[oldList[i] - min] = true
# Create new list from booleans O(n) (or O(1) based on integer range).
newList = []
for i = min to max:
if isInList[i - min]:
newList.append (i)
I'm assuming here that append is an O(1) operation, which it should be unless the implementer was brain-dead. So with k steps each O(n), you still have an O(n) operation.
Whether the steps are explicitly done in your code or whether they're done under the covers of a language is irrelevant. Otherwise you could claim that the C qsort was one operation and you now have the holy grail of an O(1) sort routine :-)
As many people have discovered, you can often trade off space complexity for time complexity. For example, the above only works because we're allowed to introduce the isInList and newList variables. If this were not allowed, the next best solution may be sorting the list (probably no better the O(n log n)) followed by an O(n) (I think) operation to remove the duplicates.
An extreme example, you can use that same extra-space method to sort an arbitrary number of 32-bit integers (say with each only having 255 or less duplicates) in O(n) time, provided you can allocate about four billion bytes for storing the counts.
Simply initialise all the counts to zero and run through each position in your list, incrementing the count based on the number at that position. That's O(n).
Then start at the beginning of the list and run through the count array, placing that many of the correct value in the list. That's O(1), with the 1 being about four billion of course but still constant time :-)
That's also O(1) space complexity but a very big "1". Typically trade-offs aren't quite that severe.
The complexity bound of an algorithm is completely unrelated to whether it is implemented 'internally' or 'externally'
Taking a list and turning it into a set through set() is O(n).
This is because set is implemented as a hash set. That means that to check if something is in the set or to add something to the set only takes O(1), constant time. Thus, to make a set from an iterable (like a list for example), you just start with an empty set and add the elements of the iterable one by one. Since there are n elements and each insertion takes O(1), the total time of converting an iterable to a set is O(n).
To understand how the hash implementation works, see the wikipedia artcle on hash tables
Off hand I can't think of how to do this in O(n), but here is the cool thing:
The difference between n^2 and n is sooo massive that the difference between you implementing it and python implementing is tiny compared to the algorithm used to implement it. n^2 is always worse than O(n), even if the n^2 one is in C and the O(n) one is in python. You should never think that kind of difference comes from the fact that you're not writing in a low level language.
That said, if you want to implement your own, you can do a sort then remove dups. the sort is n*ln(n) and the remove dups in O(n)...
There are two issues here.
Time complexity (which is expressed in big O notation) is a formal measure of how long an algorithm takes to run for a given set size. It's more about how well an algorithm scales than about the absolute speed.
The actual speed (say, in milliseconds) of an algorithm is the time complexity multiplied by a constant (in an ideal world).
Two people could implement the same removal of duplicates algorithm with O(log(n)*n) complexity, but if one writes it in Python and the other writes it in optimised C, then the C program will be faster.