Why do SSE integer averaging instructions (PAVGB/PAVGW) add 1 to temporary sum before calculating final result? - optimization

I have been working on SSE optimization for a video processing algorithm recently. I need to write the exactly same algorithm in C code to cross-check correctness of the algorithm. I forgot about this fact several time, that makes results of the two implementations become different.
I can modify the C implementation to make them match since this difference doesn't matter. But why these instructions are designed like this? Is there any mathematical reason behind it?
The Intel Instructions Reference only mentions this behavior and don't explain why. I also tried googling, but couldn't find anything about it.
UPDATE:
Thanks to Paul's answer. I didn't realize that is rounding/truncation problem. But since both operands are integer, the only fraction will be 0.5, and it has 2 "nearest integer". AFAIK there are several rounding methods for this situation. Why the instructions use rounding up specifically? Do most related applications need rounding up?

It's to give correct rounding, i.e. round to nearest rather than truncation. In general when you divide by N with integer values you need to do this to get correct rounding:
y = (x + N / 2) / N;
If you just do:
y = x / N;
then you will get a truncated (round to zero) result.
Round to nearest is generally preferred for image processing and DSP type applications.

Related

Exponents in Genetic Programming

I want to have real-valued exponents (not just integers) for the terminal variables.
For example, lets say I want to evolve a function y = x^3.5 + x^2.2 + 6. How should I proceed? I haven't seen any GP implementations which can do this.
I tried using the power function, but sometimes the initial solutions have so many exponents that the evaluated value exceeds 'double' bounds!
Any suggestion would be appreciated. Thanks in advance.
DEAP (in Python) implements it. In fact there is an example for that. By adding the math.pow from Python in the primitive set you can acheive what you want.
pset.addPrimitive(math.pow, 2)
But using the pow operator you risk getting something like x^(x^(x^(x))), which is probably not desired. You shall add a restriction (by a mean that I not sure) on where in your tree the pow is allowed (just before a leaf or something like that).
OpenBeagle (in C++) also allows it but you will need to develop your own primitive using the pow from <math.h>, you can use as an example the Sin or Cos primitive.
If only some of the initial population are suffering from the overflow problem then just penalise them with a poor fitness score and they will probably be removed from the population within a few generations.
But, if the problem is that virtually all individuals suffer from this problem, then you will have to add some constraints. The simplest thing to do would be to constrain the exponent child of the power function to be a real literal - which would mean powers would not be allowed to be nested. It depends on whether this is sufficient for your needs though. There are a few ways to add constraints like these (or more complex ones) - try looking in to Constrained Syntactic Structures and grammar guided GP.
A few other simple thoughts: can you use a data-type with a larger range? Also, you could reduce the maximum depth parameter, so that there will be less room for nested exponents. Of course that's only possible to an extent, and it depends on the complexity of the function.
Integers have a different binary representation than reals, so you have to use a slightly different bitstring representation and recombination/mutation operator.
For an excellent demonstration, see slide 24 of www.cs.vu.nl/~gusz/ecbook/slides/Genetic_Algorithms.ppt or check out the Eiben/Smith book "Introduction to Evolutionary Computing Genetic Algorithms." This describes how to map a bit string to a real number. You can then create a representation where x only lies within an interval [y,z]. In this case, choose y and z to be the of less magnitude than the capacity of the data type you are using (e.g. 10^308 for a double) so you don't run into the overflow issue you describe.
You have to consider that with real-valued exponents and a negative base you will not obtain a real, but a complex number. For example, the Math.Pow implementation in .NET says that you get NaN if you attempt to calculate the power of a negative base to a non-integer exponent. You have to make sure all your x values are positive. I think that's the problem that you're seeing when you "exceed double bounds".
Btw, you can try the HeuristicLab GP implementation. It is very flexible with a configurable grammar.

Estimating the square root

I am writing an iPhone app that needs to calculate the square root of a number about 2000 times every 1/30th of a second. sqrt() works fine on a computer, but the frame rate drops to around 10 FPS on an iPhone or iPad, and I have already optimized the rest of the code. I have heard that this can be sped up dramatically by estimating the square root, but I can not find any code to do this. I only need one or two decimal places of precision. Any suggestions on how to do this, or other ways to speed things up would be appreciated.
Thanks!
Unless you actually need the square root, compare the squared values rather than the raw values and the square root.
Squaring is much faster (and more accurate) than taking a square root, if you only need comparisons. This is the way most games do it.
Do you know the range of values that you are trying to find the square root of? Say you have values ranging from 0 to 10. You can then precalculate an array:
sqrt_val[0] = 0;
sqrt_val[1] = 1;
sqrt_val[2] = // the sqrt of 2
...
sqrt_val[10] = // the sqrt of 10
Then during runtime you take the number that you want the sqrt of, convert that to an integer (so for example 3.123 becomes 3) and use that as an index (3) to look up the precalculated value.
Of course if you want finer resolution you can just increase the number of items in your array.
First off, are you certain that square root is actually the bottleneck? Have you profiled? 2000 square roots every 1/30th of a second actually isn't all that many, even on a cell phone. The ARM documentation quotes 33 cycles for a single-precision square root and 60 cycles for double-precision; a 600mHz processor can do 10 million square roots per second (more if the instruction is pipelined at all).
If you have profiled, and square root really is the bottleneck, you will want to use the NEON vrsqrte.f32 instruction. This instruction is quite fast and gives you the approximate reciprocal square roots of four floating-point numbers simultaneously. You can then use the vmul.f32 instruction to get approximate square roots (though for many uses the reciprocal is more useful than the square root itself).
How accurate do you want your estimate to be? If you know how close you want your estimate to be to the real sqrt the Newton's Method is your friend.
Do you know the range of values that are passed to sqrt? If so you can make up a look up table that is precomputed at startup (or even read from disk at startup depending on what turns out to be faster). Find the closest in the table to your input and you get your estimate.
Maybe this is for you:
Fast inverse square root
If this method doesn't provide the accuracy you need there are also alot of other iterative methods where you can choose more or less precise between speed and accuracy:
Methods of computing square roots
The easiest change you can make on an iPhone is to use sqrtf() instead of sqrt(). Single precision float math is much faster than double precision, especially on devices of 3GS vintage and newer.
If you need the square root to calculate a Pythagoras triangle (sqrt(x*x + y*y)), and both x and y are nonnegative, then a very fast approximation to that is
max(x,y) + min(x,y)*0.333
This has a maximum error of 5.7%. Watch out for branch misprediction in min() and max() though.
A quick Google search turns up all sorts of links.
http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Approximations_that_depend_on_IEEE_representation
http://www.azillionmonkeys.com/qed/sqroot.html
If you have a "normal" positive float or double, not an int, and want to use a table look-up method, you can do two separate table look ups, one for the exponent (re-biased), and one for a few bits of the mantissa (shift and mask bitfield extraction), and then multiply the two table look up results together.

Why see -0,000000000000001 in access query?

I have an sql:
SELECT Sum(Field1), Sum(Field2), Sum(Field1)+Sum(Field2)
FROM Table
GROUP BY DateField
HAVING Sum(Field1)+Sum(Field2)<>0;
Problem is sometimes Sum of field1 and field2 is value like: 9.5-10.3 and the result is -0,800000000000001. Could anybody explain why this happens and how to solve it?
Problem is sometimes Sum of field1 and
field2 is value like: 9.5-10.3 and the
result is -0.800000000000001. Could
anybody explain why this happens and
how to solve it?
Why this happens
The float and double types store numbers in base 2, not in base 10. Sometimes, a number can be exactly represented in a finite number of bits.
9.5 → 1001.1
And sometimes it can't.
10.3 → 1010.0 1001 1001 1001 1001 1001 1001 1001 1001...
In the latter case, the number will get rounded to the closest value that can be represented as a double:
1010.0100110011001100110011001100110011001100110011010 base 2
= 10.300000000000000710542735760100185871124267578125 base 10
When the subtraction is done in binary, you get:
-0.11001100110011001100110011001100110011001100110100000
= -0.800000000000000710542735760100185871124267578125
Output routines will usually hide most of the "noise" digits.
Python 3.1 rounds it to -0.8000000000000007
SQLite 3.6 rounds it to -0.800000000000001.
printf %g rounds it to -0.8.
Note that, even on systems that display the value as -0.8, it's not the same as the best double approximation of -0.8, which is:
- 0.11001100110011001100110011001100110011001100110011010
= -0.8000000000000000444089209850062616169452667236328125
So, in any programming language using double, the expression 9.5 - 10.3 == -0.8 will be false.
The decimal non-solution
With questions like these, the most common answer is "use decimal arithmetic". This does indeed get better output in this particular example. Using Python's decimal.Decimal class:
>>> Decimal('9.5') - Decimal('10.3')
Decimal('-0.8')
However, you'll still have to deal with
>>> Decimal(1) / 3 * 3
Decimal('0.9999999999999999999999999999')
>>> Decimal(2).sqrt() ** 2
Decimal('1.999999999999999999999999999')
These may be more familiar rounding errors than the ones binary numbers have, but that doesn't make them less important.
In fact, binary fractions are more accurate than decimal fractions with the same number of bits, because of a combination of:
The hidden bit unique to base 2, and
The suboptimal radix economy of decimal.
It's also much faster (on PCs) because it has dedicated hardware.
There is nothing special about base ten. It's just an arbitrary choice based on the number of fingers we have.
It would be just as accurate to say that a newborn baby weighs 0x7.5 lb (in more familiar terms, 7 lb 5 oz) as to say that it weighs 7.3 lb. (Yes, there's a 0.2 oz difference between the two, but it's within tolerance.) In general, decimal provides no advantage in representing physical measurements.
Money is different
Unlike physical quantities which are measured to a certain level of precision, money is counted and thus an exact quantity. The quirk is that it's counted in multiples of 0.01 instead of multiples of 1 like most other discrete quantities.
If your "10.3" really means $10.30, then you should use a decimal number type to represent the value exactly.
(Unless you're working with historical stock prices from the days when they were in 1/16ths of a dollar, in which case binary is adequate anyway ;-) )
Otherwise, it's just a display issue.
You got an answer correct to 15 significant digits. That's correct for all practical purposes. If you just want to hide the "noise", use the SQL ROUND function.
I'm certain it is because the float data type (aka Double or Single in MS Access) is inexact. It is not like decimal which is a simple value scaled by a power of 10. If I'm remembering correctly, float values can have different denominators which means that they don't always convert back to base 10 exactly.
The cure is to change Field1 and Field2 from float/single/double to decimal or currency. If you give examples of the smallest and largest values you need to store, including the smallest and largest fractions needed such as 0.0001 or 0.9999, we can possibly advise you better.
Be aware that versions of Access before 2007 can have problems with ORDER BY on decimal values. Please read the comments on this post for some more perspective on this. In many cases, this would not be an issue for people, but in other cases it might be.
In general, float should be used for values that can end up being extremely small or large (smaller or larger than a decimal can hold). You need to understand that float maintains more accurate scale at the cost of some precision. That is, a decimal will overflow or underflow where a float can just keep on going. But the float only has a limited number of significant digits, whereas a decimal's digits are all significant.
If you can't change the column types, then in the meantime you can work around the problem by rounding your final calculation. Don't round until the very last possible moment.
Update
A criticism of my recommendation to use decimal has been leveled, not the point about unexpected ORDER BY results, but that float is overall more accurate with the same number of bits.
No contest to this fact. However, I think it is more common for people to be working with values that are in fact counted or are expected to be expressed in base ten. I see questions over and over in forums about what's wrong with their floating-point data types, and I don't see these same questions about decimal. That means to me that people should start off with decimal, and when they're ready for the leap to how and when to use float they can study up on it and start using it when they're competent.
In the meantime, while it may be a tad frustrating to have people always recommending decimal when you know it's not as accurate, don't let yourself get divorced from the real world where having more familiar rounding errors at the expense of very slightly reduced accuracy is of value.
Let me point out to my detractors that the example
Decimal(1) / 3 * 3 yielding 1.999999999999999999999999999
is, in what should be familiar words, "correct to 27 significant digits" which is "correct for all practical purposes."
So if we have two ways of doing what is practically speaking the same thing, and both of them can represent numbers very precisely out to a ludicrous number of significant digits, and both require rounding but one of them has markedly more familiar rounding errors than the other, I can't accept that recommending the more familiar one is in any way bad. What is a beginner to make of a system that can perform a - a and not get 0 as an answer? He's going to get confusion, and be stopped in his work while he tries to fathom it. Then he'll go ask for help on a message board, and get told the pat answer "use decimal". Then he'll be just fine for five more years, until he has grown enough to get curious one day and finally studies and really grasps what float is doing and becomes able to use it properly.
That said, in the final analysis I have to say that slamming me for recommending decimal seems just a little bit off in outer space.
Last, I would like to point out that the following statement is not strictly true, since it overgeneralizes:
The float and double types store numbers in base 2, not in base 10.
To be accurate, most modern systems store floating-point data types with a base of 2. But not all! Some use or have used base 10. For all I know, there are systems which use base 3 which is closer to e and thus has a more optimal radix economy than base 2 representations (as if that really mattered to 99.999% of all computer users). Additionally, saying "float and double types" could be a little misleading, since double IS float, but float isn't double. Float is short for floating-point, but Single and Double are float(ing point) subtypes which connote the total precision available. There are also the Single-Extended and Double-Extended floating point data types.
It is probably an effect of floating point number implementations. Sometimes numbers cannot be exactly represented, and sometimes the result of operations is slightly off what we may expect for the same reason.
The fix would be to use a rounding function on the values to cut off the extraneous digits. Like this (I've simply rounded to 4 significant digits after the decimal, but of course you should use whatever precision is appropriate for your data):
SELECT Sum(Field1), Sum(Field2), Round(Sum(Field1)+Sum(Field2), 4)
FROM Table
GROUP BY DateField
HAVING Round(Sum(Field1)+Sum(Field2), 4)<>0;

Objective C Math Formula Fail

noob here wants to calculate compound interest on iPhone.
float principal;
float rate;
int compoundPerYear;
int years;
float amount;
formula should be: amount = principal*(1+rate/compoundPerYear)^(rate*years)
I get slightly incorrect answer with:
amount = principal*pow((1+(rate/compoundPerYear)), (compoundPerYear*years));
I'm testing it with rate of .1, but debugger reports .100000001 .
Am I doing it wrong? Should I use doubles or special class (e.g., NSNumber)?
Thanks for any other ideas!
After further research it seems that the NSDecimalNumber class may be just what I need. Now I just have to figure out how to use this bad boy.
double will get you closer, but you can't represent 1/10 exactly in binary (using IEEE floating point notation, anyway).
If you're really interested, you can look at What Every Computer Scientist Should Know About Floating-Point Arithmetic. Link shamefully stolen from another SO thread.
The quick and dirty explanation is that floating point is stored in binary with bits that represents fractional powers of 2 (1/2, 1/4, 1/8, ...). There is simply no mathematical way to add up these fractions to exactly 1/10, thus 0.1 is not able to be exactly represented in IEEE floating point notation.
double extends the accuracy of the number by giving you more numerals before/after the radix, but it does not change the format of the binary in a way that can compensate for this. You'll just get the extra bit somewhere later down the line, most likely.
See also:
Why can’t decimal numbers be represented exactly in binary?
What’s wrong with using == to compare floats in Java?
and other similar threads.
Further expansion that I mulled over on the drive home from work: one way you could conceivably handle this is by just representing all of the monetary values in cents (as an int), then converting to a dollars.cents format when displaying the data. This is actually pretty easy, too, since you can take advantage of integer division's truncating when you convert:
int interest, dollars, cents;
interest = 16034; //$160.34, in cents
dollars = value / 100; //The 34 gets truncated: dollars == 160
cents = value % 100; //cents == 34
printf("Interest earned to date: $%d.%d\n", dollars, cents);
I don't know Objective-C, but hopefully this C example makes sense, too. Again, this is just one way to handle it. It would also be improved by having a function that does the string formatting whenever you need to show the data.
You can obviously come up with your own (even better!) way to do it, but maybe this will help get you started. If anyone else has suggestions on this one, I'd like to hear them, too!
Short answer: Never use floating point numbers for money.
The easy way that works across most platforms is to represent money as integer amounts of its smallest unit. The smallest unit is often something like a cent, although often 1/10 or 1/100 of a cent are the real base units.
On many platforms, there are also number types available that can represent fixed-point decimals.
Be sure to get the rounding right. Financial bookkeeping often uses banker's rounding.

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.