Tweaking MIT's bitcount algorithm to count words in parallel? - optimization

I want to use a version of the well known MIT bitcount algorithm to count neighbors in Conway's game of life using SSE2 instructions.
Here's the MIT bitcount in c, extended to count bitcounts > 63 bits.
int bitCount(unsigned long long n)
{
unsigned long long uCount;
uCount = n – ((n >> 1) & 0×7777777777777777)
- ((n >> 2) & 0×3333333333333333)
- ((n >> 3) & 0×1111111111111111);
return ((uCount + (uCount >> 4))
& 0x0F0F0F0F0F0F0F0F) % 255;
}
Here's a version in Pascal
function bitcount(n: uint64): cardinal;
var ucount: uint64;
begin
ucount:= n - ((n shr 1) and $7777777777777777)
- ((n shr 2) and $3333333333333333)
- ((n shr 3) and $1111111111111111);
Result:= ((ucount + (count shr 4))
and $0F0F0F0F0F0F0F0F) mod 255;
end;
I'm looking to count the bits in this structure in parallel.
32-bit word where the pixels are laid out as follows.
lo-byte lo-byte neighbor
0 4 8 C 048C 0 4 8 C
+---------------+
1|5 9 D 159D 1|5 9 D
| |
2|6 A E 26AE 2|6 A E
+---------------+
3 7 B F 37BF 3 7 B F
|-------------| << slice A
|---------------| << slice B
|---------------| << slice C
Notice how this structure has 16 bits in the middle that need to be looked up.
I want to calculate neighbor counts for each of the 16 bits in the middle using SSE2.
In order to do this I put slice A in XMM0 low-dword, slice B in XXM0-dword1 etc.
I copy XMM0 to XMM1 and I mask off bits 012-456-89A for bit 5 in the low word of XMM0, do the same for word1 of XMM0, etc. using different slices and masks to make sure each word in XMM0 and XMM1 holds the neighbors for a different pixel.
Question
How do I tweak the MIT-bitcount to end up with a bitcount per word/pixel in each XMM word?
Remarks
I don't want to use a lookup table, because I already have that approach and I want to
test to see if SSE2 will speed up the process by not requiring memory accesses to the lookup table.
An answer using SSE assembly would be optimal, because I'm programming this in Delphi and I'm thus using x86+SSE2 assembly code.

The MIT algorithm would be tough to implement in SSE2, since there is no integer modulus instruction which could be used for the final ... % 255 expression. Of the various popcnt methods out there, the one that lends itself to SSE most easily and efficiently is probably the first one in Chapter 5 of "Hackers Delight" by Henry S. Warren, which I have implemented here in C using SSE intrinsics:
#include <stdio.h>
#include <emmintrin.h>
__m128i _mm_popcnt_epi16(__m128i v)
{
v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x5555)), _mm_and_si128(_mm_srli_epi16(v, 1), _mm_set1_epi16(0x5555)));
v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x3333)), _mm_and_si128(_mm_srli_epi16(v, 2), _mm_set1_epi16(0x3333)));
v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x0f0f)), _mm_and_si128(_mm_srli_epi16(v, 4), _mm_set1_epi16(0x0f0f)));
v = _mm_add_epi16(_mm_and_si128(v, _mm_set1_epi16(0x00ff)), _mm_and_si128(_mm_srli_epi16(v, 8), _mm_set1_epi16(0x00ff)));
return v;
}
int main(void)
{
__m128i v0 = _mm_set_epi16(7, 6, 5, 4, 3, 2, 1, 0);
__m128i v1;
v1 = _mm_popcnt_epi16(v0);
printf("v0 = %vhd\n", v0);
printf("v1 = %vhd\n", v1);
return 0;
}
Compile and test as follows:
$ gcc -Wall -msse2 _mm_popcnt_epi16.c -o _mm_popcnt_epi16
$ ./_mm_popcnt_epi16
v0 = 0 1 2 3 4 5 6 7
v1 = 0 1 1 2 1 2 2 3
$
It looks like around 16 arithmetic/logical instructions so it should run at around 16 / 8 = 2 clocks per point.
You can easily convert this to raw assembler if you need to - each intrinsic maps to a single instruction.

Related

Where the Hamming Distance Constants Came From

The function:
function popcount (x, n) {
if (n !== undefined) {
x &= (1 << n) - 1
}
x -= x >> 1 & 0x55555555
x = (x & 0x33333333) + (x >> 2 & 0x33333333)
x = x + (x >> 4) & 0x0f0f0f0f
x += x >> 8
x += x >> 16
return x & 0x7f
}
Is for calculating Hamming Weight. I am wondering where these constants come from and generally how this method was discovered. Wondering if anyone knows the resource that describes it.
There masks select the even numbered k-bit parts, k=1 gives 0x55555555, k=2 gives 0x33333333, k=4 gives 0x0f0f0f0f.
In binary the masks look like:
0x55555555 = 01010101010101010101010101010101
0x33333333 = 00110011001100110011001100110011
0x0f0f0f0f = 00001111000011110000111100001111
They are also the result of 0xffffffff / 3, 0xffffffff / 5 and 0xffffffff / 17 but this arithmetic insight is probably not useful in this context.
Overall this method of computing the Hamming weight has the form of a tree where first adjacent bits are summed into a 2-bit number, then adjacent 2-bit numbers are summed into 4-bit numbers, and so on.
All the steps could have this form:
x = (x & m[k]) + ((x >> k) & m[k])
where m[k] is a mask selecting the even-numbered k-bit parts.
But many steps have short-cuts available for them. For example, to sum adjacent bits, there are only 4 cases to consider:
00 -> 00
01 -> 01
10 -> 01
11 -> 10
This could be done by extracting both bits and summing them, but x -= x >> 1 & 0x55555555 also works. This subtracts the top bit from the 2-bit part, so
00 -> 00 - 0 = 00
01 -> 01 - 0 = 01
10 -> 10 - 1 = 01
11 -> 11 - 1 = 10
Maybe this could be discovered through "cleverness and insight", whatever those are.
In the step x = (x + (x >> 4)) & 0x0f0f0f0f (extra parentheses added for clarity), a couple of properties are used. The results from the previous steps are the Hamming weights of 4-bit strings stored in 4 bits each, so they are at most 0100. That means two of them can be added in-place without carrying into the next higher part, because their sum will be at most 1000 which still fits. So instead of masking twice before the sum, it is enough to mask once after the sum, this mask effectively zero-extends the even numbered 4-bit parts into 8-bit parts. This could be discovered by considering the maximum values at each step.
The step x += x >> 8 has similar reasoning but it works out even better, even masking after the sum is not needed, this leaves some "stray bits" in the second byte from the bottom and in the top byte, but that is not damaging to the next step: the >> 16 throws away the second byte from the bottom, in the end all the stray bits are removed with x & 0x7f.

Eigen: Replicate (broadcast) by rows

I'd like to replicate each row of a matrix M without any copy occurring (i.e. by creating a view):
0 1 0 1
2 3 -> 0 1
2 3
2 3
M.rowwise().replicate(n) is a shorcut for M.replicate(1,n) which seems kind of useless.
The following snippet does a copy, and cannot work if M is an expression.
Eigen::Index rowFactor = 2;
Eigen::MatrixXi M2 = Eigen::Map(M.data(), 1, M.size()).replicate(rowFactor, 1);
M2.resize(M.rows()*rowFactor, M.cols()) ;
In some situation, I may use the intermediate view Eigen::Map<Eigen::MatrixXi>(M.data(), 1, M.size()).replicate(rowFactor, 1) by reshaping the other operands, but that's not very satisfying.
Is there a proper way to achieve this broadcast view?
What you want is essentially a Kronecker product with a matrix of ones. You can use the (unsupported) KroneckerProduct module for that:
#include <iostream>
#include <unsupported/Eigen/KroneckerProduct>
int main() {
Eigen::Matrix2i M; M << 0, 1, 2, 3;
std::cout << Eigen::kroneckerProduct(M, Eigen::Vector2i::Ones()) << '\n';
}
Being 'unsupported' means that the API of the module is not guaranteed to be stable (though this module has not changed since its introduction, I think).

VB.NET doesn't round numbers correctly?

I'm testing the speed of some functions so I made a test to run the functions over and over again and I stored the results in an array. I needed them to be sorted by the size of the array I randomly generated. I generate 100 elements. Merge sort to the rescue! I used this link to get me started.
The section of code I'm focusing on:
private void mergesort(int low, int high) {
// check if low is smaller then high, if not then the array is sorted
if (low < high) {
// Get the index of the element which is in the middle
int middle = low + (high - low) / 2;
// Sort the left side of the array
mergesort(low, middle);
// Sort the right side of the array
mergesort(middle + 1, high);
// Combine them both
merge(low, middle, high);
}
}
which translated to VB.NET is
private sub mergesort(low as integer, high as integer)
' check if low is smaller then high, if not then the array is sorted
if (low < high)
' Get the index of the element which is in the middle
dim middle as integer = low + (high - low) / 2
' Sort the left side of the array
mergesort(low, middle)
' Sort the right side of the array
mergesort(middle + 1, high)
' Combine them both
merge(low, middle, high)
end if
end sub
Of more importance the LOC that only matters to this question is
dim middle as integer = low + (high - low) / 2
In case you wanna see how merge sort is gonna run this baby
high low high low
100 0 10 0
50 0 6 4
25 0 5 4
12 0 12 7
6 0 10 7
3 0 8 7
2 0 :stackoverflow error:
The error comes from the fact 7 + (8 - 7) / 2 = 8. You'll see 7 and 8 get passed in to mergesort(low, middle) and then we infinite loop. Now earlier in the sort you see a comparison like this again. At 5 and 4. 4 + (5 - 4) / 2 = 4. So essentially for 5 and 4 it becomes 4 + (1) / 2 = 4.5 = 4. For 8 and 7 though it's 7 + (1) / 2 = 7.5 = 8. Remember the numbers are typecasted to an int.
Maybe I'm just using a bad implementation of it or my typecasting is wrong, but my question is: Shouldn't this be a red flag signaling something isn't right with the rounding that's occuring?
Without understanding the whole algorithm, note that VB.NET / is different than C# /. The latter has integer division by default, if you want to truncate decimal places also in VB.NET you have to use \.
Read: \ Operator
So i think that this is what you want:
Dim middle as Int32 = low + (high - low) \ 2
You are correct in your diagnosis: there's something inconsistent with the rounding that's occurring, but this is entirely expected if you know where to look.
From the VB.NET documentation on the / operator:
Divides two numbers and returns a floating-point result.
This documentation explicitly states that , if x and y are integral types, x / y returns a Double. So, 5 / 2 in VB.NET would be expected to be 2.5.
From the C# documentation on the / operator:
All numeric types have predefined division operators.
And further down the page:
When you divide two integers, the result is always an integer.
In the case of C#, if x and y are integers, x / y returns an integer (rounded down). 5 / 2 in C# is expected to return 2.

Find Maximum sum in a path in a 2D matrix with positive integers

I would like to know where I can read about algorithms for solving this problem efficiently:
Four directions allowed: up, down, left, right
Cells containing zero can't be visited.
Visiting the same cell twice is illegal.
Moves wraps around the edges:
(first row is connected with last row)
(first col is connected with last col)
Example, 5x5 and 5 steps:
9 1 3 1 9
6 3 2 4 1
0 7 * 7 7
5 4 9 4 9
7 9 1 5 5
Starting point: *
Solution: down,left,down,left,down. That is 9 + 4 + 9 + 7 + 9 = 38
[9] 1 3 1 9
6 3 2 4 1
0 7 * 7 7
5 [4][9] 4 9
[7][9] 1 5 5
This problem is probably not related to:
Finding the maximum sub matrix
Dynamic programming
You specified in comments that you wanted a sub-second way of finding the best value 20-step path out of a 5x5 matrix. I've implemented a basic recursive search tree that does this. Ultimately, the difficulty of this is still O(3^k), but highly saturated cases like yours (21 out of 24 allowed nodes visited) will solve much faster because the problem simplifies to "skip the n*n-z-k-1 lowest value cells" (in this case, n=5, z=1 and k+1 = 21; the winning path skips three 1's).
The problem instance in your question solves in 0.231seconds on a 3 year old i5 laptop and about half a second on ideone.com. I've provided code here http://ideone.com/5kOyxq (note that 'up' and 'down' are reversed because of the way I input the data).
For less saturated problems you may need to add a Bound/Cut method. You can generate a Bound as follows:
First, run over the NxN matrix and collect the K highest value elements (can be done in N² log K) and sort them by max-first. Then, cumulatively calculate the value UB[t] = SUM[i::0->t] SortedElements[i]. So, any t-length path has a UB of UB[t] (max t elements).
At step T, the current Branch's UB is UB[t]. If ValueSoFar[T] + UB[K-T] <= BestPathValue, you can stop that branch.
There may be better ways, but this should be sufficient for reasonably sized matrices and path lengths.
Game or puzzle. Given a matrix, number of steps and a sum, find the path.
Would be nice if there is a real world application for this, but i haven't found it.
Games tend to "burn in" knowledge in young brains, so why not burn in something useful, like addition?
#include<iostream>
#include<climits>
#define R 3
#define C 3
int MAX(int x, int y, int z);
int Max_Cost(int cost[R][C], int m, int n)
{
if (n < 0 || m < 0)
return INT_MIN;
else if (m == 0 && n == 0)
return cost[m][n];
else
return cost[m][n] + MIN( Max_Cost(cost, m-1, n-1),
Max_Cost(cost, m-1, n),
Max_Cost(cost, m, n-1)
);
}
int MAX(int x, int y, int z)
{
return max(max(x, y), z);
}
int main()
{
int cost[R][C] = { {3, 2, 5},
{5, 8, 2},
{9, 7, 1}
};
cout<<Max_Cost(cost, 2, 1);
return 0;
}

What's the fastest way to divide an integer by 3?

int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication
The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.
This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;
There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.
See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#
Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).
For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.
What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.
I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP
For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}
If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.
Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}
A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}