Non deterministic behavior of native_ functions in OpenCL kernel - gpu

TL;DR: Using native_log2() produces non deterministic behavior in OpenCL kernel, while using log2() produces deterministic behavior. Why is this happening?
So I have this function below acting as a helper function for an OpenCL kernel, and I was using the native_ version of log2 (native_log2) to improve speed performance.
When I was comparing the results produced by the kernel and by the original program, I realized that in most of the cases the kernel is producing the right values, however, sometimes it produces an incorrect value (like 30 incorrect values in 500k function calls). VERY IMPORTANT: The errors are not always on the same computations. I am processing multiple input files, and the errors seem to occur randomly in different sets of files with different runs. That is, the results are non deterministic.
After some tests I narrowed the problem to the function below and found out that swapping the native_log2 by log2 produces the correct value 100% of the times. All those typecasts look ugly, but the log2() and floor() functions are only compatible with double/float, while my input/output must be integers.
My device is a NVIDIA GPU 940MX and only supports OpenCL 1.2. The OpenCL 1.2 documentation states that
A subset of functions from table 6.8 that are defined with the native_ prefix. These functions may map to one or more native device instructions and will typically have better performance compared to the corresponding functions (without the native__ prefix) described in table 6.8. The accuracy (and in some cases the input range(s)) of these functions is implementation-defined.
Clearly I am supposed to expect some errors when using native_ functions, but the documentation is not clear about the determinism of the errors I may be encountering.
Can someone give me directions on why I am facing this strange behavior?
int xGetExpGolombNumberOfBits(int value){
unsigned int uiLength2 = 1;
unsigned int uiTemp2 = select((unsigned int)( value << 1 ), ( (unsigned int)( -value ) << 1 ) + 1, value <= 0);
// These magic numbers (7 and 128) are substituting two constants for the sake of clarity
while( uiTemp2 > 128 )
{
uiLength2 += ( 7 << 1 );
uiTemp2 >>= 7;
}
return uiLength2 + (((int)floor(native_log2((float)uiTemp2))) << 1);
}

Related

Approximation using gmp mpf_class

I am writing a UnitTest using Catch2.
I want to check if two vectors are equal. They look like the following using gmplib:
std::vector<mpf_class> result
Due to me 'faking' the expected_result vector, I get the following message after a failed test:
unittests/test.cpp:01: FAILED:
REQUIRE( actual_result == expected_result )
with expansion:
{ 0.5, 0.166667, 0.166667, 0.166667 }
==
{ 0.5, 0.166667, 0.166667, 0.166667 }
So I was looking for a function that could do an approximation for me.
I just wasn't successful in finding a solution that worked out for me.
I found some Comparison Functions but they do not work on my project.
EDIT:
The "minimal, reproducible example would simply be:
TEST_CASE("DemoTest") {
// simplified:
mpf_class a = 1;
mpf_class b = 6;
mpf_class actual_result = a / b;
mpf_class expected_result= 0.16666666667;
REQUIRE(actual_result == expected_result);
}
The "only" difference to my real application is that the results are stored in vectors. But because I am only "faking" the result by saying it is "0.1666666667" it probably doesn't fit the == anymore. So I need a function that takes an approximation and compares the range like epsilon = +-0.001.
Edit:
After implementing the solution #Arc suggested it worked well until I had some Values that were not complete "even".
So I have a failure with the following values:
actual 0.16666666666666666666700000000000000000000000000000
expected 0.16666666666666665741500000000000000000000000000000
Even though my "expected" value looks like this:
mpf_class expected = 0.16666666666666666666700000000000000000000000000000
Getting back to my original question if there is a way I can compare an approximation of the number with an epsilon of like +-0.0001 or what would be the best way to fix this issue?
First, we need to see some Minimal, Reproducible Example to be sure of what is happening. You can for example cut down some code from your test.cpp until you are left with just a few lines of code, but the issue still happens. Also, please provide compilation and running instructions. Frequently, a little bit of explanation on what your goals are may also help. As Catch2 is available on GitHub you don't need to provide it.
Without seeing the code, the best I can guess is that your code is trying to comparing mpf_t types in the mpf_class using the == operator, which I'm afraid has not been overload (see here). You should compare mpf_ts with the cmp function, since the C type mpf_t is actually an struct containing the pointer to the actual significand limbs. Check some usage examples in the tests/cxx/ directory of GMP (like here).
I note you are using GNU MP 4.1 version which is very old, you probably want to move to the 6.2.1 latest version if possible. Also, for using floats it's recommended that you use the GNU MPFR library instead of GMP floats.
EDIT: I did not yet manage to run Catch2, but the issue with your code is the expected_result is actually not equal to the actual_result. In GMP mpf_t variables are created with a 64-bit significand precision (on 64-bit machines), so that the division a / b actually results in a binary that prints 0.166666666666666666667 (that's 19 sixes after the digit 1). Try printing the result with gmp_printf("%.50Ff\n", actual_result);, because the standard cout output will only give you the value rounded to 6 digits: 0.166667.
But the problem is you can't just assign this like expected_result = 0.166666666666666666667 because in C/C++ numeric constants are parsed as double, thus you have to use the string overload attribution to get more precision.
But you can't also manage to easily (or, in general, justifiably) coin a decimal string that will correctly convert to the exact same binary given by a / b because decimal to float conversion has subtleties, see for example here and here.
So, it all depends on your application and the kind of numerical validation you aim to do. If you know that your decimal validation values are correct to some known precision, and if you set the mpf_t variables to withstanding precision (using for example mpf_set_prec), then you can use tolerance comparison, like so.
in C++ (without Catch2), it works like this:
#include <iostream>
#include <gmpxx.h>
using namespace std;
int main (void)
{
mpf_class a = 1;
mpf_class b = 6;
mpf_class actual = a / b;
mpf_class expected;
mpf_class tol;
expected = "0.166666666666666666666666666666667";
tol = "1e-30";
cout << "actual " << actual << "\n";
cout << "expected " << expected << "\n";
gmp_printf("actual %.50Ff\n", actual);
gmp_printf("expected %.50Ff\n", expected);
gmp_printf("tol %.50Ff\n", tol);
mpf_class diff = expected - actual;
gmp_printf("diff %.50Ff\n", diff);
if (abs(actual - expected) < tol)
cout << "ok\n";
else
cout << "nop\n";
return 0;
}
And compile with -lgmpxx -lgmp options.
It produces the output:
actual 0.166667
expected 0.166667
actual 0.16666666666666666666700000000000000000000000000000
expected 0.16666666666666666666700000000000000000000000000000
tol 0.00000000000000000000000000000100000000000000000000
diff 0.00000000000000000000000000000000033333529249058470
ok
If I understand Catch2 well, it should be ok if you assign expected_result with string then compare with REQUIRE(abs(actual - expected) < tol).

What happens when adding or multiplying an integer exceeds its limit

what will happen when the integer crosses its limit? The output is 3595 , and how it will come? And it is 2 byte type ?
#include<stdio.h>
#include<conio.h>
void main()
{
int n=12,res=1;
clrscr();
while(n>3)
{
n+=3;
res*=3;
}
printf("%d",n*res);
getch();
}
The program will have undefined behavior.
The condition you gave is non terminating. It's a loop where the condition will never be terminated in a well defined manner.
You will go on multiplying and then once it will overflow. And then if you get a negative result in n or <=3 then it will stop. And in the mean time res has also overflown. As a result you will not be sure how this program behaves. We can't be sure of what the result will be.
The behaviour is undefined - you should not rely on anything specific. Common manifestations on int overflow are:
Wraparound such that 1 + INT_MAX becomes INT_MIN. This is what every Windows PC I have encountered does. The bit pattern produced by the operation matches the unsigned cousin exactly.
Clamping such that 1 + INT_MAX becomes INT_MAX. I last observed this on a machine (with signed magnitude int) running a variant of UNIX in the 1990s.

OpenCL Alternative Modulo Uses, Advice

There is this simple function which I have used with C++ in the past to simulate simple forms of tessellation. The function takes a number and a divisor. The divisor must be (a power of two - 1) and n should be between 0 and divisor. It returns a modulus result of n % (d+1) using bitwise &.
Fairly sure the function goes like:
unsigned int BitwiseMod(unsigned int n, unsigned int d){ return n & d; }
I am wanting to use this effectively in OpenCL and am wondering if it will work as I imagine it too. In my mind, modulus is a very expensive operation on the GPU but I am familiar using it to form magnitude spaces and other techniques to travel through data.
More often, I would be more likely to simply write this assuming functions have some overhead.
x[i] = 8*(i&d)+offset[i]; //OR in other contexts,...
num = i&d+offset[i];
x[num] = data;
The question is: Will this be useful or get in the way, if useful can you give me some examples where I might try to apply it.
On NVidia's architectures, GT200 and up, Modulo isn't particularly slow, not slower than a normal integer divide. See this paper for details.
However, using a bitwise AND is still quite a lot faster. As function calls are expensive on GPUs, OpenCL compilers aggressively use inlining to improve performance by default. You should be fine with a function call, as it will be inlined.

Reducing Number of Registers Used in CUDA Kernel

I have a kernel which uses 17 registers, reducing it to 16 would bring me 100% occupancy. My question is: are there methods that can be used to reduce the number or registers used, excluding completely rewriting my algorithms in a different manner. I have always kind of assumed the compiler is a lot smarter than I am, so for example I often use extra variables for clarity's sake alone. Am I wrong in this thinking?
Please note: I do know about the --max_registers (or whatever the syntax is) flag, but the use of local memory would be more detrimental than a 25% lower occupancy (I should test this)
Occupancy can be a little misleading and 100% occupancy should not be your primary target. If you can get fully coalesced accesses to global memory then on a high end GPU 50% occupancy will be sufficient to hide the latency to global memory (for floats, even lower for doubles). Check out the Advanced CUDA C presentation from GTC last year for more information on this topic.
In your case, you should measure performance both with and without maxrregcount set to 16. The latency to local memory should be hidden as a result of having sufficient threads, assuming you don't random access into local arrays (which would result in non-coalesced accesses).
To answer you specific question about reducing registers, post the code for more detailed answers! Understanding how compilers work in general may help, but remember that nvcc is an optimising compiler with a large parameter space, so minimising register count has to be balanced with overall performance.
It's really hard to say, nvcc compiler is not very smart in my opinion.
You can try obvious things, for example using short instead of int, passing and using variables by reference (e.g.&variable), unrolling loops, using templates (as in C++). If you have divisions, transcendental functions, been applied in sequence, try to make them as a loop. Try to get rid of conditionals, possibly replacing them with redundant computations.
If you post some code, maybe you will get specific answers.
Utilizing shared memory as cache may lead less register usage and prevent register spilling to local memory...
Think that the kernel calculates some values and these calculated values are used by all of the threads,
__global__ void kernel(...) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int id0 = blockDim.x * blockIdx.x;
int reg = id0 * ...;
int reg0 = reg * a / x + y;
...
int val = reg + reg0 + 2 * idx;
output[idx] = val > 10;
}
So, instead of keeping reg and reg0 as registers and making them possibily spill out to local memory (global memory), we may use shared memory.
__global__ void kernel(...) {
__shared__ int cache[10];
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (threadIdx.x == 0) {
int id0 = blockDim.x * blockIdx.x;
cache[0] = id0 * ...;
cache[1] = cache[0] * a / x + y;
}
__syncthreads();
...
int val = cache[0] + cache[1] + 2 * idx;
output[idx] = val > 10;
}
Take a look at this paper for further information..
It is not generally a good approach to minimize register pressure. The compiler does a good job optimizing the overall projected kernel performance, and it takes into account lots of factors, incliding register.
How does it work when reducing registers caused slower speed
Most probably the compiler had to spill insufficient register data into "local" memory, which is essentially the same as global memory, and thus very slow
For optimization purposes I would recommend to use keywords like const, volatile and so on where necessary, to help the compiler on the optimization phase.
Anyway, it is not these tiny issues like registers which often make CUDA kernels run slow. I'd recommend to optimize work with global memory, the access pattern, caching in texture memory if possible, transactions over the PCIe.
The instruction count increase when lowering the register usage have a simple explanation. The compiler could be using registers to store the results of some operations that are used more than once through your code in order to avoid recalculating those values, when forced to use less registers, the compiler decides to recalculate those values that would be stored in registers otherwise.

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.