I have some software written in VB.NET that performs a lot of calculations, mostly extracting jpegs to bitmaps and computing calculations on the pixels like convolutions and matrix multiplication. Different computers are giving me different results despite having identical inputs. What might be the reason?
Edit: I can't provide the algorithm because it's proprietary but I can provide all the relevant operations:
ULong \ ULong (Turuncating division)
Bitmap.Load("filename.bmp') (Load a bitmap into memory)
Bitmap.GetPixel(Integer, Integer) (Get a pixel's brightness)
Double + Double
Double * Double
Math.Sqrt(Double)
Math.PI
Math.Cos(Double)
ULong - ULong
ULong * ULong
ULong << ULong
List.OrderBy(Of Double)(Func)
Hmm... Is it possible that OrderBy is using a non-stable QuickSort and that QuickSort is using a random pivot? Edit: Just tested, nope. The sort is stable.
Turns out that Bitmap.Load("filename.jpeg") doesn't always produce the same bitmap on each computer. I still don't know why that is, however.
one or more bugs in the software (e.g uninitialised variables) ?
old Intel CPU floating point division bug ?
numerically unstable algorithm ?
Screen Drivers - Each Driver will GUI the values differently. While the pixel count is same the color depth may differ via the screen drivers. Now setup into an array and compare that array on those machines you may see a difference of several bytes.
I would print$ the totals and see what they add up to
Related
To all homomorphic encryption experts out there:
I'm using the PALISADE library:
int plaintextModulus = 65537;
float sigma = 3.2;
SecurityLevel securityLevel = HEStd_128_classic;
uint32_t depth = 2;
//Instantiate the crypto context
CryptoContext<DCRTPoly> cc = CryptoContextFactory<DCRTPoly>::genCryptoContextBFVrns(
plaintextModulus, securityLevel, sigma, 0, depth, 0, OPTIMIZED);
could you please explain (all) the parameters especially intrested in ptm, depth and sigma.
Secondly I am trying to make a Packed Plaintext with the cc above.
cc->MakePackedPlaintext(array);
What is the maximum size of the array? On my local machine (8GB RAM) when the array is larger than ~8000 int64 I get an free(): invalid next size (normal) error
Thank you for asking the question.
Plaintext modulus t (denoted as t here) is a critical parameter for BFV as all operations are performed mod t. In other words, when you choose t, you have to make sure that all computations do not wrap around, i.e., do not exceed t. Otherwise you will get an incorrect answer unless your goal is to compute something mod t.
sigma is the distribution parameter (used for the underlying Learning with Errors problem). You can just set to 3.2. No need to change it.
Depth is the multiplicative depth of the circuit you are trying to compute. It has nothing to with the size of vectors. Basically, if you have AxBxCxD, you have a depth 3 with a naive approach. BFV also supports more efficient binary tree evaluation, i.e., (AxB)x(CxD) - this option will reduce the depth to 2.
BFV is a scheme that supports packing. By default, the size of packed ciphertext is equal to the ring dimension (something like 8192 for the example you mentioned). This means you can pack up to 8192 integers in your case. To support larger arrays/vectors, you would need to break them into batches of 8192 each and encrypt each one separately.
Regarding your application, the CKKS scheme would probably be a much better option (I will respond on the application in more detail in the other thread).
I have some experience with the SEAL library which also uses the BFV encryption scheme. The BFV scheme uses modular arithmetic and is able to encrypt integers (not real numbers).
For the parameters you're asking about:
The Plaintext Modulus is an upper bound for the input integers. If this parameter is too low, it might cause your integers to overflow (depending on how large they are of course)
The Sigma is the distribution parameter for Gaussian noise generation
The Depth is the circuit depth which is the maximum number of multiplications on a path
Also for the Packed Plaintext, you should use vectors not arrays. Maybe that will fix your problem. If not, try lowering the size and make several vectors if necessary.
You can determine the ring dimension (generated by the crypto context based on your parameter settings) by using cc->GetRingDimension() as shown in line 113 of https://gitlab.com/palisade/palisade-development/blob/master/src/pke/examples/simple-real-numbers.cpp
Do modern GPUs optimize multiplication by powers of 2 by doing a bit shift? For example suppose I do the following in a shader:
float t = 0;
t *= 16;
t *= 17;
Is it possible the first multiplication will run faster than the second?
Floating point multiplication cannot be done by bit shift. Howerver, in theory floating point multiplication by power of 2 constants can be optimized. Floating point value is normally stored in the form of S * M * 2 ^ E, where S is a sign, M is mantissa and E is exponent. Multiplying by a power of 2 constant can be done by adding/substracting to the exponent part of the float, without modifying the other parts. But in practice, I would bet that on GPUs a generic multiply instruction is always used.
I had an interesting observation regarding the power of 2 constants while studying the disassembly output of the PVRShaderEditor (PowerVR GPUs). I have noticed that a certain range of power of 2 constants ([2^(-16), 2^10] in my case), use special notation, e.g. C65, implying that they are predefined. Whereas arbitrary constants, such as 3.0 or 2.3, use shared register notation (e.g. SH12), which implies they are stored as a uniform and probably incur some setup cost. Thus using power of 2 constants may yield some optimizational benefit at least on some hardware.
From this previous post: strategy-for-doing-final-reduction, I would like to know the last functionalities offered by OpenCL 2.x (not 1.x which is the subject of this previous post above), especially about the atomic functions which allow to perform reductions of a array (in my case a sum reduction).
One told me that performances of OpenCL 1.x atomic functions (atom_add) were bad and I could check it, so I am looking for a way to get the best performances for a final reduction function (i.e the sum of each computed sum corresponding to each work-group).
I recall the typical kind of kernel code that I am using for the moment :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
As you can see, at the end of kernel code execution, I get the array partialSums[number_of_workgroups] containing all partial sums.
Could you tell me please how to perform a second and final reduction of this array, with the best performances possibles of functions availables with OpenCL 2.x . A classic solution is to perform this final reduction with CPU but ideally, I would like to do it directly with kernel code.
A suggestion of code snippet is welcome.
A last point, I am working on MacOS High Sierra 10.13.5 with the following model :
Can OpenCL 2.x be installed on my hardware MacOS model ?
Atomic functions should be avoided because they do harm performance compared to a parallel reduction kernel. Your kernel looks to be on the right track, but you need to remember that you'll have to invoke it multiple times; do not perform the final sum on the host (unless you have a very small amount of data from the previous reduction). That is, you need to keep invoking it until your local size equals your global size. There's no way to do a single invocation for large amounts of data as there is no way to synchronize between work groups.
Additionally, you want to be careful to set an appropriate work group size (i.e. local size), which depends on local & global memory throughput & latency. Unfortunately, as far as I'm aware there is no way to determine this through OpenCL, outside of self-profiling code, though that's not too difficult to write as OCL provides you with JIT compilation. Through empirical testing I've found you should find a sweet spot between suffering too many bank conflicts (too large a local size) vs. global memory latency penalties (too small a local size). It's best to do a benchmark first to determine optimal local size for your reduction, and then use that local size for future reductions.
Edit: It's also worth noting that the best way to chain your kernel invocation together is through OpenCL events.
What is the cost of casting a variable to a different type in OpenCL?
Example: I want to take dot product of 2 int3 vectors (AFAIK dot() isn't overloaded for int3s), so instead of implementing dot() by myself in unvectorized way, I want to vectorize the code by using the native dot() for float3. First I convert the 2 vectors to float3s and then I cast the result to int.
Which of the two functions, foo and bar, is less time consuming (and why)?
inline int foo(int3 a, int3 b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
inline int bar(int3 a, int3 b) {
return (int)dot(convert_float3(a), convert_float3(b));
}
As has been suggested in the comments, measuring is going to be the most useful tool in practice, and the cost of individual instructions is heavily dependent on hardware architecture, but also the compiler.
Nevertheless, a comparison to other operations is useful, and at least AMD publishes a list of the instruction throughput for their devices in this section of their OpenCL optimisation guide, and this includes float-to-int and int-to-float conversion.
In your particular case, I strongly suspect your "vectorising" attempts will have detrimental effects. Most modern GPUs aren't SIMD processors in the CPU SIMD sense. The threads run in lock-step, but each thread operates on scalars. A "horizontal" operation like a dot product may not be particularly efficient even if the GPU does use per-thread SIMD.
If you can limit the range of each of your integers to 24 bits, a series of mad24() and mul24() calls will most likely be fastest. But again - measure. Try the different options on a range of hardware, and run them lots of times, applying basic stats to make sure you aren't just seeing random variation/overhead.
A separate thing to note with regard to integer-to-float conversions is that such conversions are often "free" when you sample as floats from an image object containing integers.
I was working with PIC18f4550 and the program is critical to speed. when I multiply two floating variables it tooks the PIC about 140 cycles to perform the multiplication. I am measuring it with PIC18f4550 timer1.
variable_1 = variable_2 * variable_3; // took 140 cycles to implement
On the the other hand when I add the same two variables the PIC tooks 280 cycles to perfom the addition.
variable_1 = variable_2 + variable_3; // took 280 cycles to implement
I have seen that the number of cycles vary if the variables changed depend on their exponents.
What is the reason of those more cycles? though I was thinking the addition is more simple than multiplication.
Is there any solution?
For floating point addition, the operands need to be adjusted so that they have the same exponent before the add, and that involves shifting one of the mantissas across byte boundaries, whereas a multiply is basically multiplying the mantissas and adding the exponents.
Since the PIC apparently has a small hardware multiplier, it may not be surprising that sometimes the multiply can be faster than doing a multi-byte shift (especially if the PIC only has single bit shift instructions).
Unless a processor has direct support for it, floating point is always slow, and you should certainly consider arranging your code to use fixed point if at all possible. Getting rid of the floating point library would probably free up a lot of code space as well.