Why isn't NVRTC optimizing out my integer division and modulo operations? - optimization

I compiled a kernel in NVRTC:
__global__ void kernel_A(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx / 32;
unsigned char lane_id = idx % 32;
/* ... */
}
I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:
__global__ void kernel_B(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx >> 5;
unsigned char lane_id = idx & 31;
/* ... */
}
it seems kernel_B just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof shows kernel_A runs for 15.2us in average, while kernel_B runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.
The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram() is -arch compute_50.
Is this expected?

Did a thorough bugsweep in the codebase. Turns out my app was built in DEBUG mode. This causes additional flags -G and -lineinfo passed to nvrtcCompileProgram()
From nvcc man page:
--device-debug (-G)
Generate debug information for device code. Turns off all optimizations.
Don't use for profiling; use -lineinfo instead.

Related

Parallel Brute Force Algorithm GPU

I have implemented a parallel BF Generator in Python like in this Post!
Parallelize brute force generation.
I want to implement this parallel technique on a GPU. Should be like a parallel BF Generator on a GPU.
Can someone help me out with some code examples for a parallel BF Generator on a GPU?
Couldn't find any examples online which made me suspicious.
Look at this implementation - i did the distribution on the GPU with this code:
void IncBruteGPU( unsigned char* theBrute, unsigned int charSetLen, unsigned int bruteLength, unsigned int incNr){
unsigned int i = 0;
while(incrementBy > 0 && i < bruteLength){
int add = incrementBy + ourBrute[i];
ourBrute[i] = add % charSetLen;
incrementBy = add / charSetLen;
i++;
}
}
call it like this:
// the Thread index number
int idx = get_global_id(0);
// the length of your charset "abcdefghi......"
unsigned int charSetLen = 26;
// the length of the word you want to brute
unsigned int bruteLength = 6;
// theBrute keeps the single start numbers of the alphabeth
unsigned char theBrute[MAX_BRUTE_LENGTH];
IncrementBruteGPU(theBrute, charSetLen, bruteLength, idx);
Good Luck!

sem_open - valgrind complains about uninitialised bytes

I have a trivial program:
int main(void)
{
const char sname[]="xxx";
sem_t *pSemaphor;
if ((pSemaphor = sem_open(sname, O_CREAT, 0644, 0)) == SEM_FAILED) {
perror("semaphore initilization");
exit(1);
}
sem_unlink(sname);
sem_close(pSemaphor);
}
When I run it under valgrind, I get the following error:
==12702== Syscall param write(buf) points to uninitialised byte(s)
==12702== at 0x4E457A0: __write_nocancel (syscall-template.S:81)
==12702== by 0x4E446FC: sem_open (sem_open.c:245)
==12702== by 0x4007D0: main (test.cpp:15)
==12702== Address 0xfff00023c is on thread 1's stack
==12702== in frame #1, created by sem_open (sem_open.c:139)
The code was extracted from a bigger project where it ran successfully for years, but now it is causing segmentation fault.
The valgrind error from my example is the same as seen in the bigger project, but there it causes a crash, which my small example doesn't.
I see this with glibc 2.27-5 on Debian. In my case I only open the semaphores right at the start of a long-running program and it seems harmless so far - just annoying.
Looking at the code for sem_open.c which is available at:
https://code.woboq.org/userspace/glibc/nptl/sem_open.c.html
It seems that valgrind is complaining about the line (270 as I look now):
if (TEMP_FAILURE_RETRY (__libc_write (fd, &sem.initsem, sizeof (sem_t)))
== sizeof (sem_t)
However sem.initsem is properly initialised earlier in a fairly baroque manner, firstly by explicitly setting fields in the sem.newsem (part of the union), and then once that is done by a call to memset (L226-228):
/* Initialize the remaining bytes as well. */
memset ((char *) &sem.initsem + sizeof (struct new_sem), '\0',
sizeof (sem_t) - sizeof (struct new_sem));
I think that this particular shenanigans is all quite optimal, but we need to make sure that all of the fields of new_sem have actually been initialised... we find the definition in https://code.woboq.org/userspace/glibc/sysdeps/nptl/internaltypes.h.html and it is this wonderful creation:
struct new_sem
{
#if __HAVE_64B_ATOMICS
/* The data field holds both value (in the least-significant 32 bytes) and
nwaiters. */
# if __BYTE_ORDER == __LITTLE_ENDIAN
# define SEM_VALUE_OFFSET 0
# elif __BYTE_ORDER == __BIG_ENDIAN
# define SEM_VALUE_OFFSET 1
# else
# error Unsupported byte order.
# endif
# define SEM_NWAITERS_SHIFT 32
# define SEM_VALUE_MASK (~(unsigned int)0)
uint64_t data;
int private;
int pad;
#else
# define SEM_VALUE_SHIFT 1
# define SEM_NWAITERS_MASK ((unsigned int)1)
unsigned int value;
int private;
int pad;
unsigned int nwaiters;
#endif
};
So if we __HAVE_64B_ATOMICS then the structure has a data field which contains both the value and the nwaiters, otherwise these are separate fields.
In the initialisation of sem.newsem we can see that these are initialised correctly, as follows:
#if __HAVE_64B_ATOMICS
sem.newsem.data = value;
#else
sem.newsem.value = value << SEM_VALUE_SHIFT;
sem.newsem.nwaiters = 0;
#endif
/* pad is used as a mutex on pre-v9 sparc and ignored otherwise. */
sem.newsem.pad = 0;
/* This always is a shared semaphore. */
sem.newsem.private = FUTEX_SHARED;
I'm doing all of this on a 64-bit system, so I think that valgrind is complaining about the initialisation of the 64-bit sem.newsem.data with a 32-bit value since from:
value = va_arg (ap, unsigned int);
We can see that value is defined simply as an unsigned int which will usually still be 32 bits even on a 64-bit system (see What should be the sizeof(int) on a 64-bit machine?), but that should just be an implicit cast to 64-bits when it is assigned.
So I think this is not a bug - just valgrind getting confused.

cudaMallocHost vs malloc for better performance shows no difference

I have gone through this site. From here I got that pinned memory using cudamallocHost gives better performance than cudamalloc. Then I use two different simple program and tested the execution time as
using cudaMallocHost
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
cudaMallocHost((void **) &a_h, size);
//a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
cudaFreeHost(a_h);
cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
using malloc
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
here during execution of both program, the execution time was almost similar.
Is there anything wrong in the implementation?? what is the exact difference in execution in cudamalloc and cudamallochost??
and also with each run the execution time decreases
If you want to see the difference in execution time for the copy operation, just time the copy operation. In many cases you will see approximately a 2x difference in execution time for just the copy operation when the underlying mememory is pinned. And make your copy operation large enough/long enough so that you are well above the granularity of whatever timing mechanism you are using. The various profilers such as the visual profiler and nvprof can help here.
The cudaMallocHost operation under the hood is doing something like a malloc plus additional OS functions to "pin" each page associated with the allocation. These additional OS operations take extra time, as compared to just doing a malloc. And note that as the size of the allocation increases, the registration ("pinning") cost will generally increase as well.
Therefore, for many examples, just timing the overall execution doesn't show much difference, because while the cudaMemcpy operation may be quicker from pinned memory, the cudaMallocHost takes longer than the corresponding malloc.
So what's the point?
You may be interested in using pinned memory (i.e. cudaMallocHost) when you will be doing repeated transfers from a single buffer. You only pay the extra cost to pin it once, but you benefit on each transfer/usage.
Pinned memory is required to overlap a data transfer operations (cudaMemcpyAsync) with compute activities (kernel calls). Refer to the programming guide.
I too found that just declaring cudaHostAlloc / cudaMallocHost on a piece of memory doesn't do much.
To be sure, do a nvprof with --print-gpu-trace and see whether the throughput for memcpyHtoD or memcpyDtoH is good. For PCI2.0, you should get around 6-8gbps.
However, pinned memory is a perquisite for cudaMemcpyAsync.
After I called cudaMemcpyAsync, I shifted whatever computations I had on the host right after it. In this way you can "layer" the asynchronous memcpys with the host computations.
I was surprised that I was able to save quite a lot of time this way, it's worth a try.

ARM: saturate signed int to unsigned byte

saturating instructions saturate unsigned to unsigned or signed to signed int.
What's the best way to saturate signed 16-bit ints to unsigned byte?
In short, here's the logic
uint8_t usat8(uint8_t u8, int16_t s16)
{
s16 += u8;
if(s16 <= 0) {
return 0;
} else if(s16 >=255){
return 255;
}else{
return (uint8_t)s16;
}
}
void add_row(uint8_t * dst, uint8_t * u8, int16_t * s16)
{
for(int i=0; i<XXX; ++i)
{
dst[i] = usat8(u8[i] + s16[i]);
}
}
values of s16 are usually not much off from the [0, 255] range, e.g. it's safe to assume that abs(s16[x]) < 1000.
EDIT: I just realized that USAT16 actually saturates signed 16-bit int to unsigned integer. Simple USAT16 is the solution to the problem.
After 5 mins of thinking I have this idea (pseudo arm-asm):
sadd16 sum, s16, u8 # do two additions in parallel
orr signs, 0x1001, sum, lsr #15 # extract signs of the two 16 bit results
usat16 sum, sum, #8 # saturate both of the 16-bit sums to unsigned 8-byte range
uadd16 sum, sum, signs
this way, if sign bit was set for any of the sums the resulting sum will become 256, or 0x100. When writing back the data the shifted out 0x1 will be discarded.
Any comments, does that seem like the optimal approach, is there any better alternative?
PS. I do it for an armv6 device, no NEON or armv6t2

CUDA program causes nvidia driver to crash

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 256
#define NUM_BLOCK 256
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int incircle, k;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
__syncthreads();
g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, *sumHost, total;
const char *error;
int trials;
curandState *devStates;
trials = 500;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
sumHost = (float*)calloc(NUM_BLOCK*NUM_THREAD, sizeof(float));
cudaMalloc((void **) &sumDev, size); // Allocate array on device
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
cudaMalloc((void **) &devStates, (NUM_THREAD*NUM_BLOCK)*sizeof(curandState));
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock>>> (sumDev, trials, devStates);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
dim3 dimGrid1(1,1,1);
dim3 dimBlock1(256,1,1);
reduce0 <<<dimGrid1, dimBlock1, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, sizeof(float), cudaMemcpyDeviceToHost);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
free(sumHost);
cudaFree(sumDev);
cudaFree(devStates);
//*solution = NULL;
return 0;
}
If smaller numbers of trials work correctly, and if you are running on MS Windows without the NVIDIA Tesla Compute Cluster (TCC) driver and/or the GPU you are using is attached to a display, then you are probably exceeding the operating system's "watchdog" timeout. If the kernel occupies the display device (or any GPU on Windows without TCC) for too long, the OS will kill the kernel so that the system does not become non-interactive.
The solution is to run on a non-display-attached GPU and if you are on Windows, use the TCC driver. Otherwise, you will need to reduce the number of trials in your kernel and run the kernel multiple times to compute the number of trials you need.
EDIT: According to the CUDA 4.0 curand docs(page 15, "Performance Notes"), you can improve performance by copying the state for a generator to local storage inside your kernel, then storing the state back (if you need it again) when you are finished:
curandState state = states[i];
for(k = 0; k < trials; k++){
x = curand_uniform(&state);
y = curand_uniform(&state);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
Next, it mentions that setup is expensive, and suggests that you move curand_init into a separate kernel. This may help keep the cost of your MC kernel down so you don't run up against the watchdog.
I recommend reading that section of the docs, there are several useful guidelines.
For those of you having a geforce GPU which does not support TCC driver there is another solution based on:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx
start regedit,
navigate to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
create new DWORD key called TdrLevel, set value to 0,
restart PC.
Now your long-running kernels should not be terminated. This answer is based on:
Modifying registry to increase GPU timeout, windows 7
I just thought it might be useful to provide the solution here as well.