CUDA Crashes for big data set - crash

My computer crashes (I have to manually reset it) when I run my kernel function in a loop for 600+ times (it would not crash if it were like 50 times or so), and I'm not sure what's causing the crash.
My main is as follows:
int main()
{
int *seam = new int [image->height];
int width = image->width;
int height = image->height;
int *fMC = (int*)malloc(width*height*sizeof(int*));
int *fNew = (int*)malloc(width*height*sizeof(int*));
for(int i=0;i<numOfSeams;i++)
{
seam = cpufindSeamV2(fMC,width,height,1);
fMC = kernel_shiftSeam(fMC,fNew,seam,width,height,nWidth,1);
for(int k=0;k<height;k++)
{
fMC[(nWidth-1)+width*k] = INT_MAX;
}
}
and my kernel is :
int* kernel_shiftSeam(int *MCEnergyMat, int *newE, int *seam, int width, int height, int x, int direction)
{
//time measurement
float elapsed_time_ms = 0;
cudaEvent_t start, stop; //threads per block
dim3 threads(16,16);
//blocks
dim3 blocks((width+threads.x-1)/threads.x, (height+threads.y-1)/threads.y);
//MCEnergy and Seam arrays on device
int *device_MC, *device_new, *device_Seam;
//MCEnergy and Seam arrays on host
int *host_MC, *host_new, *host_Seam;
//total number of bytes in array
int size = width*height*sizeof(int);
int seamSize;
if(direction == 1)
{
seamSize = height*sizeof(int);
host_Seam = (int*)malloc(seamSize);
for(int i=0;i<height;i++)
host_Seam[i] = seam[i];
}
else
{
seamSize = width*sizeof(int);
host_Seam = (int*)malloc(seamSize);
for(int i=0;i<width;i++)
host_Seam[i] = seam[i];
}
cudaMallocHost((void**)&host_MC, size );
cudaMallocHost((void**)&host_new, size );
host_MC = MCEnergyMat;
host_new = newE;
//allocate 1D flat array on device
cudaMalloc((void**)&device_MC, size);
cudaMalloc((void**)&device_new, size);
cudaMalloc((void**)&device_Seam, seamSize);
//copy host array to device
cudaMemcpy(device_MC, host_MC, size, cudaMemcpyHostToDevice);
cudaMemcpy(device_new, host_new, size, cudaMemcpyHostToDevice);
cudaMemcpy(device_Seam, host_Seam, seamSize, cudaMemcpyHostToDevice);
//measure start time for cpu calculations
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
//perform gpu calculations
if(direction == 1)
{
gpu_shiftSeam<<< blocks,threads >>>(device_MC, device_new, device_Seam, width, height, x);
}
//measure end time for cpu calcuations
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
execTime += elapsed_time_ms;
//copy out the results back to host
cudaMemcpy(newE, device_new, size, cudaMemcpyDeviceToHost);
//free memory
free(host_Seam);
cudaFree(host_MC); cudaFree(host_new);
cudaFree(device_MC); cudaFree(device_new); cudaFree(device_Seam);
//destroy event objects
cudaEventDestroy(start); cudaEventDestroy(stop);
return newE;
}
So, my program crashes when I call "kernel_shiftSeam" for many times, I also freed the memory using cudaFree so I don't know whether or not its a memory leak problem. It would be great if someone can point me in the right direction.

Could be heap problems. Try reordering the cudaFree statements in your kernel to be LIFO. Check release notes for any newer CUDA drivers that contain heap/leak fixes. On windows try installing process explorer 15.12 or newer as it shows GPU memory usage - and a leaky heap is easy to spot.

Related

STM32 Crash on Flash Sector Erase

I'm trying to write 4 uint32's of data into the flash memory of my STM32F767ZI so I've looked at some examples and in the reference manual but still I cannot do it. My goal is to write 4 uint32's into the flash and read them back and compare with the original data, and light different leds depending on the success of the comparison.
My code is as follows:
void flash_write(uint32_t offset, uint32_t *data, uint32_t size) {
FLASH_EraseInitTypeDef EraseInitStruct = {0};
uint32_t SectorError = 0;
HAL_FLASH_Unlock();
EraseInitStruct.TypeErase = FLASH_TYPEERASE_SECTORS;
EraseInitStruct.VoltageRange = FLASH_VOLTAGE_RANGE_3;
EraseInitStruct.Sector = FLASH_SECTOR_11;
EraseInitStruct.NbSectors = 1;
//EraseInitStruct.Banks = FLASH_BANK_1; // or FLASH_BANK_2 or FLASH_BANK_BOTH
st = HAL_FLASHEx_Erase(&EraseInitStruct, &SectorError);
if (st == HAL_OK) {
for (int i = 0; i < size; i += 4) {
st = HAL_FLASH_Program(FLASH_TYPEPROGRAM_WORD, FLASH_USER_START_ADDR + offset + i, *(data + i)); //This is what's giving me trouble
if (st != HAL_OK) {
// handle the error
break;
}
}
}else {
// handle the error
}
HAL_FLASH_Lock();
}
void flash_read(uint32_t offset, uint32_t *data, uint32_t size) {
for (int i = 0; i < size; i += 4) {
*(data + i) = *(__IO uint32_t*)(FLASH_USER_START_ADDR + offset + i);
}
}
int main(void) {
uint32_t data[] = {'a', 'b', 'c', 'd'};
uint32_t read_data[] = {0, 0, 0, 0};
HAL_Init();
SystemClock_Config();
MX_GPIO_Init();
flash_write(0, data, sizeof(data));
flash_read(0, read_data, sizeof(read_data));
if (compareArrays(data,read_data,4))
{
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_7,SET);
}
else
{
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_14,SET);
}
return 0;
}
The problem is that before writing data I must erase a sector, and when I do it with the HAL_FLASHEx_Erase(&EraseInitStruct, &SectorError), function, the program always crashes, and sometimes even corrupts my codespace forcing me to update firmware.
I've selected the sector farthest from the code space but still it crashes when i try to erase it.
I've read in the reference manual that
Any attempt to read the Flash memory while it is being written or erased, causes the bus to
stall. Read operations are processed correctly once the program operation has completed.
This means that code or data fetches cannot be performed while a write/erase operation is
ongoing.
which I believe means the code should ideally be run from RAM while we operate on the flash, but I've seen other people online not have this issue so I'm wondering if that's the only problem I have. With that in mind I wanted to confirm if this is my only issue, or if I'm doing something wrong?
In your loop, you are adding multiples of 4 to i, but then you are adding i to data. When you add to a pointer it is automatically multiplied by the size of the pointed type, so you are adding multiples of 16 bytes and reading past the end of your input buffer.
Also, make sure you initialize all members of EraseInitStruct. Uncomment that line and set the correct value!

cudaMallocHost vs malloc for better performance shows no difference

I have gone through this site. From here I got that pinned memory using cudamallocHost gives better performance than cudamalloc. Then I use two different simple program and tested the execution time as
using cudaMallocHost
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
cudaMallocHost((void **) &a_h, size);
//a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
cudaFreeHost(a_h);
cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
using malloc
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
here during execution of both program, the execution time was almost similar.
Is there anything wrong in the implementation?? what is the exact difference in execution in cudamalloc and cudamallochost??
and also with each run the execution time decreases
If you want to see the difference in execution time for the copy operation, just time the copy operation. In many cases you will see approximately a 2x difference in execution time for just the copy operation when the underlying mememory is pinned. And make your copy operation large enough/long enough so that you are well above the granularity of whatever timing mechanism you are using. The various profilers such as the visual profiler and nvprof can help here.
The cudaMallocHost operation under the hood is doing something like a malloc plus additional OS functions to "pin" each page associated with the allocation. These additional OS operations take extra time, as compared to just doing a malloc. And note that as the size of the allocation increases, the registration ("pinning") cost will generally increase as well.
Therefore, for many examples, just timing the overall execution doesn't show much difference, because while the cudaMemcpy operation may be quicker from pinned memory, the cudaMallocHost takes longer than the corresponding malloc.
So what's the point?
You may be interested in using pinned memory (i.e. cudaMallocHost) when you will be doing repeated transfers from a single buffer. You only pay the extra cost to pin it once, but you benefit on each transfer/usage.
Pinned memory is required to overlap a data transfer operations (cudaMemcpyAsync) with compute activities (kernel calls). Refer to the programming guide.
I too found that just declaring cudaHostAlloc / cudaMallocHost on a piece of memory doesn't do much.
To be sure, do a nvprof with --print-gpu-trace and see whether the throughput for memcpyHtoD or memcpyDtoH is good. For PCI2.0, you should get around 6-8gbps.
However, pinned memory is a perquisite for cudaMemcpyAsync.
After I called cudaMemcpyAsync, I shifted whatever computations I had on the host right after it. In this way you can "layer" the asynchronous memcpys with the host computations.
I was surprised that I was able to save quite a lot of time this way, it's worth a try.

Floodfill memory leak iPhone

I'm implementing a floodfill function in C for the iPhone.
The fill works, although I'm having 2 issues.
The phone gives a memory warning after a few executions of the code below. Most likely a memory leak. Also note that the unsigned char *data (the image data) is being free()'d at the end of the floodfill.
(lesser issue) If I attempt to write RGB colors to pixels that are greater than approximately (r:200,g:200,b:200,a:200) I get weird artifacting happening. A workaround for this was to simply limit the values.
I suspect there may be a correlation between both of these problems.
The code below describes my flood-fill algorithm, using a stack:
.h:
typedef struct {
int red;
int green;
int blue;
int alpha;
} GUIColor;
struct pixel_st {
int x;
int y;
struct pixel_st *nextPixel;
};
typedef struct pixel_st pixel;
.m:
void floodFill(CGPoint location, GUIColor tc, GUIColor rc, size_t width, size_t height, unsigned char *data){
if (isGUIColorEqual(tc, rc)) return;
pixel* aPixel = (pixel *) malloc(sizeof (struct pixel_st));
NSLog(#"sizeof aPixel : %i",(int)sizeof(aPixel));
(*aPixel).x = location.x;
(*aPixel).y = location.y;
(*aPixel).nextPixel = NULL;
int i = 0;
NSLog(#"Replacement color A%i, R%i, G%i, B%i",rc.alpha,rc.red,rc.green, rc.blue);
while (aPixel != NULL){
pixel *oldPixel_p = aPixel;
pixel currentPixel = *aPixel;
aPixel = currentPixel.nextPixel;
//Now we do some boundary checks
if (!isOutOfBounds(currentPixel.x, currentPixel.y, width, height)){
//Grab the current Pixel color
GUIColor currentColor = getGUIColorFromPixelAtLocation(CGPointMake(currentPixel.x, currentPixel.y), width, height, data);
if (isGUIColorSimilar(currentColor, tc)){
//Colors are similar, lets continue the spread
setGUIColorToPixelAtLocation(CGPointMake(currentPixel.x, currentPixel.y), rc, width,height, data);
pixel *newPixel;
if ((newPixel = (pixel*) malloc(sizeof(struct pixel_st))) != NULL) {
(*newPixel).x = currentPixel.x;
(*newPixel).y = currentPixel.y-1;
(*newPixel).nextPixel = aPixel;
aPixel = newPixel;
}
if ((newPixel = (pixel*) malloc(sizeof(struct pixel_st))) != NULL) {
(*newPixel).x = currentPixel.x;
(*newPixel).y = currentPixel.y+1;
(*newPixel).nextPixel = aPixel;
aPixel = newPixel;
}
if ((newPixel = (pixel*) malloc(sizeof(struct pixel_st))) != NULL) {
(*newPixel).x = currentPixel.x+1;
(*newPixel).y = currentPixel.y;
(*newPixel).nextPixel = aPixel;
aPixel = newPixel;
}
if ((newPixel = (pixel*) malloc(sizeof(struct pixel_st))) != NULL) {
(*newPixel).x = currentPixel.x-1;
(*newPixel).y = currentPixel.y;
(*newPixel).nextPixel = aPixel;
aPixel = newPixel;
}
free(oldPixel_p);
i ++;
if (i == width * height * 4 * 5) break;
}
}
}
free(aPixel);
}
This implementation of the stack is based on the ObjFloodFill found here:
https://github.com/OgreSwamp/ObjFloodFill/blob/master/src/FloodFill.m
First of all, each if ((newPixel = (pixel*) malloc(... inside the loop allocates new memory block, so, you have 4 allocations inside the loop and only 1 deallocation.
Second, I can't understand why don't you simply use objects on stack? Do you really need to allocate newPixel, oldPixel and so on on the heap? Review the implementation, there might be much simpler way to implement the same also without managing the memory issues at all.
You need to move the deallocation of oldPixel_p to outside the if blocks, because it is "consumed" always.
Also, the final free only frees the first element in the list. The list may have more than one element. You need to step through the list and free all remaining elements.

Using memcpy and malloc resulting in corrupted data stream

The code below attempts to save a data stream to a file using fwrite. The first example using malloc works but with the second example the data stream is %70 corrupted. Can someone explain to me why the second example is corrupted and how I can remedy it?
short int fwBuffer[1000000];
// short int *fwBuffer[1000000];
unsigned long fwSize[1000000];
// Not Working *********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int tmpbuffer[length*inchannels];
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
memcpy(&fwBuffer[saveBufferCount], tmpbuffer, sizeof(tmpbuffer));
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Working ***********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int *tmpbuffer = (short int*)malloc(size);
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
fwBuffer[saveBufferCount] = tmpbuffer;
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Write to file ***********
for (int i = 0; i < saveBufferCount; i++) {
if (isRecording && outFile != NULL) {
// fwrite(fwBuffer[i], 1, fwSize[i],outFile);
fwrite(&fwBuffer[i], 1, fwSize[i],outFile);
if (fwBuffer[i] != NULL) {
// free(fwBuffer[i]);
}
fwBuffer[i] = NULL;
}
}
You initialize your size as
size = sizeof(short int) * length * inchannels;
then you declare an array of size
short int tmpbuffer[size];
This is already highly suspect. Why did you include sizeof(short int) into the size and then declare an array of short int elements with that size? The byte size of your array in this case is
sizeof(short int) * sizeof(short int) * length * inchannels
i.e. the sizeof(short int) is factored in twice.
Later you initialize only length * inchannels elements of the array, which is not entire array, for the reasons described above. But the memcpy that follows still copies the entire array
memcpy(&fwBuffer[saveBufferCount], &tmpbuffer, sizeof (tmpbuffer));
(Tail portion of the copied data is garbage). I'd suspect that you are copying sizeof(short int) times more data than was intended. The recipient memory overflows and gets corrupted.
The version based on malloc does not suffer from this problem since malloc-ed memory size is specified in bytes, not in short int-s.
If you want to simulate the malloc behavior in the upper version of the code, you need to declare your tmpbuffer as an array of char elements, not of short int elements.
This has very good chances to crash
short int tmpbuffer[(short int)(size)];
first size could be too big, but then truncating it and having whatever size results of that is probably not what you want.
Edit: Try to write the whole code without a single cast. Only then the compiler has a chance to tell you if there is something wrong.

CUDA program causes nvidia driver to crash

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 256
#define NUM_BLOCK 256
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int incircle, k;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
__syncthreads();
g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, *sumHost, total;
const char *error;
int trials;
curandState *devStates;
trials = 500;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
sumHost = (float*)calloc(NUM_BLOCK*NUM_THREAD, sizeof(float));
cudaMalloc((void **) &sumDev, size); // Allocate array on device
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
cudaMalloc((void **) &devStates, (NUM_THREAD*NUM_BLOCK)*sizeof(curandState));
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock>>> (sumDev, trials, devStates);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
dim3 dimGrid1(1,1,1);
dim3 dimBlock1(256,1,1);
reduce0 <<<dimGrid1, dimBlock1, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, sizeof(float), cudaMemcpyDeviceToHost);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
free(sumHost);
cudaFree(sumDev);
cudaFree(devStates);
//*solution = NULL;
return 0;
}
If smaller numbers of trials work correctly, and if you are running on MS Windows without the NVIDIA Tesla Compute Cluster (TCC) driver and/or the GPU you are using is attached to a display, then you are probably exceeding the operating system's "watchdog" timeout. If the kernel occupies the display device (or any GPU on Windows without TCC) for too long, the OS will kill the kernel so that the system does not become non-interactive.
The solution is to run on a non-display-attached GPU and if you are on Windows, use the TCC driver. Otherwise, you will need to reduce the number of trials in your kernel and run the kernel multiple times to compute the number of trials you need.
EDIT: According to the CUDA 4.0 curand docs(page 15, "Performance Notes"), you can improve performance by copying the state for a generator to local storage inside your kernel, then storing the state back (if you need it again) when you are finished:
curandState state = states[i];
for(k = 0; k < trials; k++){
x = curand_uniform(&state);
y = curand_uniform(&state);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
Next, it mentions that setup is expensive, and suggests that you move curand_init into a separate kernel. This may help keep the cost of your MC kernel down so you don't run up against the watchdog.
I recommend reading that section of the docs, there are several useful guidelines.
For those of you having a geforce GPU which does not support TCC driver there is another solution based on:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx
start regedit,
navigate to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
create new DWORD key called TdrLevel, set value to 0,
restart PC.
Now your long-running kernels should not be terminated. This answer is based on:
Modifying registry to increase GPU timeout, windows 7
I just thought it might be useful to provide the solution here as well.