I am using Ubuntu 18.04 on a Oracle virtual Box on HP machine. I have tried to install and run a OpenCL code but I got the following errors that OpenCL has returned. I am trying to just add values of sin^2(i) and cos^2(i) and taking average all of them. So the answer is 1.000 but due to some problem in the installation or the machine I am getting a bunch of errors and answer as 0.
I have tried adding and removing beignet. It did not resolve the issue
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <CL/opencl.h>
// OpenCL kernel. Each work item takes care of one element of c
const char *kernelSource = "\n" \
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" \
"__kernel void vecAdd( __global double *a, \n" \
" __global double *b, \n" \
" __global double *c, \n" \
" const unsigned int n) \n" \
"{ \n" \
" //Get our global thread ID \n" \
" int id = get_global_id(0); \n" \
" \n" \
" //Make sure we do not go out of bounds \n" \
" if (id < n) \n" \
" c[id] = a[id] + b[id]; \n" \
"} \n" \
"\n" ;
int main( int argc, char* argv[] )
// Length of vectors
unsigned int n = 100000;
// Host input vectors
double *h_a;
double *h_b;
// Host output vector
double *h_c;
// Device input buffers
cl_mem d_a;
cl_mem d_b;
// Device output buffer
cl_mem d_c;
cl_platform_id cpPlatform; // OpenCL platform
cl_device_id device_id; // device ID
cl_context context; // context
cl_command_queue queue; // command queue
cl_program program; // program
cl_kernel kernel; // kernel
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
// Initialize vectors on host
int i;
for( i = 0; i < n; i++ )
h_a[i] = sinf(i)*sinf(i);
h_b[i] = cosf(i)*cosf(i);
size_t globalSize, localSize;
cl_int err;
// Number of work items in each local work group
localSize = 64;
// Number of total work items - localSize must be devisor
globalSize = ceil(n/(float)localSize)*localSize;
// Bind to platform
err = clGetPlatformIDs(1, &cpPlatform, NULL);
// Get ID for the device
err = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);
// Create a context
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
// Create a command queue
queue = clCreateCommandQueue(context, device_id, 0, &err);
// Create the compute program from the source buffer
program = clCreateProgramWithSource(context, 1,
(const char **) & kernelSource, NULL, &err);
// Build the program executable
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
// Create the compute kernel in the program we wish to run
kernel = clCreateKernel(program, "vecAdd", &err);
// Create the input and output arrays in device memory for our calculation
d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, NULL);
// Write our data set into the input array in device memory
err = clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0,
bytes, h_a, 0, NULL, NULL);
err |= clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0,
bytes, h_b, 0, NULL, NULL);
// Set the arguments to our compute kernel
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &n);
// Execute the kernel over the entire range of the data set
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize,
// Wait for the command queue to get serviced before reading back results
// Read the results from the device
clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0,
bytes, h_c, 0, NULL, NULL );
//Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// release OpenCL resources
//release host memory
return 0;
These are the error messages that I have got
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [22]
param: 4, val: 0
DRM_IOCTL_I915_GEM_APERTURE failed: Invalid argument
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [22]
param: 4, val: 0
beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
final result: 0.000000


Are the compressed bytes inside GZIP and PKZIP files compatible?

This question is a follow-up to "How are zlib, gzip and zip related? What do they have in common and how are they different?" The answers are very detailed but they never quite answer my specific question.
Given a valid GZIP file, should I always be able to extract the deflate-bytes inside and use those bytes to construct a valid PKZIP file with the same contents, without decompressing and recompressing that byte stream?
For example, imagine I have a collection of GZIP files. Could I write a program that quickly (by avoiding deflate/inflate) constructs an equivalent PKZIP file of those files by cutting the GZIP headers off the source files and building a PKZIP structure around the byte streams? (Also the same in reverse by taking any valid PKZIP file and quickly convert them into many GZIP files?)
Both file formats appear to use the same "deflate" algorithm, but is it exactly the same deflate algorithm?
Yes. It is exactly the same deflate format.
(The deflate algorithm can be, and in fact often is different, producing different deflate streams. However that is irrelevant to your application. The format is compatible, and any compliant inflator will be able to decompress the gzip deflate data transplanted into a zip file.)
I forgot why I wrote this, but the C code below will convert a gzip file to a single-entry zip file, with some constraints on the gzip file.
gz2zip.c version 1.0, 31 July 2018
Copyright (C) 2018 Mark Adler
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
// Convert gzip (.gz) file to a single entry zip file. See the comments before
// gz2zip() for more details and caveats.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#if defined(MSDOS) || defined(OS2) || defined(WIN32) || defined(__CYGWIN__)
# include <fcntl.h>
# include <io.h>
# define SET_BINARY_MODE(file) setmode(fileno(file), O_BINARY)
# define SET_BINARY_MODE(file)
#define local static
// Exit on error.
local void bail(char *why) {
fprintf(stderr, "gz2zip abort: %s\n", why);
// Type to track number of bytes written.
typedef struct {
FILE *out;
off_t off;
} tally_t;
// Write len bytes at dat to t.
local void put(tally_t *t, void const *dat, size_t len) {
size_t ret = fwrite(dat, 1, len, t->out);
if (ret != len)
bail("write error");
t->off += len;
// Write 16-bit integer n in little-endian order to t.
local void put2(tally_t *t, unsigned n) {
unsigned char dat[2];
dat[0] = n;
dat[1] = n >> 8;
put(t, dat, 2);
// Write 32-bit integer n in little-endian order to t.
local void put4(tally_t *t, unsigned long n) {
put2(t, n);
put2(t, n >> 16);
// Write n zeros to t.
local void putz(tally_t *t, unsigned n) {
unsigned char const buf[1] = {0};
while (n--)
put(t, buf, 1);
// Convert the Unix time unix to DOS time in the four bytes at *dos. If there
// is a conversion error for any reason, store the current time in DOS format
// at *dos. The Unix time in seconds is rounded up to an even number of
// seconds, since the DOS time can only represent even seconds. If the Unix
// time is before 1980, the minimum DOS time of Jan 1, 1980 is used.
local void unix2dos(unsigned char *dos, time_t unix) {
unix += unix & 1;
struct tm *s = localtime(&unix);
if (s == NULL) {
unix = time(NULL); // on error, use current time
unix += unix & 1;
s = localtime(&unix);
if (s == NULL)
bail("internal error"); // shouldn't happen
if (s->tm_year < 80) { // no DOS time before 1980
dos[0] = 0; dos[1] = 0; // use midnight,
dos[2] = (1 << 5) + 1; dos[3] = 0; // Jan 1, 1980
else {
dos[0] = (s->tm_min << 5) + (s->tm_sec >> 1);
dos[1] = (s->tm_hour << 3) + (s->tm_min >> 3);
dos[2] = ((s->tm_mon + 1) << 5) + s->tm_mday;
dos[3] = ((s->tm_year - 80) << 1) + ((s->tm_mon + 1) >> 3);
// Chunk size for reading and writing raw deflate data.
#define CHUNK 16384
// Read the gzip file from in and write it as a single-entry zip file to out.
// This assumes that the gzip file has a single member, that it has no junk
// after the gzip trailer, and that it contains less than 4GB of uncompressed
// data. The gzip file is not decompressed or validated, other than checking
// for the proper header format. The modification time from the gzip header is
// used for the zip entry, unless it is not present, in which case the current
// local time is used for the zip entry. The file name from the gzip header is
// used for the zip entry, unless it is not present, in which case "-" is used.
// This does not use the Zip64 format, so the offsets in the resulting zip file
// must be less than 4GB. If name is not NULL, then the zero-terminated string
// at name is used as the file name for the single entry. Whether the file name
// comes from the gzip header or from name, it is truncated to 64K-1 characters
// if necessary.
// It is recommended that unzip -t be used on the resulting file to verify its
// integrity. If the gzip files do not obey the constraints above, then the zip
// file will not be valid.
local void gz2zip(FILE *in, FILE *out, char *name) {
// zip file constant headers for local, central, and end record
unsigned char const loc[] = {'P', 'K', 3, 4, 20, 0, 8, 0, 8, 0};
unsigned char const cen[] = {'P', 'K', 1, 2, 20, 0, 20, 0, 8, 0, 8, 0};
unsigned char const end[] = {'P', 'K', 5, 6, 0, 0, 0, 0, 1, 0, 1, 0};
// gzip header
unsigned char head[10];
// zip file modification date, CRC, and sizes -- initialize to zero for the
// local header (the actual CRC and sizes follow the compressed data)
unsigned char desc[16] = {0};
// name from gzip header to use for the zip entry (the maximum size of the
// name is 64K-1 -- if the gzip name is longer, then it is truncated)
unsigned name_len;
char save[65535];
// read and interpret the gzip header, bailing if it is invalid or has an
// unknown compression method or flag bits set
size_t got = fread(head, 1, sizeof(head), in);
if (got < sizeof(head) ||
head[0] != 0x1f || head[1] != 0x8b || head[2] != 8 || (head[3] & 0xe0))
bail("input not gzip");
if (head[3] & 4) { // extra field (ignore)
unsigned extra = getc(in);
int high = getc(in);
if (high == EOF)
bail("premature end of gzip input");
extra += (unsigned)high << 8;
fread(name, 1, extra, in);
if (head[3] & 8) { // file name (save)
name_len = 0;
int ch;
while ((ch = getc(in)) != 0 && ch != EOF)
if (name_len < sizeof(name))
save[name_len++] = ch;
else { // no file name
name_len = 1;
save[0] = '-';
if (head[3] & 16) { // comment (ignore)
int ch;
while ((ch = getc(in)) != 0 && ch != EOF)
if (head[3] & 2) { // header crc (ignore)
// use name from argument if present, otherwise from gzip header
if (name == NULL)
name = save;
else {
name_len = strlen(name);
if (name_len > 65535)
name_len = 65535;
// set modification time and date in descriptor from gzip header
time_t mod = head[4] + (head[5] << 8) + ((time_t)(head[6]) << 16) +
((time_t)(head[7]) << 24);
unix2dos(desc, mod ? mod : time(NULL));
// initialize tally of output bytes
tally_t zip = {out, 0};
// write zip local header
off_t locoff =;
put(&zip, loc, sizeof(loc));
put(&zip, desc, sizeof(desc));
put2(&zip, name_len);
putz(&zip, 2);
put(&zip, name, name_len);
// copy raw deflate stream, saving eight-byte gzip trailer
unsigned char buf[CHUNK + 8];
if (fread(buf, 1, 8, in) != 8)
bail("premature end of gzip input");
off_t comp = 0;
while ((got = fread(buf + 8, 1, CHUNK, in)) != 0) {
put(&zip, buf, got);
comp += got;
memmove(buf, buf + got, 8);
// write descriptor based on gzip trailer and compressed count
memcpy(desc + 4, buf, 4);
desc[8] = comp;
desc[9] = comp >> 8;
desc[10] = comp >> 16;
desc[11] = comp >> 24;
memcpy(desc + 12, buf + 4, 4);
put(&zip, desc + 4, sizeof(desc) - 4);
// write zip central directory
off_t cenoff =;
put(&zip, cen, sizeof(cen));
put(&zip, desc, sizeof(desc));
put2(&zip, name_len);
putz(&zip, 12);
put4(&zip, locoff);
put(&zip, name, name_len);
// write zip end-of-central-directory record
off_t endoff =;
put(&zip, end, sizeof(end));
put4(&zip, endoff - cenoff);
put4(&zip, cenoff);
putz(&zip, 2);
// Convert the gzip file on stdin to a zip file on stdout. If present, the
// first argument is used as the file name in the zip entry.
int main(int argc, char **argv) {
// avoid end-of-line conversions on evil operating systems
// convert .gz on stdin to .zip on stdout -- error returns use exit()
gz2zip(stdin, stdout, argc > 1 ? argv[1] : NULL);
return 0;

SMHasher setup?

The SMHasher test suite for hash functions is touted as the best of the lot. But the latest version I've got (from rurban) gives absolutely no clue on how to check your proposed hash function (it does include an impressive battery of hash functions, but some of interest --if only for historic value-- are missing). Add that I'm a complete CMake newbie.
It's actually quite simple. You just need to install CMake.
Building SMHasher
To build SMHasher on a Linux/Unix machine:
git clone
cd smhasher/
git submodule init
git submodule update
cmake .
Adding a new hash function
To add a new function, you can edit just three files: Hashes.cpp, Hashes.h and main.cpp.
For example, I will add the ElfHash:
unsigned long ElfHash(const unsigned char *s)
unsigned long h = 0, high;
while (*s)
h = (h << 4) + *s++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
return h;
First, need to modify it slightly to take a seed and length:
uint32_t ElfHash(const void *key, int len, uint32_t seed)
unsigned long h = seed, high;
const uint8_t *data = (const uint8_t *)key;
for (int i = 0; i < len; i++)
h = (h << 4) + *data++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
return h;
Add this function definition to Hashes.cpp. Also add the following to Hashes.h:
uint32_t ElfHash(const void *key, int len, uint32_t seed);
inline void ElfHash_test(const void *key, int len, uint32_t seed, void *out) {
*(uint32_t *) out = ElfHash(key, len, seed);
In file main.cpp add the following line into array g_hashes:
{ ElfHash_test, 32, 0x0, "ElfHash", "ElfHash 32-bit", POOR, {0x0} },
(The third value is self-verification. You will learn this only after running the test once.)
Finally, rebuild and run the test:
./SMHasher ElfHash
It will show you all the tests that this hash function fails. (It is very bad.)

Questions about this serial communication code? [Cortex-M4]

I'm looking at the following code from STMicroelectronics on implementing USART communication with interrupts
#include <stm32f10x_lib.h> // STM32F10x Library Definitions
#include <stdio.h>
#include "STM32_Init.h" // STM32 Initialization
The length of the receive and transmit buffers must be a power of 2.
Each buffer has a next_in and a next_out index.
If next_in = next_out, the buffer is empty.
(next_in - next_out) % buffer_size = the number of characters in the buffer.
#define TBUF_SIZE 256 /*** Must be a power of 2 (2,4,8,16,32,64,128,256,512,...) ***/
#define RBUF_SIZE 256 /*** Must be a power of 2 (2,4,8,16,32,64,128,256,512,...) ***/
#if TBUF_SIZE < 2
#error TBUF_SIZE is too small. It must be larger than 1.
#elif ((TBUF_SIZE & (TBUF_SIZE-1)) != 0)
#error TBUF_SIZE must be a power of 2.
#if RBUF_SIZE < 2
#error RBUF_SIZE is too small. It must be larger than 1.
#elif ((RBUF_SIZE & (RBUF_SIZE-1)) != 0)
#error RBUF_SIZE must be a power of 2.
struct buf_st {
unsigned int in; // Next In Index
unsigned int out; // Next Out Index
char buf [RBUF_SIZE]; // Buffer
static struct buf_st rbuf = { 0, 0, };
#define SIO_RBUFLEN ((unsigned short)( - rbuf.out))
static struct buf_st tbuf = { 0, 0, };
#define SIO_TBUFLEN ((unsigned short)( - tbuf.out))
static unsigned int tx_restart = 1; // NZ if TX restart is required
Handles USART1 global interrupt request.
void USART1_IRQHandler (void) {
volatile unsigned int IIR;
struct buf_st *p;
if (IIR & USART_FLAG_RXNE) { // read interrupt
USART1->SR &= ~USART_FLAG_RXNE; // clear interrupt
p = &rbuf;
if (((p->in - p->out) & ~(RBUF_SIZE-1)) == 0) {
p->buf [p->in & (RBUF_SIZE-1)] = (USART1->DR & 0x1FF);
USART1->SR &= ~USART_FLAG_TXE; // clear interrupt
p = &tbuf;
if (p->in != p->out) {
USART1->DR = (p->buf [p->out & (TBUF_SIZE-1)] & 0x1FF);
tx_restart = 0;
else {
tx_restart = 1;
USART1->CR1 &= ~USART_FLAG_TXE; // disable TX interrupt if nothing to send
initialize the buffers
void buffer_Init (void) { = 0; // Clear com buffer indexes
tbuf.out = 0;
tx_restart = 1; = 0;
rbuf.out = 0;
transmit a character
int SendChar (int c) {
struct buf_st *p = &tbuf;
// If the buffer is full, return an error value
return (-1);
p->buf [p->in & (TBUF_SIZE - 1)] = c; // Add data to the transmit buffer.
if (tx_restart) { // If transmit interrupt is disabled, enable it
tx_restart = 0;
USART1->CR1 |= USART_FLAG_TXE; // enable TX interrupt
return (0);
receive a character
int GetKey (void) {
struct buf_st *p = &rbuf;
if (SIO_RBUFLEN == 0)
return (-1);
return (p->buf [(p->out++) & (RBUF_SIZE - 1)]);
MAIN function
int main (void) {
buffer_Init(); // init RX / TX buffers
stm32_Init (); // STM32 setup
printf ("Interrupt driven Serial I/O Example\r\n\r\n");
while (1) { // Loop forever
unsigned char c;
printf ("Press a key. ");
c = getchar ();
printf ("\r\n");
printf ("You pressed '%c'.\r\n\r\n", c);
} // end while
} // end main
My questions are the following:
In the handler function, when does the statement ((p->in - p->out) & ~(RBUF_SIZE-1)) ever evaluate to a value other than zero? If RBUF_SIZE is a power of 2 as indicated, then ~(RBUF_SIZE-1) should always be zero. Is it checking if p->in > p->out? Even if this isn't true, the conditional should evaluate to zero anyway, right?
In the line following, the statement p->buf [p->in & (RBUF_SIZE-1)] = (USART1->DR & 0x1FF); is made. Why does the code AND p->in with RBUF_SIZE-1?
What kind of buffer are we using in this code? FIFO?
Not so. For example, assuming 32-bit arithmetic, if RBUF_SIZE == 0x00000100 then RBUF_SIZE-1 == 0x000000FF and ~(RBUF_SIZE-1) == 0xFFFFFF00 (it's a bitwise NOT, not a logical NOT). The check you refer to is therefore effectively the same as (p->in - p->out) < RBUF_SIZE, and it's not clear why it is superior. ARM GCC 7.2.1 produces identical length code for the two (-O1).
p->in & (RBUF_SIZE-1) is the same as p->in % RBUF_SIZE when p->in is unsigned. Again, not sure why the former would be used when the latter is clearer; sure, it effectively forces the compiler to compute the modulo using an AND operation, but given that RBUF_SIZE is known at compile time to be a power of two my guess is that most compilers could figure this out (again, ARM GCC 7.2.1 certainly can, I've just tried it - it produces the same instructions either way).
Looks like it. FIFO implemented as a circular buffer.

OpenCL, simple vector addition but wrong output for large input

So, after spending hours reading and understanding I have finally made my first OpenCL program that actually does something, which is it adds two vectors and outputs to a file.
#include <iostream>
#include <vector>
#include <cstdlib>
#include <string>
#include <fstream>
#include <CL/cl.hpp>
int main(int argc, char *argv[])
// get platforms, devices and display their info.
std::vector<cl::Platform> platforms;
std::vector<cl::Platform>::iterator i=platforms.begin();
std::cout<<"OpenCL \tPlatform : "<<i->getInfo<CL_PLATFORM_NAME>()<<std::endl;
std::cout<<"\tVendor: "<<i->getInfo<CL_PLATFORM_VENDOR>()<<std::endl;
std::cout<<"\tVersion : "<<i->getInfo<CL_PLATFORM_VERSION>()<<std::endl;
std::cout<<"\tExtensions : "<<i->getInfo<CL_PLATFORM_EXTENSIONS>()<<std::endl;
// get devices
std::vector<cl::Device> devices;
int o=99;
// iterate over available devices
for(std::vector<cl::Device>::iterator j=devices.begin(); j!=devices.end(); j++)
std::cout<<"\tOpenCL\tDevice : " << j->getInfo<CL_DEVICE_NAME>()<<std::endl;
std::cout<<"\t\t Type : " << j->getInfo<CL_DEVICE_TYPE>()<<std::endl;
std::cout<<"\t\t Vendor : " << j->getInfo<CL_DEVICE_VENDOR>()<<std::endl;
std::cout<<"\t\t Driver : " << j->getInfo<CL_DRIVER_VERSION>()<<std::endl;
std::cout<<"\t\t Global Mem : " << j->getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()/(1024*1024)<<" MBytes"<<std::endl;
std::cout<<"\t\t Local Mem : " << j->getInfo<CL_DEVICE_LOCAL_MEM_SIZE>()/1024<<" KBbytes"<<std::endl;
std::cout<<"\t\t Compute Unit : " << j->getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>()<<std::endl;
std::cout<<"\t\t Clock Rate : " << j->getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()<<" MHz"<<std::endl;
//get Kernel
std::ifstream ifs("");
std::string kernelSource((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
//Create context, select device and command queue.
cl::Context context(devices);
cl::Device &device=devices.front();
cl::CommandQueue cmdqueue(context,device);
// Generate Source vector and push the kernel source in it.
cl::Program::Sources sourceCode;
sourceCode.push_back(std::make_pair(kernelSource.c_str(), kernelSource.size()));
//Generate program using sourceCode
cl::Program program=cl::Program(context, sourceCode);
//Build program..
catch(cl::Error &err)
std::cerr<<"Building failed, "<<err.what()<<"("<<err.err()<<")"
<<"\nRetrieving build log"
<<"\n Build Log Follows \n"
//Declare and initialize vectors
cl_int N=A.size();
//Declare and intialize proper work group size and global size. Global size raised to the nearest multiple of workGroupSize.
int workGroupSize=128;
int GlobalSize;
if(N%workGroupSize) GlobalSize=N - N%workGroupSize + workGroupSize;
else GlobalSize=N;
//Declare buffers.
cl::Buffer vecA(context, CL_MEM_READ_WRITE, sizeof(cl_float)*N);
cl::Buffer vecB(context, CL_MEM_READ_ONLY , (B.size())*sizeof(cl_float));
cl::Buffer vecC(context, CL_MEM_READ_ONLY , (C.size())*sizeof(cl_float));
//Write vectors into buffers
cmdqueue.enqueueWriteBuffer(vecB, 0, 0, (B.size())*sizeof(cl_float), &B[0] );
cmdqueue.enqueueWriteBuffer(vecB, 0, 0, (C.size())*sizeof(cl_float), &C[0] );
//Executing kernel
cl::Kernel kernel(program, "vector_add");
cl::KernelFunctor kernel_func=kernel.bind(cmdqueue, cl::NDRange(GlobalSize), cl::NDRange(workGroupSize));
kernel_func(vecA, vecB, vecC, N);
//Reading back values into vector A
cmdqueue.enqueueReadBuffer(vecA,true,0,N*sizeof(cl_float), &A[0]);
//Saving into file.
std::ofstream output("vectorAdd.txt");
for(int i=0;i<N;i++) output<<A[i]<<"\n";
catch(cl::Error& err)
std::cerr << "OpenCL error: " << err.what() << "(" << err.err() <<
")" << std::endl;
The problem is, for smaller values of N, I'm getting the correct result that is 2.6
But for larger values, like the one in the code above (993448) I get garbage output varying between 1 and 2.4.
Here is the Kernel code :
__kernel void vector_add(__global float *A, __global float *B, __global float *C, int N) {
// Get the index of the current element
int i = get_global_id(0);
//Do the operation
if(i<N) A[i] = C[i] + B[i];
UPDATE : Ok it seems the code is working now. I have fixed a few minor mistakes in my code above
1) The part where GlobalSize is initialized has been fixed.
2)Stupid mistake in enqueueWriteBuffer (wrong parameters given)
It is now outputting the correct result for large values of N.
Try to change the data type from float to double etc.

CUDA program causes nvidia driver to crash

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 256
#define NUM_BLOCK 256
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int incircle, k;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
g_odata[i] = incircle;
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, *sumHost, total;
const char *error;
int trials;
curandState *devStates;
trials = 500;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
sumHost = (float*)calloc(NUM_BLOCK*NUM_THREAD, sizeof(float));
cudaMalloc((void **) &sumDev, size); // Allocate array on device
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
cudaMalloc((void **) &devStates, (NUM_THREAD*NUM_BLOCK)*sizeof(curandState));
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock>>> (sumDev, trials, devStates);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
dim3 dimGrid1(1,1,1);
dim3 dimBlock1(256,1,1);
reduce0 <<<dimGrid1, dimBlock1, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, sizeof(float), cudaMemcpyDeviceToHost);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
//*solution = NULL;
return 0;
If smaller numbers of trials work correctly, and if you are running on MS Windows without the NVIDIA Tesla Compute Cluster (TCC) driver and/or the GPU you are using is attached to a display, then you are probably exceeding the operating system's "watchdog" timeout. If the kernel occupies the display device (or any GPU on Windows without TCC) for too long, the OS will kill the kernel so that the system does not become non-interactive.
The solution is to run on a non-display-attached GPU and if you are on Windows, use the TCC driver. Otherwise, you will need to reduce the number of trials in your kernel and run the kernel multiple times to compute the number of trials you need.
EDIT: According to the CUDA 4.0 curand docs(page 15, "Performance Notes"), you can improve performance by copying the state for a generator to local storage inside your kernel, then storing the state back (if you need it again) when you are finished:
curandState state = states[i];
for(k = 0; k < trials; k++){
x = curand_uniform(&state);
y = curand_uniform(&state);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
Next, it mentions that setup is expensive, and suggests that you move curand_init into a separate kernel. This may help keep the cost of your MC kernel down so you don't run up against the watchdog.
I recommend reading that section of the docs, there are several useful guidelines.
For those of you having a geforce GPU which does not support TCC driver there is another solution based on:
start regedit,
navigate to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
create new DWORD key called TdrLevel, set value to 0,
restart PC.
Now your long-running kernels should not be terminated. This answer is based on:
Modifying registry to increase GPU timeout, windows 7
I just thought it might be useful to provide the solution here as well.