OPENACC function equivalent to CUDA's cudaError_t cudaMemGetInfo - gpu

I would like to know the size of the available memory at certain point of the program at runtime. I was wondering if OpenACC has any functions equivalent to CUDA's cudaMemGetInfo().

The OpenACC standard doesn't have this but PGI does have an OpenACC extension API call you can use. "acc_get_free_memory" will return the amount of free memory on the device while "acc_get_memory" will return the amount of total memory. Include "accel.h" which is where PGI has the prototypes for it's OpenACC extensions. Both return an unsigned long.
While I haven't tried myself, you might be able to call "cudaMemGetInfo" directly as well.

Related

Query size of local memory accessible to subgroups in Vulkan

Is there a way to know how much local memory every compute unit has access to? For example in OpenCL I could call
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
Vulkan should have something equivalent to this.
The GLSL compute shader abstraction's equivalent to OpenCL local memory is shared memory: memory accessible to all work items in a work group (defined by shared-qualified variables). As such, you may query GL_MAX_COMPUTE_SHARED_MEMORY_SIZE to get the amount of shared memory.

Why is my pcl cuda code running in CPU instead of GPU?

I have a code where I use the pcl/gpu namespace:
pcl::gpu::Octree::PointCloud clusterCloud;
clusterCloud.upload(cloud_filtered->points);
pcl::gpu::Octree::Ptr octree_device (new pcl::gpu::Octree);
octree_device->setCloud(clusterCloud);
octree_device->build();
/*tree->setCloud (clusterCloud);*/
// Create the cluster extractor object for the planar model and set all the parameters
std::vector<pcl::PointIndices> cluster_indices;
pcl::gpu::EuclideanClusterExtraction ec;
ec.setClusterTolerance (0.1);
ec.setMinClusterSize (2000);
ec.setMaxClusterSize (250000);
ec.setSearchMethod (octree_device);
ec.setHostCloud (cloud_filtered);
ec.extract (cluster_indices);
I have installed CUDA and included the needed pcl/gpu ".hpp"s to do this. It compiles (I have a catkin workspace with ROS) and when I do run it works really slow. I used nvidia-smi and my code is only running in the CPU, and I don't know why and how to solve it.
This code is an implementation of the gpu/segmentation example here:
pcl/seg.cpp
(Making this an answer since it's too long for a comment.)
I don't know pcl, but maybe it's because you pass a host-side std::vector rather than data that's on the device side.
... what is "host side" and "device side", you ask? And what's std?
Well, std is just a namespace used by the C++ standard library. std::vector is a (templated) class in the C++ standard library, which dynamically allocates memory for the elements you put in it.
The thing is, the memory std::vector uses is your main system memory (RAM) which doesn't have anything to do with the GPU. But it's likely that your pcl library requires that you pass data that's in GPU memory - which can't be the data in an std::vector. You would need to allocate device-side memory and copy your data there from the host side memory.
See also:
Why we do not have access to device memory on host side?
and consult the CUDA programming guide regarding how to perform this allocation and copying (at least, how to perform it at the lowest possible level; your "pcl" may have its own facilities for this.)

malloc with aligned memory in newLib

I'm currently working on a project using an Atmel board (SAM4C ARM Cortex-M4) and I noticed that when I set the bit "trap unaligned word accesses", I always got a "Unaligned Access Usage Fault".
After some investigation, I realized that malloc return block of memory that are unaligned. So, I was wondering if there was a way to configure malloc so it will allocate memory at an align pointer? I know that memalign can do the trick, but since there is already too many place where I use malloc, it would be simpler if I could keep using malloc instead.
I'm using the library "newLib".
The ISO spec states malloc() always returns a memory address that's suitable for a pointer to any object that fits within the size specified. In practice, this generally means it should be aligned on a 8 byte boundary.
If it isn't, then it's a non-conformant implementation and should be spanked.
That being said, I'd be really, really, really, surprised if newLib wasn't conformant.

Does gcc optimize the kernel code?

I have added a variable in the struct thread_info to count certain event.
This is done in a guest OS.
During the execution of Virtual machine I read these variables from my HOST every now and then.
I have obeserved that sometime I get the value which is expected but sometimes I read junk values.I presume that the GCC is optimizing my variable, and the memory I am reading is in garbage state.
I want to know of possible way to prevent.
turnig Off GCC optimization for the kernel is out of question because my objective is to speed up the virtual machine based on the event I have counted.
#pragma optimize("",off)
make it less efficient because then I will have to break my event counting code(which is just 2 lines) into a function. And this event I am counting occurs very often.
Is there a #pragma technique which I can use??
Will making my variable volatile help the cause??
Thanks
Making the variables volatile will prevent GCC from optimizing them out. You don't need to disable optimization altogether.
However, you might need to deal with the race condition that results by you trying to read from the struct while the kernel is possibly still updating it. I don't know how you'd do that in a VM context though. Maybe there's some special mechanism for guest-host communication provided by the hypervisor you're using. VMware for example has VMCI.

NVIDIA CUDA 4.0, page-locking a memory with runtime API

NVIDIA CUDA 4.0 (RC2 is assumed here) offers the nice feature of page-locking a memory range that was allocated before via the "normal" malloc function. This can be done using the driver API function:
CUresult cuMemHostRegister (void * p, size_t bytesize, unsigned int Flags);
Now, the development of the project was done so far using the runtime API. Unfortunately it seems that the runtime API does not offer a function like cuMemHostRegister. I really would like to avoid mixing driver and runtime API calls.
Does anyone know how to page-lock memory that was prior allocated using standard malloc ? Standard libc functions should not be used, since the page-locking is carried out for staging the memory for a fast transfer to the GPU, so I really want to stick to the "CUDA"-way.
Frank
The 4.0 runtime API offers cudaHostRegister(), which does exactly what you are asking about. Be aware that the memory allocation you lock must be host page aligned, so you probably should use either mmap() or posix_memalign() (or one of its relatives) to allocate the memory. Passing cudaHostRegister() an allocation of arbitrary size from standard malloc() will probably fail with an invalid argument error.