NVIDIA CUDA 4.0, page-locking a memory with runtime API - gpu

NVIDIA CUDA 4.0 (RC2 is assumed here) offers the nice feature of page-locking a memory range that was allocated before via the "normal" malloc function. This can be done using the driver API function:
CUresult cuMemHostRegister (void * p, size_t bytesize, unsigned int Flags);
Now, the development of the project was done so far using the runtime API. Unfortunately it seems that the runtime API does not offer a function like cuMemHostRegister. I really would like to avoid mixing driver and runtime API calls.
Does anyone know how to page-lock memory that was prior allocated using standard malloc ? Standard libc functions should not be used, since the page-locking is carried out for staging the memory for a fast transfer to the GPU, so I really want to stick to the "CUDA"-way.
Frank

The 4.0 runtime API offers cudaHostRegister(), which does exactly what you are asking about. Be aware that the memory allocation you lock must be host page aligned, so you probably should use either mmap() or posix_memalign() (or one of its relatives) to allocate the memory. Passing cudaHostRegister() an allocation of arbitrary size from standard malloc() will probably fail with an invalid argument error.

Related

vulkan compute shader direct access to CPU allocated memory

Is there any way in vulkan computer shader to bind specific location in CPU memory, So that I can directly access it in shader language.
For example, if I have a variable declaration int a[]={contents........};, can I bind the address of a to say binding location 0 and then access in glsl something like this
layout(std430,binding = 0) {
int a[];
}
I want do this because to I don't want spend time on writing and reading from buffer.
Generally, you cannot make the GPU access memory that Vulkan did not allocate itself for the GPU. The exception to this are external allocations made by other APIs that themselves are allocating GPU-accessible memory.
Just taking a random stack or global pointer and shoving it at Vulkan isn't going to work.
I want something like cudaHostGetDevicePointer in CUDA
What you're asking for here is not what that function does. That function takes a CPU pointer to CPU-accessible memory which CUDA allocated for you and which you previously mapped into a CPU address range. The pointer you give it must be within a mapped region of GPU memory.
You can't just shove a stack/global variable at it and expect it to work. The variable would have to be within the mapped allocation, and a global or stack variable can't be within such an allocation.
Vulkan doesn't have a way to reverse-engineer a pointer into a mapped range of device memory back to the VkDeviceMemory object it was mapped from. This is in part because Vulkan doesn't have pointers to allocations; you have to use VkDeviceMemory object, which you create and manage yourself. But if you need to know where a CPU-accessible pointer was mapped from, you can keep track of that yourself.
I want do this because to I don't want spend time on writing and reading from buffer.
Vulkan is exactly for people that do want to spend time managing how the data flows. You might want to consider some rapid prototyping framework or math library instead.
Is there any way in vulkan computer shader to bind specific location in CPU memory
Yes, but it won't save you any time.
Firstly Vulkan does allow allocation of CPU memory via VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT|VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. So you could allocate your stuff in that VkDeviceMemory, map it, do your CPU stuff in that address space, and then use it on GPU.
Second way is via using the VK_EXT_external_memory_host extension, which allows you to import your pointer into Vulkan as VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT. But it is involved in its own way, and the driver might say "nope", so you are back to square one.

Why does a GraalVM (SubstrateVM) native image uses so much less memory at runtime than a corresponding JIT build?

I'm wondering why a GraalVM (SubstrateVM) native image of a Java application makes it run where the runtime behavior will consume much less memory, yet if run normally, it will consume a lot more memory?
And why can't the normal JIT be made to similarly consume a small amount of memory?
GraalVM native images don't include the JIT compiler or the related infrastructure -- so there's no need to allocate memory for JIT, for the internal representation of the program to JIT it (for example a control flow graph), no need to store some of the class metadata, etc.
So it's unlikely that a JIT which actually does useful work can be implemented with the same zero overhead.
It could be possible to create an economic implementation of the virtual machine that will perhaps use less memory than HotSpot. Especially if you only want to measure the default configuration without comparing the setups where you control the amounts of memory the JVM is allowed to use. However, one needs to realize that it'll either be an incremental improvement on the existing implementations or picking a different option for some trade-off, because the existing JVM implementations are actually really-really good.

Why is my pcl cuda code running in CPU instead of GPU?

I have a code where I use the pcl/gpu namespace:
pcl::gpu::Octree::PointCloud clusterCloud;
clusterCloud.upload(cloud_filtered->points);
pcl::gpu::Octree::Ptr octree_device (new pcl::gpu::Octree);
octree_device->setCloud(clusterCloud);
octree_device->build();
/*tree->setCloud (clusterCloud);*/
// Create the cluster extractor object for the planar model and set all the parameters
std::vector<pcl::PointIndices> cluster_indices;
pcl::gpu::EuclideanClusterExtraction ec;
ec.setClusterTolerance (0.1);
ec.setMinClusterSize (2000);
ec.setMaxClusterSize (250000);
ec.setSearchMethod (octree_device);
ec.setHostCloud (cloud_filtered);
ec.extract (cluster_indices);
I have installed CUDA and included the needed pcl/gpu ".hpp"s to do this. It compiles (I have a catkin workspace with ROS) and when I do run it works really slow. I used nvidia-smi and my code is only running in the CPU, and I don't know why and how to solve it.
This code is an implementation of the gpu/segmentation example here:
pcl/seg.cpp
(Making this an answer since it's too long for a comment.)
I don't know pcl, but maybe it's because you pass a host-side std::vector rather than data that's on the device side.
... what is "host side" and "device side", you ask? And what's std?
Well, std is just a namespace used by the C++ standard library. std::vector is a (templated) class in the C++ standard library, which dynamically allocates memory for the elements you put in it.
The thing is, the memory std::vector uses is your main system memory (RAM) which doesn't have anything to do with the GPU. But it's likely that your pcl library requires that you pass data that's in GPU memory - which can't be the data in an std::vector. You would need to allocate device-side memory and copy your data there from the host side memory.
See also:
Why we do not have access to device memory on host side?
and consult the CUDA programming guide regarding how to perform this allocation and copying (at least, how to perform it at the lowest possible level; your "pcl" may have its own facilities for this.)

malloc with aligned memory in newLib

I'm currently working on a project using an Atmel board (SAM4C ARM Cortex-M4) and I noticed that when I set the bit "trap unaligned word accesses", I always got a "Unaligned Access Usage Fault".
After some investigation, I realized that malloc return block of memory that are unaligned. So, I was wondering if there was a way to configure malloc so it will allocate memory at an align pointer? I know that memalign can do the trick, but since there is already too many place where I use malloc, it would be simpler if I could keep using malloc instead.
I'm using the library "newLib".
The ISO spec states malloc() always returns a memory address that's suitable for a pointer to any object that fits within the size specified. In practice, this generally means it should be aligned on a 8 byte boundary.
If it isn't, then it's a non-conformant implementation and should be spanked.
That being said, I'd be really, really, really, surprised if newLib wasn't conformant.

GPU Processing in vb.net

I've got a program that takes about 24 hours to run. It's all written in VB.net and it's about 2000 lines long. It's already multi-threaded and this works perfectly (after some sweat and tears). I typically run the processes with 10 threads but I'd like to increase that to reduce processing time, which is where using the GPU comes into it. I've search google for everything related that I can think of to find some info but no luck.
What I'm hoping for is a basic example of a vb.net project that does some general operations then sends some threads to the GPU for processing. Ideally I don't want to have to pay for it. So something like:
'Do some initial processing eg.
dim x as integer
dim y as integer
dim z as integer
x=int(textbox1.text)
y=int(textbox2.text)
z=x*y
'Do some multi-threaded operations on the gpu eg.
'show some output to the user once this has finished.
Any help or links will be much appreciated. I've read plenty of articles about it in c++ and other languages but I'm rubbish at understanding other languages!
Thanks all!
Fraser
The VB.NET compiler does not compile onto the GPU, it compiles down to an intermediate language (IL) that is then just-in-time compiled (JITed) for the target architecture at runtime. Currently only x86, x64 and ARM targets are supported. CUDAfy (see below) takes the IL and translates it into CUDA C code. In turn this is compiled with NVCC to generate code that the GPU can execute. Note that this means that you are limited to NVidia GPUs as CUDA is not supported on AMD.
There are other projects that have taken the same approach, for example a Python to CUDA translator in Copperhead.
CUDAfy - A wrapper on top of the CUDA APIs with additional libraries for FFTs etc. There is also a commercial version. This does actually
CUDAfy Translator
Using SharpDevelop's decompiler ILSpy as basis the translator converts .NET code to CUDA C.
There are other projects to allow you to use GPUs from .NET languages. For example:
NMath - A set of math libraries that can be used from .NET and are GPU enabled.
There may be others but these seem to be the main ones. If you decide to use CUDAfy then you will still need to invest some time in understanding enough of CUDA and how GPUs work to port your algorithm to fit the GPU data-parallel model. Unless it is something that can be done out of the box with one of the math libraries.
It's important to realize that there is still a performance hit for accessing the GPU from a .NET language. You must pay a cost for moving (marshaling) data from the .NET managed runtime into the native runtime. The overhead here largely depends on not only the size but also the type of data and if it can be marshaled without conversion. This is in addition to the cost of moving data from the CPU to the GPU etc.