Why is my pcl cuda code running in CPU instead of GPU? - cmake

I have a code where I use the pcl/gpu namespace:
pcl::gpu::Octree::PointCloud clusterCloud;
clusterCloud.upload(cloud_filtered->points);
pcl::gpu::Octree::Ptr octree_device (new pcl::gpu::Octree);
octree_device->setCloud(clusterCloud);
octree_device->build();
/*tree->setCloud (clusterCloud);*/
// Create the cluster extractor object for the planar model and set all the parameters
std::vector<pcl::PointIndices> cluster_indices;
pcl::gpu::EuclideanClusterExtraction ec;
ec.setClusterTolerance (0.1);
ec.setMinClusterSize (2000);
ec.setMaxClusterSize (250000);
ec.setSearchMethod (octree_device);
ec.setHostCloud (cloud_filtered);
ec.extract (cluster_indices);
I have installed CUDA and included the needed pcl/gpu ".hpp"s to do this. It compiles (I have a catkin workspace with ROS) and when I do run it works really slow. I used nvidia-smi and my code is only running in the CPU, and I don't know why and how to solve it.
This code is an implementation of the gpu/segmentation example here:
pcl/seg.cpp

(Making this an answer since it's too long for a comment.)
I don't know pcl, but maybe it's because you pass a host-side std::vector rather than data that's on the device side.
... what is "host side" and "device side", you ask? And what's std?
Well, std is just a namespace used by the C++ standard library. std::vector is a (templated) class in the C++ standard library, which dynamically allocates memory for the elements you put in it.
The thing is, the memory std::vector uses is your main system memory (RAM) which doesn't have anything to do with the GPU. But it's likely that your pcl library requires that you pass data that's in GPU memory - which can't be the data in an std::vector. You would need to allocate device-side memory and copy your data there from the host side memory.
See also:
Why we do not have access to device memory on host side?
and consult the CUDA programming guide regarding how to perform this allocation and copying (at least, how to perform it at the lowest possible level; your "pcl" may have its own facilities for this.)

Related

vulkan compute shader direct access to CPU allocated memory

Is there any way in vulkan computer shader to bind specific location in CPU memory, So that I can directly access it in shader language.
For example, if I have a variable declaration int a[]={contents........};, can I bind the address of a to say binding location 0 and then access in glsl something like this
layout(std430,binding = 0) {
int a[];
}
I want do this because to I don't want spend time on writing and reading from buffer.
Generally, you cannot make the GPU access memory that Vulkan did not allocate itself for the GPU. The exception to this are external allocations made by other APIs that themselves are allocating GPU-accessible memory.
Just taking a random stack or global pointer and shoving it at Vulkan isn't going to work.
I want something like cudaHostGetDevicePointer in CUDA
What you're asking for here is not what that function does. That function takes a CPU pointer to CPU-accessible memory which CUDA allocated for you and which you previously mapped into a CPU address range. The pointer you give it must be within a mapped region of GPU memory.
You can't just shove a stack/global variable at it and expect it to work. The variable would have to be within the mapped allocation, and a global or stack variable can't be within such an allocation.
Vulkan doesn't have a way to reverse-engineer a pointer into a mapped range of device memory back to the VkDeviceMemory object it was mapped from. This is in part because Vulkan doesn't have pointers to allocations; you have to use VkDeviceMemory object, which you create and manage yourself. But if you need to know where a CPU-accessible pointer was mapped from, you can keep track of that yourself.
I want do this because to I don't want spend time on writing and reading from buffer.
Vulkan is exactly for people that do want to spend time managing how the data flows. You might want to consider some rapid prototyping framework or math library instead.
Is there any way in vulkan computer shader to bind specific location in CPU memory
Yes, but it won't save you any time.
Firstly Vulkan does allow allocation of CPU memory via VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT|VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. So you could allocate your stuff in that VkDeviceMemory, map it, do your CPU stuff in that address space, and then use it on GPU.
Second way is via using the VK_EXT_external_memory_host extension, which allows you to import your pointer into Vulkan as VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT. But it is involved in its own way, and the driver might say "nope", so you are back to square one.

Why does a GraalVM (SubstrateVM) native image uses so much less memory at runtime than a corresponding JIT build?

I'm wondering why a GraalVM (SubstrateVM) native image of a Java application makes it run where the runtime behavior will consume much less memory, yet if run normally, it will consume a lot more memory?
And why can't the normal JIT be made to similarly consume a small amount of memory?
GraalVM native images don't include the JIT compiler or the related infrastructure -- so there's no need to allocate memory for JIT, for the internal representation of the program to JIT it (for example a control flow graph), no need to store some of the class metadata, etc.
So it's unlikely that a JIT which actually does useful work can be implemented with the same zero overhead.
It could be possible to create an economic implementation of the virtual machine that will perhaps use less memory than HotSpot. Especially if you only want to measure the default configuration without comparing the setups where you control the amounts of memory the JVM is allowed to use. However, one needs to realize that it'll either be an incremental improvement on the existing implementations or picking a different option for some trade-off, because the existing JVM implementations are actually really-really good.

How do GPU handle indirect branches

My understanding of GPUs is that they handle branches by executing all path while suspending instances that are not supposed to execute the path. This works well for if/then/else kind of construct and loops (instance that terminated the loop can be suspended until all instance are suspended).
This flat out does not work if the branch is indirect. But modern GPUs (Fermi and beyond for nVidia, not sure when it appear for AMD, R600 ?) claim to support indirect branches (function pointers, virtual dispatch, ...).
Question is, what kind of magic is going on in the chip to make this happen ?
Accordingly to the Cuda programming guide there is some strong restrictions on virtual functions and dynamic dispatching.
See http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#functions for more information. Another interesting article about how code is mapped to the GPU hardware is http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html .
GPUs are like CPUs in regards to indirect branches. They both have an IP (instruction pointer) that points to physical memory. This IP is incremented for each hardware instruction that gets executed. An indirect branch just sets the IP to the new location. How this is done is a little bit more complicated. I will use PTX for Nvidia and GCN Assembly for AMD.
An AMD GCN GPU can have its IP simply set from any register. Example: "s_branch S8" The IP can be set with any value. In fact, on an AMD GPU, its possible to write to the program memory in a kernel and then set the IP to execute it (self modifying code).
On NVidia's PTX there is no indirect jump. I have been waiting for real hardware indirect branch support since 2009. The most current version of the PTX ISA 4.3 still does not have indirect branching. In the current PTX ISA manual, http://docs.nvidia.com/cuda/parallel-thread-execution, it still reads that "Indirect branch is currently unimplemented".
However, "indirect calls" are supported via jump tables. These are slightly different then indirect branches but do the same thing. I did some testing with jump tables in the past and the performance was not great. I believe the way this works is that the kernel is lunched with a table of already known call locations. Then when it runs across a "call %r10(params)" (something like that) it saves the current IP and then references the jump table by an index and then sets the IP to that address. I'm not 100% sure but its something like that.
Like you said, besides branching both AMD and NVidia GPUS also allow instructions to be executed but ignored. It executes them but does not write the output. This is another way of handing an if/then/else as some cores are ignored while others run. This does not really have much to do with branching. This is a trick to just avoid a time consuming branches. Some CPUs like the Intel Itanium also do this.
You can also try searching under these other names also: Indirect Calls, Indirect Branches, Dynamic branching, virtual functions, function pointers or jump tables
Hope this helps. Sorry I went so long.

NVIDIA CUDA 4.0, page-locking a memory with runtime API

NVIDIA CUDA 4.0 (RC2 is assumed here) offers the nice feature of page-locking a memory range that was allocated before via the "normal" malloc function. This can be done using the driver API function:
CUresult cuMemHostRegister (void * p, size_t bytesize, unsigned int Flags);
Now, the development of the project was done so far using the runtime API. Unfortunately it seems that the runtime API does not offer a function like cuMemHostRegister. I really would like to avoid mixing driver and runtime API calls.
Does anyone know how to page-lock memory that was prior allocated using standard malloc ? Standard libc functions should not be used, since the page-locking is carried out for staging the memory for a fast transfer to the GPU, so I really want to stick to the "CUDA"-way.
Frank
The 4.0 runtime API offers cudaHostRegister(), which does exactly what you are asking about. Be aware that the memory allocation you lock must be host page aligned, so you probably should use either mmap() or posix_memalign() (or one of its relatives) to allocate the memory. Passing cudaHostRegister() an allocation of arbitrary size from standard malloc() will probably fail with an invalid argument error.

How does one use dynamic recompilation?

It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation