cuda kernel not executing or returning an error - gpu

I have some cuda code running through some FFTs and other math operations, which works on blocks of 2^n as requested by the user. The code works well when first run, but after running long enough it starts to fail. Eventually it will get to the point where if I run any block size larger then 2^ll I get no data back (all zeros). I've done some testing by modifying the kernel code and from what I can tell the kernel is not executing. I'm trying to figure out why my code stops producing data after multiple iterations on large block sizes.
The issue looks at first glance to be memory leak. I know I have to run multiple iterations of the processing to cause an error. At first only large block sizes will stop working, but as I run more iterations smaller block sizes will start to fail as well. The reason I'm not certain the issue is memory is that my code will work for a block size lower then 2^11 regardless of how many iterations I run. If this was a simple memory leak I would have expected the symptoms to get progressively worse until I couldn't access any memory on the card.
I've also noticed that larger block sizes (roughly equivalent to the amount of memory each thread uses) tend to cause the program to fail sooner. Increasing the number of blocks processed (ie number of Cuda threads) doesn't seem to have an affect on when the code starts to fail.
as far as I can tell no error code is being returned, the kernel doesn't appear to execute at all.
Can anyone suggest what my be causing this issue? I would settle for any insight in how to debug code running on the GPU or to monitor GPU memory availability.

If you need more computation done, bump up your grid size and not your thread block size. To quote the CUDA programming guide 3.0 on pg. 8, "On current GPUs, a thread block may contain up to 512
threads."
This means that threadIdx.x * threadIdx.y * threadIdx.z <= 512 at all times. If you maintain that invariant, do things work?

Related

Should the amount of resource allocations be "per swap chain image"?

I just learned about uniform buffers (https://vulkan-tutorial.com/Uniform_buffers/Descriptor_layout_and_buffer) and a bit confused about the size of uniformBuffers and uniformBuffersMemory. In the tutorial it is said that:
We should have multiple buffers, because multiple frames may be in flight at the same time and we don't want to update the buffer in preparation of the next frame while a previous one is still reading from it! We could either have a uniform buffer per frame or per swap chain image.
As far as I understand "per swap chain image" approach is more optimal. Please, prove me wrong, if I am. But why do we need it to be the size of swapChainImages.size()? Isn't MAX_FRAMES_IN_FLIGHT just enough, because we have fences? As a simple example, if we have just a single frame in flight and do vkDeviceWaitIdle after each presentation then our single uniform buffer will always be available and not used by cpu/gpu so we don't need an array of them.
do vkDeviceWaitIdle
OK, stop right there. There is basically only one valid reason to call that function: you need to delete every resource created by that device, because you're about the destroy the device, so you wait until all such resources are no longer being used.
Yes, if you halt the CPU's execution until the GPU stops doing stuff, then you're guaranteed that CPU writes to GPU memory will not interact with GPU reads from that memory. But you purchased this guarantee by ensuring that there will be no overlap at all between CPU execution and GPU execution. The CPU sets up some stuff, sends it to the GPU, then waits till the GPU is done, and the CPU starts up again. Everything executes perfectly synchronously. While the CPU is doing work, the GPU is doing nothing. And vice-versa.
This is not a recipe for performance. If you're going to use a graphics API designed to achieve lots of CPU/GPU overlap, you shouldn't throw that away because it's easier to work with.
Get used to multi-buffering any resources that you modify from the CPU on a regular basis. How many buffers you want to use is your choice, one that should be informed by the present mode and the like.
My question is "Do I need n buffers or m is enough?".
The situation you're describing ultimately only happens if your code wanted to have X frames in flight, but the presentation engine requires you to use a minimum of Y swap-chain images, and X < Y. So the question you're asking can be boiled down to, "if I wanted to do double-buffering, but the implementation forces 3 buffers on me, is it OK if I treat it as double-buffering?"
Yes, as long as you're not relying on the vkAcquireNextImage call to block the CPU for your synchronization. But you shouldn't be relying on that anyway, since the call itself doesn't constitute a proper barrier as far as the Vulkan execution model is concerned. You should instead block the CPU on fences tied to the actual work, not on the acquire process.

Hoot determine GTK3 memory leaks on OS X

I've written a small single-line oriented UDP based display service to support raspberry pi projects I frequently work on, where it would be nice to see the results of sensor data being captured. This is a rewrite of that program using GTK3 V3.18.9, and GLIB2 V2.46.2. I'm developing on OSX El Capitan
It seems to double in real memory size every 30 minutes or so based on traffic; so I'm presuming I have a memory leak somewhere. But for the life of me I can't see where in the code it could possible be. Val grind did not initially work for me, so I've got some studying to do to resolve what ever issue that is.
Meanwhile I was hoping that different eyes might be able to suggest a coding cause for a traffic based (at least I think so) memory leak. Here is the program and test client.
It starts up using about 10MB of real memory, then jumps when it has received 64 total messages to 14MB, than slowly grows from there. At 64 message, I start deleting the 65th message off the end of the list, presuming I should be saving memory; as this program might run for weeks.
Here is the code for the test client and the display service:
https://gist.github.com/skoona/c1218919cf393c98af474a4bf868009f

Java 8: Why does Metaspace size increase but number of loaded classes stay the same?

In our app running on Jdk 8 we use VisualVM to track the usage of loaded classes and the usage of the metaspace.
At some point in time while our app is running we see that the number of loaded classes don't increase any more but the metaspace still increases in it's size while our program is running. So what else apart from classes is stored in metaspace, that could cause that?
While your program is running, some parts of your code may be determined as "hot" by HotSpot's JIT compiler. This will cause those parts to be transformed/compiled to native code, and also some other code may be inlined into it. This native code representation has to go somewhere, and it goes into the same place as other class metadata - the Metaspace.
It explains continuous growth you're seeing: hot parts are determined over time using a simple metric of how much times did that piece of code got executed. Over time more and more code pieces will be JIT'ed as they'll hit threshold set by -XX:CompileThreshold (defaults to 10000)
i am not sure but hier (http://java.dzone.com/articles/java-8-permgen-metaspace) i fund this
Garbage collection of the dead classes and classloaders is triggered once the class metadata usage reaches the “MaxMetaspaceSize”.
maybe is this the cause for increasing metaspace size.

What are some factors that could affect program runtime?

I'm doing some work on profiling the behavior of programs. One thing I would like to do is get the amount of time that a process has run on the CPU. I am accomplishing this by reading the sum_exec_runtime field in the Linux kernel's sched_entity data structure.
After testing this with some fairly simple programs which simply execute a loop and then exit, I am running into a peculiar issue, being that the program does not finish with the same runtime each time it is executed. Seeing as sum_exec_runtime is a value represented in nanoseconds, I would expect the value to differ within a few microseconds. However, I am seeing variations of several milliseconds.
My initial reaction was that this could be due to I/O waiting times, however it is my understanding that the process should give up the CPU while waiting for I/O. Furthermore, my test programs are simply executing loops, so there should be very little to no I/O.
I am seeking any advice on the following:
Is sum_exec_runtime not the actual time that a process has had control of the CPU?
Does the process not actually give up the CPU while waiting for I/O?
Are there other factors that could affect the actual runtime of a process (besides I/O)?
Keep in mind, I am only trying to find the actual time that the process spent executing on the CPU. I do not care about the total execution time including sleeping or waiting to run.
Edit: I also want to make clear that there are no branches in my test program aside from the loop, which simply loops for a constant number of iterations.
Thanks.
Your question is really broad, but you can incur context switches for various reasons. Calling most system calls involves at least one context switch. Page faults cause contexts switches. Exceeding your time slice causes a context switch.
sum_exec_runtime is equal to utime + stime from /proc/$PID/stat, but sum_exec_runtime is measured in nanoseconds. It sounds like you only care about utime which is the time your process has been scheduled in user mode. See proc(5) for more details.
You can look at nr_switches both voluntary and involuntary which are also part of sched_entity. That will probably account for most variation, but I would not expect successive runs to be identical. The exact time that you get for each run will be affected by all of the other processes running on the system.
You'll also be affected by the amount of file system cache used on your system and how many file system cache hits you get in successive runs if you are doing any IO at all.
To give a very concrete and obvious example of how other processes can affect the run time of the current process, think about if you are exceeding your physical RAM constraints. If your program asks for more RAM, then the kernel is going to spend more time swapping. That time swapping will be accounted in stime but will vary depending on how much RAM you need and how much RAM is available. There are lot's of other ways that other processes can affect your process's run time. This is just one example.
To answer your 3 points:
sum_exec_runtime is the actual time the scheduler ran the process including system time
If you count switching to the kernel as the process giving up the CPU, then yes, but it does not necessarily mean a different user process may get the CPU back once the kernel is done.
I think I've already answered this question that there are lot's of factors.

CUDA optimisation - kernel launch conditions

I am fairly new to CUDA and would like to find out more about optimising kernel launch conditions to speed up my code. This is quite a specific scenario but I'll try to generalise it as much as possible so anyone else with a similar question can gain from this in the future.
Assume I've got an array of 300 elements (Array A) that is sent to the kernel as an input. This array is made of a few repeating integers with each integer having a device function specific to it. For example, every time 5 appears in Array A, the kernel performs the function specific to 5. These functions are device functions.
How I have parallelised this problem is by launching 320 blocks (probably not the best number) so that each block will perform the device function relevant to its element in parallel.
The CPU would handle the entire problem in a serial fashion where it will take element by element and call each function one after the other whereas the GPU would allocate an element to each block so that all 320 blocks can access the relevant device functions and calculate simultaneously.
In theory for a large number of elements the GPU should be faster - at least I though so but in my case it isn't. My assumption is that since 300 elements is a small number the CPU will always be faster than the GPU.
This is acceptable BUT what I want to know is how I can cut down the GPU execution time at least by a little. Currently, the CPU takes 2.5 milliseconds and the GPU around 12 ms.
Question 1 - How can I choose the optimum number of blocks/threads to launch at the start?
First I tried 320 blocks with 1 thread per block. Then 1 block with 320 threads. No real change in execution time. Will tweaking the number of blocks/threads improve the speed?
Question 2 - If 300 elements is too small, why is that, and roughly how many elements do I need to see the GPU outperforming the CPU?
Question 3 - What optimisation techniques should I look into?
Please let me know if any of this isn't that clear and I'll expand on it.
Thanks in advance.
Internally, CUDA manages threads in groups of 32 (so-called warps). If you have 1 thread per block device will still execute 32 of those - 31 thread will simply be in divergent state. This is potentially an occupancy issue though you may not observe it on your device and with your problem size. There is also limit on number of blocks given multiprocessor (SM) can execute. AFAIR, GeForce 4x can run up to 8 blocks on one SM. Hence if you have a device with 8 SMs you can simultaneously run 64 threads if you have block size of 1. You can use a tool called occupancy calculator to estimate a better block size - or you can use a visual profiler.
This can only be decided by profiling. There are too many unknowns - e.g. what is your ratio of memory accesses to actual computations, how parallelizable your task is, etc.
I would really recommend you to start with best practices guide.