How to leverage blocks/grid and threads/block? - optimization

I'm trying to accelerate this database search application with CUDA, and I'm working on running a core algorithm in parallel with CUDA.
In one test, I run the algorithm in parallel across a digital sequence of size 5000 with 500 blocks per grid and 100 threads per block and came back with a runt time of roughly 500 ms.
Then I increased the size of the digital sequence to 8192 with 128 blocks per grid and 64 threads per block and somehow came back with a result of 350 ms to run the algorithm.
This would indicate that how many blocks and threads used and how they're related does impact performance.
My question is how to decide the number of blocks/grid and threads/block?
Below I have my GPU specs from a standard device query program:

You should test it because it depends on your particular kernel. One thing you must aim to do is to make the number of threads per block a multiple of the number of threads in a warp. After that you can aim for high occupancy of each SM but that is not always synonymous with higher performance. It was been shown that sometimes lower occupancy can give better performance. Memory bound kernels usually benefit more from higher occupancy to hide memory latency. Compute bound kernels not so much. Testing the various configurations is your best bet.

Related

Is 250,000 the maximum number of iterations in the optimization experiment in AnyLogic?

I want to run the Optimization Experiment till the maximum available memory on my computer is reached (I have 2 TB empty). Therefore, I do not specify the number of iterations, and I set the maximum available memory to 1.5 TB (1500000Mb).
However, when I run the model, the experiment keeps running and stops automatically at iteration number 250,000 (Status: Finished).
Do you think the 250,000 iterations consume the 1.5 TB !? (I disabled log model execution already)? Or is this number (250,000) specified by AnyLogic? If yes, based on what statistics, since the solution space may have millions/billions of solutions (each iteration is a solution).
I tried the Activity Based Costing Analysis model, the model stops at 250K iterations too.
4GB (4000 Mb):
I don't think there is a max limit. The "Activity-based costing" example goes well beyond 250k iterations on my machine:
Maybe you have a limited OptQuest license, do you use the Academic AL version? (I have the Professional version).
However, you are working with a big misunderstanding. The memory you assign is your RAM memory. Typically, computers have 16-64 GB (not terabytes). What you are talking about is your HDD space, AnyLogic does neither use that nor care about it.
So setting such a high value has no impact beyond AnyLogic using all of your RAM. This may cause the stop, but it may also be a fixed upper limit.
Try running a simple experiment with 4GB memory and see if it also stops at 250k

Optimizing Tensorflow for a 32-cores computer

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)
TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when

Do gpu cores switch tasks when they're done with one?

I'm experimenting with c++ AMP, one thing thats unclear from MS documentation is this:
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do. Then there was some vague stuff on warps and tiles out of which I got this impression:
Regardless of how the threads are tiled together (or not at all), the whole group must finish before taking on new tasks so if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true? Also, how do I find out what this internal groups size is?
During my experimentation, it certainly appears to have some truth to it because assigning more even amounts of work to the threads seems to speed things up, even if there is slightly more work overall.
The reason I'm trying to figure it out is because I'm deciding whether or not to engage in another lengthy experiment that would be based on threads getting uneven amounts of work (sometimes by the factor of 10x) but all the threads would be independent so data wise, the cores would be free to pick up another thread.
In practice, the underlying execution model of AMP on GPU is the same as CUDA, OpenCL, Compute Shaders, etc. The only thing that changes is the naming of each concept. So if you feel that the AMP documentation is lacking, consider reading up on CUDA or OpenCL. Those are significantly more mature APIs and the knowledge you gain from them applies as well to AMP.
If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it completes them 300 at a time or 400 or whatever number it can do.
Maybe. From the high-level view of parallel_for_each, you don't have to care about this. The threads may as well be executed sequentially, one at a time.
If you launch 1000 threads without specifying a tile size, the AMP runtime will choose a tile size for you, based on the underlying hardware. If you specify a tile size, then AMP will use that one.
GPUs are made of multiprocessors (in CUDA parlance, or compute units in OpenCL), each composed of a number of cores.
Tiles are assigned per multiprocessor: all threads within the same tile will be ran by the same multiprocessor, until all threads within that tile run to completion. Then, the multiprocessor will pick another available tile (if any) and run it, until all tiles are executed. Multiprocessors can execute multiple tiles simultaneously.
if the internally assigned group has the size of 128 and 30 of them finish, the 30 cores will idle until the other 98 are done too. Is that true?
Not necessarily. As mentionned earlier, a multiprocessor may have multiple active tiles. It may therefore schedule threads from other tiles to remain busy.
Important note: On GPU, threads are not executed on a granularity of 1. For example, NVIDIA hardware executes 32 threads at once.
To not make this answer needlessly lengthy, I encourage you to read up on the concept of warp.
The GPU certainly won't run 1000 threads at the same time, but it also won't complete them 300 at a time.
It uses multithreading, which means that just like in a CPU, it will share run time among the 1000 threads allowing them to complete seemingly at the same time.
Keep in mind creating a lot of threads may be not interesting for several reasons. For instance, if you must complete all 1000 tasks in step 1 before doing step 2, you might aswell distribute them on a number of threads equal to the number of cores in your GPU and no more than that.
Using more threads than the number of cores only makes sense if you want to dispatch tasks that are not being waited on, or because you felt like doing your code this way is easier. But keep in mind thread management is time-costly too and may drag down your performance.

CUDA optimisation - kernel launch conditions

I am fairly new to CUDA and would like to find out more about optimising kernel launch conditions to speed up my code. This is quite a specific scenario but I'll try to generalise it as much as possible so anyone else with a similar question can gain from this in the future.
Assume I've got an array of 300 elements (Array A) that is sent to the kernel as an input. This array is made of a few repeating integers with each integer having a device function specific to it. For example, every time 5 appears in Array A, the kernel performs the function specific to 5. These functions are device functions.
How I have parallelised this problem is by launching 320 blocks (probably not the best number) so that each block will perform the device function relevant to its element in parallel.
The CPU would handle the entire problem in a serial fashion where it will take element by element and call each function one after the other whereas the GPU would allocate an element to each block so that all 320 blocks can access the relevant device functions and calculate simultaneously.
In theory for a large number of elements the GPU should be faster - at least I though so but in my case it isn't. My assumption is that since 300 elements is a small number the CPU will always be faster than the GPU.
This is acceptable BUT what I want to know is how I can cut down the GPU execution time at least by a little. Currently, the CPU takes 2.5 milliseconds and the GPU around 12 ms.
Question 1 - How can I choose the optimum number of blocks/threads to launch at the start?
First I tried 320 blocks with 1 thread per block. Then 1 block with 320 threads. No real change in execution time. Will tweaking the number of blocks/threads improve the speed?
Question 2 - If 300 elements is too small, why is that, and roughly how many elements do I need to see the GPU outperforming the CPU?
Question 3 - What optimisation techniques should I look into?
Please let me know if any of this isn't that clear and I'll expand on it.
Thanks in advance.
Internally, CUDA manages threads in groups of 32 (so-called warps). If you have 1 thread per block device will still execute 32 of those - 31 thread will simply be in divergent state. This is potentially an occupancy issue though you may not observe it on your device and with your problem size. There is also limit on number of blocks given multiprocessor (SM) can execute. AFAIR, GeForce 4x can run up to 8 blocks on one SM. Hence if you have a device with 8 SMs you can simultaneously run 64 threads if you have block size of 1. You can use a tool called occupancy calculator to estimate a better block size - or you can use a visual profiler.
This can only be decided by profiling. There are too many unknowns - e.g. what is your ratio of memory accesses to actual computations, how parallelizable your task is, etc.
I would really recommend you to start with best practices guide.

Optimizing CUDA kernels regarding registers

I'm using the CUDA Occupancy calculator to try to optimize my CUDA kernel. Currently I'm using 34 registers and zero shared memory...Thus the maximum occupancy is 63% for 310 Threads per block. When I could somehow change the registers (e.g. by passing kernel parameters via shared memory) to 20 or below I could get an occupancy of 100%. Is this a good way to do it or would you advise me to use another path of optimizing?
Further I'm also wondering if there's a newer version of the occupancy calculator for Compute Capability 2.1!?
Some points to consider:
320 threads per block will give the same occupancy as 310, because occupancy is defined as active warps/maximum warps per SM, and the warp size is always 32 threads. You should never use a block size which is not a round multiple of 32. That just wastes cores and cycles.
Kernel parameters are passed in constant memory on your compute 2.1 device, and they have no effect on occupancy or register usage.
The GPU design has a pipeline latency of about 21 cycles. So for a Fermi GPU, you need about 43% occupancy to cover all of the internal scheduling latency. Once that is done, you may find that there is relatively little benefit in trying to achieve higher occupancy.
Striving for 100% occupancy is usually never a constructive optimization goal. If you have not done so, I highly recommend looking over Vasily Volkov's presentation from GTC 2010 "Better performance at lower occupancy", where he shows all sorts of surprising results, like code hitting 85% of peak memory bandwidth at 8% occupancy.
The newest occupancy calculator doesn't cover compute 2.1, but the effective occupancy rules for compute 2.0 apply to 2.1 devices too. The extra cores in the compute 2.1 multiprocessor come into play via instruction level parallelism and what is almost out of order execution. That really doesn't change the occupancy characteristics of the device at all compared to compute 2.0 devices.
talonmies is correct, occupancy is overrated.
Vasily Volkov had a great presentation at GTC2010 on this topic: "Better Performance at Lower Occupancy."
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf