Max size of thread local memory of a GPU thread (C++ AMP) - gpu

I'd like to create an integer array of 100 and an another one of ~10-100 integers (varies by user input) on every thread. I will reuse the data in the array_views several times on a thread so I want to copy the content of the aray view as local data to enhance memory access time. (Every thread is responsible for its "own" 100 elements of the array_view, creating one thread for every element is not possible with my algorythm) If it is not possible, tile static memory will do the trick too, but the thread local one would be better.
My question is, how many bytes can I allocate on a thread as local variable/array(a minimum amount which will work on most GPU s)?
Also, with which software can I query the capabilites of my GPU (Number of registers per thread, size of static memory per tile, etc.) The CUDA SDK has an utility app which queries the capabilities of the GPU, but I have an AMD one, Radeon HD 5770, and it won't work with my GPU if I am correct.

Opencl api can query gpu or cpu devices for capabilities of opencl programs but results should be similar for any natively optimized structure. But if your C++ AMP is based on HLSL or similar, you may not be able to use LDS.
32kB LDS and 24kB constant cache per compute unit means you can have 1kB LDS + 0.75kB per thread when you choose 32 threads per compute unit. But drivers may use some of it for other purposes, you can always test for different sizes. Look at constant cache bandwidth, its 2x performance of LDS bw.
If you are using those arrays without sharing with oter threads (or without any synchronization), you can use 256kB register space per compute unit(8kB per thread (setup of 32 threads per cu)) with six times wider bandwidth than LDS. But there are always some limits so actual usable value may be half of this.
taken from appendix - d of http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

Related

Performance Counter for DRAM Per-Rank Memory Access

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. I need to retrieve the number of accesses to each DRAM rank, over time, to estimate its power consumption. Based on page 261 of the chipset documentation (i.e., Datasheet, volume 2 (M- and H-processor lines)), I could use the 32-bit value in register, RAM—DRAM_ENERGY_STATUS, as a DRAM energy estimation. But I need rank-level energy estimates. I could also use core and offcore DRAM access performance counters to estimate power consumption, but, as mentioned before, I need per-rank statistics. Besides that, they report whole-system stats, while energy is calculated per-rank. They also do not report many DRAM accesses.
Therefore, IMC counters (which are uncore counters) should be the ideal choice. Perf does not support per-rank counters. I tried to use PCM-Memory to access IMC counter information. But /sys/bus/event_source/devices/uncore_imc is not mounted by the kernel (the version is 5.0.0-37-generic) and the tool does not detect the CPU. I tried to access uncore performance counters, manually. Whole-system DRAM access counters are documented, here (They were not documented in the above-mentioned chipset manual). I can retrieve total DRAM read and write accesses using these counters. But, there is no information about channel or rank-level access stats. How can I find the offset associated with these counters? Should I use trial and error?
P.S.: This question is also asked at Intel Software Tuning, Performance Optimization & Platform Monitoring Forum.
The MSR_DRAM_ENERGY_STATUS always reports an estimate of the energy consume by all memory channels. There is no easy way to break it into per-rank energy. This register reports a highly accurate estimate on Haswell.
The 5.0.0-37-generic kernel is an Ubuntu kernel and does support the uncore_imc/data_reads/ and uncore_imc/data_writes/ events on Haswell, which represent a data read CAS command and a data write CAS command from the IMC, respectively. A full cache-line read and a full cache-line write transactions cause a single bursty 64-byte transaction on the memory bus to a single rank. A partial read is also executed as a single full-line read on the bus, but a partial write may require a full line read followed by a full line write due to restrictions in the protocol. Partial writes are generally negligible.
The uncore_imc/data_reads/ and uncore_imc/data_writes/ events occur for requests targeting DRAM memory generated by any unit, not just cores. These names are given by perf and they correspond to UNC_IMC_DRAM_DATA_READS and UNC_IMC_DRAM_DATA_WRITES, respectively, which are mentioned in the Intel article you've cited. The other three events mentioned there allow you count requests (not CAS commands!) for each of the three possible sources separately (GT, IA, and IO). You won't find them listed under /sys/bus/event_source/devices/uncore_imc/events on your old kernel. They are supported in perf starting with mainstream kernel v5.9-rc2.
By the way, PCM does support these events as well, which it uses to report read and write bandwdith over all channels, but you should use the tool pcm.x, not pcm-memory.x, which only works on server processors.
A Haswell H-processor line processor has a single on-die memory controller with two DDR3L 64-bit channels. Each channel can contain zero, one, or two DIMMs with a total capacity of up to 32 GBs over all channels. Moreover, each DIMM can contain up to two ranks, so a single channel can contain anywhere between zero and four ranks. The i7-4720HQ is a high-end mobile processor. You're probably on a laptop with 8 GBs or 16 GBs of memory. If the memory topology was not changed since purchase, it probably has only two 4GB or 8GB DIMMs, one in each channel, with one remaining free slot per channel available for expansion if desired by the user. This means that there are either one or two ranks per channel.
You can approximate the number of accesses to each rank given the knowledge of how physical addresses are mapped to ranks. If each channel is populated with a single rank DIMM of the same capacity, the mapping is simple on your processor. Bit 6 of the physical address (i.e., the seventh bit) determines which channel, and therefore which rank, a request is mapped to. You can collect a set of samples of physical addresses of requests at the IMC by running perf record on MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM with the --phys-data option. Obviously this set of samples may only be representative of core-originated retired loads that reach the IMC, which are a small subset of all requests at the IMC.
It appears to me that you want to measure the number of memory accesses per rank in order to estimate the the per-rank energy from the total DRAM energy, but this is not trivial at all for the following reasons:
Not all CAS commands are of the same energy cost. Precharge and activate commands are not counted by any event and may consume significant energy, especially with high row buffer miss rates.
Even if there are zero requests in the IMC, as long as there is at least one active core, the memory channels are powered and do consume energy.
The amount of time it takes to process a request of the same type and to the same address may vary depending surrounding requests due to timing delays required by rank-to-rank turnarounds and read-write switching.
Despite of all of that, I imagine it may be possible to build a good model of upper and lower bounds on per-rank energy given a representative estimate of the number of requests to each rank (as discussed above).
The bottom line is that there is no easy way to get the luxury of per-rank counting like on server processors.

GPU execution "flow" vs. CPU

On a general purpose CPU parallel processing is performed splitting calculation / problem into sub-problems, distributing them and running them in parallel on a number of cores on one or several sockets / servers.
What is the execution "flow" on a GPU from loading data to sending back results to the CPU ? What are key differences between execution on a GPU and execution on a CPU ?
Should we see a GPU as a "kind of CPU with a higher (huge) number smaller cores" or are there additionnal differences in nature ?
The fundamental difference in parallel processing between a CPU and a GPU is that CPUs are MIMD (Multi-Instruction-Multi-Data), while GPUs are SIMD (Single-Instruction-Multi-Data). In a multicore CPU, each core fetches its instructions and data independently of the others, whereas in a GPU there is only one instruction stream for a group of cores (typically 32 or 64). While there is only one instruction stream for the 32/64 cores, each of them is working on different data elements (typically located together in memory; more below). Such SIMD execution means that the GPU cores operate in a lock-stepped fashion.
For the above mentioned reason, a GPU can't be viewed as a "kind of CPU with a higher (huge) number smaller cores".
In order to support SIMD execution (also sometimes called wide-execution), we need wide fetch of input data. For a 32-wide execution, we fetch a contiguous 4B x 32 block = 128B that is consumed (typically) entirely by a 32-wide pipe. Contrast this to a MIMD multicore, where each of 32 CPU cores would fetch a separate instruction and then load from 32 different cachelines. This SIMD nature of (wide-) instruction/data fetch results in huge power savings compared to MIMD. As a result, for the same power budget, we can put more cores on a GPU (=> more HW parallelism) than a multicore-CPU.
The SIMD nature of GPUs is driven by applications that do exactly the same operation over very many input elements (e.g.; Image processing where we apply a filter on every pixel of say a 1024x768 image) so that wide instruction/data fetch works well. At the same time, applications where each core's computation is different (e.g., take if() when input data is zero, or else() if input data is 1) or each core needs to fetch data from a different page fail to take advantage of the SIMD nature of GPUs.
A partially related fact is that GPUs support applications (e.g., images/videos) that are streaming (almost zero data reuse) and have massive data-parallelism. Streaming means that we don't need huge caches like CPUs, and massive data-parallelism almost entirely cuts the need for HW coherence mechanisms.

Query amount of VRAM or GPU clock speed

I am writing code to pick a physical device, but I want to put in some logic to prefer newer devices (more VRAM or higher clock speed) in case multiple ones fit my minimum feature requirements.
Is this possible?
Vulkan has no specific API calls to get such GPU details, for that you'd need to go with vendor specific APIs like NVAPI. The only hint may be deviceType member of VkPhysicalDeviceProperties that returns whether it's an integraed, discrete or virtual GPU.
The VRAM size though can be determined by finding the memory heap with the DEVICE_LOCAL bit set using vkGetPhysicalDeviceMemoryProperties. The VkPhysicalDeviceMemoryProperties returned by that function contains all available memory heaps in the memoryHeaps member. The configuration differs esp. between discrete and integrated GPUs, so this may not always be what you're looking for, e.g. on integrated GPUs with shared memory.
Heaps for a discrete GPU: http://vulkan.gpuinfo.org/displayreport.php?id=1432#memoryheaps
Heaps for an integrated GPU: http://vulkan.gpuinfo.org/displayreport.php?id=1200#memoryheaps

CUDA optimisation - kernel launch conditions

I am fairly new to CUDA and would like to find out more about optimising kernel launch conditions to speed up my code. This is quite a specific scenario but I'll try to generalise it as much as possible so anyone else with a similar question can gain from this in the future.
Assume I've got an array of 300 elements (Array A) that is sent to the kernel as an input. This array is made of a few repeating integers with each integer having a device function specific to it. For example, every time 5 appears in Array A, the kernel performs the function specific to 5. These functions are device functions.
How I have parallelised this problem is by launching 320 blocks (probably not the best number) so that each block will perform the device function relevant to its element in parallel.
The CPU would handle the entire problem in a serial fashion where it will take element by element and call each function one after the other whereas the GPU would allocate an element to each block so that all 320 blocks can access the relevant device functions and calculate simultaneously.
In theory for a large number of elements the GPU should be faster - at least I though so but in my case it isn't. My assumption is that since 300 elements is a small number the CPU will always be faster than the GPU.
This is acceptable BUT what I want to know is how I can cut down the GPU execution time at least by a little. Currently, the CPU takes 2.5 milliseconds and the GPU around 12 ms.
Question 1 - How can I choose the optimum number of blocks/threads to launch at the start?
First I tried 320 blocks with 1 thread per block. Then 1 block with 320 threads. No real change in execution time. Will tweaking the number of blocks/threads improve the speed?
Question 2 - If 300 elements is too small, why is that, and roughly how many elements do I need to see the GPU outperforming the CPU?
Question 3 - What optimisation techniques should I look into?
Please let me know if any of this isn't that clear and I'll expand on it.
Thanks in advance.
Internally, CUDA manages threads in groups of 32 (so-called warps). If you have 1 thread per block device will still execute 32 of those - 31 thread will simply be in divergent state. This is potentially an occupancy issue though you may not observe it on your device and with your problem size. There is also limit on number of blocks given multiprocessor (SM) can execute. AFAIR, GeForce 4x can run up to 8 blocks on one SM. Hence if you have a device with 8 SMs you can simultaneously run 64 threads if you have block size of 1. You can use a tool called occupancy calculator to estimate a better block size - or you can use a visual profiler.
This can only be decided by profiling. There are too many unknowns - e.g. what is your ratio of memory accesses to actual computations, how parallelizable your task is, etc.
I would really recommend you to start with best practices guide.

How to leverage blocks/grid and threads/block?

I'm trying to accelerate this database search application with CUDA, and I'm working on running a core algorithm in parallel with CUDA.
In one test, I run the algorithm in parallel across a digital sequence of size 5000 with 500 blocks per grid and 100 threads per block and came back with a runt time of roughly 500 ms.
Then I increased the size of the digital sequence to 8192 with 128 blocks per grid and 64 threads per block and somehow came back with a result of 350 ms to run the algorithm.
This would indicate that how many blocks and threads used and how they're related does impact performance.
My question is how to decide the number of blocks/grid and threads/block?
Below I have my GPU specs from a standard device query program:
You should test it because it depends on your particular kernel. One thing you must aim to do is to make the number of threads per block a multiple of the number of threads in a warp. After that you can aim for high occupancy of each SM but that is not always synonymous with higher performance. It was been shown that sometimes lower occupancy can give better performance. Memory bound kernels usually benefit more from higher occupancy to hide memory latency. Compute bound kernels not so much. Testing the various configurations is your best bet.