GPU shared memory size is very small - what can I do about it? - gpu

The size of the shared memory ("local memory" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today.
I have an application in which I need to create an array that has 10,000 integers. so the amount of memory I will need to fit 10,000 integers = 10,000 * 4b = 40kb.
How can I work around this?
Is there any GPU that has more than 16 KiB of shared memory ?

Think of shared memory as explicitly managed cache. You will need to store your array in global memory and cache parts of it in shared memory as needed, either by making multiple passes or some other scheme which minimises the number of loads and stores to/from global memory.
How you implement this will depend on your algorithm - if you can give some details of what it is exactly that you are trying to implement you may get some more concrete suggestions.
One last point - be aware that shared memory is shared between all threads in a block - you have way less than 16 kb per thread, unless you have a single data structure which is common to all threads in a block.

All compute capability 2.0 and greater devices (most in the last year or two) have 48KB of available shared memory per multiprocessor. That begin said, Paul's answer is correct in that you likely will not want to load all 10K integers into a single multiprocessor.

You can try to use cudaFuncSetCacheConfig(nameOfKernel, cudaFuncCachePrefer{Shared, L1}) function.
If you prefer L1 to Shared, then 48KB will go to L1 and 16KB will go to Shared.
If you prefer Shared to L1, then 48KB will go to Shared and 16KB will go to L1.
Usage:
cudaFuncSetCacheConfig(matrix_multiplication, cudaFuncCachePreferShared);
matrix_multiplication<<<bla, bla>>>(bla, bla, bla);

Related

Exiting after N threads in a compute shader

So I have a compute shader kernel with the following logic:
[numthreads(64,1,1)]
void CVProjectOX(uint3 t : SV_DispatchThreadID){
if(t.x >= TotalN)
return;
uint compt = DbMap[t.x];
....
I do understand that it's not ideal to have ifs elses/branching in compute shaders? if so, what is the best way to limit thread work if number of total expected threads aren't expected to match exactly the kernel's numthreads?
For instance in my example, the kernel group of 64 threads, let's say I expect total 961 threads (it could be anything really), if, I dispatch 960, 1 db slot won't be processed, if I dispatch 1024, there will be 63 unnecessary work or maybe work pointing to non-existing db slot. (db slots number will vary).
Is if(t.x > TotalN)/return fine and the right approach here?
Should I just do min, tx = min(t.x, TotalN) and keep writing on the final db slot?
Should I just modulo? tx = t.x % TotalN and rewrite the first db slots?
What other solutions?
Limiting the number of threads this way is fine, yes. But, be aware that an early return like this doesn't actually save (as much) work as you'd expect:
The hardware utilizes SIMD like thread collections (called wavefonts in directX). Depending on the hardware, the usual size of such a wavefont is usually 4 (Intel iGPUs), 32 (NVidia and most AMD GPUs) or 64 (a few AMD GPUs). Due to the nature of SIMD, all threads in such a wavefont always do exactly the same work, you can only "mask out" some of them (meaning, their writes will be ignored and they are fine reading out-of-bounds memory).
This means that, in the worst case (when the wavefont size is 64), when you need to execute 961 threads and are therefore dispatching 1024, there will still be 63 threads executing the code, they just behave like they wouldn't exist. If the wave size is smaller, the hardware might at least early out on some wavefonts, so in these cases the early return does actually save some work.
So it would be the best if you'd never actually need a number of threads that is not a multiple of your group size (which, in turn, is hopefully a multiple of the hardwares wavefont size). But, if that's not possible, limiting the number of threads in that way is the next best option, especially because all threads that do reach the early return are next to each other, which maximizes the chance that a whole wavefont can early out.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

How to read data from gpu memory ,not using memcpy?

In vulkan API,how can we read data from gpu memory,like some data which were calculated by compute shader?
First wait on the fence related to the compute invocation. Then map the memory you wrote the result into and if the memory is not coherent you need to invalidate the range.
Read the data out of the pointer you got from the mapping operation.
I've just gone through the same issue. I think #ratchet freak's comment 1 has got to the point. In my case, I was trying to transfer data from a texture(VkImage) to host memory. I used a linear buffer(VkBuffer) as the staging buffer. I originally used
VkMemoryPropertyFlags flag = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
and found memcpy() very slow. Then I added VK_MEMORY_PROPERTY_HOST_CACHED_BIT and the speed becomes about 10x.

Number of GC perfomred for various generations from a dump file

Is there anyway to get information about how many Garbage collection been performed for different generations from a dump file. When I try to run some psscor4 commands I get following.
0:003> !GCUsage
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to
displaying, finding or traversing objects as well as gc heap segments may not
work properly. !dumpheap and !verifyheap may incorrectly complain of heap
consistency errors.
Error: Requesting GC Heap data
0:003> !CLRUsage
The garbage collector data structures are not in a valid state for traversal.
It is either in the "plan phase," where objects are being moved around, or
we are at the initialization or shutdown of the gc heap. Commands related to
displaying, finding or traversing objects as well as gc heap segments may not
work properly. !dumpheap and !verifyheap may incorrectly complain of heap
consistency errors.
Error: Requesting GC Heap data
I can get output from eehpeap though, but it does not give me what I am looking for.
0:003> !EEHeap -gc
Number of GC Heaps: 1
generation 0 starts at 0x0000000002c81030
generation 1 starts at 0x0000000002c81018
generation 2 starts at 0x0000000002c81000
ephemeral segment allocation context: none
segment begin allocated size
0000000002c80000 0000000002c81000 0000000002c87fe8 0x6fe8(28648)
Large object heap starts at 0x0000000012c81000
segment begin allocated size
0000000012c80000 0000000012c81000 0000000012c9e358 0x1d358(119640)
Total Size: Size: 0x24340 (148288) bytes.
------------------------------
GC Heap Size: Size: 0x24340 (148288) bytes.
Dumps
You can see the number of garbage collections in performance monitor. However, the way performance counters work makes me believe that this information is not available in a dump file and probably even not available during live debugging.
Think of Debug.WriteLine(): once the text was written to the debug output, it is gone. If you didn't have DebugView running at the time, the information is lost. And that's good, otherwise it would look like a memory leak.
Performance counters (as I understand them) work in a similar fashion. Various "pings" are sent out for someone else (the performance monitor) to be recorded. If noone does, the ping with all its information is gone.
Live debugging
As already mentioned, you can try performance monitor. If you prefer WinDbg, you can use sxe clrn to see garbage collections happen.
PSSCOR
The commands you mentioned, do not show information about garbage collection count:
0:016> !gcusage
Number of GC Heaps: 1
------------------------------
GC Heap Size 0x36d498(3,593,368)
Total Commit Size 0000000000384000 (3 MB)
Total Reserved Size 0000000017c7c000 (380 MB)
0:016> !clrusage
Number of GC Heaps: 1
------------------------------
GC Heap Size 0x36d498(3,593,368)
Total Commit Size 0000000000384000 (3 MB)
Total Reserved Size 0000000017c7c000 (380 MB)
Note: I'm using PSSCOR2 here, since I have the same .NET 4.5 issue on this machine. But I expect the output of PSSCOR4 to be similar.

NASM Prefetching

I ran across the below instructions in the NASM documentation, but I can't quite make heads or tails of them. Sadly, the Intel documentation on these instructions is also somewhat lacking.
PREFETCHNTA m8 ; 0F 18 /0 [KATMAI]
PREFETCHT0 m8 ; 0F 18 /1 [KATMAI]
PREFETCHT1 m8 ; 0F 18 /2 [KATMAI]
PREFETCHT2 m8 ; 0F 18 /3 [KATMAI]
Could anyone possibly provide a concise example of the instructions, say to cache 256 bytes at a given address? Thanks in advance!
These instructions are hints used to suggest that the CPU try to prefetch a cache line into the cache. Because they're hints, a CPU can ignore them completely.
If the CPU does support them, then the CPU will try to prefetch but will give up (and won't prefetch) if a TLB miss would be involved. This is where most people get it wrong (e.g. fail to do "preloading", where you insert a dummy read to force a TLB load so that prefetching isn't prevented from working).
The amount of data prefetched is 32 bytes or more, depending on CPU, etc. You can use CPUID to determine the actual size (CPUID function 0x00000004, the "System Coherency Line Size" returned in EBX bits 0 to 31).
If you prefetch too late it doesn't help, and if you prefetch too early the data can be evicted from the cache before it's used (which also doesn't help). There's an appendix in Intel's "IA-32 Intel Architecture Optimisation Reference Manual" that describes how to calculate when to prefetch, called "Mathematics of Prefetch Scheduling Distance" that you should probably read.
Also don't forget that prefetching can decrease performance (e.g. cause data that is needed to be evicted to make room) and that if you don't prefetch anything the CPU has a hardware prefetcher that will probably do it for you anyway. You should probably also read about how this hardware prefetcher works (and when it doesn't). For example, for sequential reads (e.g. memcmp()) the hardware prefetcher does it for you and using explicit prefetches is mostly a waste of time. It's probably only worth bothering with explicit prefetches for "random" (non-sequential) accesses that the CPU's hardware prefetcher can't/won't predict.
After sifting through some examples of heavily-optimized memcmp functions and the like, I've figured out how to use these instructions (somewhat) effectively.
These instructions imply a cache "line" of 32 bytes, something I missed originally. Thus, to cache a 256 byte buffer into L1 and L2, the following instruction set could be used:
prefetcht1 [buffer]
prefetcht1 [buffer+32]
prefetcht1 [buffer+64]
prefetcht1 [buffer+96]
prefetcht1 [buffer+128]
prefetcht1 [buffer+160]
prefetcht1 [buffer+192]
prefetcht1 [buffer+224]
The t0 suffix instructs the CPU to prefetch it into the entire cache hierarchy.
t1 instructs that the data be cached into L1, L2, and so on.
t2 continues this trend, prefetching into L2 and such.
The "nta" suffix is a bit more confusing, as it tells the CPU to write the data straight to memory (ideally), as opposed to reading/writing cache lines. This can actually be quite useful in the case of incredibly large data structures, as cache pollution can be avoided and more relevant data can instead be cached.