Exiting after N threads in a compute shader

Exiting after N threads in a compute shader - gpu

So I have a compute shader kernel with the following logic:
[numthreads(64,1,1)]
void CVProjectOX(uint3 t : SV_DispatchThreadID){
if(t.x >= TotalN)
return;
uint compt = DbMap[t.x];
....
I do understand that it's not ideal to have ifs elses/branching in compute shaders? if so, what is the best way to limit thread work if number of total expected threads aren't expected to match exactly the kernel's numthreads?
For instance in my example, the kernel group of 64 threads, let's say I expect total 961 threads (it could be anything really), if, I dispatch 960, 1 db slot won't be processed, if I dispatch 1024, there will be 63 unnecessary work or maybe work pointing to non-existing db slot. (db slots number will vary).
Is if(t.x > TotalN)/return fine and the right approach here?
Should I just do min, tx = min(t.x, TotalN) and keep writing on the final db slot?
Should I just modulo? tx = t.x % TotalN and rewrite the first db slots?
What other solutions?

Limiting the number of threads this way is fine, yes. But, be aware that an early return like this doesn't actually save (as much) work as you'd expect:
The hardware utilizes SIMD like thread collections (called wavefonts in directX). Depending on the hardware, the usual size of such a wavefont is usually 4 (Intel iGPUs), 32 (NVidia and most AMD GPUs) or 64 (a few AMD GPUs). Due to the nature of SIMD, all threads in such a wavefont always do exactly the same work, you can only "mask out" some of them (meaning, their writes will be ignored and they are fine reading out-of-bounds memory).
This means that, in the worst case (when the wavefont size is 64), when you need to execute 961 threads and are therefore dispatching 1024, there will still be 63 threads executing the code, they just behave like they wouldn't exist. If the wave size is smaller, the hardware might at least early out on some wavefonts, so in these cases the early return does actually save some work.
So it would be the best if you'd never actually need a number of threads that is not a multiple of your group size (which, in turn, is hopefully a multiple of the hardwares wavefont size). But, if that's not possible, limiting the number of threads in that way is the next best option, especially because all threads that do reach the early return are next to each other, which maximizes the chance that a whole wavefont can early out.

Related

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.

Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

Software memory testing for bus failures

I have a board with quite a few flash chips, some of them are showing intermittent failures. Standard memory tests are not showing any specific problem addresses, other than certain chips are failing intermittently under mechanical and thermal stress.
Suspecting the actual connections and not the flash cells themselves, I'm looking for a way to test the parallel bus for address or data pin errors.
There are some memory tests but they apply better to RAM rather than flash memory (http://www.ganssle.com/testingram.htm). Specifically, the parallel flash has a sequence of bus writes to write to each value; a write/verify failure could easily be the write operation which could be any pin on the bus.
Ideas welcome...

The typical memory tests are there to do that. I prefer a pseudo randomizer (deterministic using an lfsr) to the 0xAA, 0x55, 0xFF, 0x00 tests. This allows for an address bus test as well as data bus test in two passes (repeat inverted). I say typical in the sense of wiggle the data bits and address bits both states each and vary the states of signals and their neighbors. The pounding on a ram to create thermal or other stresses, well you cant write very fast to a flash so you cant really do fast write/read cycles.
Flash creates another problem and that is writing then reading back isnt that interesting, you want to write the read back later, hours, days, weeks to determine if the part is actually holding data.
When you say thermal or stress do you mean only during the time it is above X degrees it fails, or do you mean that due to thermal stress it is broken all the time after the event. Likewise with mechanical, while vibrating or under mechanical stress the part fails, but when relieved of that stress it is okay, or the mechanical stress has done permanent damage that can be detected under stress or not.
Now although you cant do fast write/read cycles, you can punish a flash by reading heavily. I have seen read-disturb problems by constant reading of one block or location. Not necessarily something you have time to do for every location, but you might fill the ram with a pseudo random pattern and concentrate on one location for a while, (minutes, tens of minutes), if you have a part that you know is bad see if this accelerates the detection of the problem and if any location will work or only certain ones. then another thing is to read all the locations repetitively for hours/days or leave it sit for hours/days/weeks and then do a read pass without an erase or write and see if it has lost anything.
unfortunately as you probably know each new failure case takes its own research project and development of a new test.

First step to test a memory is data bus test0 0 0 0 0 0 0 • In this test, data bus wiring is properly tested to0 0 0 0 0 0 0 confirm that the value placed on data bus by processor0 0 0 0 0 0 0 is correctly received by memory device at the other end0 0 0 0 0 0 00 0 0 0 0 0 0 • An obvious way to test is to write all possible0 0 0 0 0 0 0 data values and verify 0 0 0 0 0 0 0 • Each bit can be tested independently• To perform walking 1s test, write the first data value given in the table, verify by reading it back, write the second value, verify and so on. • When you reach the end of the table, the test is complete

In the linked article Jack Ganssle says: "Critical to this [test], and every other RAM test algorithm, is that you write the pattern to all of RAM before doing the read test."
Since reading should be isolated from writing, testing the flash is easier. Perform the writing portion of the tests while the system is not under stress. Then perform the reading portion with the system under stress. By recording the address, expected value, and actual value in enough error cases, you should be able to determine the source of the errors.
If the system never fails when doing the above, you can then perform the whole tests while under stress. Any errors that appear are most likely write errors.

I've decided to design a memory pattern that I think I can deduce both data and address errors from. The concept is to use values significantly different as key indicators of possible read errors. The concept is also to detect a failure on one pin at a time.
The test will read alternately from only bottom and top addresses (0x000000 and 0x3FFFFF - my chip has 22 address lines). In those locations I will put 0xFF and 0x00 respectively (byte wide). The idea is to flip all address and data lines and see what happens. (All other values in the flash have at least 3 bits different from 0x00 and 0xFF)
There are 44 addresses that a single pin failure could send me to in error. In each address put one of 22 values to represent which of the 22 address pin was flipped. Each are 2 bits different from each other, and 3 bits different from 00 and FF. (I tried for 3 bits different from each other but 8 bits could only get 14 values)
07,0B,0D,0E,16,1A,1C,1F,25,29,2C,
2F,34,38,3D,3E,43,49,4A,4F,52,58
The remaining addresses I put a nice pattern of six values 33,55,66,99,AA,CC. (3 bits different from all other values) value(address) = nicePattern[ sum of bits set in address % 6];
I tested this and have statistically collected 100s of intermittent failure incidents synchronized to the mechanical stress.
single bit errors detectable
double bit errors deducible (Explainable by a combination of frequent single bit errors)
3 or more bit errors (generally inconclusive)
Even though some of the chips had 3 failing pins, 70% of the incidents were single bit (they usually didn't fail at the same time)
The testing group is now using this to identify which specific connections are failing.

Neon VLD consuming more cycles than what is expected?

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:
http://pulsar.webshaker.net/ccc/sample-d3a7fe78
As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?
I have taken care of the following:
The address is 16 byte aligned.
Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.
Can someone tell me what am I doing wrong, why this latency?
Other Details:
With reference to cortex-a8
arm-2009q1 cross compiler tool chain
coding in assembly

Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:
veor.u16 q12,q12,q12 # clear accumulated sum
top_of_loop:
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
subs r1,r1,#8
bne top_of_loop
Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.
Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).

Advice for bit level manipulation

I'm currently working on a project that involves a lot of bit level manipulation of data such as comparison, masking and shifting. Essentially I need to search through chunks of bitstreams between 8kbytes - 32kbytes long for bit patterns between 20 - 40bytes long.
Does anyone know of general resources for optimizing for such operations in CUDA?

There has been a least a couple of questions on SO on how to do text searches with CUDA. That is, finding instances of short byte-strings in long byte-strings. That is similar to what you want to do. That is, a byte-string search is much like a bit-string search where the number of bits in the byte-string can only be a multiple of 8, and the algorithm only checks for matches every 8 bits. Search on SO for CUDA string searching or matching, and see if you can find them.
I don't know of any general resources for this, but I would try something like this:
Start by preparing 8 versions of each of the search bit-strings. Each bit-string shifted a different number of bits. Also prepare start and end masks:
start
01111111
00111111
...
00000001
end
10000000
11000000
...
11111110
Then, essentially, perform byte-string searches with the different bit-strings and masks.
If you're using a device with compute capability >= 2.0, store the shifted bit-strings in global memory. The start and end masks can probably just be constants in your program.
Then, for each byte position, launch 8 threads that each checks a different version of the 8 shifted bit-strings against the long bit-string (which you now treat like a byte-string). In each block, launch enough threads to check, for instance, 32 bytes, so that the total number of threads per block becomes 32 * 8 = 256. The L1 cache should be able to hold the shifted bit-strings for each block, so that you get good performance.

cuda filter with ouput of this block is the input of the next block

Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
If I define:
dim3 threads_perblock(32, 32)
then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.
Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.
so that is the problem.
How would I solve this?
Thanks in advance.

If image is in global memory then there is no problem - you don't need to use shared memory and you can just access pixels directly from image without any problem.
However if you have already done some processing prior to this, and a block of image is already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following - either:
write shared memory back to global memory so that it is accessible to neighbouring blocks (disadvantage: performance, synchronization between blocks can be tricky)
or:
process additional edge pixels per block with an overlap (1 pixel in this case) so that you have additional pixels in each block to handle the edge cases, e.g. work with a 34x34 block size but store only the 32x32 central output pixels (disadvantage: requires additional logic within kernel, branches may result in warp divergence, not all threads in block are fully used)
Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.

You can just use a busy spin (no joke). Just make the thread processing a[32] execute:
while(!variable);
before starting to compute and the thread processing a[31] do
variable = 1;
when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that
dim3 threads_perblock(32, 32)
means you have 32 x 32 = 1024 threads per block.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas