I have two pieces of code. One written in C and the corresponding operation written in CUDA.
Please help me understand how __syncthreads() works in context of the following programs. As per my understanding, __syncthreads() ensures synchronization of threads limited to one block.
C program :
{
for(i=1;i<10000;i++)
{
t=a[i]+b[i];
a[i-1]=t;
}
}
`
The equivalent CUDA program :
`
__global__ void kernel0(int *b, int *a, int *t, int N)
{
int b0=blockIdx.x;
int t0=threadIdx.x;
int tid=b0*blockDim.x+t0;
int private_t;
if(tid<10000)
{
private_t=a[tid]+b[tid];
if(tid>1)
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
}
Kernel Dimensions:
dim3 k0_dimBlock(32);
dim3 k0_dimGrid(313);
kernel0 <<<k0_dimGrid, k0_dimBlock>>>
The surprising fact is output from C and CUDA program are identical. Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thrad-ID i and written to a[i-1] by the same thread. Now the same happens for thread-ID i-1. Had the problem size been lesser than 32, the output is obvious. But for a problem of size 10000 with 313 blocks and blocks, how does the dependency gets respected ?
As per my understanding, __syncthreads() ensures synchronization of
threads limited to one block.
You're right. __syncthreads() is a synchronization barrier in the context of a block. Therefore, it is useful, for instance, when you must to ensure that all your data is updated before starting the next stage of your algorithm.
Given the nature of problem, which has dependency of a[] onto itself,
a[i] is loaded by thread-ID i and written to a[i-1] by the same thread.
Just imagine the thread 2 reach the if statement, since it matches the condition it enters to the statement. Now that threads do the following:
private_t=a[2]+b[2];
a[1]=private_t;
Witch is equivalent to:
a[1]=a[2]+b[2];
As you pointed, it is data dependency on array a. Since you can't control the order of execution of the warps at some point you'll be using an updated version of the aarray. In my mind, you need to add an extra __syncthreads() statement:
if( tid > 0 && tid<10000)
{
private_t=a[tid]+b[tid];
__syncthreads();
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
In this way, every thread gets its own version of private_t variable using the original array a, then the array is updated in parallel.
About the *t value:
If you're only looking at the value of *t, you'll not notice the effect of this random scheduling depending on the launching parameters, that's because the thread with tid==9999 could be in the last warp along with the thread tid==9998. Since the two array positions needed to create the private_t value and you already had that synchronization barrier the answer should be right
Related
we know that there are some techniques that make virtual calls not so expensive in JVM like Inline Cache or Polymorphic Inline Cache.
Let's consider the following situation:
Base is an interface.
public void f(Base[] b) {
for(int i = 0; i < b.length; i++) {
b[i].m();
}
}
I see from my profiler that calling virtual (interface) method m is relatively expensive.
f is on the hot path and it was compiled to machine code (C2) but I see that call to m is a real virtual call. It means that it was not optimised by JVM.
The question is, how to deal with a such situation? Obviously, I cannot make the method m not virtual here because it requires a serious redesign.
Can I do anything or I have to accept it? I was thinking how to "force" or "convince" a JVM to
use polymorphic inline cache here - the number of different types in b` is quite low - between 4-5 types.
to unroll this loop - length of b is also relatively small. After an unroll it is possible that Inline Cache will be helpful here.
Thanks in advance for any advices.
Regards,
HotSpot JVM can inline up to two different targets of a virtual call, for more receivers there will be a call via vtable/itable [1].
To force inlining of more receivers, you may try to devirtualize the call manually, e.g.
if (b.getClass() == X.class) {
((X) b).m();
} else if (b.getClass() == Y.class) {
((Y) b).m();
} ...
During execution of profiled code (in the interpreter or C1), JVM collects receiver type statistics per call site. This statistics is then used in the optimizing compiler (C2). There is just one call site in your example, so the statistics will be aggregated throughout the entire execution.
However, for example, if b[0] always has just two receivers X or Y, and b[1] always has another two receivers Z or W, JIT compiler may benefit from splitting the code into multiple call sites, i.e. manual unrolling:
int len = b.length;
if (len > 0) b[0].m();
if (len > 1) b[1].m();
if (len > 2) b[2].m();
...
This will split the type profile, so that b[0].m() and b[1].m() can be optimized individually.
These are low level tricks relying on the particular JVM implementation. In general, I would not recommend them for production code, since these optimizations are fragile, but they definitely make the source code harder to read. After all, megamorphic calls are not that bad [2].
[1] https://shipilev.net/blog/2015/black-magic-method-dispatch/
[2] https://shipilev.net/jvm/anatomy-quarks/16-megamorphic-virtual-calls/
I have recently started to work with Cuda, I have multithread, multiprocess coding experience on C++, Java and Python.
With PyCuda I see example codes like this,
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
It seems the thread id itself partakes in the logic of the code. Then the question is will there be enough thread ids covering my entire array (whose indexing I apparently need to reach all elements there), and what happens if I change the size of the array.
Will the indexing always be between 0 and N?
In CUDA the thread id is only unique per so-called thread block, meaning, that your example kernel only does the right thing with only one block doing work. This is probably done in early examples to ease you into the ideas, but it is generally a very bad thing to do in terms of performance:
With one block, you can only utilize one of many streaming multiprocessors (SMs) in a GPU and even that SM will only be able to hide memory access latencies when it has enough parallel work to do while waiting.
A single thread-block also limits you in the number of threads and therefore in the problem-size, if your kernel doesn't contain a loop so every thread can compute more than one element.
Kernel execution is seen strongly hierarchically: Restricting ourselves to one dimensional indexing for simplicity, a kernel is executed on a so-called grid of gridDim.x thread blocks, each containing blockDim.x threads numbered per block by threadIdx.x, while each block is numbered via blockIdx.x.
To get a unique ID for a thread (in a fashion that ideally uses the hardware to load elements from an array), you have to take blockIdx.x * blockDim.x + threadIdx.x. If more than one element shall be computed by every thread, you use a loop of the form
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < InputSize; i += gridDim.x * blockDim.x) {
/* ... */
}
This is called a grid-stride loop, because gridDim.x * blockDim.x is the number of all threads working on the kernel. Different strides (especially having a thread working on consecutive elements: stride = 1) might work, but will be much slower due to the non-ideal memory access pattern.
I need do many comparsions in opencl programm. Now i make it like this
int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
for (int i = 0; i<size;i++){
if(a[i] != b[i])return 0;
}
return 1;
}
How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!
I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.
However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.
See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1
It is a bit old but their concepts are not so old.
PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".
EDIT: I realized that I, unfortunately, overlooked a semicolon at the end of the while statement in the first example code and misinterpreted it myself. So there is in fact an empty loop for threads with threadIdx.x != s, a convergency point after that loop and a thread waiting at this point for all the others without incrementing the s variable. I am leaving the original (uncorrected) question below for anyone interested in it. Be aware, that there is a semicolon missing at the end of the second line in the first example and thus, s++ has nothing in common with the cycle body.
--
We were studying serialization in our CUDA lesson and our teacher told us that a code like this:
__shared__ int s = 0;
while (s != threadIdx.x)
s++; // serialized code
would end up with a HW deadlock because the nvcc compiler puts a reconvergence point between the while (s != threadIdx.x) and s++ statements. If I understand it correctly, this means that once the reconvergence point is reached by a thread, this thread stops execution and waits for the other threads until they reach the point too. In this example, however, this never happens, because thread #0 enters the body of the while loop, reaches the reconvergence point without incrementing the s variable and other threads get stuck in an endless loop.
A working solution should be the following:
__shared__ int s = 0;
while (s < blockDim.x)
if (threadIdx.x == s)
s++; // serialized code
Here, all threads within a block enter the body of the loop, all evaluate the condition and only thread #0 increments the s variable in the first iteration (and loop goes on).
My question is, why does the second example work if the first hangs? To be more specific, the if statement is just another point of divergence and in terms of the Assembler language should be compiled into the same conditional jump instruction as the condition in the loop. So why isn't there any reconvergence point before s++ in the second example and has it in fact gone immediately after the statement?
In other sources I have only found that a divergent code is computed independently for every branch - e.g. in an if/else statement, first the if branch is computed with all else-branched threads masked within the same warp and then the other threads compute the else branch while the first wait. There's a reconvergence point after the if/else statement. Why then does the first example freeze, not having the loop split into two branches (a true branch for one thread and a waiting false branch for all the others in a warp)?
Thank you.
It does not make sense to put the reconvergence point between the call to while (s != threadIdx.x) and s++;. It disrupts the program flow since the reconvergence point for a piece of code should be reachable by all threads at compile time. Below picture shows the flowchart of your first piece of code and possible and impossible points of reconvergence.
Regarding this answer about recording the convergence point via SSY instruction, I created below simple kernel resembling your first piece of code
__global__ void kernel_1() {
__shared__ int s;
if(threadIdx.x==0)
s = 0;
__syncthreads();
while (s == threadIdx.x)
s++; // serialized code
}
and compiled it for CC=3.5 with -O3. Below is the result of using cuobjdumbinary tool for the output to observe the CUDA assembly. The result is:
I'm not an expert in reading CUDA assembly but I can see while loop condition checks in lines 0038 and 00a0. At line 00a8, it branches to 0x80 if it satisfies the while loop condition and executes the code block again. The introduction of the reconvergence point is at line 0058 introducing line 0xb8 as the reconvergence point which is after the loop condition check near the exit.
Overall, it is not clear what you're trying to achieve with this piece of code. Also in the second piece of code, the reconvergence point should be again after while loop code block (I don't mean between while and if).
The reason why it "hangs" is neither a HW deadlock nor branching, at least not directly. You produce an endless loop for one or multiple threads (as already suspected).
In your example, there isn't really a convergence point. Since you do not use any synchronization, there aren't any threads that actually wait. What happens here with the while-loop is pretty much a busy-wait.
A kernel only finishes if all threads return. Since you have one (or multiple) endless loops (by accident maybe even none - this is unlikely however) the kernel will never finish.
You declared a shared variable s. This variable is known to all threads within a block.
With your while-statement you basically say (to each thread): increment s until it reaches the value of your (local) thread id. Since all threads are incrementing s in parallel, you introduce race conditions.
Example:
List item
Thread 5 is looping and checking for s to become 5
s is 4
Two threads increment s, it becomes 6
At the same time thread 5 only reached the end of its loop.
Now it reaches the next loop iteration and checks for s and it's not 5.
Thread 5 will never be able to finish since you check via == and the value of s already exceeded the value of the thread id.
Also your solution is quite confusing, because each thread executes the serialized code consecutively (which probably was the intention after all - even though that actually is strange):
Thread 0 will execute the serialized code
After that, thread 1 will execute the serialized code
and so on
Most examples show a program where each thread works on some code, then all threads are synchronized and only single thread executes some more code (maybe it needed the results of all threads).
So, your second example "works" because no thread is stuck in an endless loop, however I can't think of a reason why anyone would use such a code,
since it is confusing and, well, not parallel at all.
I'm quite new to CUDA programming and need help to proceed.
Right now I'm working on CUDA project and I need to split a application between CPU and GPU and measure performance.
For ex, for a Matrix addition program with array size say 1000, I want to split the array into half and give the first half(500 jobs) to CPU and rest half(500 jobs) of it to GPU and combine the final output. I'm not quite sure how to go about it.
I saw in another link where they suggested to use thread pools and queue. I started to use pthreads to do the same but I was clueless how to use them when it comes to GPU part of the code as internally I have to invoke kernel. I actually created two pthreads one assuming for CPU and another for GPU
for (int i=0;i < NUMTHRDS; i++) {
pthread_create(&thds[i], &attr, decideFunc, (void*)i);
}
and decideFunc is as follows. When I executed, my pthread at times goes into CPU mode itself twice. I want to execute first half in CPU and second half in GPU and combine appropriately.
how can i do that?
void* decideFunc(void *arg) {
int id,first,last;
id = (long)arg;
first = id * TBSIZE/NUMTHRDS;
last = (id + 1) * TBSIZE/NUMTHRDS;
printf("id:%d\n\n",id);
if(id == 0){
printf("In CPU thread");
matrixAddCPU(first, last); //**Can i invoke another function here ?**
print_result(P, first, last); //P is array to store my result
}else{
printf("In GPU thread");
matrixAddGPU(first, last); //This dint seem to be correct
}
pthread_exit((void*)0);
}
and my matrixAddCPU()
void matrixAddCPU(int f, int l) {
printf("\nInvoked f:%d l:%d\n",f,l);
int row,col;
for(row=f;row<l;row++) {
for(col=f;col<l;col++) {
P[row*WIDTH+col] = N[row*WIDTH+col] + M[row*WIDTH+col];
printf("P[%d*%d+%d]:%f\t",row,WIDTH,col,P[row*WIDTH+col]);
}
printf("\n");
}
}
I couldn't find any tutorials that does this kind of operation. I don't want to use CUDA streams as I don't intend to create multiple kernels. All i wanted is to split simple application between CPU and GPU and watch their performance in a profiler tool.
Is using pthreads the right way to go ? If so, I would appreciate any further details/guidance or a tutorial that will help me.