I'm quite new to CUDA programming and need help to proceed.
Right now I'm working on CUDA project and I need to split a application between CPU and GPU and measure performance.
For ex, for a Matrix addition program with array size say 1000, I want to split the array into half and give the first half(500 jobs) to CPU and rest half(500 jobs) of it to GPU and combine the final output. I'm not quite sure how to go about it.
I saw in another link where they suggested to use thread pools and queue. I started to use pthreads to do the same but I was clueless how to use them when it comes to GPU part of the code as internally I have to invoke kernel. I actually created two pthreads one assuming for CPU and another for GPU
for (int i=0;i < NUMTHRDS; i++) {
pthread_create(&thds[i], &attr, decideFunc, (void*)i);
}
and decideFunc is as follows. When I executed, my pthread at times goes into CPU mode itself twice. I want to execute first half in CPU and second half in GPU and combine appropriately.
how can i do that?
void* decideFunc(void *arg) {
int id,first,last;
id = (long)arg;
first = id * TBSIZE/NUMTHRDS;
last = (id + 1) * TBSIZE/NUMTHRDS;
printf("id:%d\n\n",id);
if(id == 0){
printf("In CPU thread");
matrixAddCPU(first, last); //**Can i invoke another function here ?**
print_result(P, first, last); //P is array to store my result
}else{
printf("In GPU thread");
matrixAddGPU(first, last); //This dint seem to be correct
}
pthread_exit((void*)0);
}
and my matrixAddCPU()
void matrixAddCPU(int f, int l) {
printf("\nInvoked f:%d l:%d\n",f,l);
int row,col;
for(row=f;row<l;row++) {
for(col=f;col<l;col++) {
P[row*WIDTH+col] = N[row*WIDTH+col] + M[row*WIDTH+col];
printf("P[%d*%d+%d]:%f\t",row,WIDTH,col,P[row*WIDTH+col]);
}
printf("\n");
}
}
I couldn't find any tutorials that does this kind of operation. I don't want to use CUDA streams as I don't intend to create multiple kernels. All i wanted is to split simple application between CPU and GPU and watch their performance in a profiler tool.
Is using pthreads the right way to go ? If so, I would appreciate any further details/guidance or a tutorial that will help me.
Related
I have recently started to work with Cuda, I have multithread, multiprocess coding experience on C++, Java and Python.
With PyCuda I see example codes like this,
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
It seems the thread id itself partakes in the logic of the code. Then the question is will there be enough thread ids covering my entire array (whose indexing I apparently need to reach all elements there), and what happens if I change the size of the array.
Will the indexing always be between 0 and N?
In CUDA the thread id is only unique per so-called thread block, meaning, that your example kernel only does the right thing with only one block doing work. This is probably done in early examples to ease you into the ideas, but it is generally a very bad thing to do in terms of performance:
With one block, you can only utilize one of many streaming multiprocessors (SMs) in a GPU and even that SM will only be able to hide memory access latencies when it has enough parallel work to do while waiting.
A single thread-block also limits you in the number of threads and therefore in the problem-size, if your kernel doesn't contain a loop so every thread can compute more than one element.
Kernel execution is seen strongly hierarchically: Restricting ourselves to one dimensional indexing for simplicity, a kernel is executed on a so-called grid of gridDim.x thread blocks, each containing blockDim.x threads numbered per block by threadIdx.x, while each block is numbered via blockIdx.x.
To get a unique ID for a thread (in a fashion that ideally uses the hardware to load elements from an array), you have to take blockIdx.x * blockDim.x + threadIdx.x. If more than one element shall be computed by every thread, you use a loop of the form
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < InputSize; i += gridDim.x * blockDim.x) {
/* ... */
}
This is called a grid-stride loop, because gridDim.x * blockDim.x is the number of all threads working on the kernel. Different strides (especially having a thread working on consecutive elements: stride = 1) might work, but will be much slower due to the non-ideal memory access pattern.
I am using a Raspberry Pi 3 and basically stepped over a little tripwire.
I have a very big and complicated program that takes away a lot of memory and has a big CPU load. I thought it was normal that if I started the same process while the first one was still running, it would take the same amount of memory and especially double the CPU Load. I found out that it doesn't take away more memory and does not affect the CPU load.
To find out if this behavior came from my program, I wrote a tiny c++ program that has extremely high memory usage, here it is:
#include <iostream>
using namespace std;
int main()
{
for(int i = 0; i<100; i++) {
float a[100][100][100];
for (int i2 = 0; i2 < 99; ++i2) {
for (int i3 = 0; i3 < 99; ++i3){
for (int i4 = 0; i4 < 99; ++i4){
a[i2][i3][i4] = i2*i3*i4;
cout << a[i2][i3][i4] << endl;
}
}
}
}
return 0;
}
The CPU-load is about at 30 % of the max-level, I started the code in one terminal. Strangely, when I started it in another terminal at the same time, it didnt affect the CPU load. I concluded that this behaviour couldn't come from my program.
Now I want to know:
Is there a "lock" that ensures that a certain type of process does not grill your cores?
Why don't two identical processes double the CPU load?
Well, I found out that there is a "lock" that makes sure a process doesn't take away all memory and makes the CPU load go up to 100%. It seems that than more processes there are, than more CPU load is there, but not in a linear way.
Additionally, the code I wrote to look for the behaviour has only high memory usage, the 30% level came from the cout in the standard library . Multiple processes can use the command at the same time without increasing the CPU load, but it affects the speed of the printing.
When I found out that, I got suspicious about the programs speed. I used the analytics in my IDE for c++ to find out the duration of my original program, and indeed it was a bit more that two times slower.
That seems to be the solution I was looking for, but I think this is not really applicable to a large audience since the structure of the Raspberry Pi is very particular. I don't know how this works for other systems.
BTW: I could have guessed that there is a lock. I mean, if you start 10 processes that take away 15% of the CPU load max level, you would have 150% CPU usage. IMPOSSIBLE!
I need do many comparsions in opencl programm. Now i make it like this
int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
for (int i = 0; i<size;i++){
if(a[i] != b[i])return 0;
}
return 1;
}
How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!
I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.
However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.
See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1
It is a bit old but their concepts are not so old.
PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".
I have two pieces of code. One written in C and the corresponding operation written in CUDA.
Please help me understand how __syncthreads() works in context of the following programs. As per my understanding, __syncthreads() ensures synchronization of threads limited to one block.
C program :
{
for(i=1;i<10000;i++)
{
t=a[i]+b[i];
a[i-1]=t;
}
}
`
The equivalent CUDA program :
`
__global__ void kernel0(int *b, int *a, int *t, int N)
{
int b0=blockIdx.x;
int t0=threadIdx.x;
int tid=b0*blockDim.x+t0;
int private_t;
if(tid<10000)
{
private_t=a[tid]+b[tid];
if(tid>1)
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
}
Kernel Dimensions:
dim3 k0_dimBlock(32);
dim3 k0_dimGrid(313);
kernel0 <<<k0_dimGrid, k0_dimBlock>>>
The surprising fact is output from C and CUDA program are identical. Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thrad-ID i and written to a[i-1] by the same thread. Now the same happens for thread-ID i-1. Had the problem size been lesser than 32, the output is obvious. But for a problem of size 10000 with 313 blocks and blocks, how does the dependency gets respected ?
As per my understanding, __syncthreads() ensures synchronization of
threads limited to one block.
You're right. __syncthreads() is a synchronization barrier in the context of a block. Therefore, it is useful, for instance, when you must to ensure that all your data is updated before starting the next stage of your algorithm.
Given the nature of problem, which has dependency of a[] onto itself,
a[i] is loaded by thread-ID i and written to a[i-1] by the same thread.
Just imagine the thread 2 reach the if statement, since it matches the condition it enters to the statement. Now that threads do the following:
private_t=a[2]+b[2];
a[1]=private_t;
Witch is equivalent to:
a[1]=a[2]+b[2];
As you pointed, it is data dependency on array a. Since you can't control the order of execution of the warps at some point you'll be using an updated version of the aarray. In my mind, you need to add an extra __syncthreads() statement:
if( tid > 0 && tid<10000)
{
private_t=a[tid]+b[tid];
__syncthreads();
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
In this way, every thread gets its own version of private_t variable using the original array a, then the array is updated in parallel.
About the *t value:
If you're only looking at the value of *t, you'll not notice the effect of this random scheduling depending on the launching parameters, that's because the thread with tid==9999 could be in the last warp along with the thread tid==9998. Since the two array positions needed to create the private_t value and you already had that synchronization barrier the answer should be right
I have two ways to program the same functionality.
Method 1:
doTheWork(int action)
{
for(int i = 0 i < 1000000000; ++i)
{
doAction(action);
}
}
Method 2:
doTheWork(int action)
{
switch(action)
{
case 1:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1>();
}
break;
case 2:
for(int i = 0 i < 1000000000; ++i)
{
doAction<2>();
}
break;
//-----------------------------------------------
//... (there are 1000000 cases here)
//-----------------------------------------------
case 1000000:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1000000>();
}
break;
}
}
Let's assume that the function doAction(int action) and the function template<int Action> doAction() consist of about 10 lines of code that will get inlined at compile-time. Calling doAction(#) is equiavalent to doAction<#>() in functionality, but the non-templated doAction(int value) is somewhat slower than template<int Value> doAction(), since some nice optimizations can be done in the code when the argument value is known at compile time.
So my question is, do all the millions of lines of code fill the CPU L1 cache (and more) in the case of the templated function (and thus degrade performance considerably), or does only the lines of doAction<#>() inside the loop currently being run get cached?
It depends on the actual code size - 10 lines of code can be little or much - and of course on the actual machine.
However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.
Scalability limit
Your optimizations are usually linear - you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear - as in "running into a brick wall" nonlinear.
As soon as your code size significantly exceeds the 2nd/3rd level cache's size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:
DDR3-1333 with 10667MB/s peak memory bandwidth,
Intel Core i7 Extreme with ~75000 MIPS
gives you 10667MB / 75000M = 0.14 bytes per instruction for break even - anything larger, and main memory can't keep up with the CPU.
Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn't necessarily the same instructions, as x86 instructions are split up. Still...)
Typical x64 instruction lengths are even larger.
How much does your cache help?
I found the following number (different source, so it's hard to compare):
i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.
In addition, your L2 cache will kick in completely only if
you have perfect prediciton of the next instructions or you don't have first-run penalty and it fits the cache completely
there's no significant amount of data being processed
there's no significant other code in your "inner loop"
there's no thread executing on this core
Given that, you can lose much earlier, especially on a CPU/board with smaller caches.
The L1 instruction cache will only contain instructions which were fetched recently or in anticipation of near future execution. As such, the second method cannot fill the L1 cache simply because the code is there. Your execution path will cause it to load the template instantiated version that represents the current loop being run. As you move to the next loop, it will generally invalidate the least recently used (LRU) cache line and replace it with what you are executing next.
In other words, due to the looping nature of both your methods, the L1 cache will perform admirably in both cases and won't be the bottleneck.