how would I do that and what would the impout thing be
Related
I am using dynamic scheduling for the loop iteration. But when the works in each iteration are too small, some threads don't work or when there is a huge amount of threads. Eg. There are 100 iterations and there are 90 threads, I want every thread to do at least one iteration and the rest 10 iterations can be distributed to the threads who have done their job. How can I do that?
You cannot force the OpenMP runtime to do this. However, you can give hints to the OpenMP runtime so that it will likely do that when (it decide that) it is possible at the cost of a higher overhead.
On way is to specify the granularity of the dynamically scheduled loop.
Here is an example:
#pragma omp parallel for schedule(dynamic,1)
for(int i=0 ; i<100 ; ++i)
compute(i);
With such a code, the runtime is free to share the work evenly between threads (using a work-sharing scheduler) or let threads steal the work of a master thread that drive the parallel computation (using a work-stealing scheduler). In the second approach, although the granularity is 1 loop iteration, some threads could steal more work than they actually need (eg. to generally improve performance). If the loop iterations are fast enough, the work will probably not be balanced between threads.
Creating 90 threads is costly and sending work to 90 threads is also far from being free as it is mostly bounded by the relatively high latency of atomic operations, their salability as well as the latency of awaking threads.
Moreover, while such operation appear to be synchronous from the user point of view, it is not the case in practice (especially with 90 threads and on multi-socket NUMA-based architectures).
As a results, some threads may finish to compute one iteration of the loop while others may not be aware of the parallel computation or not even created yet.
The overhead to make threads aware of the computation to be done generally grow as the number of threads used is increased.
In some case, this overhead can be higher than the actual computation and it can be more efficient to use less threads.
OpenMP runtime developers should sometimes tread work balancing with smaller communication overheads. Thus those decisions can perform badly in your case but could improve the salability of other kind of applications. This is especially true on work-stealing scheduler (eg. the Clang/ICC OpenMP runtime). Note that improving the scalability of OpenMP runtimes is an ongoing research field.
I advise you to try multiple OpenMP runtimes (including research ones that may or may not be good to use in production code).
You can also play with the OMP_WAIT_POLICY variable to reduce the overhead of awaking threads.
You can also try to use OpenMP tasks to force a bit more the runtime to not merge iterations.
I also advise you to profile your code to see what is going on and find potential software/hardware bottlenecks.
Update
If you use more OpenMP threads than there is hardware threads on your machine, the processor cannot execute them simultaneously (it can only execute one OpenMP thread on each hardware thread). Consequently, the operating systems on your machine schedules the OpenMP threads on the hardware threads so that they seem to be executed simultaneously from the user point of view. However, they are not running simultaneously, but executed in an interleaved way during a very small quantum of time (eg. 100 ms).
For example, if you have a processor with 8 hardware threads and you use 8 OpenMP threads, you can roughly assume that they will run simultaneously.
But if you use 16 OpenMP threads, your operating system can choose to schedule them using the following way:
the first 8 threads are executed for 100 ms;
the last 8 threads are executed for 100 ms;
the first 8 threads are executed again for 100 ms;
the last 8 threads are executed again for 100 ms;
etc.
If your computation last for less than 100 ms, the OpenMP dynamic/guided schedulers will move the work of the 8 last threads to the 8 first threads so that the overall execution time will be faster. Consequently, the 8 first threads can execute all the work and the 8 last threads will not have anything to once executed. This is the cause of the work imbalance between threads.
Thus, if you want to measure the performance of an OpenMP program, you shall NOT use more OpenMP threads than hardware threads (unless you exactly know what you are doing and you are fully aware of such effects).
We saw a high CPU consumption issue in our production environment recently, and saw something strange while debugging the same. When I did a "top -H" to see the CPU stats per thread ID, I found a thread X consuming high CPU. When I took the thread dumps, I saw that this thread X was in BLOCKED state. What does this mean, can a thread which is in BLOCKED state consume high CPU ? I think this might be trivial question but I am a novice in debugging Performance issues and JVM, and not sure what I might be missing here.
Entering and exiting a BLOCKED state can be expensive. If you are BLOCKED for even a little while this is not a problem, but if you are blocking briefly in a busy loop, your thread can appear blocked but in reality burning CPU.
I would look for multiple threads repeatedly competing on a shared resources which are entering BLOCKED very briefly.
#Peter has already mentioned good point about busy loop (which could be JVM internal adaptive optimization of spin locks in case of synchronization or busy loop created by application itself on some condition) which can burn CPU. There is another indirect way in which the CPU can go very high because of thread blocking. Typically in a web server if lots of threads are in blocked state ( not because of synchronization lock related blocking but say waiting for IO from a back-end datastore) then it may put lots of pressure on JVM garbage collection. These worker threads are supposed to finish their work quickly so that all the objects created by them on heap is quickly de-referenced and garbage collected. If lots of threads are in this state then the garbage collection threads have to work overtime and they may end up taking lots of CPU.
lets says I have single threaded process and 2 CPU each with 2 cores.
How many processes can I run at any moment? 2 or 4? I couldn't find a clear answer for this.
is the cpu bound to he process and a core is wasted so only 2 processes can run at the same time or there is optimizations and we can run 4 processes at the same time on the 4 cores even if we only have 2 cpus?
There is no limit. The number of cores or CPUs has no connection whatsoever to the number of processes you can run.
I'm typing this answer to you on a machine with 8 cores that's currently executing 218 processes with a total of 524 threads.
is the cpu bound to he process and a core is wasted so only 2 processes can run at the same time or there is optimizations and we can run 4 processes at the same time on the 4 cores even if we only have 2 cpus?
A CPU has no idea what a process is and doesn't care whether a thread it's executing is associated with a process or not. Processes are OS concepts and CPUs don't know or care about them.
My understanding is that warp is a group of threads that defined at runtime through the task scheduler, one performance critical part of CUDA is the divergence of threads within a warp, is there a way to make a good guess of how the hardware will construct warps within a thread block?
For instance I have start a kernel with 1024 threads in a thread block, how is the warps be arranged, can I tell that (or at least make a good guess) from the thread index?
Since by doing this, one can minimize the divergence of threads within a given warp.
The thread arrangement inside the warp is implementation dependant but atm I have experienced always the same behavior:
A warp is composed by 32 threads but the warp scheduller will issue 1 instruction for halp a warp each time (16 threads)
If you use 1D blocks (only threadIdx.x dimension is valid) then the warp scheduller will issue 1 instruction for threadIdx.x = (0..15) (16..31) ... etc
If you use 2D blocks (threadIdx.x and threadIdx.y dimension are valid) then the warp scheduller will try to issue following this fashion:
threadIdx.y = 0 threadIdx.x = (0 ..15) (16..31) ... etc
so, the threads with consecutive threadIdx.x component will execute the same instruction in groups of 16.
A warp consists of 32 threads that will be executed at the same time. At any given time a batch of 32 will be executing on the GPU, and this is called a warp.
I haven't found anywhere that states that you can control what warp is going to execute next, the only thing you know is that it consists of 32 threads and that a threadblock should always be a multiple of that number.
Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block.
There is also this, with regards to memory operations and latency:
When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures would add a cache memory hierarchy to reduce the latency, and Fermi does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, these GPUs tolerate memory latency by using a high degree of multithreading. A Tesla supports up to 32 active warps on each multiprocessor, and a Fermi supports up to 48. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. In this way, the cores can be productive as long as there is enough parallelism to keep them busy.
source
With regards to dividing up threadblocks into warps, I have found this:
if the block is 2D or 3D, the threads are ordered by first dimension, then second, then third – then split into warps of 32
source
We're trying to figure out the optimum number of threads to use for our NServiceBus service. We're running it on a machine with 2 quad cores. We've been having problems with the queue backing up. We started with 100 threads then bumped it to 200 and things got worse. We backed it down to 75, then 50 and it seemed even better. Is there some optimal number based on how many CPU's we have or some rule of thumb that we should use to determine the number of threads to run?
Every thread you have running has an overhead attached to it. If you have 2 quad cores then you will be able to have exactly 8 threads running at any one time. Each thread will be consuming a core.
If you have more than 8 threads then there is a chance you will start to do LESS useful work, not more. This is because every time windows decides to give one of the threads not currently consuming a core a turn at doing something it needs to store the state of one of the running threads and then restore the old state of the thread that is about to run - then let the thread go at it. If you have a huge number of threads you're going to spend a large amount of time just switching between the threads and doing nothing useful.
If you have a bunch of threads that are blocked waiting for IO (for instance a message to finish writing to disk so it can be got at) then you might be able to run more threads than you have cores and still get something useful done as a number of those threads will be sitting waiting for something else to complete. It's a complex subject and there is no real answer to 'how many threads should I use'. A good rule of thumb is have a thread for every core and then try and play with it a bit if you want to achieve more throughput. Testing it under real conditions is the only real way to find the sweet spot. You might find that you only need one thread to process the messages and half the time that thread is blocked waiting for a message to come in....
Obviously, even what I've described is oversimplified. Windows needs access to the cores to do OSy things so even if you have 8 cores all of your 8 threads wont always be running because the windows threads are having a turn... then you have IO threads etc....