Python's support for multi-threading - python-multithreading

I heard that python still has this global interpreter lock issue. As a result, threads execution in python are not actually multi-threaded.
What are the possible solutions to overcome this problem?
I am using python 2.7.3

For understanding python's GIL, I would recommend using this link: http://www.dabeaz.com/python/UnderstandingGIL.pdf
From python wiki:
The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
There are discussions on eliminating the GIL, but I guess its not achieved yet. If you really want to achieve multi-threading for your custom code, you can also switch to Java.
Do see if that helps.

Related

Translate OpenCl code for CPU compilation

Sometimes I find myself writing OpenCl kernel code (using pyopencl), even for tasks which involve moderate computational complexity, because it is easier to develop than a chain of numpy operations (especially if no appropriate numpy function exists).
However, in those cases the transfer overhead/delay between host and device may exceed the time spend for computation.
I was thinking about creating some Python tool, which automatically translates the OpenCl code to e.g. Cython code (or similar) which, after compilation for the CPU, can directly work on the underlying memory of the numpy arrays, without the need to copy the data to the device. I know that the CPU is capable of executing OpenCl kernels with appropriate drivers. However, this still has the disadvantages of additional delay due to the to_device operation. A multicore CPU could also exploit the OpenCL programming model for parallel execution. Furthermore, this approach removes the need for special OpenCl drivers and just requires some build tools for C-Code compilation.
Is that a reasonable idea? I do not want to reinvent the wheel. Any hints for existing frameworks/tools which could achieve my goals are much appreciated.
While converting an OpenCL code to a parallel CPU-oriented code is probably possible, it very hard (if not possible) to generate an efficient code.
Indeed, OpenCL encourage/force programmers to perform big computational steps (kernels) often reading/writing a relatively big portion of memory. However, the GPUs memory bandwidth is generally much higher than the one of CPUs (eg. my Nvidia 1660S has a bandwidth of 336 GB/s while my i5-9600KF with 2 DD4 channel succeed to reach about 40 GB/s while they had a similar price). OpenCL computing kernels are not be fully optimized for CPUs whatever the low-level transformation applied to the code. The main problem lies in the OpenCL algorithms themselves as well as the programming model. Rewriting OpenCL kernels to a CPU code can often result in a more efficient execution if the code is specifically optimized for such a platform. Low-level optimizations include working on in caches using data chunks, using register blocking, using the best SIMD instructions available. High-level optimizations consist in choosing the best algorithm and data structure for the target problem. The best sorting algorithm on a GPU is likely very different from the best one on a CPU. The same thing applies for other problems like computing a prefix sum, a partition/median or even string searching. Thus, one should keep in mind that different hardwares required different computing methods/algorithms.
A high-level algorithmic transformation could theoretically result in an efficient code, but such a transformation is insanely complex to perform if even possible. Indeed, there is fundamental theoretical limitations that strongly prevent many generalized advanced code analysis/transformation starting from the halting problem to high-level optimization.

Memory/Address Sanitizer vs Valgrind

I want some tool to diagnose use-after-free bugs and uninitialized bugs. I am considering Sanitizer(Memory and/or Address) and Valgrind. But I have very little idea about their advantages and disadvantages. Can anyone tell the main features, differences and pros/cons of Sanitizer and Valgrind?
Edit: I found some of comparisons like: Valgrind uses DBI(dynamic binary instrumentation) and Sanitizer uses CTI(compile-time instrumentation). Valgrind makes the program much slower(20x) whether Sanitizer runs much faster than Valgrind(2x). If anyone can give me some more important points to consider, it will be a great help.
I think you'll find this wiki useful.
TLDR main advantages of sanitizers are
much smaller CPU overheads (Lsan is practically free, UBsan/Isan is 1.25x, Asan and Msan are 2-4x for computationally intensive tasks and 1.05-1.1x for GUIs, Tsan is 5-15x)
wider class of detected errors (stack and global overflows, use-after-return/scope)
full support of multi-threaded apps (Valgrind support for multi-threading is a joke)
much smaller memory overhead (up to 2x for Asan, up to 3x for Msan, up to 10x for Tsan which is way better than Valgrind)
Disadvantages are
more complicated integration (you need to teach your build system to understand Asan and sometimes work around limitations/bugs in Asan itself, you also need to use relatively recent compiler)
MemorySanitizer is not reall^W easily usable at the moment as it requires one to rebuild all dependencies under Msan (including all standard libraries e.g. libc++); this means that casual users can only use Valgrind for detecting uninitialized errors
sanitizers typically can not be combined with each other (the only supported combination is Asan+UBsan+Lsan) which means that you'll have to do separate QA runs to catch all types of bugs
One big difference is that the LLVM-included memory and thread sanitizers implicitly map huge swathes of address space (e.g., by calling mmap(X, Y, 0, MAP_NORESERVE|MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0) across terabytes of address space in the x86_64 environment). Even though they don't necessarily allocate that memory, the mapping can play havoc with restrictive environments (e.g., ones with reasonable settings for ulimit values).

Could a GPU be Programmed and/or Modified to Carry Out CPU Instructions?

I was wondering if a GPU could behave like a CPU if modified or programmed to do so. If there is a way, I would also like to know how that could be done. The reason why is, well, sometimes I do that kind of stuff as experiments, just for fun. Plus, if it isn't a big hassle, then it would be much better than buying an expensive processor just to get better performance. I usually don't need my GPU, only because I use my computer for the simplest of things. My other computer, that's a slightly different story (because I use it for video playback), but you get the idea.
Yes, it's called GPGPU (general purpose GPU), and with it you could program some CPU-like workloads on your GPU using languages like CUDA or OpenCL.
Of course this method doesn't work well with any workload, the CPU is still much better in single-threaded hard-to-parallelize codes, or codes with complicated control flow (due to branch predictors) or memory locality (due to better caching and prefetching). GPGPUs are mostly better for performing very straight-forward highly parallel vectorizable code.
In fact, this method of computation caught enough traction to create a new lines of products, (such as Xeon Phi, formely Larrabee), and enhancing existing GPUs (e.g. Tesla/Fermi, and others)
EDIT
Having reread your question - if you mean running actual CPU ISA on such GPGPU, not just some general CPU task, then the best bet is Xeon Phi mentioned above, it's intended to be based on the same ISA as the CPU (it's the only x86 GPGPU I know of).

Risk Assessment: Using Pthreads (vs. GCD or NSThread)

A colleague suggested recently that I use pthreads instead of GCD because it's, "way faster." I don't disagree that it's faster, but what's the risk with pthreads?
My feeling is that they will ultimately not be anywhere nearly as idiot-proof as GCD (and my team of one is 50% idiots). Are pthreads hard to get right?
GCD and pthreads are both ways of doing work asynchronously, but they are significantly different. Most descriptions of GCD describe it in terms of threads and of thread pooling, but as DrPizza puts it
to concentrate on [threads and thread pools] is to miss the point. GCD’s value lies not in thread pooling, but in queuing.
                                                                Grand Central Dispatch for Win32: why I want it
GCD has some nice benefits over APIs like pthreads.
GCD does more to encourage and support "islands of serialization in a sea of parallelism." GCD makes it easy to avoid a lot of locks and mutexes and condition variables that are the normal way of comunicating between threads. This is because you decompose your program into tasks and GCD handles getting the task input and output to the appropriate thread behind the scenes. So programming with GCD allows you to pretty much write serially and not worry too much about stuff people often worry about in threaded code. That makes the code simpler and less bug prone.
GCD can do scaling for you so the program uses as much parallelism as the dependencies between the tasks you've decomposed your program into and the hardware allow for. Of course designing the program to be scalable is generally the hard bit, but you'll still need something to actually take advantage of that work to run as much as possible in parallel. Work stealing schedulers like GCD do that part.
GCD is composable. If you explicitly spawn threads for things you want to do asynchronously or in parallel you can run into a problem when libraries you use do the same thing. Say you decide you can run eight threads simultaneously because that's how many threads will be effective for your program given the machine it runs on. And then say a library you use on each thread does the same thing. Now you could have up to 64 threads running at once, which is more than you know is effective for your program.
Thread pooling solves this but everyone needs to use the same thread pool. GCD uses thread pooling internally and provides the same pool to everyone.
GCD provides a bunch of 'sources' and makes it easy to write an event driven program that depends on or takes input from the sources. For example you can very easily have a queue set up to launch a task every time data is available to read on a network socket, or when a timer fires, or whatever.
I don't think they're hard to get right, but having worked with many different approaches over the years (pthreads, GCD, NSThread, NSOperationQueue, etc.) I have no evidence to support an assertion like "pthreads are way faster." Even if they were faster (and I would expect the difference to be marginal at best) I always say, "use the highest level abstraction that gets the job done." Also, avoid pre-mature optimization.
Anecdotally speaking, GCD is pretty damn fast. How I see it, portability is the primary advantage of pthreads over GCD. If this is OSX/iOS exclusive code, I would see no advantage whatsoever to using pthreads, absent empirical evidence to the contrary.
Ignore the other well thought technical reasons, because they aren't relevant. You are not writing software for a benchmark, are you? At some point, a user is going to sit in front of your device and try to use it. And do you know what happens if you use pthreads instead of GCD? What happens is that your software doesn't scale well in the presence of other software multitasking at the same time because it is going to fight for the CPU presuming it is the only software running at the same time. Which is crazy. Nobody runs single task OSes any more. Even single task iOS runs much stuff in the background.
Instead, if all the programs you were running used GCD, the OS can scale the number of concurrent tasks running on their queues and thus match better the number of actual processors, reducing task switching overhead.
If your program doesn't require pseudo real time low latency and thus a dedicated thread to process stuff as soon as it is available (maybe the definition of your colleague's "way faster"), chances are GCD will be superior for the user because it will use better the resources available on their device. Even if GCD's API was horrible or slow it would be worthwhile to use it over other solutions which don't scale across different processes.
Probably NSThread is implemented using the pthreads library, the point is that the lower is the level of a concept, the more you have to do useless and repetitive tasks.
So the pthreads library isn't so hard to learn, my professor at university taught it, and even the most (call 'em so) slow at learning people were able to use the library, maybe randomly copying-pasting the code just for lazily but doing the job successfully.
So I definitely suggest you to implement a pthread wrapper class, it's easy to do it.
This way you eliminate the useless stuff, for example you may be doing this thousand of times:
pthread_mutex_init( mutex_ptr, NULL);
So (if that's your case, but it's just an example) you may be passing always NULL, and the same is valid for other functions.
Once implemented the class it isn't said that is faster than GCD.
GCD do some optimizations, for example two blocks may be ran in the same thread.
So I suggest to use your defined class only if it's faster than GCD, to test it with time profiler.

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.