One thread's computational heavy work disrupting another thread's interaction with the sound card in Python - python-multithreading

I'm writing a python 3.6 application which make a heavy use of threads (as a design pattern, a directed graph of threads linked with queue.Queue as edges) and it works fine.
One thread makes use of the sound card with pyaudio and buffer the samples to output (there is at any time as much of samples as required), one thread makes use of sklearn's IncrementalPCA, doing heavy computational work sometimes (high CPU requirements, ~30%, with high memory requirements, ~8Mb).
When the heavy computational work is running in the latter thread, I got a (snd_pcm_recover) underrun occurred from the sound card, just as the first thread isn't doing its buffering well (but it does buffer well, the samples are ready to be sent when the underrun occurs).
There is no mutex nor synchronized code between the two threads, they are two independent pieces of code. Everything works fine when I reduce the computational heavy work by a 2 factor, getting something like ~15% CPU with 4Mb of memory consumed by the thread.
As sklearn's IncrementalPCA spend most time using numpy and as I use python 3.6, I don't think that the problem is related to Python's GIL, but I may be misunderstanding something.
I've think about throttling the heavy computational work of the latter thread, but it seems unlikely to be even possible. I'm running short of ideas, so:
How can I make the first thread working with the latter thread without underruns ?

Related

How can I speed up a Mac app processing 5000 independent tasks?

I have a long running (5-10 hours) Mac app that processes 5000 items. Each item is processed by performing a number of transforms (using Saxon), running a bunch of scripts (in Python and Racket), collecting data, and serializing it as a set of XML files, a SQLite database, and a CoreData database. Each item is completely independent from every other item.
In summary, it does a lot, takes a long time, and appears to be highly parallelizable.
After loading up all the items that need processing it, the app uses GCD to parallelize the work, using dispatch_apply:
dispatch_apply(numberOfItems, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^(size_t i) {
#autoreleasepool {
...
}
});
I'm running the app on a Mac Pro with 12 cores (24 virtual). So I would expect to have 24 items being processed at all times. However, I found through logging that the number of items being processed varies between 8 and 24. This is literally adding hours to the run time (assuming it could work on 24 items at a time).
On the one hand, perhaps GCD is really, really smart and it is already giving me the maximum throughput. But I'm worried that, because much of the work happens in scripts that are spawned by this app, maybe GCD is reasoning from incomplete information and isn't making the best decisions.
Any ideas how to improve performance? After correctness, the number one desired attribute is shortening how long it takes this app to run. I don't care about power consumption, hogging the Mac Pro, or anything else.
UPDATE: In fact, this looks alarming in the docs: "The actual number of tasks executed by a concurrent queue at any given moment is variable and can change dynamically as conditions in your application change. Many factors affect the number of tasks executed by the concurrent queues, including the number of available cores, the amount of work being done by other processes, and the number and priority of tasks in other serial dispatch queues." (emphasis added) It looks like having other processes doing work will adversely affect scheduling in the app.
It'd be nice to be able to just say "run these blocks concurrently, one per core, don't try to do anything smarter".
If you are bound and determined, you can explicitly spawn 24 threads using the NSThread API, and have each of those threads pull from a synchronized queue of work items. I would bet money that performance would get noticeably worse.
GCD works at its most efficient when the work items submitted to it never block. That said, the workload you're describing is rather complex and rife with opportunities for your threads to block. For starters, you're spawning a bunch of other processes. Right here, this means that you're already relying on the OS to divvy up time/resources between your master task and these slave tasks. Other than setting the OS priority of each subprocess, the OS scheduler has no way to know which processes are more important than others, and by default, your subprocesses are going to have the same priority as their parent. That said, it doesn't sound like you have anything to gain by tweaking process priorities. I'm assuming you're blocking the master task thread that's waiting for the slave tasks to complete. That is effectively parking that thread -- it can do no useful work. But like I said, I don't think there's much to be gained by tweaking the OS priorities of your slave tasks, because this really sounds like it's an I/O bound workflow...
You go on to describe three I/O-heavy operations ("serializing it as a set of XML files, a SQLite database, and a CoreData database.") So now you have all these different threads and processes vying for what is presumably a shared bulk storage device. (i.e. unless you're writing to 24 different databases, on 24 separate hard drives, one for each core, your process is ultimately going to be serialized at the disk accesses.) Even if you had 24 different hard drives, writing to a hard drive (even an SSD) is comparatively slow. Your threads are going to be taken off of the CPU they were running on (so that another thread that's waiting can run) for virtually any blocking disk write.
If you wanted to maximize the performance you're getting out of GCD, you would probably want to rewrite all the stuff you're doing in subtasks in C/C++/Objective-C, bringing them in-process, and then conducting all the associated I/O using dispatch_io primitives. For API where you don't control the low-level reads and writes, you would want to carefully manage and tune your workload to optimize it for the hardware you have. For instance, if you have a bunch of stuff to write to a single, shared SQLite database, there's no point in ever having more than one thread trying to write to that database at once. You'd be better off making one thread (or a serial GCD queue) to write to SQLite and submitting tasks to that after pre-processing is done.
I could go on for quite a while here, but the bottom line is that you've got a complex, seemingly I/O bound workflow here. At the highest-level, CPU utilization or "number of running threads" is going to be a particularly poor measure of performance for such a task. By using sub-processes (i.e. scripts), you're putting a lot of control into the hands of the OS, which knows effectively nothing about your workload a priori, and therefore can do nothing except use its general scheduler to divvy up resources. GCD's opaque thread pool management is really the least of your problems.
On a practical level, if you want to speed things up, go buy multiple, faster (i.e. SSD) hard drives, and rework your task/workflow to utilize them separately and in parallel. I suspect that would yield the biggest bang for your buck (for some equivalence relation of time == money == hardware.)

Risk Assessment: Using Pthreads (vs. GCD or NSThread)

A colleague suggested recently that I use pthreads instead of GCD because it's, "way faster." I don't disagree that it's faster, but what's the risk with pthreads?
My feeling is that they will ultimately not be anywhere nearly as idiot-proof as GCD (and my team of one is 50% idiots). Are pthreads hard to get right?
GCD and pthreads are both ways of doing work asynchronously, but they are significantly different. Most descriptions of GCD describe it in terms of threads and of thread pooling, but as DrPizza puts it
to concentrate on [threads and thread pools] is to miss the point. GCD’s value lies not in thread pooling, but in queuing.
                                                                Grand Central Dispatch for Win32: why I want it
GCD has some nice benefits over APIs like pthreads.
GCD does more to encourage and support "islands of serialization in a sea of parallelism." GCD makes it easy to avoid a lot of locks and mutexes and condition variables that are the normal way of comunicating between threads. This is because you decompose your program into tasks and GCD handles getting the task input and output to the appropriate thread behind the scenes. So programming with GCD allows you to pretty much write serially and not worry too much about stuff people often worry about in threaded code. That makes the code simpler and less bug prone.
GCD can do scaling for you so the program uses as much parallelism as the dependencies between the tasks you've decomposed your program into and the hardware allow for. Of course designing the program to be scalable is generally the hard bit, but you'll still need something to actually take advantage of that work to run as much as possible in parallel. Work stealing schedulers like GCD do that part.
GCD is composable. If you explicitly spawn threads for things you want to do asynchronously or in parallel you can run into a problem when libraries you use do the same thing. Say you decide you can run eight threads simultaneously because that's how many threads will be effective for your program given the machine it runs on. And then say a library you use on each thread does the same thing. Now you could have up to 64 threads running at once, which is more than you know is effective for your program.
Thread pooling solves this but everyone needs to use the same thread pool. GCD uses thread pooling internally and provides the same pool to everyone.
GCD provides a bunch of 'sources' and makes it easy to write an event driven program that depends on or takes input from the sources. For example you can very easily have a queue set up to launch a task every time data is available to read on a network socket, or when a timer fires, or whatever.
I don't think they're hard to get right, but having worked with many different approaches over the years (pthreads, GCD, NSThread, NSOperationQueue, etc.) I have no evidence to support an assertion like "pthreads are way faster." Even if they were faster (and I would expect the difference to be marginal at best) I always say, "use the highest level abstraction that gets the job done." Also, avoid pre-mature optimization.
Anecdotally speaking, GCD is pretty damn fast. How I see it, portability is the primary advantage of pthreads over GCD. If this is OSX/iOS exclusive code, I would see no advantage whatsoever to using pthreads, absent empirical evidence to the contrary.
Ignore the other well thought technical reasons, because they aren't relevant. You are not writing software for a benchmark, are you? At some point, a user is going to sit in front of your device and try to use it. And do you know what happens if you use pthreads instead of GCD? What happens is that your software doesn't scale well in the presence of other software multitasking at the same time because it is going to fight for the CPU presuming it is the only software running at the same time. Which is crazy. Nobody runs single task OSes any more. Even single task iOS runs much stuff in the background.
Instead, if all the programs you were running used GCD, the OS can scale the number of concurrent tasks running on their queues and thus match better the number of actual processors, reducing task switching overhead.
If your program doesn't require pseudo real time low latency and thus a dedicated thread to process stuff as soon as it is available (maybe the definition of your colleague's "way faster"), chances are GCD will be superior for the user because it will use better the resources available on their device. Even if GCD's API was horrible or slow it would be worthwhile to use it over other solutions which don't scale across different processes.
Probably NSThread is implemented using the pthreads library, the point is that the lower is the level of a concept, the more you have to do useless and repetitive tasks.
So the pthreads library isn't so hard to learn, my professor at university taught it, and even the most (call 'em so) slow at learning people were able to use the library, maybe randomly copying-pasting the code just for lazily but doing the job successfully.
So I definitely suggest you to implement a pthread wrapper class, it's easy to do it.
This way you eliminate the useless stuff, for example you may be doing this thousand of times:
pthread_mutex_init( mutex_ptr, NULL);
So (if that's your case, but it's just an example) you may be passing always NULL, and the same is valid for other functions.
Once implemented the class it isn't said that is faster than GCD.
GCD do some optimizations, for example two blocks may be ran in the same thread.
So I suggest to use your defined class only if it's faster than GCD, to test it with time profiler.

Grand Central Dispatch vs NSThreads?

I searched a variety of sources but don't really understand the difference between using NSThreads and GCD. I'm completely new to the OS X platform so I might be completely misinterpreting this.
From what I read online, GCD seems to do the exact same thing as basic threads (POSIX, NSThreads etc.) while adding much more technical jargon ("blocks"). It seems to just overcomplicate the basic thread creation system (create thread, run function).
What exactly is GCD and why would it ever be preferred over traditional threading? When should traditional threads be used rather than GCD? And finally is there a reason for GCD's strange syntax? ("blocks" instead of simply calling functions).
I am on Mac OS X 10.6.8 Snow Leopard and I am not programming for iOS - I am programming for Macs. I am using Xcode 3.6.8 in Cocoa, creating a GUI application.
Advantages of Dispatch
The advantages of dispatch are mostly outlined here:
Migrating Away from Threads
The idea is that you eliminate work on your part, since the paradigm fits MOST code more easily.
It reduces the memory penalty your application pays for storing thread stacks in the application’s memory space.
It eliminates the code needed to create and configure your threads.
It eliminates the code needed to manage and schedule work on threads.
It simplifies the code you have to write.
Empirically, using GCD-type locking instead of #synchronized is about 80% faster or more, though micro-benchmarks may be deceiving. Read more here, though I think the advice to go async with writes does not apply in many cases, and it's slower (but it's asynchronous).
Advantages of Threads
Why would you continue to use Threads? From the same document:
It is important to remember that queues are not a panacea for
replacing threads. The asynchronous programming model offered by
queues is appropriate in situations where latency is not an issue.
Even though queues offer ways to configure the execution priority of
tasks in the queue, higher execution priorities do not guarantee the
execution of tasks at specific times. Therefore, threads are still a
more appropriate choice in cases where you need minimal latency, such
as in audio and video playback.
Another place where I haven't personally found an ideal solution using queues is daemon processes that need to be constantly rescheduled. Not that you cannot reschedule them, but looping within a NSThread method is simpler (I think). Edit: Now I'm convinced that even in this context, GCD-style locking would be faster, and you could also do a loop within a GCD-dispatched operation.
Blocks in Objective-C?
Blocks are really horrible in Objective-C due to the awful syntax (though Xcode can sometimes help with autocompletion, at least). If you look at blocks in Ruby (or any other language, pretty much) you'll see how simple and elegant they are for dispatching operations. I'd say that you'll get used to the Objective-C syntax, but I really think that you'll get used to copying from your examples a lot :)
You might find my examples from here to be helpful, or just distracting. Not sure.
While the answers so far are about the context of threads vs GCD inside the domain of a single application and the differences it has for programming, the reason you should always prefer GCD is because of multitasking environments (since you are on MacOSX and not iOS). Threads are ok if your application is running alone on your machine. Say, you have a video edition program and want to apply some effect to the video. The render is going to take 10 minutes on a machine with eight cores. Fine.
Now, while the video app is churning in the background, you open an image edition program and play with some high resolution image, decide to apply some special image filter and your image application being clever detects you have eight cores and starts eight threads to process the image. Nice isn't it? Except that's terrible for performance. The image edition app doesn't know anything about the video app (and vice versa) and therefore both will request their respectively optimum number of threads. And there will be pain and blood while the cores try to switch from one thread to another, because to avoid starvation the CPU will eventually let all threads run, even though in this situation it would be more optimal to run only 4 threads for the video app and 4 threads for the image app.
For a more detailed reference, take a look at http://deusty.blogspot.com/2010/11/introducing-gcd-based-cocoahttpserver.html where you can see a benchmark of an HTTP server using GCD versus thread, and see how it scales. Once you understand the problem threads have for multicore machines in multi-app environments, you will always want to use GCD, simply because threads are not always optimal, while GCD potentially can be since the OS can scale thread usage per app depending on load.
Please, remember we won't have more GHz in our machines any time soon. From now on we will only have more cores, so it's your duty to use the best tool for this environment, and that is GCD.
Blocks allow for passing a block of code to execute. Once you get past the "strange syntax", they are quite powerful.
GCD also uses queues which if used properly can help with lock free concurrency if the code executing in the separate queues are isolated. It's a simpler way to offer background and concurrency while minimizing the chance for deadlocks (if used right).
The "strange syntax" is because they chose the caret (^) because it was one of the few symbols that wasn't overloaded as an operator in C++
See:
https://developer.apple.com/library/ios/#documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html
When it comes to adding concurrency to an application, dispatch queues
provide several advantages over threads. The most direct advantage is
the simplicity of the work-queue programming model. With threads, you
have to write code both for the work you want to perform and for the
creation and management of the threads themselves. Dispatch queues let
you focus on the work you actually want to perform without having to
worry about the thread creation and management. Instead, the system
handles all of the thread creation and management for you. The
advantage is that the system is able to manage threads much more
efficiently than any single application ever could. The system can
scale the number of threads dynamically based on the available
resources and current system conditions. In addition, the system is
usually able to start running your task more quickly than you could if
you created the thread yourself.
Although you might think rewriting your code for dispatch queues would
be difficult, it is often easier to write code for dispatch queues
than it is to write code for threads. The key to writing your code is
to design tasks that are self-contained and able to run
asynchronously. (This is actually true for both threads and dispatch
queues.)
...
Although you would be right to point out that two tasks running in a
serial queue do not run concurrently, you have to remember that if two
threads take a lock at the same time, any concurrency offered by the
threads is lost or significantly reduced. More importantly, the
threaded model requires the creation of two threads, which take up
both kernel and user-space memory. Dispatch queues do not pay the same
memory penalty for their threads, and the threads they do use are kept
busy and not blocked.
GCD (Grand Central Dispatch): GCD provides and manages FIFO queues to which your application can submit tasks in the form of block objects. Work submitted to dispatch queues are executed on a pool of threads fully managed by the system. No guarantee is made as to the thread on which a task executes. Why GCD over threads :
How much work your CPU cores are doing
How many CPU cores you have.
How much threads should be spawned.
If GCD needs it can go down into the kernel and communicate about resources, thus better scheduling.
Less load on kernel and better sync with OS
GCD uses existing threads from thread pool instead of creating and then destroying.
Best advantage of the system’s hardware resources, while allowing the operating system to balance the load of all the programs currently running along with considerations like heating and battery life.
I have shared my experience with threads, operating system and GCD AT http://iosdose.com

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.

Is it safe to access the hard drive via many different GCD queues?

Is it safe? For instance, if I create a bunch of different GCD queues that each compress (tar cvzf) some files, am I doing something wrong? Will the hard drive be destroyed?
Or does the system properly take care of such things?
Dietrich's answer is correct save for one detail (that is completely non-obvious).
If you were to spin off, say, 100 asynchronous tar executions via GCD, you'd quickly find that you have 100 threads running in your application (which would also be dead slow due to gross abuse of the I/O subsystem).
In a fully asynchronous concurrent system with queues, there is no way to know if a particular unit of work is blocked because it is waiting for a system resource or waiting for some other enqueued unit of work. Therefore, anytime anything blocks, you pretty much have to spin up another thread and consume another unit of work or risk locking up the application.
In such a case, the "obvious" solution is to wait a bit when a unit of work blocks before spinning up another thread to de-queue and process another unit of work with the hope that the first unit of work "unblocks" and continues processing.
Doing so, though, would mean that any asynchronous concurrent system with interaction between units of work -- a common case -- would be so slow as to be useless.
Far more effective is to limit the # of units of work that are enqueued in the global asynchronous queues at any one time. A GCD semaphore makes this quite easy; you have a single serial queue into which all units of work are enqueued. Every time you dequeue a unit of work, you increment the semaphore. Every time a unit of work is completed, you decrement the semaphore. As long as the semaphore is below some maximum value (say, 4), then you enqueue a new unit of work.
If you take something that is normally IO limited, such as tar, and run a bunch of copies in GCD,
It will run more slowly because you are throwing more CPU at an IO-bound task, meaning the IO will be more scattered and there will be more of it at the same time,
No more than N tasks will run at a time, which is the point of GCD, so "a billion queue entries" and "ten queue entries" give you the same thing if you have less than 10 threads,
Your hard drive will be fine.
Even though this question was asked back in May, it's still worth noting that GCD has now provided I/O primitives with the release of 10.7 (OS X Lion). See the man pages for dispatch_read and dispatch_io_create for examples on how to do efficient I/O with the new APIs. They are smart enough to properly schedule I/O against a single disk (or multiple disks) with knowledge of how much concurrency is, or is not, possible in the actual I/O requests.