How to avoid Boost ASIO reactor becoming constrained to a single core? - optimization

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?

The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.

I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.

In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.

Related

Visual Basic .NET limit a process to a single thread

How do you keep a Visual Basic program (or, better, only a single portion of it) to a single thread? For example
[Hypothetical thread limiting command here]
my code and such
[End thread limitation]
Is there any way to do this? Sorry if I am being ambiguous.
Unless you are explicitly starting other threads, or are using a 3rd party library that uses threads, Visual Basic (or any other .NET language) will not run with multiple threads. You may see some debugging messages about threads starting and shutting down, but those are from the garbage collector running in the background and not your application.
Web projects are an exception to this, as each request will be handled on a separate thread.
UPDATE
What you are seeing is most likely a combination of multiple effects.
The most important being something called Hyper-threading. Despite it's name, it has nothing to do with threads at the application level. Most likely the CPU in your computer has only 2 physical cores that support Hyper-threading (HT). HT will make each physical core show up as 2 logical cores in the operating system. The CPU will decide to process instructions on one or both of these cores on its own. This isn't true multi-threading, and you will never have to worry about any effects from HT in your code.
Another cause for what you are seeing is called Processor affinity(PA), or more properly, the lack of PA. The operating system has the choice of which logical CPUs to execute your code on. It also has the option of moving that execution around to different CPUs to try and optimize its workload. It can do this whenever it wants and as often as it wants; sometimes it can happen very rapidly. Depending on your OS you can "pin" a program to a specific core (or set of cores) to stop this from happening, but again, it won't should not affect your program's execution in any way.
As far as the rise in CPU temperature you are seeing across all core, it can be explained pretty easily. The physical cores on your CPU are very close together, under a heat-sink. As one core heats up it will warm up the other core near it.

How does Redis achieve the high throughput and performance?

I know this is a very generic question. But, I wanted to understand what are the major architectural decision that allow Redis (or caches like MemCached, Cassandra) to work at amazing performance limits.
How are connections maintained?
Are connections TCP or HTTP?
I know that it is completely written in C. How is the memory managed?
What are the synchronization techniques used to achieve high throughput inspite
of competing read/writes?
Basically, what is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box? I also understand that the answer needs to be very huge and should include very complex details for completion. But, what I'm looking for are some general techniques used rather than all nuances.
There is a wealth of of information in the Redis documentation to understand how it works. Now, to answer specifically your questions:
1) How are connections maintained?
Connections are maintained and managed using the ae event loop (designed by the Redis author). All network I/O operations are non blocking. You can see ae as a minimalistic implementation using the best network I/O demultiplexing mechanism of the platform (epoll for Linux, kqueue for BSD, etc ...) just like libevent, libev, libuv, etc ...
2) Are connections TCP or HTTP?
Connections are TCP using the Redis protocol, which is a simple telnet compatible, text oriented protocol supporting binary data. This protocol is typically more efficient than HTTP.
3) How is the memory managed?
Memory is managed by relying on a general purpose memory allocator. On some platforms, this is actually the system memory allocator. On some other platforms (including Linux), jemalloc has been selected since it offers a good balance between CPU consumption, concurrency support, fragmentation and memory footprint. jemalloc source code is part of the Redis distribution.
Contrary to other products (such as memcached), there is no implementation of a slab allocator in Redis.
A number of optimized data structures have been implemented on top of the general purpose allocator to reduce the memory footprint.
4) What are the synchronization techniques used to achieve high throughput inspite of competing read/writes?
Redis is a single-threaded event loop, so there is no synchronization to be done since all commands are serialized. Now, some threads also run in the background for internal purposes. In the rare cases they access the data managed by the main thread, classical pthread synchronization primitives are used (mutexes for instance). But 100% of the data accesses made on behalf of multiple client connections do not require any synchronization.
You can find more information there:
Redis is single-threaded, then how does it do concurrent I/O?
What is the difference between a plain vanilla implementation of a machine with in memory cache and server that can respond to commands and a Redis box?
There is no difference. Redis is a plain vanilla implementation of a machine with in memory cache and server that can respond to commands. But it is an implementation which is done right:
using the single threaded event loop model
using simple and minimalistic data structures optimized for their corresponding use cases
offering a set of commands carefully chosen to balance minimalism and usefulness
constantly targeting the best raw performance
well adapted to modern OS mechanisms
providing multiple persistence mechanisms because the "one size does fit all" approach is only a dream.
providing the building blocks for HA mechanisms (replication system for instance)
avoiding stacking up useless abstraction layers like pancakes
resulting in a clean and understandable code base that any good C developer can be comfortable with

High CPU with ImageResizer DiskCache plugin

We are noticing occasional periods of high CPU on a web server that happens to use ImageResizer. Here are the surprising results of a trace performed with NewRelic's thread profiler during such a spike:
It would appear that the cleanup routine associated with ImageResizer's DiskCache plugin is responsible for a significant percentage of the high CPU consumption associated with this application. We have autoClean on, but otherwise we're configured to use the defaults, which I understand are optimal for most typical situations:
<diskCache autoClean="true" />
Armed with this information, is there anything I can do to relieve the CPU spikes? I'm open to disabling autoClean and setting up a simple nightly cleanup routine, but my understanding is that this plugin is built to be smart about how it uses resources. Has anyone experienced this and had any luck simply changing the default configuration?
This is an ASP.NET MVC application running on Windows Server 2008 R2 with ImageResizer.Plugins.DiskCache 3.4.3.
Sampling, or why the profiling is unhelpful
New Relic's thread profiler uses a technique called sampling - it does not instrument the calls - and therefore cannot know if CPU usage is actually occurring.
Looking at the provided screenshot, we can see that the backtrace of the cleanup thread (there is only ever one) is frequently found at the WaitHandle.WaitAny and WaitHandle.WaitOne calls. These methods are low-level synchronization constructs that do not spin or consume CPU resources, but rather efficiently return CPU time back to other threads, and resume on a signal.
Correct profilers should be able to detect idle or waiting threads and eliminate them from their statistical analysis. Because New Relic's profiler failed to do that, there is no useful way to interpret the data it's giving you.
If you have more than 7,000 files in /imagecache, here is one way to improve performance
By default, in V3, DiskCache uses 32 subfolders with 400 items per folder (1000 hard limit). Due to imperfect hash distribution, this means that you may start seeing cleanup occur at as few as 7,000 images, and you will start thrashing the disk at ~12,000 active cache files.
This is explained in the DiskCache documentation - see subfolders section.
I would suggest setting subfolders="8192" if you have a larger volume of images. A higher subfolder count increases overhead slightly, but also increases scalability.

Which takes longer time? Switching between the user & kernel modes or switching between two processes?

Which takes longer time?
Switching between the user & kernel modes (or) switching between two processes?
Please explain the reason too.
EDIT : I do know that whenever there is a context switch, it takes some time for the dispatcher to save the status of the previous process in its PCB, and then reload the next process from its corresponding PCB. And for switching between the user and the kernel modes, I know that the mode bit has to be changed. Isn't it all, or is there more to it?
Switching between processes (given you actually switch, not run them in parallel) by an order of oh-my-god.
Trapping from userspace to kernelspace used to be done with a processor interrupt earlier. Around 2005 (don't remember the kernel version), and after a discussion on the mailing list where someone found that trapping was slower (in absolute measures!) on a high-end xeon processor than on an earlier Pentium II or III (again, my memory), they implemented it with a new cpu instruction sysenter (which had actually existed since Pentium Pro I think). This is done in the Virtual Dynamic Shared Object (vdso) page in each process (cat /proc/pid/maps to find it) IIRC.
So, nowadays, a kernel trap is basically just a couple of cpu instructions, hence rather few cycles, compared to tenths or hundreds of thousands when using an interrupt (which is really slow on modern CPU's).
A context switch between processes is heavy. It means storing all processor state (registers, etc) to RAM (at a magic memory location in the user process space actually, guess where!), in practice dirtying all cached memory in the cpu, and reading back the process state for the new process. It will (likely) have nothing still in the cpu cache from last time it ran, so each memory read will be a cache miss, and needed to be read from RAM. This is rather slow. When I was at the university, I "invented" (well, I did come up with the idea, knowing that there is plenty of dye in a CPU, but not enough cool if it's constantly powered) a cache that was infinite size although unpowered when unused (only used on context switches i.e.) in the CPU, and implemented this in Simics. Implemented support for this magic cache I called CARD (Context-switch Active, Run-time Drowsy) in Linux, and benchmarked rather heavily. I found that it could speed-up a Linux machine with lots of heavy processes sharing the same core with about 5%. This was at relatively short (low-latency) process time slices, though.
Anyway. A context switch is still pretty heavy, while a kernel trap is basically free.
Answer to at which memory location in user-space, for each process:
At address zero. Yep, the null pointer! You can't read from this entire page from user-space anyway :) This was back in 2005, but it's probably the same now unless the CPU state information has grown larger than a page size, in which case they might have changed the implementation.

Grand Central Dispatch vs NSThreads?

I searched a variety of sources but don't really understand the difference between using NSThreads and GCD. I'm completely new to the OS X platform so I might be completely misinterpreting this.
From what I read online, GCD seems to do the exact same thing as basic threads (POSIX, NSThreads etc.) while adding much more technical jargon ("blocks"). It seems to just overcomplicate the basic thread creation system (create thread, run function).
What exactly is GCD and why would it ever be preferred over traditional threading? When should traditional threads be used rather than GCD? And finally is there a reason for GCD's strange syntax? ("blocks" instead of simply calling functions).
I am on Mac OS X 10.6.8 Snow Leopard and I am not programming for iOS - I am programming for Macs. I am using Xcode 3.6.8 in Cocoa, creating a GUI application.
Advantages of Dispatch
The advantages of dispatch are mostly outlined here:
Migrating Away from Threads
The idea is that you eliminate work on your part, since the paradigm fits MOST code more easily.
It reduces the memory penalty your application pays for storing thread stacks in the application’s memory space.
It eliminates the code needed to create and configure your threads.
It eliminates the code needed to manage and schedule work on threads.
It simplifies the code you have to write.
Empirically, using GCD-type locking instead of #synchronized is about 80% faster or more, though micro-benchmarks may be deceiving. Read more here, though I think the advice to go async with writes does not apply in many cases, and it's slower (but it's asynchronous).
Advantages of Threads
Why would you continue to use Threads? From the same document:
It is important to remember that queues are not a panacea for
replacing threads. The asynchronous programming model offered by
queues is appropriate in situations where latency is not an issue.
Even though queues offer ways to configure the execution priority of
tasks in the queue, higher execution priorities do not guarantee the
execution of tasks at specific times. Therefore, threads are still a
more appropriate choice in cases where you need minimal latency, such
as in audio and video playback.
Another place where I haven't personally found an ideal solution using queues is daemon processes that need to be constantly rescheduled. Not that you cannot reschedule them, but looping within a NSThread method is simpler (I think). Edit: Now I'm convinced that even in this context, GCD-style locking would be faster, and you could also do a loop within a GCD-dispatched operation.
Blocks in Objective-C?
Blocks are really horrible in Objective-C due to the awful syntax (though Xcode can sometimes help with autocompletion, at least). If you look at blocks in Ruby (or any other language, pretty much) you'll see how simple and elegant they are for dispatching operations. I'd say that you'll get used to the Objective-C syntax, but I really think that you'll get used to copying from your examples a lot :)
You might find my examples from here to be helpful, or just distracting. Not sure.
While the answers so far are about the context of threads vs GCD inside the domain of a single application and the differences it has for programming, the reason you should always prefer GCD is because of multitasking environments (since you are on MacOSX and not iOS). Threads are ok if your application is running alone on your machine. Say, you have a video edition program and want to apply some effect to the video. The render is going to take 10 minutes on a machine with eight cores. Fine.
Now, while the video app is churning in the background, you open an image edition program and play with some high resolution image, decide to apply some special image filter and your image application being clever detects you have eight cores and starts eight threads to process the image. Nice isn't it? Except that's terrible for performance. The image edition app doesn't know anything about the video app (and vice versa) and therefore both will request their respectively optimum number of threads. And there will be pain and blood while the cores try to switch from one thread to another, because to avoid starvation the CPU will eventually let all threads run, even though in this situation it would be more optimal to run only 4 threads for the video app and 4 threads for the image app.
For a more detailed reference, take a look at http://deusty.blogspot.com/2010/11/introducing-gcd-based-cocoahttpserver.html where you can see a benchmark of an HTTP server using GCD versus thread, and see how it scales. Once you understand the problem threads have for multicore machines in multi-app environments, you will always want to use GCD, simply because threads are not always optimal, while GCD potentially can be since the OS can scale thread usage per app depending on load.
Please, remember we won't have more GHz in our machines any time soon. From now on we will only have more cores, so it's your duty to use the best tool for this environment, and that is GCD.
Blocks allow for passing a block of code to execute. Once you get past the "strange syntax", they are quite powerful.
GCD also uses queues which if used properly can help with lock free concurrency if the code executing in the separate queues are isolated. It's a simpler way to offer background and concurrency while minimizing the chance for deadlocks (if used right).
The "strange syntax" is because they chose the caret (^) because it was one of the few symbols that wasn't overloaded as an operator in C++
See:
https://developer.apple.com/library/ios/#documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html
When it comes to adding concurrency to an application, dispatch queues
provide several advantages over threads. The most direct advantage is
the simplicity of the work-queue programming model. With threads, you
have to write code both for the work you want to perform and for the
creation and management of the threads themselves. Dispatch queues let
you focus on the work you actually want to perform without having to
worry about the thread creation and management. Instead, the system
handles all of the thread creation and management for you. The
advantage is that the system is able to manage threads much more
efficiently than any single application ever could. The system can
scale the number of threads dynamically based on the available
resources and current system conditions. In addition, the system is
usually able to start running your task more quickly than you could if
you created the thread yourself.
Although you might think rewriting your code for dispatch queues would
be difficult, it is often easier to write code for dispatch queues
than it is to write code for threads. The key to writing your code is
to design tasks that are self-contained and able to run
asynchronously. (This is actually true for both threads and dispatch
queues.)
...
Although you would be right to point out that two tasks running in a
serial queue do not run concurrently, you have to remember that if two
threads take a lock at the same time, any concurrency offered by the
threads is lost or significantly reduced. More importantly, the
threaded model requires the creation of two threads, which take up
both kernel and user-space memory. Dispatch queues do not pay the same
memory penalty for their threads, and the threads they do use are kept
busy and not blocked.
GCD (Grand Central Dispatch): GCD provides and manages FIFO queues to which your application can submit tasks in the form of block objects. Work submitted to dispatch queues are executed on a pool of threads fully managed by the system. No guarantee is made as to the thread on which a task executes. Why GCD over threads :
How much work your CPU cores are doing
How many CPU cores you have.
How much threads should be spawned.
If GCD needs it can go down into the kernel and communicate about resources, thus better scheduling.
Less load on kernel and better sync with OS
GCD uses existing threads from thread pool instead of creating and then destroying.
Best advantage of the system’s hardware resources, while allowing the operating system to balance the load of all the programs currently running along with considerations like heating and battery life.
I have shared my experience with threads, operating system and GCD AT http://iosdose.com