IOCP file limit on AIX 7.1 - aix

I am seeing E_BADF returned from aio_read/aio_write operations against FD 192 when the application is accessing a large number of files through a common IOCP.
Is there a limit to the number of files descriptors that can be associated with a simple IO Completion port under AIX 7.1? Can the limit be raised? Is there a reason to limit the number of file descriptors per IOCP?
ulimit reports an unlimited number of file descriptors, so that's not it. I'm hoping there is another tunable.
Thanks.

The errors turned out to arise from un-related coding issues from a parallel set of potential changes (read: I'm an idiot).
For what it's worth, though, AIX does support an implementation of IOCP when you enable legacy AIO support.

Related

Resource usage of a static web server

I came across this question in a blog post. It was asked by Mozilla in their internship interview. (Blog Post)
You are running a HTTP server (nginx, Apache, etc) that is configured
to serve static files off the local filesystem of your modern,
multi-core server connected to a gigabit network. A handful of clients
start requesting the same 4kb static file as fast as they can. What
system resource do you think will be exhausted first?
a. CPU
b. Disk / I/O
c. Memory
d. Network
e. Other
According to me, none of this would be exhausted on a modern machine, with Nginx/Apache. Won't the web server cache such a small file and just keep serving that. Also, for repeated request it can easily send a Not-Modified header.
In case of Apache, I guess due to it handling multiple clients by spawning threads, CPU will be exhausted first, but for a "handful" of clients, that won't matter.
I wanted to know what others have to say about this question.
It reeeeeeeeally depends. 4k is that magical size that will fit into as good as all caches and buffers at their default settings, so it is easy (and fast) to pass around. memory is not a limiting factor here as webservers will operate on filehandles, not entire files. In this case I would assume they keep it right in memory, but that would be one file per worker instance which would usually come down to 4kb * (num_cores + 1) at most, which is not really an issue.
One could argue that either memory- or diskspeed were an issue. But former one were neglectable when methods like sendfile are properly configured, enabling for a zero-copy approach. Latter one would amortize over time once a copy of the file got loaded into memory.
Lastly, there's the interface and the CPU(s). Overall, CPU time tends to be a lot cheaper than network time, so I would expect the NIC to be the bottleneck long before the CPU - if at all.
The question is a bit unspecific on the location of the clients. If they are connected to the same GbE network, they could indeed have the power to saturate your NIC with their requests. If not, some intermediary could become the limiting factor.
Now let us assume those clients were in our network and we had a single-homed 10GbE NIC here, connected via 8 lanes (which is fairly standard IMHO): PCIe 3.0 x8 is specified with 7,877 MB/s. A Core i7 3770 has a bus speed of 5GT/s, which is translating to roughly 8 GB/s at 8 lanes. Assuming no other I/O-intensive workload, this CPU could easily saturate the NIC.
So in summary: Network/NIC saturation before CPU saturation before anything else.

High CPU with ImageResizer DiskCache plugin

We are noticing occasional periods of high CPU on a web server that happens to use ImageResizer. Here are the surprising results of a trace performed with NewRelic's thread profiler during such a spike:
It would appear that the cleanup routine associated with ImageResizer's DiskCache plugin is responsible for a significant percentage of the high CPU consumption associated with this application. We have autoClean on, but otherwise we're configured to use the defaults, which I understand are optimal for most typical situations:
<diskCache autoClean="true" />
Armed with this information, is there anything I can do to relieve the CPU spikes? I'm open to disabling autoClean and setting up a simple nightly cleanup routine, but my understanding is that this plugin is built to be smart about how it uses resources. Has anyone experienced this and had any luck simply changing the default configuration?
This is an ASP.NET MVC application running on Windows Server 2008 R2 with ImageResizer.Plugins.DiskCache 3.4.3.
Sampling, or why the profiling is unhelpful
New Relic's thread profiler uses a technique called sampling - it does not instrument the calls - and therefore cannot know if CPU usage is actually occurring.
Looking at the provided screenshot, we can see that the backtrace of the cleanup thread (there is only ever one) is frequently found at the WaitHandle.WaitAny and WaitHandle.WaitOne calls. These methods are low-level synchronization constructs that do not spin or consume CPU resources, but rather efficiently return CPU time back to other threads, and resume on a signal.
Correct profilers should be able to detect idle or waiting threads and eliminate them from their statistical analysis. Because New Relic's profiler failed to do that, there is no useful way to interpret the data it's giving you.
If you have more than 7,000 files in /imagecache, here is one way to improve performance
By default, in V3, DiskCache uses 32 subfolders with 400 items per folder (1000 hard limit). Due to imperfect hash distribution, this means that you may start seeing cleanup occur at as few as 7,000 images, and you will start thrashing the disk at ~12,000 active cache files.
This is explained in the DiskCache documentation - see subfolders section.
I would suggest setting subfolders="8192" if you have a larger volume of images. A higher subfolder count increases overhead slightly, but also increases scalability.

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.

How to measure the memory usage per active Apache Connection?

I would like to measure the memory consumption for one active Apache connection(=Thread) under Ubuntu.
Is there a monitoring tool which is capable of doing this?
If not, does anyone knows how much memory an Apache connection roughly needs?
Activate the mod_status module, you'll get a report on /server-status page, there is a more parseable version on /server-status?q=auto. If you enable ExtendedStatus On you will have a lot of information on processes and threads.
This is the page used by monitoring tools to track a lot of stats parameters, so you will certainly find the one you need (edit: if it is not memory...) . Be careful with security/access settings of this file, it's a nice tool to check how your server respond to DOS :-)
About memory you must note that Apache loves memory, how much memory per process depends on a lot of things (number of modules loaded - check that you need all the ones you have, number of virtualHosts, etc). But on a stable configuration it does not move a lot (except if you use PHP scripts with high memory limit usage...). If you find memory leaks try to limit the number of requests per process MaxRequests (apache will kill him and put a new one).
edit: in fact not a lot of memory info in the server-status. About monitoring tools, any tool using SNMP MIB-II can track memory usage per process, with average/top/low values for the different childs (Cacti, Nagios, Munin, etc) if you had a snmpd daemon. Check this excellent Munin example. It's not a tracking of each apache child but it will give you an idea of what you can track with these tools. If you do not need a complete monitoring system such as Nagios or Centreon, with alerts, user managmenent, big networks (and if you do not have a lot of days for books reading) Munin is, IMHO, a pretty tool to get monitoring reports quite fast.
I'm not sure if there are any tools for doing this. But you could estimate it yourself. Start apache and check how much memory it uses without any sessions. Than create a big number of sessions and check again how much memory it uses.
You could use JMeter to create different workloads.

Keeping a file in the OS block buffer

I need to keep as much as I can of large file in the operating system block cache even though it's bigger than I can fit in ram, and I'm continously reading another very very large file. ATM I'll remove large chunk of large important file from system cache when I stream read form another file.
In a POSIX system like Linux or Solaris, try using posix_fadvise.
On the streaming file, do something like this:
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
while( bytes > 0 ) {
bytes = pread(fd, buffer, 64 * 1024, current_pos);
current_pos += 64 * 1024;
posix_fadvise(fd, 0, current_pos, POSIX_FADV_DONTNEED);
}
And you can apply POSIX_FADV_WILLNEED to your other file, which should raise its memory priority.
Now, I know that Windows Vista and Server 2008 can also do nifty tricks with memory priorities. Probably older versions like XP can do more basic tricks as well. But I don't know the functions off the top of my head and don't have time to look them up.
Within linux, you can mount a filesystem as the type tmpfs, which uses available swap memory as backing if needed. You should be able to create a filesystem greater than your memory size and it will prioritize the contents of that filesystem in the system cache.
mount -t tmpfs none /mnt/point
See: http://lxr.linux.no/linux/Documentation/filesystems/tmpfs.txt
You may also benefit from the files swapiness and drop_cache within /proc/sys/vm
If you're using Windows, consider opening the file you're scanning through with the flag
FILE_FLAG_SEQUENTIAL_SCAN
You could also use
FILE_FLAG_NO_BUFFERING
for that file, but it imposes some restrictions on your read size and buffer alignment.
Some operating systems have ramdisks that you can use to set aside a segment of ram for storage and then mounting it as a file system.
What I don't understand, though, is why you want to keep the operating system from caching the file. Your full question doesn't really make sense to me.
Buy more ram (it's relatively cheap!) or let the OS do its thing. I think you'll find that circumventing the OS is going to be more trouble than it's worth. The OS will cache as much of the file as needed, until yours or any other applications needs memory.
I guess you could minimize the number of processes, but it's probably quicker to buy more memory.
mlock() and mlockall() respectively lock part or all of the calling process’s virtual address space into RAM, preventing that memory from being paged to the swap area.
(copied from the MLOCK(2) Linux man page)