Resource usage of a static web server - apache

I came across this question in a blog post. It was asked by Mozilla in their internship interview. (Blog Post)
You are running a HTTP server (nginx, Apache, etc) that is configured
to serve static files off the local filesystem of your modern,
multi-core server connected to a gigabit network. A handful of clients
start requesting the same 4kb static file as fast as they can. What
system resource do you think will be exhausted first?
a. CPU
b. Disk / I/O
c. Memory
d. Network
e. Other
According to me, none of this would be exhausted on a modern machine, with Nginx/Apache. Won't the web server cache such a small file and just keep serving that. Also, for repeated request it can easily send a Not-Modified header.
In case of Apache, I guess due to it handling multiple clients by spawning threads, CPU will be exhausted first, but for a "handful" of clients, that won't matter.
I wanted to know what others have to say about this question.

It reeeeeeeeally depends. 4k is that magical size that will fit into as good as all caches and buffers at their default settings, so it is easy (and fast) to pass around. memory is not a limiting factor here as webservers will operate on filehandles, not entire files. In this case I would assume they keep it right in memory, but that would be one file per worker instance which would usually come down to 4kb * (num_cores + 1) at most, which is not really an issue.
One could argue that either memory- or diskspeed were an issue. But former one were neglectable when methods like sendfile are properly configured, enabling for a zero-copy approach. Latter one would amortize over time once a copy of the file got loaded into memory.
Lastly, there's the interface and the CPU(s). Overall, CPU time tends to be a lot cheaper than network time, so I would expect the NIC to be the bottleneck long before the CPU - if at all.
The question is a bit unspecific on the location of the clients. If they are connected to the same GbE network, they could indeed have the power to saturate your NIC with their requests. If not, some intermediary could become the limiting factor.
Now let us assume those clients were in our network and we had a single-homed 10GbE NIC here, connected via 8 lanes (which is fairly standard IMHO): PCIe 3.0 x8 is specified with 7,877 MB/s. A Core i7 3770 has a bus speed of 5GT/s, which is translating to roughly 8 GB/s at 8 lanes. Assuming no other I/O-intensive workload, this CPU could easily saturate the NIC.
So in summary: Network/NIC saturation before CPU saturation before anything else.

Related

H/W requirements for Confluent single node

What are the hardware requirements for Single node Confluent installation?
I checked their official site, but it has specifications for multi-node: https://docs.confluent.io/platform/current/installation/system-requirements.html
Unclear what all you're trying to run. Zookeeper and Kafka alone have been made to run with limited resources on a Raspberry Pi and definitely can run on any modern laptop or computer.
If you're running a single node, it's not considered "production grade", so with at least 5 services (ZK, broker, Schema Registry, REST proxy, ksqlDB) at 2 GB max heap each, that'd require 10 GB RAM + overhead for the OS, so call it 16 GB of memory to be conservative
If you also want to (reasonably) run Control Center, it's suggested to have 6GB for that, increasing your memory requirements up to at least 24 GB on a single node if you want to include calculation for your own Kafka client applications
Of course, you can opt out of certain services and tune each JVM to how you want...
As far as disk space goes, really depends on how much data you plan on having. 500 GB would be a good starting point, but a single disk wouldn't be fault tolerant

What would be a good hardware configuration for a Redis dedicated server?

I am planning to configure Redis in Master/Slave configuration.
I have got three machines (8GB RAM, 8 cores), planing to to use one master and two slaves.
What would be the recommended hardware configuration for these machines?
Redis is not CPU intensive, so you should get at least 2 cores per server (one for redis, one for backups, maybe one more to do basic stuff on the server?), more is not really relevant. Redis is single-threaded.
Get as much RAM as you can as it defines the size of your store. Also making a dump consumes RAM so your true space size is less than you can think. Monitor your RAM usage to prevent surprises.
For RAM type, if it fails, redis fails and sometimes silently (consistency broken). If you need to be careful with your data always use ECC RAM, it is expensive but maybe less expensive than broken data in RAM accessed through redis causing unknown effects. Redis has no known checks against hardware errors from RAM, even if it is quite rare (less likely to happen than a broken hard drive) it does happen.

How to avoid Boost ASIO reactor becoming constrained to a single core?

TL;DR: Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
I have a farily massively parallel application, running on a hyperthreaded-dual-quad-core-Xeon machine with tons of RAM and a fast SSD RAID. This is developed using boost::asio.
This application accepts connections from about 1,000 other machines, reads data, decodes a simple protocol, and shuffles data into files mapped using mmap(). The application also pre-fetches "future" mmap pages using madvise(WILLNEED) so it's unlikely to be blocking on page faults, but just to be sure, I've tried spawning up to 300 threads.
This is running on Linux kernel 2.6.32-27-generic (Ubuntu Server x64 LTS 10.04). Gcc version is 4.4.3 and boost::asio version is 1.40 (both are stock Ubuntu LTS).
Running vmstat, iostat and top, I see that disk throughput (both in TPS and data volume) is on the single digits of %. Similarly, the disk queue length is always a lot smaller than the number of threads, so I don't think I'm I/O bound. Also, the RSS climbs but then stabilizes at a few gigs (as expected) and vmstat shows no paging, so I imagine I'm not memory bound. CPU is constant at 0-1% user, 6-7% system and the rest as idle. Clue! One full "core" (remember hyper-threading) is 6.25% of the CPU.
I know the system is falling behind, because the client machines block on TCP send when more than 64kB is outstanding, and report the fact; they all keep reporting this fact, and throughput to the system is much less than desired, intended, and theoretically possible.
My guess is I'm contending on a lock of some sort. I use an application-level lock to guard a look-up table that may be mutated, so I sharded this into 256 top-level locks/tables to break that dependency. However, that didn't seem to help at all.
All threads go through one, global io_service instance. Running strace on the application shows that it spends most of its time dealing with futex calls, which I imagine have to do with the evented-based implementation of the io_service reactor.
Is it possible that I am reactor throughput limited? How would I tell? How expensive and scalable (across threads) is the implementation of the io_service?
EDIT: I didn't initially find this other thread because it used a set of tags that didn't overlap mine :-/ It is quite possible my problem is excessive locking used in the implementation of the boost::asio reactor. See C++ Socket Server - Unable to saturate CPU
However, the question remains: How can I prove this? And how can I fix it?
The answer is indeed that even the latest boost::asio only calls into the epoll file descriptor from a single thread, not entering the kernel from more than one thread at a time. I can kind-of understand why, because thread safety and lifetime of objects is extremely precarious when you use multiple threads that each can get notifications for the same file descriptor. When I code this up myself (using pthreads), it works, and scales beyond a single core. Not using boost::asio at that point -- it's a shame that an otherwise well designed and portable library should have this limitation.
I believe that if you use multiple io_service object (say for each cpu core), each run by a single thread, you will not have this problem. See the http server example 2 on the boost ASIO page.
I have done various benchmarks against the server example 2 and server example 3 and have found that the implementation I mentioned works the best.
In my single-threaded application, I found out from profiling that a large portion of the processor instructions was spent on locking and unlocking by the io_service::poll(). I disabled the lock operations with the BOOST_ASIO_DISABLE_THREADS macro. It may make sense for you, too, depending on your threading situation.

Virtual Processors and Logical Partitions

I basically wanted to know what exactly a virtual processor is. At IBM's site they define it as:
"A virtual processor is a representation of a physical processor core to the operating system of a logical partition that uses shared processors. "
I understand that if there are x processors, each of which can simultaneously perform two operations, then the system can perform 2x operations simultaneously. But where does virtual processor fit into this. And i tried looking up the difference between a logical partition and other partitions such as primary but wasn't really sure.
I'd like to draw an analogy between virtual memory and virtual processors.
Start with expectations:
A user program is written against a set of expectation about what the memory looks like (an a nice flat, large, continuous memory model is the best...)
An OS system is written against a set of expectation of how the hardware performs (what CPU protection modes operation are available, how interrupts arrive and are blocked and handled, how to talk to IO devices, etc...)
Realize that expectation can be met directly by the hardware, or by an abstraction layer
Virtual memory is a set of (specialized, not found in simple chips) hardware tools and OS services that fake a user program into thinking that it has that nice, flat, large, continuous memory space, even while the OS is busily dividing the real memory into little piece, and storing some of them on disk, bringing other back, and otherwise making a real hash of it. But your code doesn't care. Everything just works.
A virtual processor system is a set of (specialized, not found in consumer CPUs) hardware tools and hypervisor services that allow your OS to believe it has direct access to one or more processors with the expected protection modes, interrupts, etc. even though the hypervisor is busily swapping whole OS contexts onto and off of one or more real processors, starting and stopping access to IO busses, and so on and so forth. But the OS doesn't care. Everything just works.
The hardware support to do this is has only recently started to be available in "desktop" CPUs, but Big Iron has had it for ages. It is useful for a couple of reasons
Protection. In a properly protected OS, it is tough for one processes or user to spy on another. But since they can be resident in the same context, it may still be possible. Virtualizing OSs divides them by another, even thinner channel and makes it that much harder for data to leak, and malicious things to be done.
Robustness. If you can swap OS contexts in and out you migrate them from one machine to anther and checkpoint and restart. Which allows for computers that detect failures on their own processors and recover gracefully.
These are the things (aside from millions of LOC of heavily debugged, mission critical code) that have kept people paying for Big Iron.

Hardware requirements for a Virtual Server

We have decided to go with a virtualization solution for a few of our development servers. I have an idea of what the hardware specs would be like if we bought separate physical servers, but I have no idea how to consolidate that information into the specification for a generalized virtual server.
I know intuitively that the specs are not additive - I shouldn't just add up all the RAM requirements from each machine to get the RAM required for the virtual server. I can't really treat them as parallel systems either because no matter how good the virtualization software is, it can't abstract away two servers trying to peg the CPU at the same time.
So my question is - is there a standard method to estimating the hardware requirements for a virtualized system given hardware requirement estimations for the underlying virtual machines? Is there a +C constant for VMWare/MS Virtual Server overhead (and if so, what is C?)?
P.S. I promise to move this over to serverfault once it goes into beta (Promise kept)
Yes add 25% additional resources to manage the VM. So if I need 4 servers that are equal to single core 2 ghz machines with 2 gigs of ram I will need 10 ghz processing power plus 10 gigs of ram. This will allow all systems to redline and still be ok.
In the real world this will never happen though, all your servers will not always be running all the time. You can get a feel for usage by profiling your current servers and determine their exact requirements and then adding an additional 25% in resources.
Check out this software for profiling utilization http://confluence.atlassian.com/display/JIRA/Profiling+Memory+and+CPU+usage+with+YourKit
The requirements are in fact additive. You should add up the memory requirements for each VM, and the disk requirements, and have at least one processor core per VM. Then add on whatever you need for the host system.
VMs can share a CPU, to some extent, if you have really low performance requirements, but they cannot share disk space or memory.
Answers above are far too high, second (1 core per VM) is closer. You can either 1) plan ahead and probably over-purchase 2) add just-in-time. Do you have some reason that you must know well ahead (yearly budget? your chosen host platform doesn't cluster hosts, so you can't add later?)
Unless you have an incredible simple usage profile, it will be hard to predict before and you'll over purchase. The answer above (+25%) would be several times more than you need for an modern server virtualization software (VMware, Zen, etc) that manages resources smartly. It's accurate only for desktop products like VPC. I chose to rough it out on a napkin and profile my first environment (set of machines) on the host. I'm happy.
Examples of things that will confound your estimation
Disk space, Some systems (Lab
Manager) use only the difference in
space from the base template. 10
deployed machines with 10 GB drives
using about 10 GB (template) + 200MB.
Disk space: You'll then find you
don't like the deltas in specific
scenarios.
CPU / Memory: This is dev
shop - so you'll have erratic load.
Smart hosts don't reserve memory and CPU.
CPU / Memory: But then you'll
want to do perf testing, and want to
reserve CPU cycles (not all hosts can
do that)
We all virtualize for different reasons. Many of the guests
in our environment don't have much work. We want them there to see how something behaves with a cluster of 3 servers of type X. Or, we have a bundle of weird client desktops waiting around, being used one at time by a tester. They rarely consume many host resources.
So, if you are using something like that doesn't do delta disks, disk space might be somewhat calculable. If lab manager (delta disk), disk space is really hard to predict.
Memory and processor usage: You'll have to profile or over-purchase heavily. I have many more guest CPUs than host CPUS, and don't have perf problems - but that's because of the choppy usage in our QA environments.