Apache and the c10k - apache

How is Apache in respect to handling the c10k problem under normal conditions ?
Say while running very small scripts with little data, or do I need to scale out if I use Apache?
In the background heavy lifting is done by a few servers running specialized software that processes the requests but I'd like to use Apache as a front. Is this a viable plan?

I consider Apache to be more of an origin server - running something like mod_php or mod_perl to generate the content and being smart about routing to the appropriate system.
If you are getting thousands of concurrent hits to the front of your site, with a mix of types of data (static and dynamic) being returned, you may find it useful to put a more optimised system in front of it though.
The classic post-optimisation problem with Apache isn't generating the dynamic content (or at least, that can be optimised for early in the process), but simply waiting for a slow client to be able to receive the bytes that are being sent. It can therefore be a significant advantage to put a reverse proxy, in the form of Squid or Nginx, in front of the servers to take over the 'spoon-feeding' of the slow network clients, while allowing the content production to happen at full speed, and at local network speeds - 100Mb/sec or even gigabit speeds - if it even has to traverse a network at all.

I'm assuming you've probably seen this data, but if not, it might give you some idea.

Guys, imagine that you are running web server with 10K connections (simultaneous). How could it be?
You've got many many connections per second
Dynamic content
Are you sure that your CPU can handle that many PHP sessions for example? I guess no, so why are you thinking about C10K problem? :D
Static content - small files
And still soo many connections? On single server? Probably you've got problems with networking/throughput too or you are future competitor of Google. Use lighttpd which addresses C10K problem and is stable - fly light. Using Apache for only static files for large sites is obvious.
Your clients are downloading large files for a large time - static content
ISO images, archives etc
If you are doing it via web server - FTP may be more appropriate.
Video streaming
Use lighttpd or specialized software. And still... What about other resources?
I am using Linux Virtual Server as load balancer in front of apache servers (with specific patches for LVS-NAT) and I am happy :) This string is an answer you want to hear.

Related

The Node.js event loop - nginx/apache

Both nginx and Node.js have event loops to handle requests. I put nginx in front of Node.js as has been recommended here
Using Node.js only vs. using Node.js with Apache/Nginx
with the setup shown here
Node.js + Nginx - What now?
How do the two event loops play together? Is there any risk of conflicts between the two? I wonder because Nginx may not be able to handle as many events per second as Node.js or vice versa. For example, if Nginx can handle 1000 events per second but node.js only 500, won't that cause issues? (I have no idea if 1000,500 are reasonable orders of magnitude, you could correct me on that.)
What about putting Apache in front of Node.js? Apache has no event loop. Just threads. So won't putting Apache in front of Node.js defeat the purpose?
In this 2010 talk, Node.js creator Ryan Dahl had vision to get rid of nginx/apache/whatever entirely and make node talk directly to the internet. When do you think this will be reality?
Both nginx and Node use an asynchronous and event-driven approach. The communication between them will go more or less like this:
nginx receives a request
nginx forwards the request to the Node process and immediately goes back to wait for more requests
Node receives the request from nginx
Node handles the request with minimal CPU usage, until at some point it needs to issue one or more I/O requests (read from a database, write the response, etc). At this point it launches all these I/O requests and goes back to wait for more requests.
The above can repeat lots of times. You could have hundreds of thousands of requests all in a non-blocking wait state where nginx is waiting for Node and Node is waiting for I/O. And while this happens both nginx and Node are ready to accept even more requests!
Eventually async I/O started by the Node process will complete and a callback function will get invoked.
If there are still I/O requests that haven't completed for this request, then Node goes back to its loop one more time. It can also happen that once an I/O operation completes this data is consumed by the Node callback and then new I/O needs to happen, so Node can start more async I/O requests before going back to the loop.
Eventually all I/O operations started by Node for a particular request will be complete, including those that write the response back to nginx. So Node ends this request, and then as always goes back to its loop.
nginx receives an event indicating that response data has arrived for a request, so it takes that data and writes it back to the client, once again in a non-blocking fashion. When the response has been written to the client and event will trigger and nginx will then end the request.
You are asking about what would happen if nginx and Node can handle a different number of maximum connections. They really don't have a maximum, the maximum in general comes from operating system configuration, for example from the maximum number of open handles the system can have at a time or the CPU throughput. So your question does not really apply. If the system is configured correctly and all processes are I/O bound, neither nginx or Node will ever block.
Putting Apache in front of Node will only work well if you can guarantee that your Apache never blocks (i.e it never reaches its maximum connection limit). This is hard/impossible to achieve for large number of connections, because Apache uses an individual process or thread for each connection. nginx and Node scale really well, Apache does not.
Running Node without another server in front works fine and it should be okay for small/medium load sites. The reason putting a web server in front of it is preferred is that web servers like nginx come with features that Node does not have and you would need to implement yourself. Things like caching, load balancing, running multiple apps from the same server, etc.
I think your questions have been largely covered by some of the others answers, but there are a few pieces missing, and some that I disagree with, so here are mine:
The event loops are isolated from each other at the process level, but do interact. The issues you're most likely to encounter are around the configuration of nginx response buffers, chunked data, etc. but this is optimisation rather than error resolution.
As you point out, if you use Apache you're nullifying the benefit of using Node.js, i.e. massive concurrency and websockets. I wouldn't recommend doing that.
People are already using Node.js at the front of their stack. Searching for benchmarks returns some reasonable-looking results in Node's favour, so performance to my mind isn't an issue. However, there are still reasons to put Nginx in front of Node.
Security - Node has been given increasing scrutiny, but it's still young. You may not have problems here, but caution is often your friend.
Training - Ops staff that you hire will know how to manage Nginx, but the configuration and management of your custom Node app will only ever be understood by those people your developers successfully communicate it to. In some companies this is nobody.
Operational Flexibility - If you reach scale you might want to split out the serving of static content, purely to reduce the load on your app servers. You might want to split content amongst different domains and have it managed separately, or have different SSL or proxying behaviour for different domains or URL patterns. These are the things that are easy for Ops guys to configure in Nginx, but you'd have to code manually in a Node app.
The event loops are independent. Event loops are implemented at the application level, so neither cares what sort of architecture the other uses.
NodeJS is good at many things, but there are some places where it still falters. Once example is serving static files. At the moment, nodejs performs fairly poorly in this test, so having a dedicated web server for your static files greatly improves response time. Also, nodejs is still in its infancy, and has not been "tested and hardened" in the matters of security like Apache on nginX.
It'll take a long time for people to consider fronting nodejs all by itself. The cluster module is a step in the right direction, but it'll take a long time even after it reaches v1 before it happens.
Both event loops are unrelated. They don't play together.
Yes, it is pretty useless. Apache is not a load balancer.
What Ryan Dahl said may be applicable already. The limit of concurrent users is definitely higher than that of Apache. Before node.js websites with fair amount of concurrent users had to use nginx to balance the load. For small to medium sized businesses it can be done with node.js alone. But ruling out nginx completely will take time. Let node.js be stable before it can follow this ambitious dream.

Low latency web server/load balancer for the non-Twitters of the world

Apache httpd has done me well over the years, just rock solid and highly performant in a legacy custom LAMP stack application I've been maintaining (read: trying to escape from)
My LAMP stack days are now numbered and am moving on to the wonderful world of polyglot:
1) Scala REST framework on Jetty 8 (on the fence between Spray & Scalatra)
2) Load balancer/Static file server: Apache Httpd, Nginx, or ?
3) MySQL via ScalaQuery
4) Client-side: jQuery, Backbone, 320 & up or Twitter Bootstrap
Option #2 is the focus of this question. The benchmarks I have seen indicate that Nginx, Lighthttpd, G-WAN (in particular) and friends blow away Apache in terms of performance, but this blowing away appears to manifest more in high-load scenarios where the web server is handling many simultaneous connections. Given that our server does max 100gb bandwidth per month and average load is around 0.10, the high-load scenario is clearly not at play.
Basically I need the connection to the application server (Jetty) and static file delivery by the web server to be both reliable and fast. Finally, the web server should double duty as a load balancer for the application server (SSL not required, server lives behind an ASA). I am not sure how fast Apache Httpd is compared to the alternatives, but it's proven, road warrior tested software.
So, if I roll with Nginx or other Apache alternative, will there be any difference whatsoever in terms of visible performance? I assume not, but in the interest of achieving near instant page loads, putting the question out there ;-)
if I roll with Nginx or other Apache alternative, will there be any difference whatsoever in terms of visible performance?
Yes, mostly in terms of latency.
According to Google (who might know a thing or tow about latency), latency is important both for the user experience, high search-engine rankings, and to survive high loads (success, script kiddies, real attacks, etc.).
But scaling on multicore and/or using less RAM and CPU resources cannot hurt - and that's the purpose of these Web server alternatives.
The benchmarks I have seen indicate that Nginx, Lighthttpd, G-WAN (in particular) and friends blow away Apache in terms of performance, but this blowing away appears to manifest more in high-load scenarios where the web server is handling many simultaneous connections
The benchmarks show that even at low numbers of clients, some servers are faster than others: here are compared Apache 2.4, Nginx, Lighttpd, Varnish, Litespeed, Cherokee and G-WAN.
Since this test has been made by someone independent from the authors of those servers, these tests (made with virtualization and 1,2,4,8 CPU Cores) have clear value.
There will be a massive difference. Nginx wipes the floor with Apache for anything over zero concurrent users. That's assuming you properly configure everything. Check out the following links for some help diving into it.
http://wiki.nginx.org/Main
http://michael.lustfield.net/content/dummies-guide-nginx
http://blog.martinfjordvald.com/2010/07/nginx-primer/
You'll see improvements in terms of requests/second but you'll also see significantly less RAM and CPU usage. One thing I like is the greater control over what's going on with a more simple configuration.
Apache made a claim that apache 2.4 will offer performance as good or better than nginx. They made a bold claim calling out nginx and when they made that release it kinda bit them in the ass. They're closer, sure, but nginx still wipes the floor in almost every single benchmark.

Load balancer - how to write one for a custom application?

I've written a simple server application which will run distributed on several machines.
My question is how does a network load balancer works, in general?
I've heard of round-robin and other algorithms, but what I haven't got answer to is how does the process really goes? In socket terms.
The client connects to one of the load balancer machines, asks for a "free-to-connect-to" server and simply connects to it?
That's the simpliest way I can think of.
.. or, does it use the load balancer as a proxy (that implies that all the NBs must be always connected to the application servers, and data is transferred through them)?
It's more of a general question. How would you do this?
Thank you all!
There are several different ways to load balance an application. Some are physical devices that sit between your router and the servers; some are software based with a bit of code that runs on each of the load balanced devices.
Microsoft has load balancing built into Windows which is all software based. It's pretty good and easy to set up.
However, I'll cover the physical route.
There are several algorithms here, but the main one is Round Robin with an option for "sticky" sessions. Sticky in this case means that the load balancer will try to keep a history of clients and forward requests from the same client to the same machine. This means the load balancer needs to keep a list of clients and where it directed those clients. Depending on cache size, clients may fall off the list and on future requests they may be forwarded to a different server.
Round Robin is a pretty simple idea. For each request that comes in send it to the next server in the list. More complicated algorithms might take into account how many requests go to a particular server and how long are those requests taking; then try to rebalance new requests to favor faster servers. This part is complicated though.

Apache resource usage vs Mongoose or other lightweight web server

How much memory and/or other resources does Apache web server use?
How much more are lightweight servers efficient?
Say appache vs. Mongoose Web Server
Neil Butterworth you out there?
Thanks.
Yes, lightweight servers are more efficient with memory and resources, as the term 'lightweight' would indicate. nginx is a popular one.
Apache's memory and resource usage depends a lot on what you're doing with it - which modules are loaded, what your PHP etc. scripts are doing. There's no single answer.
You have to take into account your specific task, and also the fact that almost every web server has some sort of specialization (a niche).
Apache is configurable and stable.
nginx is extremely fast, but works only with static context.
lighttpd is small, fast and does both static and dynamic context.
Mongoose is embeddable, small and easy to use.
There are many more web servers, I won't go through the whole list here. You need to decide which features do you require for your task, and make a choice accordingly.
Apache Httpd is great if you need lots of flexibility that is provided via various mods. If you're looking for straight-up file serving or proxying, then some lightweight options might be better. I manage the Maven Central repo that gets millions of hits a day and I have some experience with Nginx.

What perfromance improvement will I get by moving to lighttpd from Apache?

I currently have a cluster of 4 Apache web servers which are used to serve up static files of up to 30Mb in size. Generally, I can expect up to 5000 concurrent connections to these servers. What performance improvement would I expect to get by moving this to lighttpd?
I would expect it to handle the concurrency with much more ease and less memory overhead. I've stopped deploying Apache pretty much everywhere I can.
You may also consider nginx for a comparison.
If you are using Apache with MPM with worker or event you probably won't see much of a difference. If you haven't moved to using them I would give that a try. There isn't really any problem with lighttpd though either. I think today it is just a matter of picking one and going with it.
If I where serving that type of file I would push it out to a CDN and not have to worry about it. There are plenty of cheap ones now like CacheFly and Amazon's Cloudfront.
From the top of my head:
Smaller memory footprint
Quicker file reads
Definitely check out the benchmark at their site, they provide a lot of information on this topic: http://www.lighttpd.net/benchmark