Improving a Web App's Performance - apache

My web app, an exploded WAR, is hosted by Apache (static content) and Tomcat (dynamic content) via mod_jk. Optionally, there's an ActiveMQ component of this system, but it's currently not being used.
As I understand, each HTTP request will hit Apache. If it's a dynamic content request, Apache will forward the request to Tomcat via mod_jk. To fulfill this request, Tomcat will start a new thread to do the work.
I'm running the app on a 6-core, 12 GB RAM machine.
Besides using the ActiveMQ component, how can I improve my system's performance? Also, please correct me if I'm misstating how Apache and Tomcat communicate.

while (unhappyWithSitePerformance) {
executeLoadTest();
identifyBiggestBottleneck(); // e.g. what breaks first
fixIdentifiedBottleneck();
}
There is no blank silver bullet to provide. You should make sure your load test simulates realistic user behaviour and define the number of (virtual) users you want your server to handle within given answering time. Then tune your server until your goal is met.
Common parameters to look for are
memory consumption
CPU consumption (e.g. certain algorithms)
I/O saturation - e.g. communication to the database, general HTTP traffic saturating the network adapter
Database or backend answering time - e.g. sometimes you'll have to tune the backend, not the webserver itself.

Related

NGINX as a Web Server + Load Balancer with Cacheing Enabled

We currently run a SaaS application on apache which server ecommerce websites (its a store builder). We currently host over 1000 clients on that application and are now running into scalability issues (CPU going over 90% even on a fairly large 20 core 80GB ram + all SSD disk server).
We're looking for help from an nginx expert who can:
1. Explain the difference between running nginx as a web server vs. using it like a reverse proxy. What are the benefits?
2. We also want to use nginx as a load balancer (and have that already setup in testing), but we haven't enabled cacheing on the load balancer. So while its helping redirect requests, its not really serving any traffic directly and it simply passes through everything to one of the two apache servers.
The question is that we have a lot of user-generated content coming from the apache servers, how do we invalidate the cache for only certain pages that are being cached by nginx? If we setup a cron to clear this cache every 1 minute or so, it wouldn't be that useful... as cache would then be virtually non existent.
--
Also need an overall word on what is the best architecture to build for given the above scenarios.
Is it
NGINX Load Balancer + Cacheing ==> Nginx Web Server
NGINX Load Balancer ==> Nginx Web Server + Cacheing ?
NGINX Load Balancer + Cacheing ==> Apache Web Server
NGINX Load Balancer == > Apache Web Server (unlikely)
Please help!
Scaling horizontally to support more clients is a good option. Its recommended to first evaluate what is causing the bottleneck, memory within the application, long running requests etc.
Nginx Vs other web servers: Nginx is a HTTP server and not a servlet engine. Given that, you can check if it fits your needs.
It is a fast web server. You need to evaluate the benefits of using it as a single stand alone webserver against other web servers. Speed and memory could help.
Nginx as a load balancer:
You can have multiple web server instances behind nginx.
It supports load balancing algorithms like round robin, weighted etc so the load can be distributed based on the resource availability.
It helps in terminating ssl at Nginx, filter requests, modify headers,
compression, application upgrades wihtout downtime, serve cached content etc. This frees up resources on the server running the application. Also separation of concerns.
This setup is a reverse proxy and the benefits to it.
You can handle cache expiry with nginx. nginx documentaion has good details http://nginx.com/resources/admin-guide/caching/

is RTC proxy server only read only?

in RTC, for Global Software development scenario, there is the concept of cached proxies.
as i understand, it is only a read only proxy which will help while loading a component in the remote location.[scm part]
All Commit and Deliver actions when the changes are transmitted to the central server, the changes are sent directly over WAN. So these actions do not benefit from the proxy. Is this understanding correct?
Or does the proxy help in improving the performance for deliver/commit actions from the remote location?
Cache proxy are mentioned in:
"Does Rational Team Concert support MultiSite? "
"Using content caching proxies for Jazz Source Control"
We realize that there will still be cases where a WAN connection will not meet the 200ms guidance. In this case, we’ve leveraged the Web architecture of RTC to allow caching proxies to be used.
Based on standard Web caching technology, IBM or Apache HTTP Server or Squid, a cache can be deployed in the location which has a poor connection to the central server.
This caching proxy will cache SCM contents that are fetched from the server, greatly improving access times for the RTC client, and reducing traffic on the network.
So in case of RTC, it is more targeted to quicken "Load" and "Accept" and operations, rather than "Commit" and "Deliver".
If multiple developers all load from a specific stream, a caching proxy will help reduce the network traffic.
We realize that there will still be cases where a WAN connection will not meet the 200ms guidance. In this case, we’ve leveraged the Web architecture of RTC to allow caching proxies to be used.
Based on standard Web caching technology, IBM or Apache HTTP Server or Squid, a cache can be deployed in the location which has a poor connection to the central server.
This caching proxy will cache SCM contents that are fetched from the server, greatly improving access times for the RTC client, and reducing traffic on the network.
So in case of RTC, it is more targeted to quicken "Load" and "Accept" and operations, rather than "Commit" and "Deliver".
If multiple developers all load from a specific stream, a caching proxy will help reduce the network traffic.

Should I run Tomcat by itself or Apache + Tomcat?

I was wondering if it would be okay to run Tomcat as both the web server and container? On the other hand, it seems that the right way to go about scaling your webapp is to use Apache HTTP listening on port 80 and connecting that to Tomcat listening on another port?
Are both ways acceptable? What is being used nowdays? Whats the prime difference? How do most major websites go about this?
Thanks.
Placing an Apache (or any other webserver) in front of your application server(s) (Tomcat) is a good thing for a number of reasons.
First consideration is about static resources and caching.
Tomcat will probably serve also a lot of static content, or even on dynamic content it will send some caching directives to browsers. However, each browser that hits your tomcat for the first time will cause tomcat to send the static file. Since processing a request is a bit more expensive in Tomcat than it is in Apache (because of Apache being super-optimized and exploiting very low level stuff not always available in Tomcat, because Tomcat extracting much more informations from the request than Apache needs etc...), it may be better for the static files to be server by Apache.
Since however configuring Apache to serve part of the content and Tomcat for the rest or the URL space is a daunting task, it is usually easier to have Tomcat serve everything with the right cache headers, and Apache in front of it capturing the content, serving it to the requiring browser, and caching it so that other browser hitting the same file will get served directly from Apache without even disturbing Tomcat.
Other than static files, also many dynamic stuff may not need to be updated every millisecond. For example, a json loaded by the homepage that tells the user how much stuff is in your database, is an expensive query performed thousands of times that can safely be performed each hour or so without making your users angry. So, tomcat may serve the json with proper one hour caching directive, Apache will cache the json fragment and serve it to any browser requiring it for one hour. There are obviously a ton of other ways to implement it (a caching filter, a JPA cache that caches the query etc...), but sending proper cache headers and using Apache as a reverse proxy is quite easy, REST compliant and scales well.
Another consideration is load balancing. Apache comes with a nice load balancing module, that can help you scale your application on a number of Tomcat instances, supposed that your application can scale horizontally or run on a cluster.
A third consideration is about ulrs, headers etc.. From time to time you may need to change some urls, or remove or override some headers. For example, before a major update you may want to disable caching on browsers for some hours to avoid browsers keep using stale data (same as lowering the DNS TTL before switching servers), or move the old application on another url space, or rewrite old URLs to new ones when possible. While reconfiguring the servlets inside your web.xml files is possible, and filters can do wonders, if you are using a framework that interprets the URLs you may need to do a lot of work on your sitemap files or similar stuff.
Having Apache or another web server in front of Tomcat may help a lot changing only Apache configuration files with modules like mod_rewrite.
So, I always recommend having Apache httpd in front of Tomcat. The small overhead on connection handling is usually recovered thanks to caching of resources, and the additional configuration works is regained the first time you need to move URLs or handle some headers.
It depends on your network and how you wish to have security set up.
If you have a two-firewall DMZ, with applications deployed inside the second firewall, it makes sense to have an Apache or IIS instance in between the two firewalls to handle security and proxy calls into the app server. If it's acceptable to put the Tomcat instance in the DMZ you're free to do so. The only downside that I see is that you'll have to open a port in the second firewall to access a database inside. That might put the database at risk.
Another consideration is traffic. You don't say anything about traffic, sizing servers, and possible load balancing and clustering. A load balancer in front of a cluster of app servers is more likely to be kept inside the second firewall. The Tomcat instance is capable of handling traffic on its own, but there are always volume limitations depending on the hardware it's deployed on and what the application is doing with each request. It's almost impossible to give a yes or no answer without more detailed, application-specific information.
Search the site for "tomcat without apache" - it's been asked before. I voted to close before finding duplicates.

What's the most scalable and high performing Amazon Web Service (AWS) configuration for a RESTful web service?

I'm building an asynchronous RESTful web service and I'm trying to figure out what the most scalable and high performing solution is. Originally, I planned to use the FriendFeed configuration, using one machine running nginx to host static content, act as a load balancer, and act as a reverse proxy to four machines running the Tornado web server for dynamic content. It's recommended to run nginx on a quad-core machine and each Tornado server on a single core machine. Amazon Web Services (AWS) seems to be the most economical and flexible hosting provider, so here are my questions:
1a.) On AWS, I can only find c1.medium (dual core CPU and 1.7 GB memory) instance types. So does this mean I should have one nginx instance running on c1.medium and two Tornado servers on m1.small (single core CPU and 1.7 GB memory) instances?
1b.) If I needed to scale up, how would I chain these three instances to another three instances in the same configuration?
2a.) It makes more sense to host static content in an S3 bucket. Would nginx still be hosting these files?
2b.) If not, would performance suffer from not having nginx host them?
2c.) If nginx won't be hosting the static content, it's really only acting as a load balancer. There's a great paper here that compares the performance of different cloud configurations, and says this about load balancers: "Both HaProxy and Nginx forward traffic at layer 7, so they are less scalable because of SSL termination and SSL renegotiation. In comparison, Rock forwards traffic at layer 4 without the SSL processing overhead." Would you recommend replacing nginx as a load balancer by one that operates on layer 4, or is Amazon's Elastic Load Balancer sufficiently high performing?
1a) Nginx is asynchronous server (event based), with single worker itself they can handle lots of simultaneous connection (max_clients = worker_processes * worker_connections/4 ref) and still perform well. I myself tested around 20K simultaneous connection on c1.medium kind of box (not in aws). Here you set workers to two (one for each cpu) and run 4 backend (you can even test with more to see where it breaks). Only if this gives you more problem then go for one more similar setups and chain them via an elastic load balancer
1b) As said in (1a) use elastic load balancer. See somebody tested ELB for 20K reqs/sec and this is not the limit as he gave up as they lost interest.
2a) Host static content in cloudfront, its CDN and meant for exactly this (Cheaper and faster then S3, and it can pull content from s3 bucket or your own server). Its highly scalable.
2b) Obviously with nginx serving static files, it will now have to serve more requests to same number of users. Taking that load away will reduce work of accepting connections and sending the files across (less bandwidth usage).
2c). Avoiding nginx altogether looks good solution (one less middle man). Elastic Load balancer will handle SSL termination and reduce SSL load on your backend servers (This will improve performance of backends). From above experiments it showed around 20K and since its elastic it should stretch more then software LB (See this nice document on its working)

apache + lighttpd front-proxy concept

In order to lighten Apache's load people often suggest using lighttpd to serve up static content.
e.g. http://www.linux.com/feature/51673
In this setup Apache passes requests for static content back to lighttpd via mod_proxy, while serving dynamic requests itself.
My question is: how does this reduce the load on the server? Since you still have an apache process spawned for every request that comes in, how does this positively impact the load? From what I can see the size of the Apache process proxying its request through lighttpd is as large as it would be if it were serving the file itself.
Running Lighttpd behind Apache to serve static files certainly seems braindead to me. Apache still has to unpack the HTTP packets and parse the request through its parse tree, send proxy requests, and then Lighttpd has to re-unpack, hit the filesystem and send the files back through Apache. I've never heard of anyone using a setup like this in production.
What you will see, is people using a lightweight webserver like Nginx as a frontend server to serve static files and proxy dynamic URLs to Apache. Or, you can run Varnish or Squid as a caching reverse proxy frontend, so that all your high-traffic static files (i.e. images, CSS etc. and any dynamic pages you're willing to send cache-friendly headers for) are served out of memory.
Apache can also be optimized to serve static files -- so often when I hear people complain about Apache, they really don't know how to configure it. They've only ever used the prefork MPM (vs. threaded or worker) and have all sorts of modules enabled (usually they're running from a Linux distribution's kitchen-sink Apache package that builds everything as modules and defaults to enabling 10-20 modules or more). Tune Apache by turning off unneeded modules/stupid features like support for .htaccess (which makes Apache scan the filesystem on every request!) first. (You can also run two instances of Apache, with a "light" Apache as frontend that proxies to a "heavy" Apache for dynamic requests ... maybe your frontend is threaded but your backend is prefork because you have to run thread-unsafe external modules like mod_php.)
Re:
Since you still have an apache process
spawned for every request that comes
in, how does this positively impact
the load? From what I can see the size
of the Apache process proxying its
request through lighttpd is as large
as it would be if it were serving the
file itself.
If you're spawning processes on every request, then that means you're using the prefork MPM. Keep in mind that when the OS reports memory usage for each of these processes, not all that memory is wired, a lot of those processes are idle. And when you're talking about speed, you're concerned more with request parsing and internal code branches for a given request (how much processing is the server doing?) than with memory usage reported by the OS.
For example, if you enable something like mod_php, then each of those worker processes is going to instantly go up by about 20-40M (depending on what's enabled in your PHP interpreter), but that doesn't mean Apache is using that memory on static requests. Of course if you're optimizing your server for maximum concurrency on small static files, then enabling mod_php would still be very bad, you're not going to be able to fit nearly as many prefork processes into RAM.
I probably could come up with a "nightmare configuration" for Apache that would make it actually slower serving static files than proxying those requests to a backend Lighttpd, but it would involve enabling expensive features like .htaccess in Apache that are disabled in Lighttpd, so it wouldn't really be fair.
If you still have the power to serve static and dynamic content from the same machine (as they in your referenced article do), then I really see no point in that setup.
Maybe it does reduce the Load of Apache, because it doesn't have to do IO to the disk, but it will increase the Load of Lighttpd on the same machine and thus reducing the available load to apache ...
Maybe Lighttpd IO access is lighter, than that of Apache 1.3, but why not just switch to Apache 2 or Lighttpd completely? And if the performance really start to suck, host the static files on another machine (media.yourdomain.com).
I small introduction to how you can make a performant setup is found here:
Deploying Django -> scroll to Scaling some page before the end
I don't know much about internal workings of Apache, but one explanation I've seen is about memory pressure. In short, Apache tries to balance the memory it uses for caching and for dynamic pages; but usually ends up with too much cache and too little for apps. If you separate them to different processes, each one will optimize for the kind of load.
Currently, what I'm doing is using nginx as front end. It's really fast and light, and specifically designed as a frontend proxy; but also serves static files. In fact, since it can also call FastCGI processes, you could get rid of Apache and still get the benefits of split file/app processes. (and there's some extra memcached magic that looks absolutely genius)
(Yes, lighttpd can also be used as frontend to Apache and/or FastCGI)
You don't have an Apache process spawned for each request - static files (images and the like) are fetched directly by lighttpd.
Use Apache MPM Worker fastcgi this will lower you server memory usage. MPM worker serves static content better then Prefork and is nearly on par with lighttpd when it comes to static content.