Varnish x Apache CDN & HTTPs - apache

I'm currently trying to setup a DYI CDN using Varnish, Nginx, & Apache.
This is the following setup I have planned.
The following assumes:
1. The Varnish Origin server is on the same server as the web server (Apache in this case)
2. You have 2 Varnish Cache Servers located in different countries, one in NA, & one in EU
Example of an NA client attempting to retrieve data:
NA Client --> Varnish Origin Server --> NA Varnish Cache Server --> If result NOT IN cache --> Return Apache Origin Server results --> Input request data into NA Varnish Cache Server
Example of an EU client attempting to retrieve data:
EU Client --> Varnish Origin Server --> EU Varnish Cache Server --> If result IN cache --> Return EU Varnish Cache results
Any suggestions and/or mistakes? Where would I insert Nginx/HAProxy in order to terminate SSL, since Varnish doesn't accept HTTPs?

What you're suggesting is perfectly possible and has become an increasingly more popular use case for us at Varnish Software.
Geolocation
First things first: I'm assuming all users, regardless of their location, will use the same hostname to connect to the CDN. Let's say the hostname is www.example.com.
US users should automatically connect to a Varnish edge node in the US, EU users should be directed to the EU.
This geographical targeting requires some sort of GeoDNS approach. If your DNS provider can do this for you, things will be a lot easier.
If not, I know for a fact that AWS Route 53 does this. If you want to use open source technology, you can host the DNS zone yourself using https://github.com/abh/geodns/.
So if a US user does a DNS call on www.example.com, this should resolve to us.example.com. For EU users this will be eu.example.com.
Topology
Your setup will connect from the a local Varnish server to a remote Varnish server. This seems like one hop too many. If the geolocation properly works, you'll directly end up on the Varnish server that is closest to your user.
We call these geolocated servers "edge nodes". They will connect back to the origin server(s) in case requested content is not available in cache.
It's up to you to decide if one origin Apache will do, or if you want to duplicate your Apache servers in the different geographical regions.
SSL/TLS
My advice in terms of SSL/TLS termination: Use Hitch. It's a dedicated TLS Proxy that was developed by Varnish, to use with Varnish. It's open source.
You can install Hitch on each Varnish server and accept HTTPS there. The connection between Hitch and Varnish can be done over Unix Domain Sockets, which further reduces latency.
Our tests show you can easily process 100 Gbps on a single server using terminated TLS with Hitch.
Single tier Varnish or multi-tier Varnish
If your CDN requires a lot of storage, I'd advise you to setup a multi-tier Varnish setup in each geographical location:
The edge tier will be RAM heavy and will cache the hot content in memory using the malloc stevedore in Varnish
The storage tier will be disk heavy and will cache long tail content on disk using the file stevedore in Varnish
Although the file stevedore is capable of caching terrabytes of data, it is quite prone to disk fragmentation, which at very large scale will slow you down in the long run.
If you have tiered Varnish servers, you can tune each tier to its needs. Combined, the results will be quite good: although the file stevedore has its limitations, it will still be a lot faster than constantly accessing the origin when the cache of the edge servers is full.
Varnish Software's DIY CDN solution
Varnish Software, the company behind the Varnish Cache project, has done many CDN integration projects for some of the world's biggest web platforms.
Varnish Cache, the open source project, is the foundation of these CDN solutions. However, typical CDN clients have some extra requirements, that are not part of the open source solution.
That's why we developed Varnish Enterprise, to tackle these limitations.
Have a look at Varnish Software's DIY CDN solution to learn more. Please also have a look at the docs containing the extra features of the product.
If you want to play around with these features without buying a license up front, you can play around with Varnish Enterprise images in the Cloud.
We have an AWS image available on the AWS marketplace.
We have an Azure image available on the Azure marketplace.
We have a GCP image available on the GCP marketplace
Our most significant CDN feature in Varnish Enterpise is the Massive Storage Engine. It was specifically built to counter the limitations of the file stevedore that is prone to disk fragmentation and non-persistent.
There's a lot of other cool stuff in Varnish Enterprise for CDN as well, but you'll find that on the docs pages I referred to.

Related

1 Server Multi cPanel, how the apache work?

i want to ask something about dedicated server.
i have dedicated server and a cPanel website with heavy load, when i check the server load, all parameter didn't go up to 60% usage. but the apache work is high.
so i wonder if i can do this.
i buy dedicated server(DS) and install 2 cPanel on same DS. i know that cPanel need an IP to bind the license so i add 1 additional IP to my DS.
what i am trying to archieve here is to split workload in same website, and to split the traffic i use loadbalancer from CF.
so i have abc.com with 2 different IPs and use LoadBalancer to split the load.
here is why i need to do this
Server load relative low (under 80%)
Apache load relative high 3-10 req/s
There is a problem in your problem definition
What do you mean by Apache work?
if you want have more threads and processes of Apache httpd on the same server, you dont need to install two Cpanel instances, you could tune your Apache httpd worker configuration for a better performance and resource utilization.
you can even use litespeed or nginx web servers on cpanel.

NGINX as a Web Server + Load Balancer with Cacheing Enabled

We currently run a SaaS application on apache which server ecommerce websites (its a store builder). We currently host over 1000 clients on that application and are now running into scalability issues (CPU going over 90% even on a fairly large 20 core 80GB ram + all SSD disk server).
We're looking for help from an nginx expert who can:
1. Explain the difference between running nginx as a web server vs. using it like a reverse proxy. What are the benefits?
2. We also want to use nginx as a load balancer (and have that already setup in testing), but we haven't enabled cacheing on the load balancer. So while its helping redirect requests, its not really serving any traffic directly and it simply passes through everything to one of the two apache servers.
The question is that we have a lot of user-generated content coming from the apache servers, how do we invalidate the cache for only certain pages that are being cached by nginx? If we setup a cron to clear this cache every 1 minute or so, it wouldn't be that useful... as cache would then be virtually non existent.
--
Also need an overall word on what is the best architecture to build for given the above scenarios.
Is it
NGINX Load Balancer + Cacheing ==> Nginx Web Server
NGINX Load Balancer ==> Nginx Web Server + Cacheing ?
NGINX Load Balancer + Cacheing ==> Apache Web Server
NGINX Load Balancer == > Apache Web Server (unlikely)
Please help!
Scaling horizontally to support more clients is a good option. Its recommended to first evaluate what is causing the bottleneck, memory within the application, long running requests etc.
Nginx Vs other web servers: Nginx is a HTTP server and not a servlet engine. Given that, you can check if it fits your needs.
It is a fast web server. You need to evaluate the benefits of using it as a single stand alone webserver against other web servers. Speed and memory could help.
Nginx as a load balancer:
You can have multiple web server instances behind nginx.
It supports load balancing algorithms like round robin, weighted etc so the load can be distributed based on the resource availability.
It helps in terminating ssl at Nginx, filter requests, modify headers,
compression, application upgrades wihtout downtime, serve cached content etc. This frees up resources on the server running the application. Also separation of concerns.
This setup is a reverse proxy and the benefits to it.
You can handle cache expiry with nginx. nginx documentaion has good details http://nginx.com/resources/admin-guide/caching/

is RTC proxy server only read only?

in RTC, for Global Software development scenario, there is the concept of cached proxies.
as i understand, it is only a read only proxy which will help while loading a component in the remote location.[scm part]
All Commit and Deliver actions when the changes are transmitted to the central server, the changes are sent directly over WAN. So these actions do not benefit from the proxy. Is this understanding correct?
Or does the proxy help in improving the performance for deliver/commit actions from the remote location?
Cache proxy are mentioned in:
"Does Rational Team Concert support MultiSite? "
"Using content caching proxies for Jazz Source Control"
We realize that there will still be cases where a WAN connection will not meet the 200ms guidance. In this case, we’ve leveraged the Web architecture of RTC to allow caching proxies to be used.
Based on standard Web caching technology, IBM or Apache HTTP Server or Squid, a cache can be deployed in the location which has a poor connection to the central server.
This caching proxy will cache SCM contents that are fetched from the server, greatly improving access times for the RTC client, and reducing traffic on the network.
So in case of RTC, it is more targeted to quicken "Load" and "Accept" and operations, rather than "Commit" and "Deliver".
If multiple developers all load from a specific stream, a caching proxy will help reduce the network traffic.
We realize that there will still be cases where a WAN connection will not meet the 200ms guidance. In this case, we’ve leveraged the Web architecture of RTC to allow caching proxies to be used.
Based on standard Web caching technology, IBM or Apache HTTP Server or Squid, a cache can be deployed in the location which has a poor connection to the central server.
This caching proxy will cache SCM contents that are fetched from the server, greatly improving access times for the RTC client, and reducing traffic on the network.
So in case of RTC, it is more targeted to quicken "Load" and "Accept" and operations, rather than "Commit" and "Deliver".
If multiple developers all load from a specific stream, a caching proxy will help reduce the network traffic.

Improving a Web App's Performance

My web app, an exploded WAR, is hosted by Apache (static content) and Tomcat (dynamic content) via mod_jk. Optionally, there's an ActiveMQ component of this system, but it's currently not being used.
As I understand, each HTTP request will hit Apache. If it's a dynamic content request, Apache will forward the request to Tomcat via mod_jk. To fulfill this request, Tomcat will start a new thread to do the work.
I'm running the app on a 6-core, 12 GB RAM machine.
Besides using the ActiveMQ component, how can I improve my system's performance? Also, please correct me if I'm misstating how Apache and Tomcat communicate.
while (unhappyWithSitePerformance) {
executeLoadTest();
identifyBiggestBottleneck(); // e.g. what breaks first
fixIdentifiedBottleneck();
}
There is no blank silver bullet to provide. You should make sure your load test simulates realistic user behaviour and define the number of (virtual) users you want your server to handle within given answering time. Then tune your server until your goal is met.
Common parameters to look for are
memory consumption
CPU consumption (e.g. certain algorithms)
I/O saturation - e.g. communication to the database, general HTTP traffic saturating the network adapter
Database or backend answering time - e.g. sometimes you'll have to tune the backend, not the webserver itself.

What's the most scalable and high performing Amazon Web Service (AWS) configuration for a RESTful web service?

I'm building an asynchronous RESTful web service and I'm trying to figure out what the most scalable and high performing solution is. Originally, I planned to use the FriendFeed configuration, using one machine running nginx to host static content, act as a load balancer, and act as a reverse proxy to four machines running the Tornado web server for dynamic content. It's recommended to run nginx on a quad-core machine and each Tornado server on a single core machine. Amazon Web Services (AWS) seems to be the most economical and flexible hosting provider, so here are my questions:
1a.) On AWS, I can only find c1.medium (dual core CPU and 1.7 GB memory) instance types. So does this mean I should have one nginx instance running on c1.medium and two Tornado servers on m1.small (single core CPU and 1.7 GB memory) instances?
1b.) If I needed to scale up, how would I chain these three instances to another three instances in the same configuration?
2a.) It makes more sense to host static content in an S3 bucket. Would nginx still be hosting these files?
2b.) If not, would performance suffer from not having nginx host them?
2c.) If nginx won't be hosting the static content, it's really only acting as a load balancer. There's a great paper here that compares the performance of different cloud configurations, and says this about load balancers: "Both HaProxy and Nginx forward traffic at layer 7, so they are less scalable because of SSL termination and SSL renegotiation. In comparison, Rock forwards traffic at layer 4 without the SSL processing overhead." Would you recommend replacing nginx as a load balancer by one that operates on layer 4, or is Amazon's Elastic Load Balancer sufficiently high performing?
1a) Nginx is asynchronous server (event based), with single worker itself they can handle lots of simultaneous connection (max_clients = worker_processes * worker_connections/4 ref) and still perform well. I myself tested around 20K simultaneous connection on c1.medium kind of box (not in aws). Here you set workers to two (one for each cpu) and run 4 backend (you can even test with more to see where it breaks). Only if this gives you more problem then go for one more similar setups and chain them via an elastic load balancer
1b) As said in (1a) use elastic load balancer. See somebody tested ELB for 20K reqs/sec and this is not the limit as he gave up as they lost interest.
2a) Host static content in cloudfront, its CDN and meant for exactly this (Cheaper and faster then S3, and it can pull content from s3 bucket or your own server). Its highly scalable.
2b) Obviously with nginx serving static files, it will now have to serve more requests to same number of users. Taking that load away will reduce work of accepting connections and sending the files across (less bandwidth usage).
2c). Avoiding nginx altogether looks good solution (one less middle man). Elastic Load balancer will handle SSL termination and reduce SSL load on your backend servers (This will improve performance of backends). From above experiments it showed around 20K and since its elastic it should stretch more then software LB (See this nice document on its working)