I am studying about scalability design and I've having a hard time thinking of ways to ensure a load balancer does not become a single point of failure. If a load balancer goes down, who makes the decision to route to a back up load balancer? What if that "decision maker" goes down too?
The point in avoiding a load balancer as a single point of failure is the load balancer(s) will run in a high availability cluster with hardware backup.
I believe the answer to this question is redundancy.
The load balancer, instead of being a single computer/service/module/whatever, should be several instances of that computer/service/whatever.
The clients should be aware of the options they have in case their favorite load balancer goes down.
In case a client is timing out on their favorite load balancer, they already have the logic of how to access the next one.
This is the most straight forward way I can think of to get rid of single points of failure, but I'm sure there are many others that have been researched.
Note that any system component is a single point of failure, no matter how much redundancy you put in. The question is: "How sure do you want to be that it will not go down?"
if the probability for a single instance to go down is p, then the probability for n instances to all go down together (assuming they are independent) is p^n. Pick how sure you want to be, or how much resources you can pay, and get the other side of the equation.
Related
I was reading up on problems with server based authentication. I need help with elaboration on the following point.
Scalability: Since sessions are stored in memory, this provides problems with scalability. As our cloud providers start replicating servers to handle application load, having vital information in session memory will limit our ability to scale.
I don't seem to understand why "... having vital information in session memory will limit our ability to scale", will limit the ability to scale. Is it just because the information is being replicated.. so it's to do with redundancy? I don't think so. Anyway, would anyone be kind enough to explain this further? Much appreciated.
What's being referred to is the difference between stateless and stateful server-side ops. Stateful servers keep part of their resources (main memory, mostly) occupied for retaining state pertaining to some client, even when the server is actually doing nothing at all for the client and just waiting for the client to come back. Such systems' performance profile is "linear" only up to the point where all available memory has been filled with state, and beyond that point the server seems to essentially stall. Stateless servers only keep resources occupied when they're actually doing something, and once finished doing stuff, those resources are immediately freed and available for other clients. Such servers are essentially not capped by memory limits and therefore "scale easier".
Also, the explanation given seems to refer to scenario's where a set of distinct machines present themselves to the outside world as being one, when actually they are not (this is often called a "cluster" of machines/servers). In such scenario's, if a client has connected to the "big single virtual machine", then actually he is connected to just one of the "actual machines" in the cluster. If state is kept there, subsequent visits by that same client must then be routed to the same physical machine, or that piece of state must be trafficked around to whatever machine the next visit happens to be to. The former implies the implementation of management functions that take their own set of resources, plus limitations on the freedom the cluster has to distribute the load (the opposite of why you want to do clustering), the latter implies additional network traffic that will cap scalability in essentially the same way as available memory does.
Server-based authentication makes use of sessions, which in turn make use of a local session id. In the cloud, when the servers are replicated to handle application load, it becomes difficult for one server to know which sessions are active on other servers. Now to overcome this problem, extra steps must be performed... for instance to persist the session id on to the database. However, as the servers are increasingly replicated, it becomes more and more difficult to handle all this. Therefore, server-based or session-based authentication can be problematic for scalability.
I have integrated twemproxy into web layer and I have 6 Elasticache(1 master , 5 read replicas) I am getting issue that the all replicas have same keys everything is same but cache hits on one replica is way more than others and I performed several load testing still on every test I am getting same result. I have separate data engine that writes on the master of this cluster and remaining 5 replicas get sync with it. So I am using twemproxy only for reading data from Elasticache not for sharding purpose. So my simple question is why i am getting 90% of hits on single read replicas of Elasticache it should distribute the hits evenly among all read replicas? right?
Thank you in advance
Twemproxy hashes everything as I recall. This means it will try to split keys among the masters you give it. If you have one master this means it hashes everything to one server. Thus, as far as it is concerned you have one server for acceptable queries. As such, it isn't helping you in this case.
If you want to have a single endpoint to distribute reads across a bank of identical slaves, you will need to put a TCP load balancer in front of the slaves and have your application talk to the IP:port of the load balancer. Common options are Nginx and HAProxy for software based ones, on AWS you could use their load balancer but you could run into various resource limits out of your control there, and pretty much any hardware load balancer would work as well (though this is difficult if not impossible on AWS).
Which load balancer to use is dependent on your (or your personnel's) comfort and knowledge level with each option.
Currently , I am doing some research about the load balancer .
On Wikipedia , refer to this link http://en.wikipedia.org/wiki/Load_balancing_(computing).
It says : "Usually load balancers are implemented in high-availability pairs which may also replicate session persistence data if required by the specific application."
Besides , I have also used the search engine to find some related articles about the reason and the cases when we need to use 2 load balancers in a system but I did not find any good information.
So I want to ask why do we need 2 load balancers in most the cases? and which cases we need to use 2 or more load balancers instead of one?
Now a days there is need of implementing applications which are highly available. So in case of load balancer you should have a pairs of load balancer as a highly available pair.
Because if you are using a single server/node load balancer there is a chance it may go down or need to take off for the maintenance. This will cause application downtime or we need to redirect all requests to only one server which will affect the performance severely.
To avoid these things it is always recommended that load balancers should be available in highly available pairs so that load balancer is continuously operational for a desirably long length of time or all the time.
For example, I execute "sudo named" several times, so there are several named processes running. When I use "pidof named", I get several pids.
I want to calculate the CPU usage rate of the BIND process,so I need to get some parameters from "/proc/pid/stat", so I need the pid of the named process which is really providing the domain resolution service.
What's the difference between the named process which is providing the service and the others? Could you give me a detailed explanation?
thanks very much~
(It's my first time to use stackoverflow , to use English to ask quetions , please ignore those syntax errors.)
There should be just one named running, the scripts that manage the service ensure that. You shouldn't start it like that, you should use what your distribution uses to start it, probably something along the lines of service bind start (that is probably a RedHat-ism), or /etc/rc.d/bind start (for bog-standard SysVinit).
I was responsible for DNS for quite some time here. Some tips:
DNS is a very critical service, configure and monitor with extreme care. Do read up on setting up and managing this, don't go ahead until you are absolutely clear.
Get somebody as a backup for the case that you aren't available, and make sure they understand the previous point.
DNS isn't CPU intensive (OK, with signed domains and that newfangled stuff that might have changed), it is memory intensive (and network intensive, or at least sensitive to delays). Our main DNS server was running for months at a time, and clocked up some half hour of CPU time during that kind of period IIRC.
Separate your master server (responsible for the domain(s) from the servers queried by clients (caching servers). There have been vulnerabilities where malformed questions or "answers" to questions that hadn't been asked soiled the database
The master server will have all the domain information in RAM, make sure you have got enough of it
Make sure all machines under your jurisdiction use the same caching server. It makes no sense for more than one, that destroys the idea of cache.
The caching servers collect immense amounts of data over time. This data rarely is performance critical, so configure them with plenty of swap space to accommodate overflows.
Bind issues as many named processes as many CPUs you have:
man named:
-n #cpus
Create #cpus worker threads to take advantage of multiple CPUs. If not specified, named will try to determine the number of CPUs present and create one thread per CPU. If it is unable to determine the number of CPUs, a single worker thread will be created.
External source:
https://unix.stackexchange.com/questions/140986/multiple-named-processes-for-bind9-in-debian
My company is about to write a new public facing website in SharePoint (so Windows Server 2008 RC2, SQL Server 2008 RC2, etc) and we're looking at using Amazon EC2 to host it. I've read and been told that instances can disappear (often through user-error, but also in batches), so I'm skeptical that EC2 is the best idea for us.
I've done research on the Amazon AWS site, but must confess that most of the terminology used is confusing, and Googling my questions often brought me here, so I thought I'd ask my questions here too and see if people can advise me.
1) It's critical that our website be available to the public as much as possible (the usual 99.9% up times apply). The Amazon EC2 Service Level Agreement commitment is 99.95% availability, which is fine, but what happens if we hit that 0.05% scenario? Would our E2 instance be lost? Can these be recovered? If so, what would we need to do to ensure that we recover to a not-too-old version of our site?
2) I've read about Amazon Elastic Block Store (EBS), and how this is persist independently from the lifetime of the instance. If I understand right, EBS is like having a hard-drive, so if the instance is lost we can start a new instance using our EBS to recover the latest version, while the 'local instance store' would be lost if the instance is lost as well. Is that right?
3) Are 'reserved instances' a more stable option? i.e. are they less likely to disappear? If they do still disappear, what recovery benefits do they offer, if any?
I know these questions are kinda vague, but hopefully you'll be able to offer a newbie from basic info - enough to point me in the right direction for further, deeper research at least.
Many thanks.
Kevin
We rely on AWS for our webservers. I won't use anything else. They're highly scalable, easily configurable and have an absurd uptime. I've never experienced downtime with them. We've been with them for two years.
Reserved instances are cheaper. Get them if you're planning on having that instance for a while. It's simply a cost/budgeting issue.
Never heard of people losing an EC2 instance.
Not terribly knowledgeable about EBS, but S3 is a good way to back up data.
HTH
EDIT:
Came across some links that might be helpful. Cheers.
http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html
One of the main design goals of AWS is to make fault tolerant services--that is services that can recover from failures. That is, they design all of their services with the assumption that something will fail in some way at some point, but that there will be redundancies and other mechanism in place to recover from those inevitable failures.
In the case of storage services like S3 and SimpleDB, this is achieved primarily by replicating your data across multiple nodes (machines) in multiple data centers. So when one node experiences a hardware failure or one data center experiences a power outage, there's no real down time as the replicas can still service the requests. As a consumer, you aren't even aware of the down nodes or data centers.
EC2 is designed to work similarly, but it is not quite as encapsulated as S3 and SimpleDB, so you'll need to plan for a bit of the work yourself. For example, if you need a web service with guaranteed uptime and availablity, you'll want to look into AWS ELB (Elastic Load Balancing) service. That way if an instance is down, requests will automatically be routed to other healthy instances. For your data, you can either store it in other AWS services (like S3 and SimpleDB and EBS) which have built-in redundancy or you can build your own solution using similar redundancy techniques.
The SLA amounts to none, when we found out that:
Instances and EBS volumes DID get lost
It takes Amazon more than 2 days to recover from a disaster, and even that not to the full extent
We were the lucky ones, that managed to get back on our feet in less than 2 days. Other companies got stuck with no recovery option.
And what does Amazon recommend? "Don't trust our reliability. Pay for 2 or 3 more copies of your system in different regions, and then you will be safe".
More information can be found here:
http://www.zdnet.com/blog/saas/lightning-strike-zaps-ec2-ireland/1382
tldr: AWS is very reliable if you know what you're doing, a bad idea if you don't.
As your unfamiliar with terms here's a very quick glossary:
AZ - Availability zone, there's several availability zones per region (e.g. 3 in Ireland). They are physical isolated datacentres with different power grids, flood plains etc. But with internal network quality speed connections. It's possible even likely an AZ may become unavailable at some point, I don't think all AZ's in a region have ever been down though.
EBS/Instance Store - These are the two main types of storage available to instance. The best way to describe them is Instance Store is the equivalent to a HDD you have plugged in via sata to your motherboard - its very fast. But what happens if you shutdown your instance (or if the motherboard fails) and want to instantly start on another board? (Amazon completely hides the physical hardware setup) obviously you aren't going to wait for an engineer to unplug a drive from one server and into another so they don't even offer this. Instance store is fast but temporary and tied to the physical machine DO NOT store anything important on it. EBS then is the alternative it is a very low latency network drive that any server can connect to as though it were local. You shut down a server, change the size and restart on a completely different server on the other side of the datacentre (again the physical stuff is hidden), doesn't matter your ebs hasn't gone anywhere (by default theyre also on multiple physical discs).
Commodity cloud hardware - My interpretation of all the 'cloud hardware fails all the time - its really risky and unreliable' is that yes aws hardware is not as reliable as enterprise level components in a managed datacentre. This doesn't mean its unreliable, it just means you should build failure as an option into your design.
First very important thing to note when talking about SLA's is that amazon state very clearly that the SLA ONLY applies if one or more AZ goes down. So if you do not understand how their service works and only build one server in one AZ and a generator or router fails it's your own fault.
As for recovery, that depends - is your entire application state stored on one server - if it is, don't bother with the cloud. If however you can cluster your state on multiple servers, store it in RDS or some other persistent DB. OR if your content changes so infrequently you can utilise periodic copies to s3 storage, you'll be fine. You failure strategy (in order of preference) could be clustered, failover, or auto repair. For the first one you have clustered servers sharing state - it doesn't matter if you lose a server or an AZ. For the second you only have one live server, but if it goes down you have a failover standing by with the same content. Finally with auto repair there's two possible situations - if your data is only on one EBS drive, you could start another instance with the same drive and carry on. But if the EBS drive or AZ fails, you will need to be ready with some snapshot in s3 that a completely fresh instance can copy and start up with.
Reserved instances are no more reliable - they're the same hardware, you're just entering into a contract to say i'll have x machines for y years. Which allows aws to plan better, which is cheaper for you.