The ideal multi-server LAMP environment - apache

There's alot of information out there on setting up LAMP stacks on a single box, or perhaps moving MySQL onto it's own box, but growing beyond that doesn't seem to be very well documented.
My current web environment is having capacity issues, and so I'm looking for best-practices regarding configuration tuning, determining bottlenecks, security, etc.
I presently host around 400 sites, with a fair need for redundany and security, and so I've grown beyond the single-box solution - but am not at the level of a full ISP or dedicated web-hosting company.
Can anyone point me in the direction of some good expertise on setting up a great apache web-farm with a view to security and future expansion?
My web environment consists of 2 redundant MySQL servers, 2 redundant web-content servers, 2 load balancing front-end apache servers that mount the content via nfs and share apache config and sessions directories between them, and a single "developer's" server which also mounts the web-content via nfs, and contains all the developer accounts.
I'm pretty happy with alot of this setup, but it seems to be choking on the load prematurely.
Thanks!!
--UPDATE--
Turns out the "choking on the load" is related to mod_log_sql, which I use to send my apache logs to a mysql database. By re-configuring the webservers to write their sql statements to a disk file, and then creating a separate process to submit those to the database it allows the webservers to free up their threads much quicker, and handle a much greater load.

You need to be able to identify bottlenecks and test improvements.
To identify bottlenecks, you need to use your system's reporting tools. Some examples:
MySQL has a slow query log.
Linux provides stats like load average, iostat, vmstat, netstat, etc.
Apache has the access log and the server-status page.
Programming languages have profilers, like Pear Benchmark.
Use these tools to identifyy the slowest/biggest offenders and concentrate on them. Try an improvement and measure to see if it actually improves performance.
This becomes a never ending loop for two reasons: there's always something in a complex system that can be faster and as your system grows, different functions will start slowing down.
Based on the description of your system, my first hunch would be disk io and network io on the NFS servers, then I'd look at MySQL query times. I'd also check the performance of the shared sessions.

The schoolbook way of doing it would be to identify the bottlenecks with real empirical data.
Is it the database, apache, network, cpu, memory,io? Do you need more ram, sharding(+), is the DiskIO, the NFS network load, cpu for doing full table scans?
When you find out where the problem is you might run into the problem that its not enough to scale the infrastructure, because of the way the code works, and you end up with the need to either just create more instances of you current setup or make the code different.

I would also recommend as a first step in terms of scalability, off-load your content to a CDN like Edgecast. Use your current two content servers as additional web servers.

Related

BIND9.7. When several named processes are running, how to judge which process is providing the service?

For example, I execute "sudo named" several times, so there are several named processes running. When I use "pidof named", I get several pids.
I want to calculate the CPU usage rate of the BIND process,so I need to get some parameters from "/proc/pid/stat", so I need the pid of the named process which is really providing the domain resolution service.
What's the difference between the named process which is providing the service and the others? Could you give me a detailed explanation?
thanks very much~
(It's my first time to use stackoverflow , to use English to ask quetions , please ignore those syntax errors.)
There should be just one named running, the scripts that manage the service ensure that. You shouldn't start it like that, you should use what your distribution uses to start it, probably something along the lines of service bind start (that is probably a RedHat-ism), or /etc/rc.d/bind start (for bog-standard SysVinit).
I was responsible for DNS for quite some time here. Some tips:
DNS is a very critical service, configure and monitor with extreme care. Do read up on setting up and managing this, don't go ahead until you are absolutely clear.
Get somebody as a backup for the case that you aren't available, and make sure they understand the previous point.
DNS isn't CPU intensive (OK, with signed domains and that newfangled stuff that might have changed), it is memory intensive (and network intensive, or at least sensitive to delays). Our main DNS server was running for months at a time, and clocked up some half hour of CPU time during that kind of period IIRC.
Separate your master server (responsible for the domain(s) from the servers queried by clients (caching servers). There have been vulnerabilities where malformed questions or "answers" to questions that hadn't been asked soiled the database
The master server will have all the domain information in RAM, make sure you have got enough of it
Make sure all machines under your jurisdiction use the same caching server. It makes no sense for more than one, that destroys the idea of cache.
The caching servers collect immense amounts of data over time. This data rarely is performance critical, so configure them with plenty of swap space to accommodate overflows.
Bind issues as many named processes as many CPUs you have:
man named:
-n #cpus
Create #cpus worker threads to take advantage of multiple CPUs. If not specified, named will try to determine the number of CPUs present and create one thread per CPU. If it is unable to determine the number of CPUs, a single worker thread will be created.
External source:
https://unix.stackexchange.com/questions/140986/multiple-named-processes-for-bind9-in-debian

Is Amazon EC2 recommended for a persistent public facing website?

My company is about to write a new public facing website in SharePoint (so Windows Server 2008 RC2, SQL Server 2008 RC2, etc) and we're looking at using Amazon EC2 to host it. I've read and been told that instances can disappear (often through user-error, but also in batches), so I'm skeptical that EC2 is the best idea for us.
I've done research on the Amazon AWS site, but must confess that most of the terminology used is confusing, and Googling my questions often brought me here, so I thought I'd ask my questions here too and see if people can advise me.
1) It's critical that our website be available to the public as much as possible (the usual 99.9% up times apply). The Amazon EC2 Service Level Agreement commitment is 99.95% availability, which is fine, but what happens if we hit that 0.05% scenario? Would our E2 instance be lost? Can these be recovered? If so, what would we need to do to ensure that we recover to a not-too-old version of our site?
2) I've read about Amazon Elastic Block Store (EBS), and how this is persist independently from the lifetime of the instance. If I understand right, EBS is like having a hard-drive, so if the instance is lost we can start a new instance using our EBS to recover the latest version, while the 'local instance store' would be lost if the instance is lost as well. Is that right?
3) Are 'reserved instances' a more stable option? i.e. are they less likely to disappear? If they do still disappear, what recovery benefits do they offer, if any?
I know these questions are kinda vague, but hopefully you'll be able to offer a newbie from basic info - enough to point me in the right direction for further, deeper research at least.
Many thanks.
Kevin
We rely on AWS for our webservers. I won't use anything else. They're highly scalable, easily configurable and have an absurd uptime. I've never experienced downtime with them. We've been with them for two years.
Reserved instances are cheaper. Get them if you're planning on having that instance for a while. It's simply a cost/budgeting issue.
Never heard of people losing an EC2 instance.
Not terribly knowledgeable about EBS, but S3 is a good way to back up data.
HTH
EDIT:
Came across some links that might be helpful. Cheers.
http://techblog.netflix.com/2010/12/four-reasons-we-choose-amazons-cloud-as.html
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html
One of the main design goals of AWS is to make fault tolerant services--that is services that can recover from failures. That is, they design all of their services with the assumption that something will fail in some way at some point, but that there will be redundancies and other mechanism in place to recover from those inevitable failures.
In the case of storage services like S3 and SimpleDB, this is achieved primarily by replicating your data across multiple nodes (machines) in multiple data centers. So when one node experiences a hardware failure or one data center experiences a power outage, there's no real down time as the replicas can still service the requests. As a consumer, you aren't even aware of the down nodes or data centers.
EC2 is designed to work similarly, but it is not quite as encapsulated as S3 and SimpleDB, so you'll need to plan for a bit of the work yourself. For example, if you need a web service with guaranteed uptime and availablity, you'll want to look into AWS ELB (Elastic Load Balancing) service. That way if an instance is down, requests will automatically be routed to other healthy instances. For your data, you can either store it in other AWS services (like S3 and SimpleDB and EBS) which have built-in redundancy or you can build your own solution using similar redundancy techniques.
The SLA amounts to none, when we found out that:
Instances and EBS volumes DID get lost
It takes Amazon more than 2 days to recover from a disaster, and even that not to the full extent
We were the lucky ones, that managed to get back on our feet in less than 2 days. Other companies got stuck with no recovery option.
And what does Amazon recommend? "Don't trust our reliability. Pay for 2 or 3 more copies of your system in different regions, and then you will be safe".
More information can be found here:
http://www.zdnet.com/blog/saas/lightning-strike-zaps-ec2-ireland/1382
tldr: AWS is very reliable if you know what you're doing, a bad idea if you don't.
As your unfamiliar with terms here's a very quick glossary:
AZ - Availability zone, there's several availability zones per region (e.g. 3 in Ireland). They are physical isolated datacentres with different power grids, flood plains etc. But with internal network quality speed connections. It's possible even likely an AZ may become unavailable at some point, I don't think all AZ's in a region have ever been down though.
EBS/Instance Store - These are the two main types of storage available to instance. The best way to describe them is Instance Store is the equivalent to a HDD you have plugged in via sata to your motherboard - its very fast. But what happens if you shutdown your instance (or if the motherboard fails) and want to instantly start on another board? (Amazon completely hides the physical hardware setup) obviously you aren't going to wait for an engineer to unplug a drive from one server and into another so they don't even offer this. Instance store is fast but temporary and tied to the physical machine DO NOT store anything important on it. EBS then is the alternative it is a very low latency network drive that any server can connect to as though it were local. You shut down a server, change the size and restart on a completely different server on the other side of the datacentre (again the physical stuff is hidden), doesn't matter your ebs hasn't gone anywhere (by default theyre also on multiple physical discs).
Commodity cloud hardware - My interpretation of all the 'cloud hardware fails all the time - its really risky and unreliable' is that yes aws hardware is not as reliable as enterprise level components in a managed datacentre. This doesn't mean its unreliable, it just means you should build failure as an option into your design.
First very important thing to note when talking about SLA's is that amazon state very clearly that the SLA ONLY applies if one or more AZ goes down. So if you do not understand how their service works and only build one server in one AZ and a generator or router fails it's your own fault.
As for recovery, that depends - is your entire application state stored on one server - if it is, don't bother with the cloud. If however you can cluster your state on multiple servers, store it in RDS or some other persistent DB. OR if your content changes so infrequently you can utilise periodic copies to s3 storage, you'll be fine. You failure strategy (in order of preference) could be clustered, failover, or auto repair. For the first one you have clustered servers sharing state - it doesn't matter if you lose a server or an AZ. For the second you only have one live server, but if it goes down you have a failover standing by with the same content. Finally with auto repair there's two possible situations - if your data is only on one EBS drive, you could start another instance with the same drive and carry on. But if the EBS drive or AZ fails, you will need to be ready with some snapshot in s3 that a completely fresh instance can copy and start up with.
Reserved instances are no more reliable - they're the same hardware, you're just entering into a contract to say i'll have x machines for y years. Which allows aws to plan better, which is cheaper for you.

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column

Justifications for a test/development server

At my current workplace, the production SQL server and web servers are also used as development and test servers. I've asked for dedicated servers, but been refused as I can't justify it to satisfaction (the reasons against being cost of software, software licenses and hardware resources).
So, what justifications are there for a dedicated test/development server (a combined server at the moment - I don't want to push my luck and ask for 6 servers!)?
Summarised list
Resource usage
Prevention of errors
DR purposes
The list doesn't seem as extensive as I'd hoped.
Consider using Virtual Machines to reduce costs.
Well for starters the potential resources the production database has to use is restricted.
Also rogue/accidental developer SQL scripts could play havock with the production data.
Could there be issues with production data sensitivity? (eg personal data)
just a few to get started :)
Try to calculate the cost of downtime if you take the production system down due to a mistake in development.
Try also to calculate the cost of slow response times in production if/when you are doing performance testing.
As a cost benefit the test/dev hardware can be used as a spare if something bad happens to the production hardware.
Explain how often developer have fat-handed moments and hit enter too soon while editing statements starting...
drop table...
UPDATE veryImportantTable SET veryImportantField = '' WHERE 1 = 1 --TODO: make proper condition
This'd be reason enough for me. :)
I hope you have at least separate databases and are not developing on production data.
Check the data protection act, and also look into PCI-DSS if you want to be really secure (Payment Card Industry Data Security Standard).
I think it's livable to have a test-database on the same physical machine as your production DB. Performance is often not an issue (and assuming it's a multicore muchas memory machine, even if you do a heavy query on test, production will often not noticably slow down), and so long as the DB connections are separate, the chance of accidental damage is very very low.
As for a web-server, almost any machine can run one of those (apache is free, and even IIS is free for 10 simultaneous connections or fewer) - you could install a test web server on any old machine, configure it to use your test DB, and have a decent, low-cost solution.
'course a separate machine is "cleaner" - but the difference isn't huge.
One strong argument is availability / reduce downtime / disaster recovery.
i.e. to have another machine on standby to replace the production machine should anything bad happen to it hardware-wise (e.g. disk controllers or motherboards or power supplies dying).
Ideally the additional machine should be identical to the production one so it can be swapped directly, or individual parts swapped in as required. They can also back each other up or have a local copy of their counterparts last backup so they can be restored from quickly.
Of course it depends on how critical uptime is to the business as to how much value they'll see it this. If you're able to roughly work out how much they'll lose in $ due to lost business with and without a 'hot spare' server and present your case from a $ saved viewpoint (hopefully a lot more than the cost of the server), they might go for it.

Index replication and Load balancing

Am using Lucene API in my web portal which is going to have 1000s of concurrent users.
Our web server will call Lucene API which will be sitting on an app server.We plan to use 2 app servers for load balancing.
Given this, what should be our strategy for replicating lucene indexes on the 2nd app server?any tips please?
You could use solr, which contains built in replication. This is possibly the best and easiest solution, since it probably would take quite a lot of work to implement your own replication scheme.
That said, I'm about to do exactly that myself, for a project I'm working on. The difference is that since we're using PHP for the frontend, we've implemented lucene in a socket server that accepts queries and returns a list of db primary keys. My plan is to push changes to the server and store them in a queue, where I'll first store them into the the memory index, and then flush the memory index to disk when the load is low enough.
Still, it's a complex thing to do and I'm set on doing quite a lot of work before we have a stable final solution that's reliable enough.
From experience, Lucene should have no problem scaling to thousands of users. That said, if you're only using your second App server for load balancing and not for fail over situations, you should be fine hosting Lucene on only one of those servers and accessing it via NDS (if you have a unix environment) or shared directory (in windows environment) from the second server.
Again, this is dependent on your specific situation. If you're talking about having millions (5 or more) of documents in your index and needing your lucene index to be failoverable, you may want to look into Solr or Katta.
We are working on a similar implementation to what you are describing as a proof of concept. What we see as an end-product for us consists of three separate servers to accomplish this.
There is a "publication" server, that is responsible for generating the indices that will be used. There is a service implementation that handles the workflows used to build these indices, as well as being able to signal completion (a custom management API exposed via WCF web services).
There are two "site-facing" Lucene.NET servers. Access to the API is provided via WCF Services to the site. They sit behind a physical load balancer and will periodically "ping" the publication server to see if there is a more current set of indicies than what is currently running. If it is, it requests a lock from the publication server and updates the local indices by initiating a transfer to a local "incoming" folder. Once there, it is just a matter of suspending the searcher while the index is attached. It then releases its lock and the other server is available to do the same.
Like I said, we are only approaching the proof of concept stage with this, as a replacement for our current solution, which is a load balanced Endeca cluster. The size of the indices and the amount of time it will take to actually complete the tasks required are the larger questions that have yet to be proved out.
Just some random things that we are considering:
The downtime of a given server could be reduced if two local folders are used on each machine receiving data to achieve a "round-robin" approach.
We are looking to see if the load balancer allows programmatic access to have a node remove and add itself from the cluster. This would lessen the chance that a user experiences a hang if he/she accesses during an update.
We are looking at "request forwarding" in the event that cluster manipulation is not possible.
We looked at solr, too. While a lot of it just works out of the box, we have some bench time to explore this path as a learning exercise - learning things like Lucene.NET, improving our WF and WCF skills, and implementing ASP.NET MVC for a management front-end. Worst case scenario, we go with something like solr, but have gained experience in some skills we are looking to improve on.
I'm creating the Indices on the publishing Backend machines into the filesystem and replicate those over to the marketing.
That way every single, load & fail balanced, node has it's own index without network latency.
Only drawback is, you shouldn't try to recreate the index within the replicated folder, as you'll have the lockfile lying around at every node, blocking the indexreader until your reindex finished.