Website Hardware Scaling - hardware

So I was listening to the latest Stackoverflow podcast (episode 19), and Jeff and Joel talked a bit about scaling server hardware as a website grows. From what Joel was saying, the first few steps are pretty standard:
One server running both the webserver and the database (the current Stackoverflow setup)
One webserver and one database server
Two load-balanced webservers and one database server
They didn't talk much about what comes next though. Do you add more webservers? Another database server? Replicate this three-machine cluster in a different datacenter for redundancy? Where does a web startup go from here in the hardware department?

A reasonable setup supporting an "average" web application might evolve as follows:
Single combined application/database server
Separate database on a different machine
Second application server with DNS round-robin (poor man's load balancing) or, e.g. Perlbal
Second, replicated database server (for read loads, requires some application logic changes so eligible database reads go to a slave)
At this point, evaluating the current state of affairs would help to determine a better scaling path. For example, if read load is high and content doesn't change too often, it might be better to emphasise caching and introduce dedicated front-end caches, e.g. Squid to avoid un-needed database reads, although you will need to consider how to maintain cache coherency, typically in the application.
On the other hand, if content changes reasonably often, then you will probably prefer a more spread-out solution; introduce a few more application servers and database slaves to help mitigate the effects, and use object caching, such as memcached to avoid hitting the database for the less volatile content.
For most sites, this is probably enough, although if you do become a global phenomenon, then you'll probably want to start considering having hardware in regional data centres, and using tricks such as geographic load balancing to direct visitors to the closest "cluster". By that point, you'll probably be in a position to hire engineers who can really fine-tune things.
Probably the most valuable scaling advice I can think of would be to avoid worrying about it all far too soon; concentrate on developing a service people are going to want to use, and making the application reasonably robust. Some easy early optimisations are to make sure your database design is fairly solid, and that indexes are set up so you're not doing anything painfully crazy; also, make sure the application emits cache-control headers that direct browsers on how to cache the data. Doing this sort of work early on in the design can yield benefits later, especially when you don't have to rework the entire thing to deal with cache coherency issues.
The second most valuable piece of advice I want to put across is that you shouldn't assume what works for some other web site will work for you; check your logs, run some analysis on your traffic and profile your application - see where your bottlenecks are and resolve them.

plenty of fish Architecture
some interesitng videos:
Youtube scalibility
Inteview with Dan Farino, System Architect at Myspace

Joel mentioned adding a second datacenter, with the same setup, and then assigning your users randomly to each. Changes to the data are logged and sent from one location to the other, so that both locations contain all the data.

The talk Scalable Web Architectures Common Patterns & Approaches from Cal Henderson (Yahoo) on Web 2.0 Expo was quite interesting. I thought there was an video, but I could not find it. But here are the slides:
http://www.slideshare.net/techdude/scalable-web-architectures-common-patterns-and-approaches

A certain next step would be a cluster of webservers (a web farm) and a clustered system of database servers (replication or Oracle RAC etc. etc.)

If your interested in caching and using .Net, look into the application caching block in enterprise library (of course use this along with the other points above).

Related

Can you switch programming languages with the same database

Say I have a python/Django website fully built. I now want to re-create that website with Ruby on Rails or some other language. Is this possible to keep the same database? Or would I have to transfer data between the two databases?
Yes, it is possible to keep the same database in the new application that you used from the old. You can even have multiple applications use the same database as the same time.
However, you should not just port everything over query for query. There are likely subtleties in the old application that will be easy to miss... places where logic one would normally expect to live in the database instead lives in the application. There are also likely decisions regarding database structure that were made to accommodate quirks or abilities of the old environment that no longer make sense for the new.
The result is this a common point where you might also create a service layer. With a service layer, neither application talks to the database directly. Instead, they both talk to a service application that mediates access to the DB.
This new service layer helps make sure business logic is consistent across applications, without drifting between the two platforms. It helps avoid duplicating work. It helps with performance by creating an obvious place for things like a heroku caching layer, and by making it easier to scale data access across multiple servers.

I want a sandboxed test environment that is *always* an exact copy of Production

I'm having an issue with a web application I am responsible for maintaining.
The system experiences regular bugs, and our support vendors are always asking us to see if we can "replicate the error in UAT". This is obviously a reasonable request. A lot of the time, for various reasons (some of which are clear, some of which are not), these errors are not present in UAT. This lack of bug reproducability in a testing environment is adding huge amounts of friction to the bug resolution process.
There are 3 key pieces of our system architecture where these bugs are flaring (the CMS, the API layer, and the database). I am proposing we set up a system job that perpetually clones these 3 parts of the system in to a sandboxed test environment. This cloning would happen periodically (eg, once every 24 hours), and automatically.
Is there a technical term for this sort of environment? Is this an established method of helping diagnose system issues? Is there somewhere I can read up on the industry best practices for establishing something like this? Thanks.
The technical term for this kind of process is replication it is often done for some systems like databases, but normally not for testing purpose, but in order to increase available, so the replication is used as a failover spare.
An exact copy of a production system, with all the data is not you'll find often, due to the high demand on resources. Also at some points to two systems have to differ. Most systems (I know of) have tons of interfaces you just can't copy a complete system systems.
Also: you only need the copy of the production system when you actually debugging an issue. And if you are in the middle of that you probably don't want everything to go away and get replaced by a new copy.
So instead I would recommend to setup scripts that allows to obtain a copy of the relevant parts on demand.
Also you might want to consider how you might be able to modify your system to make it easier to setup a copy.
For example, when you have all the setup automated (with chef/docker or similar) you should be able to setup the same system again anywhere you want, so you now you just have to get the production data over.
Which is an interesting point. Production data often contains secret information (because it is vital to the business, or because it is personal data). You don't want this kind of stuff hang around in a test system everybody can access.

how can we integrate two rails applications deployed within an intranet

Is RESTful services the only route for integrating any application with a rails applications including any other rails applications irrespective of whether it is in same network or not?
For integrating two applications how heavy is a RESTful service compared to the RMI based integration available in other technologies like Java EE?
Is there way to integrate two rails applications using any natively understood binary format which can avoid transformation to a different format ex: HTTP request.
The REST approach means simply that application A will make requests of application B (and potentially the other way around) using the HTTP protocol. The data send can be in whatever format you like, although JSON is the default today (and XML was the default yesterday, and even ... SOAP -- gaq!).
These days, the vast majority of external APIs are implemented this way -- Amazon, Google Maps, Yelp, etc, etc, etc. Why? Because the HTTP (or HTTPS) protocol is well understood and widely deployed. No special configuration is required and the same protocol that serves the application to regular people on web browsers works for other applications. Rails makes this brilliantly easy (if you go with the flow).
Java's RMI is a specific protocol (just as HTTP is). The advantage is that objects defined in A are available as instances in B (after a great deal of work in both). This really makes sense when you have a set of applications all designed up front to work together and whose main requirement is to be distributed across locations, servers, etc. RMI creates a tight binding between applications -- a change in one typically requires a change in the other. It's right for some kinds of applications.
But if you have, for example, two departments in a company who talk to each other, but don't want to be "bound at the hip", a REST interface provides a great deal of flexibility.
Your second question ("how heavy") is very difficult to answer. A company I worked for in 2001 had hundreds of servers all running an instance of a "worker" process -- they were all designed to queue their results to a "controller" process which would process the output and forward to another set servers designed to process and manage the data. In 2001, this was the right architecture because it was completely designed to work together -- persistent socket connections on a single subnet of our intranet running on a room full of servers. Now in 2012, that room full of servers is replaced by a few high-powered processors running 64-bit OS and addressing massive amounts of memory -- it's a whole new world. A doubling of performance in 2001 could save potentially millions of dollars of hardware, operational support, space and so on. In 2012, the most expensive thing is good developers! So "heavy" is really kind of irrelevant in all but the most compute-intensive operations these days. An HTTP request is light and simple.
Final question: natively understood binary format. Sure, if needed. In the end, any binary format that is sent over the wire between two servers needs to be serialized and de-serialized as a stream, and this is work, both for programmers and for machines. JSON is a text format, but one natively understood by JavaScript (JavaScript Object Notation) and has the distinct advantage of being human-readable. Given that most servers are set up to compress output automatically whether something is text or binary becomes kind of less relevant, at least as far as I/O and payload goes. Of course you can come up with any mutually understood format and send it over HTTP, but again, this is something that mattered a decade ago, and today is usually not an issue worth considering. Processors have been getting faster and faster, and memory cheaper (and bigger) -- so (as always) I/O (whether network or disk) is the typical bottleneck in modern applications.
If I were to re-design the application I mentioned from 2001 where hundreds of (today's) servers needed to communicate with (many) peer servers very specifically designed to interoperate, I might work to make sure that the serialize/deserialize process was as lightweight as possible (but only if it turned out to be a bottleneck). For me, being bound to any given platform or language is a non-starter -- the computing world is moving way to fast.
But in almost all realistic business applications today, keeping things simple, standard, and straightforward has both present and future benefits that make the need to worry obsessively about performance a thing of the past.
Hope this helps :-)

wamp apache - polling server continuously

I've heard polling the server is not the best of ideas.
Let's say I make a client-server application.
A simple game for example.
Where each client polls the server every half a minute.
How many clients is it possible to have before it overloads a wamp server?
Basically how robust is Apache for this kind of stuff?
Getting a request, aggregating data from mysql server, and then returning the data in an xml format.
This is a really open ended question. It entirely depends on your configuration, how many apache services do you have running, how many physical servers do you have, how is your mysql server setup (is it on it's own machine)? You want to also keep in mind that by polling the server you have to initiate a connection each time and allocate resources for that communication (in the lower level networking and your program).
If at all possible it might be better for the server to push content to the client (assuming a push happens less frequently than polls happen).
My guess is you will maxout whatever the request is handling and the mysql database before Apache server becomes a problem.
However, going out on a limb, if proper caching is in place, and a sufficiently smart architecture and pragmatic design, and effectively only the 30 sec polling, you should be able to support a couple 1000 users.
But : mock it up, quick and dirty, do roughly what you think you will be doing, and hit it with JMeter (or similar) with 3, 10, 30, 100, 300, 1000, 3000, ... till you find the performance wall.
I am scared about the aggregating data part... if aggregation is really needed, carefully architect it so you do not need to go to the DB, because this will kill you in production, and you wont find it in development.
Can't really give you numbers. It depends on a bunch of factors (including hardware).
What's more important to you: How many concurrent users you can support or how real-time the results are?
If you looking for real-time results, you probably want to investigate something like Comet or long polling.
If you're looking for supporting a lot of users, the long polling approach probably isn't ideal and you'll probably want something more lightweight than Apache. Personally, I'm a fan of nginx.
EDIT: And if you're feeling really hip, your best bet for real-time results here is Web Sockets, but if you're a microsoft guy this isn't going to do you much good as IE doesn't support them.

PyAMF backend choices!

I've been using PyAMF to write a backend for a flex app that will request different groups of hundreds of different images depending on what the client needs. I have been using the "simple_server" WSGI server that PyAMF supplies while developing the flex code. Now I'm ready to write a robust backend that will be able to pull images from a mySQL database and send them as fast as possible and as efficiently as possible to many concurrent clients.
The PyAMF documentation is great because they supply many examples to follow, however I am confused about what kind of backend I am trying to create.
Do I want a SocketServer or a WSGI server or something like Twisted or web2py or Tornado? Are these even all different? :) Should I be using Apache modules instead (mod_wsgi or modjy or mod_python)?
I realize that this probably touches on many open debates, so maybe you could just point me to any good summaries of these debates?
Its great to have so many options, but how do I choose?
The short answer is, of course, that it depends on the requirements of your project.
How many concurrent connections is "a lot"?
How much programmer time can you throw at the problem?
How much hardware can you throw at the problem?
...etc...
If you plan to have lots of concurrent clients, it's hard to beat Twisted in the Python world. However, you'll have to deal with your database asynchronously to avoid blocking, and depending on how complex your database interactions are, this can be a bit of a pain. You're basically limited to either using twisted.enterprise.adbapi or coming up with your own twisted-ORM integration.
If you'd rather have "easy" database code (i.e. you want to use an ORM), you're better off going with a (TurboGears/Pylons/plain wsgi) project, probably hosted using Apache and mod_wsgi. This can be a pretty scalable solution, and you get a lot of stuff for free using these frameworks, but it may be more than you need.
I would avoid using one of the many plain python wsgi servers out there (wsgiref, paster, etc.) in production if you really want high performance.
Good Luck!