Do companies that provide APIs use a shim or proxy in front of their APIs? - api

I'm researching how large companies manage their public APIs. I'm thinking of companies with mature established APIs such as Google, Facebook, Twitter, and Amazon.
These companies have a number of different APIs that they expose to the public. Google, for example, has Plus, AdSense, AdWords etc. APIs that are publicly consumable. I'd like to understand if they use a cluster of reverse-proxy servers in front of those APIs to provide common functionality so that their specialist API servers don't need to implement that.
For example: Throttling and Authentication could be handled at this layer instead of implementing it in each API cluster.
The questions: Does anyone use a shim or reverse proxy in front of their APIs to handle common tasks? What are the use cases that make a reverse-proxy a good or bad idea for a cluster of API servers?

Most large companies explore a variety of things to handle the traffic and load on their servers. Roughly speaking:
A load balancer sits between the entry point and the actual client.
A reverse proxy often times sits between these to handle static files, pre-computed/rendered views, and other such largely static assets.
Any cast is used for DNS purposes, so that you are routed towards the nearest server that handles that URL.
Back pressure is employed in systems to limit the amount of requests feeding through a single pipeline and so that services don't tip over.
Memcached, Redis and the like are used as short term caches. That is, if it's going to roughly be the same result every 5 seconds, then that result can be cached in memory for faster delivery. Some proxies can be configured to read out of these.
If you're really interested, start reading some of the Netflix blog. Take a look at some of the open source they've used like Hystrix or Zuul. You can also take a look at some of their videos. They make heavy use of proxies and have built in some very advanced distributed behavior.
As far as a reverse proxy being a good idea, think in terms of failure. If your service calls out to another API by direct route and that service fails, then your service will fail and cascade upwards to the end user. On the other hand, if it's hitting a reverse proxy, then that proxy can be configured or even auto detect failures and divert traffic to back up servers.
As far as a reverse proxy being a good idea, think in terms of load. Sometimes servers can only handle a fraction of the traffic individually so that load must be shared on many servers. This is true not just of CPU capped but also IO capped resources (even if the return signal itself will not be the cause of the IO capping.)
Daisy chaining like this presents its own special little hell but it's sometimes unavoidable. On the downsides and what makes it a really bad choice if you can avoid it at all costs is a loss of deterministic behavior. Sometimes the stupidest things will bring your servers down. And by stupid, I mean, really, really dumb stuff that you never thought in a million years might bite you in the butt (think server clocks out of sync.) You have to start using rolling deploys of code, take down servers manually or forcefully if they stop responding, and keep those proxy configs in good order.
HTTP1.1 support can also be an issue. Not all reverse proxy adhere to the spec. In fact, some of them only cover ~50%. HAProxy does not do SSL. If you're only limited hardware then thread based proxy can unexpectedly swamp the system with threads.
Finally, adding in a proxy is one more thing that will break (not can, will.) You have to monitor them just like any piece of the platform, aggregate their logs, and run mock drills on them too.

Related

How Can I use Apache to load balance Marklogic Cluster

Hi I am new to Marklogic and Apache. I have been provided task to use apache as loadbalancer for our Marklogic cluster of 3 machines. Marklogic cluster is currently running on Linux servers.
How can we achieve this? Any information regarding this would be helpful.
You could use mod_proxy_balancer. How you configure it depends what MarkLogic client you would like to use. If you would like to use the Java Client API, please follow the second example here to allow apache to generate stickiness cookies. If you would like to use XCC, please configure it to use the ML-Server-generated or backend-generated "SessionID" cookie.
The difference here is that XCC uses sessions whereas the Java Client API builds on the REST API which is stateless, so there are no sessions. However, even in the Java Client API when you use multi-request transactions, that imposes state for the duration of that transaction so the load balancer needs a way to route requests during that transaction to the correct node in the MarkLogic cluster. The stickiness cookie will be resent by the Java Client API with every request that uses a Transaction so the load balancer can maintain that stickiness for requests related to that transaction.
As always, do some testing of your configuration to make sure you got it right. Properly configuring apache plugins is an advanced skill. Since you are new to apache, your best hope of ensuring you got it right is checking with an HTTP monitoring tool like WireShark to look at the HTTP traffic from your application to MarkLogic Server to make sure things are going to the correct node in the cluster as expected.
Note that even with the client APIs (Java, Node.js) its not always obvious or explicit at the language API layer what might cause a session to be created. Explicitly creating multi statement transactions definately will, but other operations may do so as well. If you are using the same connection for UI (browser) and API (REST or XCC) then the browser app is likely to be doing things that create session state.
The safest, but least flexable configuration is "TCP Session Affinity". If they are supported they will eliminate most concerns related to load balancing. Cookie Session Affinity relies on guarenteeing that the load balencer uses the correct cookie. Not all code is equal. I have had cases where it the load balancer didn't always use the cookie provided. Changing the configuration to "Load Balancer provided Cookie Affinity" fixed that.
None of this is needed if all your communications are stateless at the TCP layer, the HTTP layer and the app layer. The later cannot be inferred by the server.
Another conern is if your app or middle tier is co-resident with other apps or the same app connecting to the same load balancer and port. That can be difficult to make sure there are no 'crossed wires' . When ML gets a request it associates its identity with the client IP and port. Even without load balencers, most modern HTTP and TCP client libraries implement socket caching. A great perfomrance win, but a hidden source of subtle random severe errors if the library or app are sharing "cookie jars" (not uncomnon). A TCP and Cookie Jar cache used by different application contexts can end up sending state information from one unrelated app in the same process to another. Mostly this is in middle tier app servers that may simply pass on requests from the first tier without domain knowledge, presuming that relying on the low level TCP libraries to "do the right thing" ... They are doing the right thing -- for the use case the library programmers had in mind -- don't assume that your case is the one the library authors assumed. The symptoms tend to be very rare but catastrophic problems with transaction failures and possibly data corruption
and security problems (at an application layer) because the server cannot tell the difference between 2 connections from the same middle tier.
Sometimes a better strategy is to load balance between the first tier and the middle tier, and directly connect from the middle tier to MarkLogic.
Especially if caching is done at the load balancer. Its more common for caching to be useful between the middle tier and the client then the middle tier and the server. This is also more analogous to the classic 3 tier architecture used with RDBMS's .. where load balancing is between the client and business logic tiers not between business logic and database.

Designing scaling services with high available load-balancing

When running self-healing, scalable stateless services in frameworks such as marathon, the "affirmed" pattern is to have a tool for service discovery (e.g., bamboo) that feeds a load-balancer (e.g., HAProxy), preferrably with some automatic configuration, so that users can be proxied to services when hitting the load-balancer.
I don't seem to find much material about how to make the load-balancer itself highly available.
If the host that runs the load-balancer dies, I would like the services to still be accessed on the same URIs without downtimes.
What I desire can be achieved with Pacemaker/Corosync, but the fact that this specific point is often omitted in the various tutorials and blog posts, makes me think that maybe there is a simpler pattern or that I am overlooking the problem.
Do you have any suggestions?
One common approach is to run multiple machines with Bamboo/HAProxy on them pulling container locations from the Marathon/Mesos masters. In an AWS/Cloud-du-jour environment, these proxy machines can go behind an ELB (or even in an Autoscaling Group if your automation is good enough) to give you true HA functionality.

The Node.js event loop - nginx/apache

Both nginx and Node.js have event loops to handle requests. I put nginx in front of Node.js as has been recommended here
Using Node.js only vs. using Node.js with Apache/Nginx
with the setup shown here
Node.js + Nginx - What now?
How do the two event loops play together? Is there any risk of conflicts between the two? I wonder because Nginx may not be able to handle as many events per second as Node.js or vice versa. For example, if Nginx can handle 1000 events per second but node.js only 500, won't that cause issues? (I have no idea if 1000,500 are reasonable orders of magnitude, you could correct me on that.)
What about putting Apache in front of Node.js? Apache has no event loop. Just threads. So won't putting Apache in front of Node.js defeat the purpose?
In this 2010 talk, Node.js creator Ryan Dahl had vision to get rid of nginx/apache/whatever entirely and make node talk directly to the internet. When do you think this will be reality?
Both nginx and Node use an asynchronous and event-driven approach. The communication between them will go more or less like this:
nginx receives a request
nginx forwards the request to the Node process and immediately goes back to wait for more requests
Node receives the request from nginx
Node handles the request with minimal CPU usage, until at some point it needs to issue one or more I/O requests (read from a database, write the response, etc). At this point it launches all these I/O requests and goes back to wait for more requests.
The above can repeat lots of times. You could have hundreds of thousands of requests all in a non-blocking wait state where nginx is waiting for Node and Node is waiting for I/O. And while this happens both nginx and Node are ready to accept even more requests!
Eventually async I/O started by the Node process will complete and a callback function will get invoked.
If there are still I/O requests that haven't completed for this request, then Node goes back to its loop one more time. It can also happen that once an I/O operation completes this data is consumed by the Node callback and then new I/O needs to happen, so Node can start more async I/O requests before going back to the loop.
Eventually all I/O operations started by Node for a particular request will be complete, including those that write the response back to nginx. So Node ends this request, and then as always goes back to its loop.
nginx receives an event indicating that response data has arrived for a request, so it takes that data and writes it back to the client, once again in a non-blocking fashion. When the response has been written to the client and event will trigger and nginx will then end the request.
You are asking about what would happen if nginx and Node can handle a different number of maximum connections. They really don't have a maximum, the maximum in general comes from operating system configuration, for example from the maximum number of open handles the system can have at a time or the CPU throughput. So your question does not really apply. If the system is configured correctly and all processes are I/O bound, neither nginx or Node will ever block.
Putting Apache in front of Node will only work well if you can guarantee that your Apache never blocks (i.e it never reaches its maximum connection limit). This is hard/impossible to achieve for large number of connections, because Apache uses an individual process or thread for each connection. nginx and Node scale really well, Apache does not.
Running Node without another server in front works fine and it should be okay for small/medium load sites. The reason putting a web server in front of it is preferred is that web servers like nginx come with features that Node does not have and you would need to implement yourself. Things like caching, load balancing, running multiple apps from the same server, etc.
I think your questions have been largely covered by some of the others answers, but there are a few pieces missing, and some that I disagree with, so here are mine:
The event loops are isolated from each other at the process level, but do interact. The issues you're most likely to encounter are around the configuration of nginx response buffers, chunked data, etc. but this is optimisation rather than error resolution.
As you point out, if you use Apache you're nullifying the benefit of using Node.js, i.e. massive concurrency and websockets. I wouldn't recommend doing that.
People are already using Node.js at the front of their stack. Searching for benchmarks returns some reasonable-looking results in Node's favour, so performance to my mind isn't an issue. However, there are still reasons to put Nginx in front of Node.
Security - Node has been given increasing scrutiny, but it's still young. You may not have problems here, but caution is often your friend.
Training - Ops staff that you hire will know how to manage Nginx, but the configuration and management of your custom Node app will only ever be understood by those people your developers successfully communicate it to. In some companies this is nobody.
Operational Flexibility - If you reach scale you might want to split out the serving of static content, purely to reduce the load on your app servers. You might want to split content amongst different domains and have it managed separately, or have different SSL or proxying behaviour for different domains or URL patterns. These are the things that are easy for Ops guys to configure in Nginx, but you'd have to code manually in a Node app.
The event loops are independent. Event loops are implemented at the application level, so neither cares what sort of architecture the other uses.
NodeJS is good at many things, but there are some places where it still falters. Once example is serving static files. At the moment, nodejs performs fairly poorly in this test, so having a dedicated web server for your static files greatly improves response time. Also, nodejs is still in its infancy, and has not been "tested and hardened" in the matters of security like Apache on nginX.
It'll take a long time for people to consider fronting nodejs all by itself. The cluster module is a step in the right direction, but it'll take a long time even after it reaches v1 before it happens.
Both event loops are unrelated. They don't play together.
Yes, it is pretty useless. Apache is not a load balancer.
What Ryan Dahl said may be applicable already. The limit of concurrent users is definitely higher than that of Apache. Before node.js websites with fair amount of concurrent users had to use nginx to balance the load. For small to medium sized businesses it can be done with node.js alone. But ruling out nginx completely will take time. Let node.js be stable before it can follow this ambitious dream.

Load balancer - how to write one for a custom application?

I've written a simple server application which will run distributed on several machines.
My question is how does a network load balancer works, in general?
I've heard of round-robin and other algorithms, but what I haven't got answer to is how does the process really goes? In socket terms.
The client connects to one of the load balancer machines, asks for a "free-to-connect-to" server and simply connects to it?
That's the simpliest way I can think of.
.. or, does it use the load balancer as a proxy (that implies that all the NBs must be always connected to the application servers, and data is transferred through them)?
It's more of a general question. How would you do this?
Thank you all!
There are several different ways to load balance an application. Some are physical devices that sit between your router and the servers; some are software based with a bit of code that runs on each of the load balanced devices.
Microsoft has load balancing built into Windows which is all software based. It's pretty good and easy to set up.
However, I'll cover the physical route.
There are several algorithms here, but the main one is Round Robin with an option for "sticky" sessions. Sticky in this case means that the load balancer will try to keep a history of clients and forward requests from the same client to the same machine. This means the load balancer needs to keep a list of clients and where it directed those clients. Depending on cache size, clients may fall off the list and on future requests they may be forwarded to a different server.
Round Robin is a pretty simple idea. For each request that comes in send it to the next server in the list. More complicated algorithms might take into account how many requests go to a particular server and how long are those requests taking; then try to rebalance new requests to favor faster servers. This part is complicated though.

What would be the best approach to designing a highly available pool of web services?

I've heard a lot of people touting success using Linux based proxies to handle routing for high availability of web applications, but what are others doing with web services? I have a bank of WCF services that need to be moved to a high availability (failover) model, meaning that if a particular server hosting the WCF services goes down, the request is routed to another of the servers in the bank. I would rather stay away from implementing a Linux based solution, since there are no Linux knowledgeable people in the environment.
If you don't need durability, you can load balance WCF service requests just like normal web requests without doing anything special. If you need durability and want requests to survive being cut off mid-process, use the netMsmqBinding.
I would rather stay away from
implementing a Linux based solution,
since there are no Linux knowledgeable
people in the environment.
This is probably a strong enough reason to not use a Linux-based solution. Doing what you describe well requires reasonable expertise beyond a simple recipe approach, and substantial maintenance.