Performance Penalty of Multiple VHosts in rabbitmq? - rabbitmq

Is there a performance penalty to run many vhosts as opposed to many exchanges? I have to support thousands of different clients, and I am trying to decide whether each client should receive its own vhost, or I should have one vhost and each client gets its own exchange. Which is the better choice vis-a-vis performance and resource utilization?

Received an answer from RabbitMQ. VHosts actually only exist as a security context so they have almost no overhead so there should be very little overhead in using many vhosts.

Related

Failover with Spring AMQP and RabbitMQ HA

There are multiple articles suggesting that load-balancer should be used in front of RabbitMQ cluster.
However, there are also multiple references that Spring AMQP is using some
failover implementation like connection reset when broker comes back to life.
I have several questions regarding this topic (given that those articles are more or less old and it's 2018 today)
When using Spring AMQP, is it load-balancing for still required?
If load-balancing is still suggested, how would I solve affinity of primary queue to its node? There would be much inter-connect between cluster nodes, because round-robin load-balancer would have 1-(1/n) success rate of hitting correct cluster node
Does Spring AMQP support some kind of topology awareness, which would allow it to consume from correct node?
There were some articles suggesting that clients should publish/consume to nodes respecting locality of queues. Does this still apply? How does this all fits together given load-balancing, Spring AMQP failover and CachingConnectionFactory?
Can anybody please provide answers to those topics and also provide relevant references, which would provide additional information for verification?
Thanks a lot
For each of your bullets:
a load balancer makes little sense with default configuration of Spring AMQP since it opens a single, long-lived, connection that is shared across all consumers. In, 2.0, you can configure the RabbitTemplate to use a separate connections; this is because it is a recommended configuration to use a different connection for publishers/consumers; this will be default in 2.1.
It might make sense to use a load balancer if you configure the connection factory to cache connections (instead of just channels) since, then, each component gets its own connection.
See next bullet.
See Queue Affinity and the LocalizedQueueConnectionFactory. It uses the management plugin to determine which node currently hosts the queue and connects to that. It will not work with a load balancer since it needs to connect to the actual node.
It is my understanding from several discussions that queue affinity is only needed in the most extreme environments and that, in most environments, the difference is immeasurable. However, environments/networks differ so much, YMMV so you may want to test. My general rule of thumb is to avoid premature optimization since the added complexity of the configuration may simply not be worth the benefit (and you may not have a problem in the first place).

Apache or nginx ? I like to understand the basic working flow of Nginx , its advantage and disadvantage

Pros & cons over Apache or nginx and how they work internally in order to maximize the resource utilization
Can I use Apache & Nginx together ? If I use only Nginx then what problem I can face ?
Apache has some disadvantages, especially when it is used with the PHP module.
Apache's process model is such that each connection uses a separate process. Each process carries all the overhead of PHP and any other modules you may have loaded with it. An Apache process might run a PHP script or serve static content for one request. If the PHP has a memory leak (which does happen sometimes), the process continues to grow in size. Also, when KeepAlive is enabled, which is usually recommended, that process stays alive for a few seconds after the connection, consuming a "slot" that another client might be able to use and helping the server to reach its MaxClients sooner.
Nginx is an alternative webserver that normally uses the Linux "epoll" API to process requests in a non-blocking mode. This means that one single process can handle many simultaneous connections. Epoll is an efficient way to tell the single process which connection(s) it needs to deal with and which can wait. Nginx has a goal of solving the "C10k" problem - how to have 10,000 concurrent connections.
This naturally goes hand in hand with php-fpm, the FastCGI Process Manager. Nginx itself does not have PHP built-in. When it receives a request for a PHP script, it makes a call out to php-fpm to run the script, which then returns the result to nginx, which returns it to the client.
This all uses a lot less memory than a similar Apache+mod_php configuration.
There are a couple more huge advantages of php-fpm over mod_php:
It uses different "pools", each of which can run as a separate Linux user. This provides a simple and effective way of isolating websites (for example, if they are run by different customers who should not read each other's code) without the overhead or nastiness of suexec or suphp.
It has a slow log feature where it can dump a PHP stack trace of any script that has been running for greater than X seconds. This can help diagnose slow code issues.
Php-fpm can be run with Apache, and in fact this allows you to take advantage of Apache's more efficient Worker MPM (or Event in Apache 2.4). However, my experience is that configuring it in Apache is significantly more complex than configuring it in nginx, and even with Worker, it still is not quite as efficient with nginx.
Disadvantages of moving to nginx - not many, but things to keep in mind:
It does not support .htaccess files. I think this is a good thing personally as .htaccess files must be parsed by Apache for every request, which can cause significant overhead.
Configuration files need to be re-written. If you have many complex site configurations, this could take some doing. For simple cases it is not usually a big deal.
Feature Of Nginx
Nginx is fast because it does not need to create a new process for
each new request.
HTTP proxy and Web server features
Ability to handle more than 10,000 simultaneous connections with a
low memory footprint (~2.5 MB per 10k inactive HTTP keep-alive
connections)
Handling of static files, index files, and auto-indexing
Reverse proxy with caching
Load balancing with in-band health checks
Fault tolerance
Nginx uses very little memory, especially for static Web pages..
FastCGI, SCGI, uWSGI support with caching
Name- and IP address-based virtual servers
IPv6-compatible
SPDY protocol support
FLV and MP4 streaming
Web page access authentication
gzip compression and decompression
URL rewriting having its own rewrite engine
Custom logging with on-the-fly gzip compression
Response rate and concurrent requests limiting
Bandwidth throttling
Server Side Includes
IP address-based geolocation
User tracking
WebDAV
XSLT data processing
Embedded Perl scripting
Nginx is highly scalable, and performance is not dependent on
hardware.
With only Nginx, you lose a whole bunch of apache-specific features such as all the mod_dav stuff. You lose a lot of modules, effectively
Conclusion
The best use for nginx is in front of Apache if you need Apache modules. Use it as a load-balancer if you might, between multiple Apache instances, and you suddenly have a mixed set-up that is rather

Do companies that provide APIs use a shim or proxy in front of their APIs?

I'm researching how large companies manage their public APIs. I'm thinking of companies with mature established APIs such as Google, Facebook, Twitter, and Amazon.
These companies have a number of different APIs that they expose to the public. Google, for example, has Plus, AdSense, AdWords etc. APIs that are publicly consumable. I'd like to understand if they use a cluster of reverse-proxy servers in front of those APIs to provide common functionality so that their specialist API servers don't need to implement that.
For example: Throttling and Authentication could be handled at this layer instead of implementing it in each API cluster.
The questions: Does anyone use a shim or reverse proxy in front of their APIs to handle common tasks? What are the use cases that make a reverse-proxy a good or bad idea for a cluster of API servers?
Most large companies explore a variety of things to handle the traffic and load on their servers. Roughly speaking:
A load balancer sits between the entry point and the actual client.
A reverse proxy often times sits between these to handle static files, pre-computed/rendered views, and other such largely static assets.
Any cast is used for DNS purposes, so that you are routed towards the nearest server that handles that URL.
Back pressure is employed in systems to limit the amount of requests feeding through a single pipeline and so that services don't tip over.
Memcached, Redis and the like are used as short term caches. That is, if it's going to roughly be the same result every 5 seconds, then that result can be cached in memory for faster delivery. Some proxies can be configured to read out of these.
If you're really interested, start reading some of the Netflix blog. Take a look at some of the open source they've used like Hystrix or Zuul. You can also take a look at some of their videos. They make heavy use of proxies and have built in some very advanced distributed behavior.
As far as a reverse proxy being a good idea, think in terms of failure. If your service calls out to another API by direct route and that service fails, then your service will fail and cascade upwards to the end user. On the other hand, if it's hitting a reverse proxy, then that proxy can be configured or even auto detect failures and divert traffic to back up servers.
As far as a reverse proxy being a good idea, think in terms of load. Sometimes servers can only handle a fraction of the traffic individually so that load must be shared on many servers. This is true not just of CPU capped but also IO capped resources (even if the return signal itself will not be the cause of the IO capping.)
Daisy chaining like this presents its own special little hell but it's sometimes unavoidable. On the downsides and what makes it a really bad choice if you can avoid it at all costs is a loss of deterministic behavior. Sometimes the stupidest things will bring your servers down. And by stupid, I mean, really, really dumb stuff that you never thought in a million years might bite you in the butt (think server clocks out of sync.) You have to start using rolling deploys of code, take down servers manually or forcefully if they stop responding, and keep those proxy configs in good order.
HTTP1.1 support can also be an issue. Not all reverse proxy adhere to the spec. In fact, some of them only cover ~50%. HAProxy does not do SSL. If you're only limited hardware then thread based proxy can unexpectedly swamp the system with threads.
Finally, adding in a proxy is one more thing that will break (not can, will.) You have to monitor them just like any piece of the platform, aggregate their logs, and run mock drills on them too.

What specifically makes Node.js more scalable than Apache?

To be honest I've not understood it completely yet - and I even do understand how Node.js works, as a single thread using the event model. I just don't get how this is better than Apache, and how it scales horizontally if it's single-threaded.
I've found that this blog post by Tomislav Capan explains it very well:
Why The Hell Would I Use Node.js? A Case-by-Case Introduction
My interpretation of the gist of it, for Node 0.10, compared to Apache:
The good parts
Node.js avoids spinning up threads for each request, or does not need to handle pooling of requests to a set of threads like Apache does. Therefore it has less overhead to handle requests, and excels at responding quickly.
Node.js can delegate execution of the request to a separate component, and focus on new requests until the delegated component returns with the processed result. This is asynchronous code, and is made possible by the eventing model. Apache executes requests in serial within a pool, and cannot reuse the thread when one of its modules is simply waiting for a task to complete. Apache will then queue requests until a thread in the pool becomes available again.
Node.js talks JavaScript and is therefore very fast in passing through and manipulating JSON retrieved from external web API sources like MongoDB, reducing time needed per request. Apache modules, like PHP, may need more time, because they cannot efficiently parse and manipulate JSON because they need marshalling to process the data.
The bad parts
Note: most of the bad parts listed below will be improved with the upcoming version 0.12, something to keep aware of.
Node.js sucks at computational intensive tasks, because whenever it does something long running, it will queue all other incoming requests, due to its single thread. Apache will generally have more threads available, and the OS will neatly and fairly schedule CPU time between these threads, still allowing new threads to be handled, albeit a bit slower. Except when all available threads in Apache are handling requests, then Apache will also start queueing requests.
Node.js doesn't fully utilize multi-core CPUs, unless you make a Node.js cluster or spin up child processes. Ironically, if you do the latter two, you may add more orchestrating overhead, the same issue that Apache has. Logically you could also spin up more Node.js processes, but this is not managed by Node.js. You would have to test your code to see what works better; 1) multi-threading from within Node.js with clusters and child processes, or 2) multiple Node.js processes.
Mitigations
All server platforms have an upper limit. Node.js and Apache both will reach it at some point.
Node.js will reach it the fastest when you have heavy computational tasks.
Apache will reach it the fastest when you throw tons of small requests at it that require long serial execution.
Three things you could do to scale the throughput of Node.js
Utilize multi-core CPUs, by either setting up a cluster, use child processes, or use a multi-process orchestrator like Phusion Passenger.
Setup worker roles connected with a message queue. This will be the most effective solution against computational intensive long running requests; off-load them to a worker farm. This will split up your servers in two parts; 1) public facing clerical servers that accept requests from users, and 2) private worker servers handling long running tasks. Both are connected with a message queue. The clerical servers add messages (incoming long-running requests) to the queue. The worker roles listen for incoming messages, handle those, and may return the result into the message queue. If request/response is needed, then the clerical server could asynchronously wait for the response message to arrive in the message queue. Examples of message queues are RabbitMQ and ZeroMQ.
Setup a load balancer and spin up more servers. Now that you efficiently use hardware and delegate long running tasks, you can scale horizontally. If you have a load balancer, you can add more clerical servers. Using a message queue, you can add more worker servers. You could even set this up in the cloud so that you could scale on demand.
It depends on how you use it. Node.js is single threaded by default, but using the (relatively) new cluster module you can scale horizontally across multiple threads.
Furthermore, your database needs will also dictate how effective scaling is with node. For example, using MySQL with node.js won't get you nearly as much benefit as using MongoDB, because of the event driven nature of both MongoDB and node.js.
The following link has a lot of nice benchmarks of systems with different setups:
http://www.techempower.com/benchmarks/
Node.js doesn't rank the highest but compared to other setups using nginx (no apache on their tables, but close enough) it does pretty well.
Again though, it highly depends on your needs. I believe if you are simply serving static websites it is recommend you stick with a more traditional stack. However people have done some amazing things with node.js for other needs: http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/ (c10k? ha!)
Edit: It is worth mentioning that you really aren't 'replacing' just apache with node.js. You would be replacing apache AND php (in a typical lamp stack).

Nginx v Apache for high traffic sites

Would nginx be a more suitable choice as a web server for high traffic websites?
The site we will be building is an e-commerce site, if that makes a difference.
I am really interested in the actual 'why' from a technical point of view either way. i.e., why would nginx be a better choice for this type of site from a technical standpoint, or the opposite, why it wouldn't?
Martin,
In general, Nginx is better for high-traffic sites due to its event-driven architecture. Rather than handling each request in a distinct thread, it uses non-blocking I/O to service many requests in each thread.
The important aspect of this architecture is the reduced use of processes or threads. A thread can consume anywhere from 2MB to over 64MB of RAM. So when Apache serves a 10KB JPEG, it may actually be using a significant amount of RAM. It becomes worse if you have slow clients (e.g. smartphones) where the request may keep a thread busy for several seconds.
Many people find that running Nginx as a proxy in front of Apache to be an ideal middle ground. Nginx talks to the slow clients and can do so using a very small amount of RAM. When requests are forwarded to Apache, the request speed is limited by your local connectivity, not that of the remote user. This means that the network bottleneck will not keep the request (and it's memory-hogging thread) alive for any longer than necessary.
In short you get the low-resource benefits of Nginx coupled with the wide feature-set of Apache.