How stable is RabbitMQ in production (using DRBD and Pacemaker)? - rabbitmq

Looking for experience with RabbitMQ, especially in HA configuration using Pacemaker and DRDB as recommended here:
The DRBD part in particular makes me nervous, so I'm hoping someone here has real-world experience to share.

Works most of the time. However you'll have to pay special attention to fencing (split brain), when dealing with DRBD. On a production system it's always a pain to have to fix this kind of issues manually.
We failed to run RabbitMQ in a master/slave (multi-state RA). We thought we'd enhance availability. We're back to a single instance now. If anyone else has experience with several RabbitMQ instances running concurrently and backing a master entity that would be great to share!
I find the lack of tools to debug Pacemaker when there are issues is a big hurdle to deploy to live systems... It's not always clear what Pacemaker is "thinking" or doing. hb_report is not sufficient unfortunately.
Hope this helps,

We tried master/slave configuration as well, however it became difficult to maintain all instances up to date with no downtime. And trust me, you want to update RabbitMQ. There are always bugs popping up either in RabbitMQ itself or in Erlang.
We've are getting about 100 crashes per year without any meaningful explanation in the logs. The error log just has generic "error while starting" in it and that's pretty much it. Sometimes it won't start after the crash and most of those times, the only solution is to delete all the persistent messages from all instances, so that the queue state is synchronized across the cluster. Other times it would crash immediately after launching and only after multiple repeated attempts will it properly load. Meaning there is no added reliability what so ever when using master/slave. At least there was none in our case. (RabbitMQ 3.5.3, Erlang 18.0)
It works for production, but only if you keep a copy of the message somewhere in the logs or in the database, from where it can be quickly recovered after a major crash.


I want a sandboxed test environment that is *always* an exact copy of Production

I'm having an issue with a web application I am responsible for maintaining.
The system experiences regular bugs, and our support vendors are always asking us to see if we can "replicate the error in UAT". This is obviously a reasonable request. A lot of the time, for various reasons (some of which are clear, some of which are not), these errors are not present in UAT. This lack of bug reproducability in a testing environment is adding huge amounts of friction to the bug resolution process.
There are 3 key pieces of our system architecture where these bugs are flaring (the CMS, the API layer, and the database). I am proposing we set up a system job that perpetually clones these 3 parts of the system in to a sandboxed test environment. This cloning would happen periodically (eg, once every 24 hours), and automatically.
Is there a technical term for this sort of environment? Is this an established method of helping diagnose system issues? Is there somewhere I can read up on the industry best practices for establishing something like this? Thanks.
The technical term for this kind of process is replication it is often done for some systems like databases, but normally not for testing purpose, but in order to increase available, so the replication is used as a failover spare.
An exact copy of a production system, with all the data is not you'll find often, due to the high demand on resources. Also at some points to two systems have to differ. Most systems (I know of) have tons of interfaces you just can't copy a complete system systems.
Also: you only need the copy of the production system when you actually debugging an issue. And if you are in the middle of that you probably don't want everything to go away and get replaced by a new copy.
So instead I would recommend to setup scripts that allows to obtain a copy of the relevant parts on demand.
Also you might want to consider how you might be able to modify your system to make it easier to setup a copy.
For example, when you have all the setup automated (with chef/docker or similar) you should be able to setup the same system again anywhere you want, so you now you just have to get the production data over.
Which is an interesting point. Production data often contains secret information (because it is vital to the business, or because it is personal data). You don't want this kind of stuff hang around in a test system everybody can access.

IWantToRunWhenBusStartsAndStops not for production?

New to NServiceBus (4.7.5) and just implemented an NSB host.exe hosted service (implementing IWantToRunWhenBusStartsAndStops) that detects changes to database tables and notifies subscribing web apps by publishing events, e.g. "CustomerDataWasUpdatedEvent". In the future we will perform the actual update through messagehandlers receiving commands obviously, but at the moment this publishing service just polls the database etc.
It all works well, however, approaching production, I noticed that David Boike, in his latest edition of "Learning NServiceBus", states that classes implementing
IWantToRunWhenBusStartsAndStops are really mostly for development and rarely used in production. I set up my database change detection in the Start method and it works nicely, does anyone know why this is discouraged?
Here is the comment in the actual book:
The actual quote is: isn't common to have widespread use of in a production system.
Uncommon is not the same thing as discouraged.
That said I do think there is intent here by the author to highlight the fact that further up the page they assert that this is not a good place to be doing lots of coding, as an unhandled exception can cause the whole process to fail.
The author actually does go on to mention a possible use case for when you may want to load a resource(s) to do work within the handler.
Ok, maybe it's just this scenario we have that is a bit uncommon
Agreed - there is nothing fundamentally wrong with your approach. I recently did the same thing as you for wiring up SqlDependency to listen for database events and then publish a message as a result. In these scenarios there is literally nothing else you can do other than to use IWantToRunAtStatup.
Also, David himself often trawls the nservicebus tag, maybe he'll provide a more definitive answer than mine.
I'll copy the answer I gave in the Particular Software Google Group...
I'll quote myself directly here:
An implementation of IWantToRunWhenBusStartsAndStops is a great place to create a quick interface in order to test messages during debugging by allowing you to send messages based on the console input. Apart from this, it isn't common to have widespread use of them in a production system. One possible production use case will be to provision a resource needed by the endpoint at startup and then tear it down when the endpoint stops.
I think if I could add a little bit of emphasis it would be to "widespread use". I'm not trying to say you won't/can't have an IWantToRunWhenBusStartsAndStops in production code or that avoiding them is a best practice. I am trying to say that having a ton of them is probably a code smell.
Above that paragraph in the book, I warn about IWantToRunWhenBusStartsAndStops not having any ambient transactions or try/catch stuff going on. THAT is really the key part. If you end up throwing an exception in an IWantToRunWhenBusStartsAndStops, tyou can run into big problems. If you use something like a .NET Timer and then throw an exception, you can crash your process!
Let me tell you how I screwed up on this in my first-ever NServiceBus system. The system (still in use today, from what I hear) is responsible for ingesting more than 3000 RSS feeds (probably a lot more than that now) into a CMS. So processing each feed, breaking it up into items, resizing images, encoding attached video for mobile ... all those things were handled in NServiceBus message handlers, which was scaled out to multiple servers, and that was all fantastic.
The problem was the scheduler. I implemented that as an IWantToRunWhenBusStartsAndStops (well, actually IWantToRunAtStartup at that time) and it quickly turned into a mess. I kept the whole table worth of feed information in memory so that I could calculate when to fire off the next ProcessFeed command. I was using the .NET Timer class, and IIRC, I eventually had to use threading primitives like ManualResetEvent in order to coordinate the activity. And because I was using .NET Timer, if the scheduler threw an exception, that endpoint failed and had to restart. Lots of weird edge cases and it was always a quagmire of bugs. Plus, this was now a singleton "commander app" so while the feed/item processors could be scaled out, the scheduler could not.
As I got more experienced with NServiceBus, I realized that each feed should have been a saga, starting from a FeedCreated event, controlled through PauseProcessing and ResumeProcessing commands, using timeouts to control the next processing time, and finally (perhaps) ended via a FeedRemoved event. This would have been MUCH more straightforward and everything would have executed inside transactionally-controlled message handlers.
That experience led me to be a little bit distrustful/skeptical of IWantToRunWhenBusStartsAndStops. Not saying it's bad, just something to be aware of. Always be prepared to consider if what you're trying to do couldn't be better accomplished in another way.

Best ESB/Message Queue for appharbor

I'm currently trying to find the best message queue solution for an appharbor application. Most of the ones of looked at assume you have a windows environment with MSMQ and DTC installed, which I don't believe the appharbor environment provides.
I would like something that works well with ravendb, as that is the database we are using. Something who's only dependence is on raven would be ideal, especially if it integrates with our existing unit of work. Ie, when save changes is called in our controller action the messages are saved in the same transaction.
It would also need a host that works in a console application for background processing.
Ideally I would like something that "just works" in a development environment also. With raven, for example, we use the embedded mode while developing and I would like something that doesn't require installation.
I've looked at nServicebus, which seems to fail these conditions because it needs a transport (msmq, sql, etc) and much of the documentation is out of date.
I also looked at rhino service bus but there is a distinct lack of documentation and community. I'm also not sure if it can depend entirely on ravendb.
The others I looked at all seemed quite heavyweight and required installation and configuration to run in a development environment.
Edit: the other option, is to implement our own.
First of all, congratulations on being the 1000th NServiceBus question on StackOverflow!
Second, if you were to use SQL for persisting your business data, then you could run NServiceBus on top of that same SQL where all the messages go through tables (instead of queues) and then you wouldn't need the DTC.
Third, if you did want to go with RavenDB as your transport for NServiceBus, you would have to implement the ISendMessages and IReceiveMessages interfaces on top of it, but I believe that somebody in the community has already started working on that, so possibly you could join forces with them.
Finally, I wouldn't recommend writing your own ESB these days - not when there are so many good choices already out there. You mentioned the issues of community and documentation - those tend to be handled the worst when writing your own infrastructure.

How do you solve a problem that is unreproducible, random and changes are not immediately testable?

Thought I would throw this one out there and see what other people's experiences have been like.
I'm experiencing an issue with a system at work where it stops processing jobs in a queue and 'jams' so to speak. Once the services are restarted the software processes the queue and everything returns to normal.
In my experience so far, I cannot for the life of me figure out what is causing these stoppages. That, and I cannot reproduce the stoppage myself. The queue fails at all different intervals, sometimes running for a month straight, other times failing as close together as twice in 1 day. I have since involved two different vendors and various colleagues within the department and everyone is stumped, and has been for several months.
Since I started, we've isolated the processing to a single server and cranked up the logging which we've sent to the vendors. Neither have no idea what the problem is.
We've updated a few settings here and there, upgraded client and server pieces, but we have no idea if the things we are doing is contributing to an overall solution.
So I have a problem that appears to be unreproducible, random and untestable.
Has anyone been involved with any similar situations?
What are some of the ways to solve a situation like this?
Any shared input or experiences would be great.
EDIT:: Cranked up the logging, updated all of the components to the latest version, and made sure proper anti-virus exclusions were done and so far it has not jammed in over a month!
Use a logging framework that can be turned on in production. You might have to have too much logging initially but it should help narrow down the problem and as you get closer you can narrow the scope of the logging and at the same time increase the verbosity (is that a word) of the remaining log statements.
In addition to the logging as pointed out by Kelly there is the possibilty of a deadlock taking place since things seem to stop. One option if this is a Java application is to use jconsole and connect to the JVM instance. jconsole has a detect deadlock option which can provide very valuable information when the hangup occurs.
If this is not a Java application and perhaps a .NET application you could make use of this technique.

Why use AMQP/ZeroMQ/RabbitMQ

as opposed to writing your own library.
We're working on a project here that will be a self-dividing server pool, if one section grows too heavy, the manager would divide it and put it on another machine as a separate process. It would also alert all connected clients this affects to connect to the new server.
I am curious about using ZeroMQ for inter-server and inter-process communication. My partner would prefer to roll his own. I'm looking to the community to answer this question.
I'm a fairly novice programmer myself and just learned about messaging queues. As i've googled and read, it seems everyone is using messaging queues for all sorts of things, but why? What makes them better than writing your own library? Why are they so common and why are there so many?
what makes them better than writing your own library?
When rolling out the first version of your app, probably nothing: your needs are well defined and you will develop a messaging system that will fit your needs: small feature list, small source code etc.
Those tools are very useful after the first release, when you actually have to extend your application and add more features to it.
Let me give you a few use cases:
your app will have to talk to a big endian machine (sparc/powerpc) from a little endian machine (x86, intel/amd). Your messaging system had some endian ordering assumption: go and fix it
you designed your app so it is not a binary protocol/messaging system and now it is very slow because you spend most of your time parsing it (the number of messages increased and parsing became a bottleneck): adapt it so it can transport binary/fixed encoding
at the beginning you had 3 machine inside a lan, no noticeable delays everything gets to every machine. your client/boss/pointy-haired-devil-boss shows up and tell you that you will install the app on WAN you do not manage - and then you start having connection failures, bad latency etc. you need to store message and retry sending them later on: go back to the code and plug this stuff in (and enjoy)
messages sent need to have replies, but not all of them: you send some parameters in and expect a spreadsheet as a result instead of just sending and acknowledges, go back to code and plug this stuff in (and enjoy.)
some messages are critical and there reception/sending needs proper backup/persistence/. Why you ask ? auditing purposes
And many other use cases that I forgot ...
You can implement it yourself, but do not spend much time doing so: you will probably replace it later on anyway.
That's very much like asking: why use a database when you can write your own?
The answer is that using a tool that has been around for a while and is well understood in lots of different use cases, pays off more and more over time and as your requirements evolve. This is especially true if more than one developer is involved in a project. Do you want to become support staff for a queueing system if you change to a new project? Using a tool prevents that from happening. It becomes someone else's problem.
Case in point: persistence. Writing a tool to store one message on disk is easy. Writing a persistor that scales and performs well and stably, in many different use cases, and is manageable, and cheap to support, is hard. If you want to see someone complaining about how hard it is then look at this:
Anyway, I hope this helps. By all means write your own tool. Many many people have done so. Whatever solves your problem, is good.
I'm considering using ZeroMQ myself - hence I stumbled across this question.
Let's assume for the moment that you have the ability to implement a message queuing system that meets all of your requirements. Why would you adopt ZeroMQ (or other third party library) over the roll-your-own approach? Simple - cost.
Let's assume for a moment that ZeroMQ already meets all of your requirements. All that needs to be done is integrating it into your build, read some doco and then start using it. That's got to be far less effort than rolling your own. Plus, the maintenance burden has been shifted to another company. Since ZeroMQ is free, it's like you've just grown your development team to include (part of) the ZeroMQ team.
If you ran a Software Development business, then I think that you would balance the cost/risk of using third party libraries against rolling your own, and in this case, using ZeroMQ would win hands down.
Perhaps you (or rather, your partner) suffer, as so many developers do, from the "Not Invented Here" syndrome? If so, adjust your attitude and reassess the use of ZeroMQ. Personally, I much prefer the benefits of Proudly Found Elsewhere attitude. I'm hoping I can proud of finding ZeroMQ... time will tell.
EDIT: I came across this video from the ZeroMQ developers that talks about why you should use ZeroMQ.
what makes them better than writing your own library?
Message queuing systems are transactional, which is conceptually easy to use as a client, but hard to get right as an implementor, especially considering persistent queues. You might think you can get away with writing a quick messaging library, but without transactions and persistence, you'd not have the full benefits of a messaging system.
Persistence in this context means that the messaging middleware keeps unhandled messages in permanent storage (on disk) in case the server goes down; after a restart, the messages can be handled and no retransmit is necessary (the sender does not even know there was a problem). Transactional means that you can read messages from different queues and write messages to different queues in a transactional manner, meaning that either all reads and writes succeed or (if one or more fail) none succeeds. This is not really much different from the transactionality known from interfacing with databases and has the same benefits (it simplifies error handling; without transactions, you would have to assure that each individual read/write succeeds, and if one or more fail, you have to roll back those changes that did succeed).
Before writing your own library, read the 0MQ Guide here:
Chances are that you will either decide to install RabbitMQ, or else you will make your library on top of ZeroMQ since they have already done all the hard parts.
If you have a little time give it a try and roll out your own implemntation! The learnings of this excercise will convince you about the wisdom of using an already tested library.