nservicebus workers detached after cluster failover - nservicebus

Background:
We are currently trying to move to the failover cluster distributor model and are seeing some issues when failing over the cluster from one distributor to another. In our lab environment we have set everything up as per David Boike's blog post. We have 3 servers for this exercise, 2 Dist boxes and 1 worker.
I see the worker initially register to the distributor and it processes messages fine until I initiate a failover test. Once I failover from one node to another, the worker appears to be detached and no longer processes messages. The messages still come in and sit in the clustered MSMQ queue.
I had to restart the nServiceBus Host windows service on the worker to get things going again.
Question:
Is there a setting that controls how often the worker subscribes to the distributor in case of cluster failover?
Thanks for the help

Related

RabbitMQ creates a number of strange processes

I happened to find a number of strange processes created by rabbitmq on my RabbitMQ server. I ran rabbitmq server in a docker container. I recreated the container and hours later those processes appeared again. There're some consumers connecting to it. Any idea about what those processes for? Thanks!

Multiple broker machine rabbitmq configuration, how does HA work?

I'm trying to figure out how HA works. (high availability queues)
The current configuration I have is: every machine has multiple celery workers and points to itself as broker. Each machine can do this rather than point at one broker machine because of HA; in this way, there is less load on any one machine, as all are brokers and have copies of the same queue.
My question is, is my above logic correct? Or do all workers need to point to one broker machine regardless of HA?
If you have looked at HA and clustering and have ensured that the queues mirror each other then what you are doing should be fine. But that may seem a tad inefficient to run it on every server where you run your workers.
The other option is to run your queues on a few servers for HA and have other servers running the workers to point to them. But since the celery worker config can only point to one broker url, you would need to work around that by possibly using a load balancer to which all workers will point to. This is to the best of what I've come to understand over the past few years on RabbitMQ HA for celery.

Celery workers missing heartbeats and getting substantial drift over Ec2

I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.

What is the correct way to use the timeout manager with the distributor in NServiceBus 3+?

Version pre-3 the recommendation was to run a timeout manager as a standalone process on your cluster, beside the distributor. (As detailed here: http://support.nservicebus.com/customer/portal/articles/965131-deploying-nservicebus-in-a-windows-failover-cluster).
After the inclusion of the timeout manager as a satellite assembly, what is the correct way to use it when scaling out with the distributor?
Should each worker of Service A run with timeout manager enabled or should only the distributor process for service A be configured to run a timeout manager for service A?
If each worker runs it, do they share the same Raven instance for storing the timeouts? (And if so, how do you make sure that two or more workers don't pick up the same expired timeout at the same time?)
Allow me to answer this clearly myself.
After a lot of digging and with help from Andreas Öhlund on the NSB team(http://tech.groups.yahoo.com/group/nservicebus/message/17758), the correct anwer to this question is:
Like Udi Dahan mentioned, by design ONLY the distributor/master node should run a timeout manager in a scale out scenario.
Unfortunately in early versions of NServiceBus 3 this is not implemented as designed.
You have the following 3 issues:
1) Running with the Distributor profile does NOT start a timeout manager.
Workaround:
Start the timeout manager on the distributor yourself by including this code on the distributor:
class DistributorProfileHandler : IHandleProfile<Distributor>
{
public void ProfileActivated()
{
Configure.Instance.RunTimeoutManager();
}
}
If you run the Master profile this is not an issue as a timeout manager is started on the master node for you automatically.
2) Workers running with the Worker profile DO each start a local timeout manager.
This is not as designed and messes up the polling against the timeout store and dispatching of timeouts. All workers poll the timeout store with "give me the imminent timeouts for MASTERNODE". Notice they ask for timeouts of MASTERNODE, not for W1, W2 etc. So several workers can end up fetching the same timeouts from the timeout store concurrently, leading to conflicts against Raven when deleting the timeouts from it.
The dispatching always happens through the LOCAL .timouts/.timeoutsdispatcher queues, while it SHOULD be through the queues of the timeout manager on the MasterNode/Distributor.
Workaround, you'll need to do both:
a) Disable the timeout manager on the workers. Include this code on your workers
class WorkerProfileHandler:IHandleProfile<Worker>
{
public void ProfileActivated()
{
Configure.Instance.DisableTimeoutManager();
}
}
b) Reroute NServiceBus on the workers to use the .timeouts queue on the MasterNode/Distributor.
If you don't do this, any call to RequestTimeout or Defer on the worker will die with an exception saying that you have forgotten to configure a timeout manager. Include this in your worker config:
<UnicastBusConfig TimeoutManagerAddress="{endpointname}.Timeouts#{masternode}" />
3) Erroneous "Ready" messages back to the distributor.
Because the timeout manager dispatches the messages directly to the workers input queues without removing an entry from the available workers in the distributor storage queue, the workers send erroneous "Ready" messages back to the distributor after handling a timeout. This happens even if you have fixed 1 and 2, and it makes no difference if the timeout was fetched from a local timeout manager on the worker or on one running on the distributor/MasterNode. The consequence is a build up of an extra entry in the storage queue on the distributor for each timeout handled by a worker.
Workaround:
Use NServiceBus 3.3.15 or later.
In version 3+ we created the concept of a master node which hosts inside it all the satellites like the distributor, timeout manager, gateway, etc.
The master node is very simple to run - you just pass a /master flag to the NServiceBus.Host.exe process and it runs everything for you. So, from a deployment perspective, where you used to deploy the distributor, now you deploy the master node.

Worker node handles message from two distributor

I have same question asked in nServicebus group. I did not get firm answer of this feature is supported. I like to post it here to see SO community thoughts.
http://tech.groups.yahoo.com/group/nservicebus/message/16487
I already have windows processor worker nodes that handles the messages from a
Distributor. Now I like to extend this worker node to handle messages from
another distributor with different queue names. When I looked at the unicast bus
configuration, I found that only one distributor Control and data address can be
set. Is there a way to set up multiple distributor in NServiceBus Configuration?
If you also explain the pros and cons of using handling multiple distributor
that would help.
It sounds like you may be using NServiceBus 2.x, because in NServiceBus 3.0, the Distributor story is very much changed.
Under NServiceBus 2.x, you usually set up multiple endpoints all talking to the same distributor. These endpoints become worker nodes and the distributor divides up the work between them based on each worker node reporting when it has a free thread.
So, if you had the load of messages coming into Queue X handled by X.Worker#Server1 and X.Worker#Server2, it doesn't make sense to me why you would want one of the X.Worker instances to handle messages coming into queue Y?
Instead, you should (normally) set up one Distributor per logical service. This is akin to a Network Load Balancer for HTTP traffic. Then the endpoints behind it act as the worker nodes. You can set up a second distributor, with its own worker nodes, for another logical service.
Now, with all that said, in NServiceBus 3.x, the distributor is integrated with the endpoint. So you start off with one endpoint configured as a Master Node. Basically it functions as a distributor AND a worker. Then to scale out, you simply stand up more nodes in Worker role only, pointing at the Master Node to get their work.
In that scenario, there is (generally) no freestanding Distributor. This is why I'm guessing you're referring to V2.