Bull - Reached the max retries per request limit - express

I am using npm bull to add my queue job to handle about sending mail for my project. It runs no problems for a long time, but recently, it shows this error:
Error while handling task collect-metrics: Reached the max retries per request limit (which is 10). Refer to "maxRetriesPerRequest" option for details. error log
And I checked in redis-cli: key *, it didn't show any key.
The bull module support #bull-monitor/express to monitor the job, but since the error shows, I couldn't access the monitor
bull admin panel
here is my code

I faced this problem as well when I deployed my application to production. It turns out that Bull.js doesn't automatically allow a redis connection over TLS, especially the fact that the production environment is already running over TLS. So what fixed it for me was setting tls to true, and enableTLSForSentinelMode to false in the Redis options of my queue. Here's a sample code:
const myQueue = new Queue('my_queue', YOUR_REDIS_URL, {
redis: { tls: true, enableTLSForSentinelMode: false },
...other queue options
})

Bull can't find Redis to connect with.
I'm was using bull in local environment and there is no problem, on the cloud the bull shows me the same error.
so in local environment it's connect to 127.0.0.1:6379, but in cloud you don't have this port so you need to specific the redis's username, redis's password and redis's port.

Related

NestJS, Bull Queue and Redis in production

I have a project using NestJS version 9.2.0 with Bull Queue version 4.9.0 which is connecting to Redis to schedule jobs.
Locally this is working as expected, however, when deploying to staging then Bull won't connect and it fails with error maxRetriesPerRequest.
My assumption is that on staging Bull is not being able to connect over TLS connection, however I am unable to confirm that this is definitely the issue.
I have tried the following without having any success:
set the Redis url as rediss://.....
pass the parameter ?tls=true at the end of the Redis url
pass an empty tls: {} object as part of the Redis configuration which I then use within the BullModule that I am importing as use as factory within the app.module file
The reasons why I suspect it is an issue related to TLS connection:
the local connection works just fine
when I set an invalid URL locally mocking a scenario where Bull cannot connect to Redis I experience the same behaviour in the console as it happens on staging and production
I ensured that the Redis instance on staging and production is up and running and that the url I am using in the app is correct

Moleculer JS "Redis-pub client is disconnected" every 10 minutes

my application (Node.js) is using moleculer for microservices and redis as transporter. However, I find that the application will have this log Redis-pub client is disconnected every 10 minutes, then reconnect with the log Redis-pub client is connected after a few seconds. This is a problem because if a client send a moleculer action during this time, it will fail.
Any idea what is causing this? Let me know if more information is needed.
Azure Cache for Redis currently has a 10-minute idle timeout for connections, so the idle timeout setting in your client application should be less than 10 minutes. Most common client libraries have a configuration setting that allows client libraries to send Redis PING commands to a Redis server automatically and periodically. However, when using client libraries without this type of setting, customer applications themselves are responsible for keeping the connection alive.
More info: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-connection#idle-timeout

Network issues and Redis PubSub

I am using ServiceStack 5.0.2 and Redis 3.2.100 on Windows.
I have got several nodes with active Pub/Sub Subscription and a few Pub's per second.
I noticed that if Redis Service restarts while there is no physical network connection (so one of the clients cannot connect to Redis Service), that client stops receiving any messages after network recovers. Let's call it a "zombie subscriber": it thinks that it is still operational, but never actually receives a message: client thinks it has a connection, the same connection on server is closed.
The problem is no exception is thrown in RedisSubscription.SubscribeToChannels, so I am not able to detect the issue in order to resubscribe.
I have also analyzed RedisPubSubServer and I think I have discovered a problem. In the described case RedisPubSubServer tries to restart (send stop command CTRL), but "zombie subscriber" does not receive it and no resubscription is made.

What is the correct way to use the timeout manager with the distributor in NServiceBus 3+?

Version pre-3 the recommendation was to run a timeout manager as a standalone process on your cluster, beside the distributor. (As detailed here: http://support.nservicebus.com/customer/portal/articles/965131-deploying-nservicebus-in-a-windows-failover-cluster).
After the inclusion of the timeout manager as a satellite assembly, what is the correct way to use it when scaling out with the distributor?
Should each worker of Service A run with timeout manager enabled or should only the distributor process for service A be configured to run a timeout manager for service A?
If each worker runs it, do they share the same Raven instance for storing the timeouts? (And if so, how do you make sure that two or more workers don't pick up the same expired timeout at the same time?)
Allow me to answer this clearly myself.
After a lot of digging and with help from Andreas Öhlund on the NSB team(http://tech.groups.yahoo.com/group/nservicebus/message/17758), the correct anwer to this question is:
Like Udi Dahan mentioned, by design ONLY the distributor/master node should run a timeout manager in a scale out scenario.
Unfortunately in early versions of NServiceBus 3 this is not implemented as designed.
You have the following 3 issues:
1) Running with the Distributor profile does NOT start a timeout manager.
Workaround:
Start the timeout manager on the distributor yourself by including this code on the distributor:
class DistributorProfileHandler : IHandleProfile<Distributor>
{
public void ProfileActivated()
{
Configure.Instance.RunTimeoutManager();
}
}
If you run the Master profile this is not an issue as a timeout manager is started on the master node for you automatically.
2) Workers running with the Worker profile DO each start a local timeout manager.
This is not as designed and messes up the polling against the timeout store and dispatching of timeouts. All workers poll the timeout store with "give me the imminent timeouts for MASTERNODE". Notice they ask for timeouts of MASTERNODE, not for W1, W2 etc. So several workers can end up fetching the same timeouts from the timeout store concurrently, leading to conflicts against Raven when deleting the timeouts from it.
The dispatching always happens through the LOCAL .timouts/.timeoutsdispatcher queues, while it SHOULD be through the queues of the timeout manager on the MasterNode/Distributor.
Workaround, you'll need to do both:
a) Disable the timeout manager on the workers. Include this code on your workers
class WorkerProfileHandler:IHandleProfile<Worker>
{
public void ProfileActivated()
{
Configure.Instance.DisableTimeoutManager();
}
}
b) Reroute NServiceBus on the workers to use the .timeouts queue on the MasterNode/Distributor.
If you don't do this, any call to RequestTimeout or Defer on the worker will die with an exception saying that you have forgotten to configure a timeout manager. Include this in your worker config:
<UnicastBusConfig TimeoutManagerAddress="{endpointname}.Timeouts#{masternode}" />
3) Erroneous "Ready" messages back to the distributor.
Because the timeout manager dispatches the messages directly to the workers input queues without removing an entry from the available workers in the distributor storage queue, the workers send erroneous "Ready" messages back to the distributor after handling a timeout. This happens even if you have fixed 1 and 2, and it makes no difference if the timeout was fetched from a local timeout manager on the worker or on one running on the distributor/MasterNode. The consequence is a build up of an extra entry in the storage queue on the distributor for each timeout handled by a worker.
Workaround:
Use NServiceBus 3.3.15 or later.
In version 3+ we created the concept of a master node which hosts inside it all the satellites like the distributor, timeout manager, gateway, etc.
The master node is very simple to run - you just pass a /master flag to the NServiceBus.Host.exe process and it runs everything for you. So, from a deployment perspective, where you used to deploy the distributor, now you deploy the master node.

How to debug ActiveMQ client?

I'm a fairly new user of ActiveMQ and I'm looking for a way to get detailed debug information on the client side of a queue connection. My problem is this: I have a server that is sending a message through a queue to a client. Using the admin web page associated with the broker, I can verify the following: the queue was created, there is a consumer associated with the queue, the message has been enqueued, the message has been dispatched, the dispatched queue size is 1, the message has not been dequeued. This setup was working yesterday but mysteriously stopped working today even though I did a restart of the activemq service. The log file at /var/log/activemq.log does not contain any useful information.
At this point I'm stumped; I'm assuming that there is some sort of problem with the configuration, but it hasn't changed since yesterday. Does anybody have a suggestion about what my next step should be?
Turn on debug (or even trace) logging in the broker first of all in conf/log4j.properties.
log4j.logger.org.apache.activemq=DEBUG
restart the broker and re-run your scenario. The logging will hopefully provide you with some information.
Jconsole is also a useful tool to monitor the running broker.
Does your client use any message filters?
You can also enable remote debugging and then connect with an IDE.
To start remote debugging execute
$ ACTIVEMQ_DEBUG=true bin/activemq
and then start a remote debugger to connect to port 5005