we are using
masstransit 3.5.7
rabbitmq 3.6.5
Our environment is running ~2000 microservices.
We use cluster.
We are experiencing a leak in the number of channels as well as the number of erlang processes being used.
In the image below you can see that we have ~46,000 channels.
If we look into the connections, we see there are many idle channels in each connection.
In addition, maybe related to it, we can see that the number of erlang processes is constantly increasing.
Can someone please share some information and assist with this behavior? enter image description here
Erlang process yes it's related to opened channels, I simulated here opening thousands of channels (without masstransit, just a regular app) (and not closing by purpose) and look the result, similar to yours:
About the issue, possible it's related to:
https://github.com/MassTransit/MassTransit/issues/266
So you can try do this:
it's necessary to setup a cleanup timer on the SendEndpointCache so
that unused endpoints are shut down after a few minutes.
Hope it helps.
Related
RabbitMQ introduced streams last year. They claim streams work with AMQP 0.9 and 1.0 as well as mentioned here. That is, theoretically we should be able to create a queue backed by a stream, connect as many workers we need to fan-out to the queue and each worker should get the message delivered.
My question is, has anyone tried to use streams with celery yet? If so, please share any info on how to configure streams in Celery and your experience with them so far. There are unfortunately no blog posts nor any documentation I could find on this topic. I am hoping this post brings together all this information in one place.
The big advantage of streams is they allow large fan-out using the existing infra of RabbitMQ + Celery.
As far as I am aware of there is no way Celery can utilise streams. However, you can probably spin up a long running Celery task that processes particular stream. This is probably reason why nobody attempted (or better say recorded as a blog post or something similar) to do this. - Why bother using Celery for something it is not made for?
We’ve been using the Tokbox platform for several months now with a Javascript web-client as well as an Android phone client, where sessions and connections are managed by a Python server. While integration and bring-up went well on both ends (client and server), we continue to encounter problems with the in-session audio and video experience.
Sessions are always routed and always between two participants only, with much use of a collaborative editor.
The in-session experience is like a coin toss: we never know how it’s going to go, and that’s becoming a business threat.
Web-Client: A/V Resources
The most common problem is the acquisition of audio and/or video: at the beginning of a session, one or the other participants may have problems hearing or seeing the other. Allocating a new connection to establish new streams does not fix that, nor does restarting the browser.
Question: What’s the recommended way to detect possible resource locks (e.g. does another application hog the camera/microphone)?
Web-Client: Network
Bandwidth and packet loss are a challenge, for example this inspector graph:
Audio and video of both participants is all over the place, and while we can not control the network connections the web-client should be able to reliably give useful information.
Question: Other than continuous connection monitoring with getStats() and maybe the experimental navigator.connection property, how can the web-client monitor network connectivity?
Pre-Call Test
We recommend to customers to run a pre-call test and have implemented it on our site as well. However, results of that test often times do not reflect the in-session connectivity. Worse, a pre-call test may detect a low (no video) bandwidth while Skype works just fine.
Question: How can that be?
I'm a member of the TokBox development team. I remember you reported an issue with the Python SDK, thanks for that!
Web-Client: A/V Resources
Most acquisition issues are detected by the JS SDK and if they aren't then we'd really like to hear about it! Please report reproduction steps or affected session IDs to TokBox support (referencing this StackOverflow question): https://support.tokbox.com/hc/en-us/requests/new
Most acquisition errors appear as OT_HARDWARE_UNAVAILABLE or OT_MEDIA_ERR_ABORTED errors. Are you detecting and surfacing these errors to your users? There is also the special OT_CHROME_MICROPHONE_ACQUISITION_ERROR error which is due to a known issue with Chrome that has been mostly fixed since Chrome 63 (see https://bugs.chromium.org/p/webrtc/issues/detail?id=4799).
Web-Client: Network
Unfortunately this is one of the more difficult issues to troubleshoot. Yes, Subscriber#getStats() is the best tool we have at our disposal and is a wrapper around the native RTCPeerConnection#getStats() function. Unfortunately we don't have much control over the values returned by the native function and if you think our SDK is returning incorrect values when compared with values from RTCPeerConnection#getStats() then please let us know!
It would be worthwhile confirming whether the issue is reproducible in all browsers or only a particular one. If you have detailed data regarding the inaccuracy of the native RTCPeerConnection#getStats() function then we could work together to report it to the browser vendor(s).
Fortunately we have just released the new Publisher#getStats() function which lets you get the publisher side of the stats. This should help you narrow down the cause of a connectivity issue to either a publisher or subscriber side. Please let us know if this helps with tracking down these issues.
Pre-Call Test
Again, these tests are based on Subscriber#getStats() which in turn are based on RTCPeerConnection#getStats(), the accuracy of which is out of our hands, but we'd love any reproduction steps to either fix a bug in our client SDK or report a bug to the browser vendors.
Just to confirm though, when you say you've implemented a pre-call test in your site, did you use the official JavaScript network test module? https://github.com/opentok/opentok-network-test-js This is actually what's used by the TokBox pre-call test.
#Aiham, thanks for responding, I've been looking at the the new Publisher#getStats() you linked to (thank you!), so we too can give our users some way of visibly seeing the network conditions that might be affected the quality of their call (and who's causing it). However, it seems as though bytes / packets sent goes up sharply as the number of subscribers increases, even though we're in a routed session.
Am I wrong to expect the Publisher#getStats() statistics to stay fairly stable regardless of the number of subscribers then receiving that stream in a routed session? I expected the nature of a routed call to mean it's sent once to the OpenTok Media Servers, and the statistics would end there.
I'm working on an application where-in I have a listener on a rabbit mq queue. Depending on the kind of message, the listener goes ahead and performs a task. My problem is I need a way to spawn a new listener if a single listener isn't able to cope up with the queue. As far as I can tell, I can use the rabbitmq json api to find the len of the queue and take actions based on that. So, I write a script that checks using curl the queue length and spawns a new listener process. Am I on the right path here? Is there a better way to achieve this? I'm looking for a solution that kinda scales with load to a certain limit atleast.
Checking the RabbitMQ API to see the length of the queue is one way, and it would definitely work.
You should try to predict when the load is spiking so that you slowly can increase the number of consumers if needed, so that you don't see a sudden spike of instances spawning. Having many instances spawning simultaneously could cause unnecessary load on your system.
I'm setting up a web service with pyramid. A typical request for a view will be very long, about 15 min to finish. So my idea was to queue jobs with celery and a rabbitmq broker.
I would like to know what would be the best way to ensure that bad things cannot happen.
Specifically I would like to prevent the task queue from overflow for example.
A first mesure will be defining quotas per IP, to limit the number of requests a given IP can submit per hour.
However I cannot predict the number of involved IPs, so this cannot solve everything.
I have read that it's not possible to limit the queue size with celery/rabbitmq. I was thinking of retrieving the queue size before pushing a new item into it but I'm not sure if it's a good idea.
I'm not used to good practices in messaging/job scheduling. Is there a recommended way to handle this kind of problems ?
RabbitMQ has flow control built into the QoS. If RabbitMQ cannot handle the publishing rate it will adjust the TCP window size to slow down the publishers. In the event of too many messages being sent to the server it will also overflow to disk. This will allow your consumer to be a bit more naive although if you restart the connection on error and flood the connection you can cause problems.
I've always decided to spend more time making sure the publishers/consumers could work with multiple queue servers instead of trying to make them more intelligent about a single queue server. The benefit is that if you are really overloading a single server you can just add another one (or another pair if using RabbitMQ HA. There is a useful video from Pycon about Messaging at Scale using Celery and RabbitMQ that should be of use.
What is "GridInterceptingMessageHandler"? I did a search and I can find no mention of this on nservicebus.com. Also, I see the samples have the line:
.LoadMessageHandlers(First<GridInterceptingMessageHandler>.Then<SagaMessageHandler>())
What does that do exactly?
If you look at the source and its documentation you'll see the following:
Intercepts all messages, not allowing any through if the endpoint has had its number of worker threads reduced to zero.
GridInterceptingMessageHandler
NSB allows you to dynamically tune the number of work threads and endpoint is using to process messages. If the number of work threads has been reduced to zero, the endpoint becomes disabled and will not continue to process messages. The tuning of threads is useful if you would like to increase the speed of message processing(assuming everything else will scale as well) while not having to restart the endpoint.
This is especially helpful if you want to slowing drain the system of messages so that you can perform upgrades or other maintenance duties. By default this is wired up for you, you would only reference it if you decided to override how the message handlers are loaded(as in the example).