Pika/RabbitMQ: Correct usage of add_backpressure_callback - rabbitmq

I am new to using RabbitMQ and Pika so please excuse if the answer is obvious...
We are feeding some data and passing the results into our rabbitmq message queue. The queue is being consumed by a process that writes the data into elasticsearch.
The data is being produced faster than it can be fed into elastic search and consequently the queue grows and almost never shrinks.
We are using pika and getting the warning:
UserWarning: Pika: Write buffer exceeded warning threshold at X bytes and an estimated X frames behind.
This continues for some time until Pika simply crashes with a strange error message:
NameError: global name 'log' is not defined
We are using the Pika BlockingConnection object (http://pika.github.com/connecting.html#blockingconnection).
My plan to fix this is to use the add_backpressure_callback function to have a function that will call time.sleep(0.5) every time that we need to apply back-pressure. However, this seems like it is too simple of a solution and that there must be a more appropriate way of dealing with something like this.
I would guess that it is a common situation that the queue is being populated faster than it is being consumed. I am looking for an example or even some advice as to what is the best way to slow down the queue.
Thanks!

Interesting problem, and as you rightly point out this is probably quite common. I saw another related question on Stack Overflow with some pointers
Pika: Write buffer exceeded warning
Additionally, may you want to consider scaling up your elasticsearch, this is perhaps the fundamental bottleneck you want to fix. A quick look on the elasticsearch.org website came up with
"Distributed
One of the main features of Elastic Search is its distributed nature. Indices are broken down into shards, each shard with 0 or more replicas. Each data node within the cluster hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically and behind the scenes.
"
(...although not sure if insertion is also distributed and scalable)
Afterall, RabbitMQ is not supposed to grow queues infinitely. Also may want to look at scaling up RabbitMQ itself, for example by using things like per-queue processes etc. in the RabbitMQ configuration.
Cheers!

Related

Tracing only error traces in jaeger, finding out only the error traces in jaeger

I actually was trying to sample only the error traces in my application but i already have a probabilistic sampler parameter set in my application which samples the span at the beginning itself and the rest span follow the same pattern after then, i tried using force sampling option in jaeger but it doesnt seem to override the original decision made by the initial span of getting sampled or not. Kindly help me out here.
Jaeger clients implement so-called head-based sampling, where a sampling decision is made at the root of the call tree and propagated down the tree along with the trace context. This is done to guarantee consistent sampling of all spans of a given trace (or none of them), because we don't want to make the coin flip at every node and end up with partial/broken traces. Implementing on-error sampling in the head-based sampling system is not really possible. Imaging that your service is calling service A, which returns successfully, and then service B, which returns an error. Let's assume the root of the trace was not sampled (because otherwise you'd catch the error normally). That means by the time you know of an error from B, the whole sub-tree at A has been already executed and all spans discarded because of the earlier decision not to sample. The sub-tree at B has also finished executing. The only thing you can sample at this point is the spans in the current service. You could also implement a reverse propagation of the sampling decision via response to your caller. So in the best case you could end up with a sub-branch of the whole trace sampled, and possible future branches if the trace continues from above (e.g. via retries). But you can never capture the full trace, and sometimes the reason B failed was because A (successfully) returned some data that caused the error later.
Note that reverse propagation is not supported by the OpenTracing or OpenTelemetry today, but it has been discussed in the last meetings of the W3C Trace Context working group.
The alternative way to implement sampling is with tail-based sampling, a technique employed by some of the commercial vendors today, such as Lightstep, DataDog. It is also on the roadmap for Jaeger (we're working on it right now at Uber). With tail-based sampling 100% of spans are captured from the application, but only stored in memory in a collection tier, until the full trace is gathered and a sampling decision is made. The decision making code has a lot more information now, including errors, unusual latencies, etc. If we decide to sample the trace, only then it goes to disk storage, otherwise we evict it from memory, so that we only need to keep spans in memory for a few seconds on average. Tail-based sampling imposes heavier performance penalty on the traced applications because 100% of traffic needs to be profiled by tracing instrumentation.
You can read more about head-based and tail-based sampling either in Chapter 3 of my book (https://www.shkuro.com/books/2019-mastering-distributed-tracing/) or in the awesome paper "So, you want to trace your distributed system? Key design insights from years of practical experience" by Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, Gregory R. Ganger (http://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf).

How a proposer know its propose is not approved by a quorum of acceptors?

I am reading "paxos" on wiki, and it reads:
"Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number."
But I don't understand how the proposer tells the difference between its proposal not being approved and it just takes more time for the message to transmit?
One of the tricky parts to understanding Paxos is that the original paper and most others, including the wiki, do not describe a full protocol capable of real-world use. They only focus on the algorithmic necessities. For example, they say that a proposer must choose a number "n" higher than any previously used number. But they say nothing about how to actually go about doing that, the kinds of failures that can happen, or how to resolve the situation if two proposers simultaneously try to use the same proposal number (as in both choosing n=2). That actually completely breaks the protocol and would lead to incorrect results but I'm not sure I've ever seen that specifically called out. I guess it's just supposed to be "obvious".
Specifically to your question, there's no perfect way to tell the difference using the raw algorithm. Practical implementations typically go the extra mile by sending a Nack message to the Proposer rather than just silently ignoring it. There are plenty of other tricks that can be used but all of them, including the nacks, come with varying downsides. Which approach is best generally depends on both the kind of application employing Paxos and the environment it's intended to run in.
If you're interested, I put together a much longer-winded description of Paxos that includes many of issues practical implementations must address in addition to the core components. It covers this issue along with several others.
Specific to your question it isn't possible for a proposer to distinguish between lost messages, delayed messages, crashed acceptors or stalled acceptors. In each case you get no response. Typically an implementation will timeout on getting less than a quorum response and resend the proposal on the assumption messages were dropped or acceptors are rebooting.
Often implementations add "nack" messages as negative acknowledgement as an optimisation to speed up recovery. The proposer only gets "nack" responses from nodes that are reachable that have accepted a higher promise. The ”nack” can show both the highest promise and also the highest instance known to be fixed. How this helps will be outlined below.
I wrote an implementation of Paxos called TRex with some of these techniques sticking as closely as possible to the description of the algorithm in the paper Paxos Made Simple. I wrote up a description of the practical considerations of timeouts and nacks on a blog post.
One of the interesting techniques it uses is for a timed out node to make the first proposal with a very low number. This will always get "nack" messages. Why? Consider a three node cluster where one network link breaks between a stable proposer and one other node. The other node will timeout and issue a prepare. If it issues a high prepare it will get a promise from the third node. This will interrupt the stable leader. You then have symmetry where the two nodes that cannot message one another can fight with the leadership swapping with no forward progress.
To avoid this a timed out node can start with a low prepare. It can then look at the "nack" messages to learn from the third node that there is a leader who is making progress. It will see this as the highest instance known to be fixed in the nack will be greater than the local value. The timed out node can then not issue a high prepare and instead ask the third node to send it the latest fixed and accepted values. With that enhancement a timed out node can now distinguish between a stable proposer crashing or the connection failing. Such ”nack” based techniques don't affect the correctness of the implementation they are only an optimisation to ensure fast failover and forward progress.

Optimization techniques

In the above diagram, we have a producer and a consumer. The producer takes about 1 unit of time to produce something and the consumer take about 9 units of time (4 to read and compute the data and 5 to write it back to the database). From a design standpoint, what might be my options to ensure that consumer does not start lagging behind? What can I do (like caching, ensure proper indexing in the DB) to make this better?
I don't know the hidden details of what your system is exactly like but the initial suggestion which instantly popped into my mind is to create multiple threads for both consumer and producer and use threadpool to reuse the threads. You must create more threads for consumer than producer as consumer is slow and the control flow is synchronous. You should try to perform tuning to decide what should be the ratio of number of consumer to producer threads so that there will be always some consumer threads available to consume the events created by the producer thread instantly.
Again, I don't know what's the exact requirement. For example, using multiple threads will affect the order of execution of events streams resulting in inconsistency. So if you don't require the events to be processed and persisted in exact order they are coming, you can certainly boost the performance by parallelization (using threadpool).
Good luck!

Improve MPI program

I wrote a MPI program that seems to run ok, but I wonder about performance. Master thread needs to do 10 or more times MPI_Send, and the worker receives data 10 or more times and sends it. I wonder if it gives a performance penalty and whether I could transfer everything in single structs or which other technique could I benefit from.
Other general question, once a mpi program works more or less, what are the best optimization techniques.
It's usually the case that sending 1 large message is faster than sending 10 small messages. The time cost of sending a message is well modelled by considering a latency (how long it would take to send an empty message, which is non-zero because of the overhead of function calls, network latency, etc) and a bandwidth (how much longer it takes to send an extra byte given that the network communications has already started). By bundling up messages into one message, you only incurr the latency cost once, and this is often a win (although it's always possible to come up with cases where it isn't). The best way to know for any particular code is simply to try. Note that MPI datatypes allow you very powerful ways to describe the layout of your data in memory so that you can take it almost directly from memory to the network without having to do an intermediate copy into some buffer (so-called "marshalling" of the data).
As to more general optimization questions about MPI -- without knowing more, all we can do is give you advice which is so general as to not be very useful. Minimize the amount of communications which need to be done; wherever possible, use built-in MPI tools (collectives, etc) rather than implementing your own.
One way to fully understand the performance of your MPI application is to run it within the SimGrid platform simulator. The tooling and models provided are sufficient to get realistic timing predictions of mid-range applications (like, a few dozen thousands lines of C or Fortran), and it can be associated to adapted visualization tools that can help you fully understand what is going on in your application, and the actual performance tradeoffs that you have to consider.
For a demo, please refer to this screencast: https://www.youtube.com/watch?v=NOxFOR_t3xI

ZooKeeper and RabbitMQ/Qpid together - overkill or a good combination?

Greetings,
I'm evaluating some components for a multi-data center distributed system. We're going to be using message queues (via either RabbitMQ or Qpid) so agents can make asynchronous requests to other agents without worrying about addressing, routing, load balancing or retransmission.
In many cases, the agents will be interacting with components that were not designed for highly concurrent access, so locking and cross-agent coordination will be needed to avoid race conditions. Also, we'd like the system to automatically respond to agent or data center failures.
With the above use cases in mind, ZooKeeper seemed like it might be a good fit. But I'm wondering if trying to use both ZK and message queuing is overkill. It seems like what Zookeeper does could be accomplished by my own cluster manager using AMQP messaging, but that would be hard to get really right. On the other hand, I've seen some examples where ZooKeeper was used to implement message queuing, but I think RabbitMQ/Qpid are a more natural fit for that.
Has anyone out there used a combination like this?
Thanks in advance,
-Chris
Coming into this late, but maybe it will be of some use. The primary consideration should be the performance characteristics of your system. ZooKeeper, like you said, is more than capable of implementing a task distribution system using a distributed queue, but zk currently, is more optimized for reads than it is for writes (this only comes into play in the 1000's of ops per second range). If your throughput needs are less than this, then using just zk to implement your system would reduce number of runtime components and make it simpler. Of course, you should always run your performance tests before deciding.
Distributed coordination is really hard to get right, so I would definitely recommend using zookeeper for that and not rolling your own.
Not quite sure what ZooKeeper exactly is, but I guess that using a component from Apache (if it does fit your needs well) is preferred before managing such things as distributed synchronization and group services at your own. You could of course hire a team of developers especially for that purpose, but that doesn't guarantee you a better implementation.
I guess, that it would be anyways implemented as a separate component, cuz other way could bring much complexity and decelerate the workflow; so the preference of ZooKeeper or anything similar is kind of obvious (to me).
And surely, unless you're in the global optimization phase of your project workflow, I guess it would be better to use RabbitMQ or such (I would even stress that, cuz implementations (especially commercial) of the AMQP would be more reliable than everything that you'd come up with).
So I would go for both, carefully chosing the appropriate thirdparty products, but using as much of them as it is needed. And that's just my opinion; thanks for reading :)