Handling of pubsub subscribers for distributed longrunning tasks - redis

I am evaluating the use of using pubsub for long-running tasks such as video transcoding, where a particular transcode may take between 2-10 minutes. Is pubsub a good approach for such a task distribution? For example, let's say I have five servers:
- publisher1
- publisher2
- publisher3
- publisher4
- publisher5
And a topic called "videos". Would it be possible to spread out the messages equally across those five servers? What about when servers are added or removed? What would be a good approach to doing this, or is pubsub not the right tool for something like this?

This does sound like a reasonable use case for pubsub. Specifically, if you use a pull subscriber, you can configure flow control settings to have at most one outstanding message to your server, and configure the max ack extension period (in java) to be a reasonable upper bound of your processing time. This api is described here http://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/pubsub/v1/package-summary.html
This should effectively load balance across your servers by default if you use the same subscriber id for all jobs. If a server is added and backlog exists, it will receive a new entry. If a server is removed, it will no longer be sent messages. If it removed while processing or crashes, the message it was working on will be resent to another server.
One concern however is that pubsub has a limit of 10MB per message. You might consider instead putting the data itself in a google cloud storage bucket. Cloud storage can publish the file location to a pubsub topic when an upload is complete. https://cloud.google.com/storage/docs/pubsub-notifications

Related

Kafka Streams Fault Tolerance with Offset Management in Parellel

Description :
I have one Kafka Stream application which is consuming from a topic.
The events are coming at high volumes.
KafkaStream will consume the events as a terminal operation and club the events in a bunch say 1000 events and writes it to AWS S3.
I have threads that are writing to s3 in parallel after consuming events from Kafka topic.
Not using kafka-connector-s3 due to some business application logics and processings.
Problem ::
I want the application to be fault-tolerant don't want to loose messages.
--> CRASH SCENARIO
Suppose the application has 10 threads all are running and trying to put the events in S3, and a crash happens, in that case, since the KafkaStream has ( enable.auto.commit = false )and we cannot commit the offset manually and all the threads have consumed messages from Kafka topic.
In this case, KafkaStreams has already committed the offset after reading but it could not have processed the events to S3.
I need a mechanism so that I can be sure of that what was the last offset till the events get written to the S3 file successfully.
And In crash scenarios, how should I deal with this and how to manage the Kafka offsets in Kafka Streams as I am using say 10 threads. What if some failed to write to s3 and some are passed. How do I ensure the ordering of offset getting successfully processed to s3 or not?
Let me know if I am not clear to describe my problem statement.
Thanks!
I can assure you that enable.auto.commit is set to false in Kafka Streams. The Javadocs at https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/StreamsConfig.html state
"enable.auto.commit" (false) - Streams client will always disable/turn off auto committing
You are right that Kafka Streams will automatically commit in more or less regular intervals. However, Kafka Streams waits until records are processed before committing the corresponding offsets. That means you would at least get at-least-once guarantees and not lose messages.
As far as I understand your application, your terminal processor does not block until the records are sent to S3. That means, Kafka Streams cannot know when the sending is completed. Kafka Streams just sees that the terminal processor completed its processing and then -- if the commit interval elapsed -- it commits the offsets.
You say
Not using kafka-connector-s3 due to some business application logics and processings.
Could you put the business application logic in the Kafka Streams application, write the results to a Kafka topic with operator to(), and then use the kafka-connector-s3 to send the messages in that topic to S3?
I am not a connect expert, but I guess that would make sure that messages are not lost and would make your implementation simpler.
Using kafka-stream ,you could aggragate 5000 messages from source topic to one big message and send the big one to another topic like middle_topic. You need another proceccor source from the middle_topic and sink to s3 using s3-connector.

To be sure about concurrency, same group of works in multiple queues (FIFO)

I have a question about multi consumer concurrency.
I want to send works to rabbitmq that comes from web request to distributed queues.
I just want to be sure about order of works in multiple queues (FIFO).
Because this request comes from different users eech user requests/works must be ordered.
I have found this feature with different names on Azure ServiceBus and ActiveMQ message grouping.
Is there any way to do this in pretty RabbitMQ ?
I want to quaranty that customer's requests must be ordered each other.
Each customer may have multiple requests but those requests for that customer must be processed in order.
I desire to process quickly incoming requests with using multiple consumer on different nodes.
For example different customers 1 to 1000 send requests over 1 millions.
If I put this huge request in only one queue it takes a lot of time to consume. So I want to share this process load between n (5) node. For customer X 's requests must be in same sequence for processing
When working with event-based systems, and especially when using multiple producers and/or consumers, it is important to come to terms with the fact that there usually is no such thing as a guaranteed order of events. And to get a robust system, it is also wise to design the system so the message handlers are idempotent; they should tolerate to get the same message twice (or more).
There are way to many things that may (and actually should be allowed to) interfere with the order;
The producers may deliver the messages in a slightly different pace
One producer might miss an ack (due to a missed package) and will resend the message
One consumer may get and process a message, but the ack is lost on the way back, so the message is delivered twice (to another consumer).
Some other service that your handlers depend on might be down, so that you have to reject the message.
That being said, there is one pattern that servicebus-systems like NServicebus use to enforce the order messages are consumed. There are some requirements:
You will need a centralized storage (like a sql-server or document store) that allows for conditional updates; for instance you want to be able to store the sequence number of the last processed message (or how far you have come in the process), but only if the already stored sequence/progress is the right/expected one. Storing the user-id and the progress even for millions of customers should be a very easy operation for most databases.
You make sure the queue is configured with a dead-letter-queue/exchange for retries, and then set your original queue as a dead-letter-queue for that one again.
You set a TTL (for instance 30 seconds) on the retry/dead-letter-queue. This way the messages that appear on the dead-letter-queue will automatically be pushed back to your original queue after some timeout.
When processing your messages you check your storage/database if you are in the right state to handle the message (i.e. the needed previous steps are already done).
If you are ok to handle it you do and update the storage (conditionally!).
If not - you nack the message, so that it is thrown on the dead-letter queue. Basically you are saying "nah - I can't handle this message, there are probably some other message in the queue that should be handled first".
This way the happy-path is to process a great number of messages in the right order.
But if something happens and a you get a message out of band, you will throw it on the retry-queue (the dead-letter-queue) and Rabbit will make sure it will get back in the queue to be retried at a later stage. But only after a delay.
The beauty of this is that you are able to handle most of the situations that may interfere with processing the message (out of order messages, dependent services being down, your handler being shut down in the middle of handling the message) in exact the same way; by rejecting the message and letting your infrastructure (Rabbit) take care of it being retried after a while.
(Assuming the OP is asking about things like ActiveMQs "message grouping:)
This isn't currently built in to RabbitMQ AFAIK (it wasn't as of 2013 as per this answer) and I'm not aware of it now (though I haven't kept up lately).
However, RabbitMQ's model of exchanges and queues is very flexible - exchanges and queues can be easily created dynamically (this can be done in other messaging systems but, for example, if you read ActiveMQ documentation or Red Hat AMQ documentation you'll find all of the examples in the user guides are using pre-declared queues in configuration files loaded at system startup - except for RPC-like request/response communication).
Also it is very easy in RabbitMQ for a consumer (i.e., message consuming thread) to consume from multiple queues.
So you could build, on top of RabbitMQ, a system where you got your desired grouping semantics.
One way would be to create dynamic queues: The first time a customer order was seen or a new group of customer orders a queue would be created with a unique name for all messages for that group - that queue name would be communicated (via another queue) to a consumer who's sole purpose was to load-balance among other consumers that were responsible for handling customer order groups. I.e., the load-balancer would pull off of its queue a message saying "new group with queue name XYZ" and it would find in a pool of order group consumer a consumer which could take this load and pass it a message saying "start listening to XYZ".
Another way to do it is with pub/sub and topic routing - each customer order group would get a unique topic - and proceed as above.
RabbitMQ Consistent Hash Exchange Type
We are using RabbitMQ and we have found a plugin. It use Consistent Hashing algorithm to distribute messages in order to consistent keys.
For more information about Consistent Hashing ;
https://en.wikipedia.org/wiki/Consistent_hashing
https://www.youtube.com/watch?v=viaNG1zyx1g
You can find this plugin from rabbitmq web page
plugin : rabbitmq_consistent_hash_exchange
https://www.rabbitmq.com/plugins.html

How do I monitor RabbitMQ exchange lifecycle events

I'm working with a product suite which uses RabbitMQ as a back end for service bus messaging. Many of the clients use software (NeuronESB) which is supposed to automatically configure exchanges, queues and channels as needed. Somewhere in the system exchanges in Rabbit are being deleted and not re-created, resulting in unexpected issues. Because of the size of the system and closed source nature of at least one of the service bus clients, an audit of code has been unsuccessful in determining the source of the deletion of these exchanges.
I have tried using the firehose functionality of Rabbit, but that only provides the messages being sent through Rabbit, not the internal activities I need.
What methods are available for logging the creation and deletion of exchanges in RabbitMQ? Ideally I would like to know the date, time and client IP of the deleter, but even just getting the date and time would allow me to narrow my search of logs to help find the offender.
Try Events Exchange plugin that should do the trick.
If not working for some reason, the last resort I can think of:
Get a test environment with less clients/messages if you app is busy, then analyse your traffic with wireshark (it can understand amqp) to filter out requests to delete exchange.

How to load balancing ActiveMQ with persistent message

I have a middleware based on Apache Camel which does a transaction like this:
from("amq:job-input")
to("inOut:businessInvoker-one") // Into business processor
to("inOut:businessInvoker-two")
to("amq:job-out");
Currently it works perfectly. But I can't scale it up, let say from 100 TPS to 500 TPS. I already
Raised the concurrent consumers settings and used empty businessProcessor
Configured JAVA_XMX and PERMGEN
to speed up the transaction.
According to Active MQ web Console, there are so many messages waiting for being processed on scenario 500TPS. I guess, one of the solution is scale the ActiveMQ up. So I want to use multiple brokers in cluster.
According to http://fuse.fusesource.org/mq/docs/mq-fabric.html (Section "Topologies"), configuring ActiveMQ in clustering mode is suitable for non-persistent message. IMHO, it is true that it's not suitable, because all running brokers use the same store file. But, what about separating the store file? Now it's possible right?
Could anybody explain this? If it's not possible, what is the best way to load balance persistent message?
Thanks
You can share the load of persistent messages by creating 2 master/slave pairs. The master and slave share their state either though a database or a shared filesystem so you need to duplicate that setup.
Create 2 master slave pairs, and configure so called "network connectors" between the 2 pairs. This will double your performance without risk of loosing messages.
See http://activemq.apache.org/networks-of-brokers.html
This answer relates to an version of the question before the Camel details were added.
It is not immediately clear what exactly it is that you want to load balance and why. Messages across consumers? Producers across brokers? What sort of concern are you trying to address?
In general you should avoid using networks of brokers unless you are trying to address some sort of geographical use case, have too many connections for a signle broker to handle, or if a single broker (which could be a pair of brokers configured in HA) is not giving you the throughput that you require (in 90% of cases it will).
In a broker network, each node has its own store and passes messages around by way of a mechanism called store-and-forward. Have a read of Understanding broker networks for an explanation of how this works.
ActiveMQ already works as a kind of load balancer by distributing messages evenly in a round-robin fashion among the subscribers on a queue. So if you have 2 subscribers on a queue, and send it a stream of messages A,B,C,D; one subcriber will receive A & C, while the other receives B & D.
If you want to take this a step further and group related messages on a queue so that they are processed consistently by only one subscriber, you should consider Message Groups.
Adding consumers might help to a point (depends on the number of cores/cpus your server has). Adding threads beyond the point your "Camel server" is utilizing all available CPU for the business processing makes no sense and can be conter productive.
Adding more ActiveMQ machines is probably needed. You can use an ActiveMQ "network" to communicate between instances that has separated persistence files. It should be straight forward to add more brokers and put them into a network.
Make sure you performance test along the road to make sure what kind of load the broker can handle and what load the camel processor can handle (if at different machines).
When you do persistent messaging - you likely also want transactions. Make sure you are using them.
If all running brokers use the same store file or tx-supported database for persistence, then only the first broker to start will be active, while others are in standby mode until the first one loses its lock.
If you want to loadbalance your persistence, there were two way that we could try to do:
configure several brokers in network-bridge mode, then send messages
to any one and consumer messages from more than one of them. it can
loadbalance the brokers and loadbalance the persistences.
override the persistenceAdapter and use the database-sharding middleware
(such as tddl:https://github.com/alibaba/tb_tddl) to store the
messages by partitions.
Your first step is to increase the number of workers that are processing from ActiveMQ. The way to do this is to add the ?concurrentConsumers=10 attribute to the starting URI. The default behaviour is that only one thread consumes from that endpoint, leading to a pile up of messages in ActiveMQ. Adding more brokers won't help.
Secondly what you appear to be doing could benefit from a Staged Event-Driven Architecture (SEDA). In a SEDA, processing is broken down into a number of stages which can have different numbers of consumer on them to even out throughput. Your threads consuming from ActiveMQ only do one step of the process, hand off the Exchange to the next phase and go back to pulling messages from the input queue.
You route can therefore be rewritten as 2 smaller routes:
from("activemq:input?concurrentConsumers=10").id("FirstPhase")
.process(businessInvokerOne)
.to("seda:invokeSecondProcess");
from("seda:invokeSecondProcess?concurentConsumers=20").id("SecondPhase")
.process(businessInvokerTwo)
.to("activemq:output");
The two stages can have different numbers of concurrent consumers so that the rate of message consumption from the input queue matches the rate of output. This is useful if one of the invokers is much slower than another.
The seda: endpoint can be replaced with another intermediate activemq: endpoint if you want message persistence.
Finally to increase throughput, you can focus on making the processing itself faster, by profiling the invokers themselves and optimising that code.

How to handle long asynchronous requests with pyramid and celery?

I'm setting up a web service with pyramid. A typical request for a view will be very long, about 15 min to finish. So my idea was to queue jobs with celery and a rabbitmq broker.
I would like to know what would be the best way to ensure that bad things cannot happen.
Specifically I would like to prevent the task queue from overflow for example.
A first mesure will be defining quotas per IP, to limit the number of requests a given IP can submit per hour.
However I cannot predict the number of involved IPs, so this cannot solve everything.
I have read that it's not possible to limit the queue size with celery/rabbitmq. I was thinking of retrieving the queue size before pushing a new item into it but I'm not sure if it's a good idea.
I'm not used to good practices in messaging/job scheduling. Is there a recommended way to handle this kind of problems ?
RabbitMQ has flow control built into the QoS. If RabbitMQ cannot handle the publishing rate it will adjust the TCP window size to slow down the publishers. In the event of too many messages being sent to the server it will also overflow to disk. This will allow your consumer to be a bit more naive although if you restart the connection on error and flood the connection you can cause problems.
I've always decided to spend more time making sure the publishers/consumers could work with multiple queue servers instead of trying to make them more intelligent about a single queue server. The benefit is that if you are really overloading a single server you can just add another one (or another pair if using RabbitMQ HA. There is a useful video from Pycon about Messaging at Scale using Celery and RabbitMQ that should be of use.