I have streams of input data that I process. Each stream is sent in chunks of data. I can only process the N+1st chunk of data of a stream i after I finished processing the Nth chunk of data of the same stream i. Therefore, parallelization can happen by processing multiple streams at once, but I can never split one stream on multiple workers.
Chunks of one stream are added to the queue in order (although chunks from several streams can be added at the same time).
Most message queues, like RabbitMQ, guarantee ordered delivery when multiple workers operate on one queue. However, to achieve the behaviour I would like, I'd need to restrict the number of workers to 1 for each queue, so that the next chunk is always only processed when the previous chunk was finished. To parallelize, I could create a queue for each stream, or a queue for each worker, and have another process that redirects the streams to the worker queues. In fact, the one-queue-per-worker approach is what I do right now, using RabbitMQ's consistent-hashing and shovels. Of course, in terms of load balancing and dynamic scaling of the number of workers, that is far from ideal.
I've read a lot about Kafka, and how it is designed for time-series data (like logs). Yet, I couldn't figure out how I could apply Kafka - or any other message queue out there - to solve my problem.
I would greatly appreciate some hints on how to best use a message queue for my problem.
You could use Kafka, but you'd have to use some stream identification to hash messages on the Producer side, so that messages from one stream always go to the same partition.
Then, on the Consumer side, you'd have to use the low-level consumer to spawn as much consuming threads as you have partitions, where each thread would consume from a single partition.
That would mean that you always process messages in order within each of your streams.
I haven't yet checked out how Kafka 0.9 Producer works, but there were some changes, so you should probably look into those if you want to use the latest version.
Why don't you push the next chunk only after receiving the delivery acknowledgement of the former chunk to the worker? Or some kind of a flag that the former chuck is processed by the worker, flag is set to true & then push the next chunk.
If you need to parallelize work create several queues with unique routing keys, based on routing keys push the chunks to respective queues. And have separate flags for every routing key.
Related
Currently, I have a redis stream which have many entries produced. I want to process stream parallelly but make sure each entry is processed only once. I search the official document about redis stream. It seems that consumer group is one solution.
So I try to create one consumer group and let multi consumer in that group consume the same stream parallelly(maybe multi consumer instances on different servers or multi threads in the same server).
Can Redis consumer group guarantee that multi-consumers running parallelly consume different exclusive subset from the same stream to make sure each entry is processed only once?
If it can be guaranteed, for each consumer, when reading from stream, is xreadgroup group mygroup consumer1 [count 1000] streams mystream > enough?
Yes. Each consumer in the group reads a mutually-exclusive subset of the stream's entries. Each message will be handled by a single consumer - the one that had read it - unless it is XCLAIMed.
Only once is an entirely different matter. It is up to the consumer to make sure of that.
Description :
I have one Kafka Stream application which is consuming from a topic.
The events are coming at high volumes.
KafkaStream will consume the events as a terminal operation and club the events in a bunch say 1000 events and writes it to AWS S3.
I have threads that are writing to s3 in parallel after consuming events from Kafka topic.
Not using kafka-connector-s3 due to some business application logics and processings.
Problem ::
I want the application to be fault-tolerant don't want to loose messages.
--> CRASH SCENARIO
Suppose the application has 10 threads all are running and trying to put the events in S3, and a crash happens, in that case, since the KafkaStream has ( enable.auto.commit = false )and we cannot commit the offset manually and all the threads have consumed messages from Kafka topic.
In this case, KafkaStreams has already committed the offset after reading but it could not have processed the events to S3.
I need a mechanism so that I can be sure of that what was the last offset till the events get written to the S3 file successfully.
And In crash scenarios, how should I deal with this and how to manage the Kafka offsets in Kafka Streams as I am using say 10 threads. What if some failed to write to s3 and some are passed. How do I ensure the ordering of offset getting successfully processed to s3 or not?
Let me know if I am not clear to describe my problem statement.
Thanks!
I can assure you that enable.auto.commit is set to false in Kafka Streams. The Javadocs at https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/StreamsConfig.html state
"enable.auto.commit" (false) - Streams client will always disable/turn off auto committing
You are right that Kafka Streams will automatically commit in more or less regular intervals. However, Kafka Streams waits until records are processed before committing the corresponding offsets. That means you would at least get at-least-once guarantees and not lose messages.
As far as I understand your application, your terminal processor does not block until the records are sent to S3. That means, Kafka Streams cannot know when the sending is completed. Kafka Streams just sees that the terminal processor completed its processing and then -- if the commit interval elapsed -- it commits the offsets.
You say
Not using kafka-connector-s3 due to some business application logics and processings.
Could you put the business application logic in the Kafka Streams application, write the results to a Kafka topic with operator to(), and then use the kafka-connector-s3 to send the messages in that topic to S3?
I am not a connect expert, but I guess that would make sure that messages are not lost and would make your implementation simpler.
Using kafka-stream ,you could aggragate 5000 messages from source topic to one big message and send the big one to another topic like middle_topic. You need another proceccor source from the middle_topic and sink to s3 using s3-connector.
Is there any way to make a RabbitMQ queue behave as a Stack, i.e. the client gets the last message that was posted in the queue (LIFO) rather than the first one? Or maybe alternatively make it a priority queue using a timestamp which the client could set?
RabbitMQ does support priority queues but the priority it allows is just a number up to 255 (recommended to use up to 10).
What I want to achieve is that the latest messages are processed first because they contain the latest information about the source. I still want to process the old messages, but in situations when the client cannot keep up (or there was some downtime and the client is recovering) I want to process the latest state information first.
The only solution I came up with so far is to use a TTL on the messages of the main queue and have them go to a dead letter queue when they expire, which is also processed by the client. However this is not so clean, and if the source of the message takes longer than the TTL to send a new status update, the latest state will be stuck in queue behind the other older expired messages still to be processed.
If it is not possible to achieve with RabbitMQ, is there any other recommended messaging framework that supports this requirement?
Kafka Log Compaction was created for exactly the use case you describe:
Log compaction ensures that Kafka will always retain at least the last
known value for each message key within the log of data for a single
topic partition. It addresses use cases and scenarios such as
restoring state after application crashes or system failure, or
reloading caches after application restarts during operational
maintenance. Let's dive into these use cases in more detail and then
describe how compaction works.
So, RabbitMQ is a queue, not a stack. It is specifically designed NOT to do what you are asking (a queue is always a first-in, first-out data structure).
However, there are options:
Presumably some process (e.g. a web service) exists between the client and the message server. This process could save the data off to an additional storage location (e.g. memcached) for immediate access of the latest value, thus leaving the queue untouched.
You could configure a secondary queue/service combination. When messages are published, they can then be routed to both queues. The first queue is for your heavy processing, and the second queue would be a service whose only task is to update the latest value in memcached or some other fast storage/retrieval system. Thus, message lifetime in this queue would presumably be much shorter.
You could implement multiple processing steps. The first step would be to update the current state (presumably a quick operation), after which the message is then re-published to the longer processing step's queue.
I'm implementing a bolt in Storm that receives messages from a RabbitMQ spout (https://github.com/ppat/storm-rabbitmq).
Each event I have to process in Storm arrives as two messages from Rabbit so I have a fieldsGrouping on the bolt so that the two messages arrive in the same bolt.
My first approach I would:
Receive the first Tuple and save the message in Memory
Ack the first tuple
When second Tuple arrived fetch the first from Memory and emit a new tuple anchored to the second from the spout.
This worked but I could loose messages if a worker died because I would ack the first tuple before getting the second and processing.
I changed this to:
Receive the first Tuple and save it in Memory
When second Tuple arrived fetch the first from Memory, emit a new tuple anchored to both input tuples and ack both input tuples.
The in-memory cache is a Guava cache with time expiration and when a Tuple is evicted due to timeout I will fail() it in the topology so that it gets reprocessed latter.
This seemed to work but when I did some tests I got to a situation where the system would stop fetching messages from the Rabbit Queue.
The prefetch on the queue is set to 5, and spout with setMaxSpoutPending at 7. In the Rabbit interface I see 5 Unacked messages.
In the storm logs I see the same Tuples being evicted from the cache over and over again.
I understand that the problem is that spout will only fetch 5 messages that are all the first part of a pair. I can increase the prefetch but that is no warranty that this will not happen in production.
So my question is: How to implement a join while handling these problems in Storm?
Storm does not provide a good solution for this... What you would need is a reliable storage that buffers the first tuple (ie, a stateful operator). Thus, you could ack the first tuple immediately and recover the state after a failure.
As far as I know, Trident supports some state handling. But I never used it.
As a second alternative, you could use a distributed key-value store (like Casandra) as buffer. Of course, this would be a hand-written solution, ie, you need to code all Casandra interactions by yourself.
Last but not least, you could switch to a stream processing system that does support stateful operators like Apache Flink. (disclaimer: I am a committer at Flink)
I'm new to RabbitMQ and I'm wondering how to implement the following: producer creates tasks for multiple sites, there's a bunch of consumers that should process these tasks one by one, but only talking to 1 site with concurrency of 1, without starting a new task for this site before the previous one ended. This way slow site would be processed slowly, and the fast ones - fast (as opposed by slow sites taking up all the worker capacity).
Ideally a site would be processed only by one worker at a time, being replaced by another worker if it dies. This seems like a task for exclusive queues, but apparently there's no easy way to list and subscribe to new queues. What is the proper way to achieve such results with RabbitMQ?
I think you may have things the wrong way round. For workers you have 1 or more producers sending to 1 exchange. The exchange has 1 queue (you can send directly to the queue, but all that is really doing is going via a default exchange, I prefer to be explicit). All consumers connect to the single queue and read off tasks in turn. You should set the queue to require messages to be ACKed before removing them. That way if a process dies it should be returned to the queue and picked up by the next consumer/worker.