Kafka Streams Fault Tolerance with Offset Management in Parellel - amazon-s3

Description :
I have one Kafka Stream application which is consuming from a topic.
The events are coming at high volumes.
KafkaStream will consume the events as a terminal operation and club the events in a bunch say 1000 events and writes it to AWS S3.
I have threads that are writing to s3 in parallel after consuming events from Kafka topic.
Not using kafka-connector-s3 due to some business application logics and processings.
Problem ::
I want the application to be fault-tolerant don't want to loose messages.
--> CRASH SCENARIO
Suppose the application has 10 threads all are running and trying to put the events in S3, and a crash happens, in that case, since the KafkaStream has ( enable.auto.commit = false )and we cannot commit the offset manually and all the threads have consumed messages from Kafka topic.
In this case, KafkaStreams has already committed the offset after reading but it could not have processed the events to S3.
I need a mechanism so that I can be sure of that what was the last offset till the events get written to the S3 file successfully.
And In crash scenarios, how should I deal with this and how to manage the Kafka offsets in Kafka Streams as I am using say 10 threads. What if some failed to write to s3 and some are passed. How do I ensure the ordering of offset getting successfully processed to s3 or not?
Let me know if I am not clear to describe my problem statement.
Thanks!

I can assure you that enable.auto.commit is set to false in Kafka Streams. The Javadocs at https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/StreamsConfig.html state
"enable.auto.commit" (false) - Streams client will always disable/turn off auto committing
You are right that Kafka Streams will automatically commit in more or less regular intervals. However, Kafka Streams waits until records are processed before committing the corresponding offsets. That means you would at least get at-least-once guarantees and not lose messages.
As far as I understand your application, your terminal processor does not block until the records are sent to S3. That means, Kafka Streams cannot know when the sending is completed. Kafka Streams just sees that the terminal processor completed its processing and then -- if the commit interval elapsed -- it commits the offsets.
You say
Not using kafka-connector-s3 due to some business application logics and processings.
Could you put the business application logic in the Kafka Streams application, write the results to a Kafka topic with operator to(), and then use the kafka-connector-s3 to send the messages in that topic to S3?
I am not a connect expert, but I guess that would make sure that messages are not lost and would make your implementation simpler.

Using kafka-stream ,you could aggragate 5000 messages from source topic to one big message and send the big one to another topic like middle_topic. You need another proceccor source from the middle_topic and sink to s3 using s3-connector.

Related

RabbitMQ support for LIFO or time based priority queue

Is there any way to make a RabbitMQ queue behave as a Stack, i.e. the client gets the last message that was posted in the queue (LIFO) rather than the first one? Or maybe alternatively make it a priority queue using a timestamp which the client could set?
RabbitMQ does support priority queues but the priority it allows is just a number up to 255 (recommended to use up to 10).
What I want to achieve is that the latest messages are processed first because they contain the latest information about the source. I still want to process the old messages, but in situations when the client cannot keep up (or there was some downtime and the client is recovering) I want to process the latest state information first.
The only solution I came up with so far is to use a TTL on the messages of the main queue and have them go to a dead letter queue when they expire, which is also processed by the client. However this is not so clean, and if the source of the message takes longer than the TTL to send a new status update, the latest state will be stuck in queue behind the other older expired messages still to be processed.
If it is not possible to achieve with RabbitMQ, is there any other recommended messaging framework that supports this requirement?
Kafka Log Compaction was created for exactly the use case you describe:
Log compaction ensures that Kafka will always retain at least the last
known value for each message key within the log of data for a single
topic partition. It addresses use cases and scenarios such as
restoring state after application crashes or system failure, or
reloading caches after application restarts during operational
maintenance. Let's dive into these use cases in more detail and then
describe how compaction works.
So, RabbitMQ is a queue, not a stack. It is specifically designed NOT to do what you are asking (a queue is always a first-in, first-out data structure).
However, there are options:
Presumably some process (e.g. a web service) exists between the client and the message server. This process could save the data off to an additional storage location (e.g. memcached) for immediate access of the latest value, thus leaving the queue untouched.
You could configure a secondary queue/service combination. When messages are published, they can then be routed to both queues. The first queue is for your heavy processing, and the second queue would be a service whose only task is to update the latest value in memcached or some other fast storage/retrieval system. Thus, message lifetime in this queue would presumably be much shorter.
You could implement multiple processing steps. The first step would be to update the current state (presumably a quick operation), after which the message is then re-published to the longer processing step's queue.

Handling of pubsub subscribers for distributed longrunning tasks

I am evaluating the use of using pubsub for long-running tasks such as video transcoding, where a particular transcode may take between 2-10 minutes. Is pubsub a good approach for such a task distribution? For example, let's say I have five servers:
- publisher1
- publisher2
- publisher3
- publisher4
- publisher5
And a topic called "videos". Would it be possible to spread out the messages equally across those five servers? What about when servers are added or removed? What would be a good approach to doing this, or is pubsub not the right tool for something like this?
This does sound like a reasonable use case for pubsub. Specifically, if you use a pull subscriber, you can configure flow control settings to have at most one outstanding message to your server, and configure the max ack extension period (in java) to be a reasonable upper bound of your processing time. This api is described here http://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/pubsub/v1/package-summary.html
This should effectively load balance across your servers by default if you use the same subscriber id for all jobs. If a server is added and backlog exists, it will receive a new entry. If a server is removed, it will no longer be sent messages. If it removed while processing or crashes, the message it was working on will be resent to another server.
One concern however is that pubsub has a limit of 10MB per message. You might consider instead putting the data itself in a google cloud storage bucket. Cloud storage can publish the file location to a pubsub topic when an upload is complete. https://cloud.google.com/storage/docs/pubsub-notifications

Inter process(service) communication without message queue

We want to develop an application based on micro services architecture.
To communicate between various services asynchronously, we plan to use message queues(like RabbitMQ,ActiveMQ,JMS etc.,) . Is there any approach other than message queue is available to achieve inter process communication?
Thanks.
You should use Queues to handle the tasks that needs not to be completed in real time.
Append the tasks in queue and when there is a room, processor will take tasks from queue and will handle & will remove from queue.
Example :
Assuming your application deals with images, users are uploading so many images. Upload the tasks in a queue to compress the images. And when processor is free it will compress the queued images.
When you want to write some kind of logs of your system, give it to the queue and one process will take logs from queue and write that to disk. So the main process will not waste its time for the I/O operations.
Suggestion :
If you want the real time responses, you should not use the queue. You need to ping the queue constantly to read the incomings, and that is bad practice. And there is no guarantee that queue will handle your tasks immediately.
So the solutions are :
Redis cache - You can put your messages into cache and other process will read that message. Redis is "In memory data-structure". It is very fast and easy to use. Too much libraries and good resources available on the Internet, as it is open source. Read more about Redis. But here you also need to keep check whether there is some kind of message available and if available read from it, process and give response. But to read from Redis, is not very much costlier. With redis, you do not need to worry about memory management, it is well managed by open source community.
Using Sockets. Socket is very much faster, you can make this lightweight(if you want) as it is event based. One process will ping on port and other process will listen and give response. But you need to manage memory. If the buffered memory gets full, you can not put more messages here. If there are so many users producing messages, you need to manage to whom to you want to respond.
So it depends upon your requirement, like do you want to read messages constantly?, do you want to make one to one communication or many to one communication?

Is there a way to do hourly batched writes from Google Cloud Pub/Sub into Google Cloud Storage?

I want to store IoT event data in Google Cloud Storage, which will be used as my data lake. But doing a PUT call for every event is too costly, therefore I want to append into a file, and then do a PUT call per hour. What is a way of doing this without losing data in case a node in my message processing service goes down?
Because if my processing service ACKs the message, the message will no longer be in Google Pub/Sub, but also not in Google Cloud Storage yet, and at that moment if that processing node goes down, I would have lost the data.
My desired usage is similar to this post that talks about using AWS Kinesis Firehose to batch messages before PUTing into S3, but even Kinesis Firehose's max batch interval is only 900 seconds (or 128MB):
https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/
If you want to continuously receive messages from your subscription, then you would need to hold off acking the messages until you have successfully written them to Google Cloud Storage. The latest client libraries in Google Cloud Pub/Sub will automatically extend the ack deadline of messages for you in the background if you haven't acked them.
Alternatively, what if you just start your subscriber every hour for some portion of time? Every hour, you could start up your subscriber, receive messages, batch them together, do a single write to Cloud Storage, and ack all of the messages. To determine when to stop your subscriber for the current batch, you could either keep it up for a certain length of time or you could monitor the num_undelivered_messages attribute via Stackdriver to determine when you have consumed most of the outstanding messages.

message queue for processing streams of data

I have streams of input data that I process. Each stream is sent in chunks of data. I can only process the N+1st chunk of data of a stream i after I finished processing the Nth chunk of data of the same stream i. Therefore, parallelization can happen by processing multiple streams at once, but I can never split one stream on multiple workers.
Chunks of one stream are added to the queue in order (although chunks from several streams can be added at the same time).
Most message queues, like RabbitMQ, guarantee ordered delivery when multiple workers operate on one queue. However, to achieve the behaviour I would like, I'd need to restrict the number of workers to 1 for each queue, so that the next chunk is always only processed when the previous chunk was finished. To parallelize, I could create a queue for each stream, or a queue for each worker, and have another process that redirects the streams to the worker queues. In fact, the one-queue-per-worker approach is what I do right now, using RabbitMQ's consistent-hashing and shovels. Of course, in terms of load balancing and dynamic scaling of the number of workers, that is far from ideal.
I've read a lot about Kafka, and how it is designed for time-series data (like logs). Yet, I couldn't figure out how I could apply Kafka - or any other message queue out there - to solve my problem.
I would greatly appreciate some hints on how to best use a message queue for my problem.
You could use Kafka, but you'd have to use some stream identification to hash messages on the Producer side, so that messages from one stream always go to the same partition.
Then, on the Consumer side, you'd have to use the low-level consumer to spawn as much consuming threads as you have partitions, where each thread would consume from a single partition.
That would mean that you always process messages in order within each of your streams.
I haven't yet checked out how Kafka 0.9 Producer works, but there were some changes, so you should probably look into those if you want to use the latest version.
Why don't you push the next chunk only after receiving the delivery acknowledgement of the former chunk to the worker? Or some kind of a flag that the former chuck is processed by the worker, flag is set to true & then push the next chunk.
If you need to parallelize work create several queues with unique routing keys, based on routing keys push the chunks to respective queues. And have separate flags for every routing key.