Is there a way to do hourly batched writes from Google Cloud Pub/Sub into Google Cloud Storage? - amazon-s3

I want to store IoT event data in Google Cloud Storage, which will be used as my data lake. But doing a PUT call for every event is too costly, therefore I want to append into a file, and then do a PUT call per hour. What is a way of doing this without losing data in case a node in my message processing service goes down?
Because if my processing service ACKs the message, the message will no longer be in Google Pub/Sub, but also not in Google Cloud Storage yet, and at that moment if that processing node goes down, I would have lost the data.
My desired usage is similar to this post that talks about using AWS Kinesis Firehose to batch messages before PUTing into S3, but even Kinesis Firehose's max batch interval is only 900 seconds (or 128MB):
https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/

If you want to continuously receive messages from your subscription, then you would need to hold off acking the messages until you have successfully written them to Google Cloud Storage. The latest client libraries in Google Cloud Pub/Sub will automatically extend the ack deadline of messages for you in the background if you haven't acked them.
Alternatively, what if you just start your subscriber every hour for some portion of time? Every hour, you could start up your subscriber, receive messages, batch them together, do a single write to Cloud Storage, and ack all of the messages. To determine when to stop your subscriber for the current batch, you could either keep it up for a certain length of time or you could monitor the num_undelivered_messages attribute via Stackdriver to determine when you have consumed most of the outstanding messages.

Related

Kafka Streams Fault Tolerance with Offset Management in Parellel

Description :
I have one Kafka Stream application which is consuming from a topic.
The events are coming at high volumes.
KafkaStream will consume the events as a terminal operation and club the events in a bunch say 1000 events and writes it to AWS S3.
I have threads that are writing to s3 in parallel after consuming events from Kafka topic.
Not using kafka-connector-s3 due to some business application logics and processings.
Problem ::
I want the application to be fault-tolerant don't want to loose messages.
--> CRASH SCENARIO
Suppose the application has 10 threads all are running and trying to put the events in S3, and a crash happens, in that case, since the KafkaStream has ( enable.auto.commit = false )and we cannot commit the offset manually and all the threads have consumed messages from Kafka topic.
In this case, KafkaStreams has already committed the offset after reading but it could not have processed the events to S3.
I need a mechanism so that I can be sure of that what was the last offset till the events get written to the S3 file successfully.
And In crash scenarios, how should I deal with this and how to manage the Kafka offsets in Kafka Streams as I am using say 10 threads. What if some failed to write to s3 and some are passed. How do I ensure the ordering of offset getting successfully processed to s3 or not?
Let me know if I am not clear to describe my problem statement.
Thanks!
I can assure you that enable.auto.commit is set to false in Kafka Streams. The Javadocs at https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/StreamsConfig.html state
"enable.auto.commit" (false) - Streams client will always disable/turn off auto committing
You are right that Kafka Streams will automatically commit in more or less regular intervals. However, Kafka Streams waits until records are processed before committing the corresponding offsets. That means you would at least get at-least-once guarantees and not lose messages.
As far as I understand your application, your terminal processor does not block until the records are sent to S3. That means, Kafka Streams cannot know when the sending is completed. Kafka Streams just sees that the terminal processor completed its processing and then -- if the commit interval elapsed -- it commits the offsets.
You say
Not using kafka-connector-s3 due to some business application logics and processings.
Could you put the business application logic in the Kafka Streams application, write the results to a Kafka topic with operator to(), and then use the kafka-connector-s3 to send the messages in that topic to S3?
I am not a connect expert, but I guess that would make sure that messages are not lost and would make your implementation simpler.
Using kafka-stream ,you could aggragate 5000 messages from source topic to one big message and send the big one to another topic like middle_topic. You need another proceccor source from the middle_topic and sink to s3 using s3-connector.

Pushing messages to new subscribers

I am creating a bulk video processing system using spring-boot. Here the user will provide all the video related information through an xlsx sheet and we will process the videos in the backend. I am using the Rabbitmq for queuing up the request.
Let say a user has uploaded a sheet with 100 rows,then there will be 100 messages in the Rabbitmq queue. In the back-end, we are auto-scaling the subscribers (servers). So we will start with one subscriber-only and based on the load (number of messages in the queue) we will scale up to 15 subscribers.
But our producer is very fast and it allocating all the messages to our first subscriber (before other subscribers are coming up) and all our new subscriber are not getting any messages from the queue.
If all the subscribers are available before producer started pushing the messages then it is allocating the messages to all servers.
Please provide me a solution of how can our new subscribers pull the messages from the queue that were produced earlier.
You are probably being affected by the listener container prefetchCount property - it defaults to 250 with recent versions (since 2.0).
So the first consumer will get up to 250 messages when it starts.
It sounds like you should reduce it to a small number, even all the way down to 1 so only one message is outstanding at each consumer.

Handling of pubsub subscribers for distributed longrunning tasks

I am evaluating the use of using pubsub for long-running tasks such as video transcoding, where a particular transcode may take between 2-10 minutes. Is pubsub a good approach for such a task distribution? For example, let's say I have five servers:
- publisher1
- publisher2
- publisher3
- publisher4
- publisher5
And a topic called "videos". Would it be possible to spread out the messages equally across those five servers? What about when servers are added or removed? What would be a good approach to doing this, or is pubsub not the right tool for something like this?
This does sound like a reasonable use case for pubsub. Specifically, if you use a pull subscriber, you can configure flow control settings to have at most one outstanding message to your server, and configure the max ack extension period (in java) to be a reasonable upper bound of your processing time. This api is described here http://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/pubsub/v1/package-summary.html
This should effectively load balance across your servers by default if you use the same subscriber id for all jobs. If a server is added and backlog exists, it will receive a new entry. If a server is removed, it will no longer be sent messages. If it removed while processing or crashes, the message it was working on will be resent to another server.
One concern however is that pubsub has a limit of 10MB per message. You might consider instead putting the data itself in a google cloud storage bucket. Cloud storage can publish the file location to a pubsub topic when an upload is complete. https://cloud.google.com/storage/docs/pubsub-notifications

Google Cloud Storage Notification with Pub/Sub and docs

In the docs about GCP Storage and Pub/Sub notification I find this sentence that is not really clear:
Cloud Pub/Sub also offers at-least-once delivery to the recipient [that's pretty clear],
which means that you could receive multiple messages, with multiple
IDs, that represent the same Cloud Storage event [why?]
Can anyone give a better explanation of this behavior?
Thanks!
Google Cloud Storage uses at-least-once delivery to deliver your notifications To Cloud Pub/Sub. In other words, GCS will publish at least one message into Cloud Pub/Sub for each event that occurs.
Next, a Cloud Pub/Sub subscription will deliver the message to you, the end user, at least once.
So, say that in some rare case, GCS publishes two messages about the same event to Cloud Pub/Sub. Now that one GCS event has two Pub/Sub message IDs. Next, to make it even more unlikely, Pub/Sub delivers each of those messages twice. Now you have received 4 messages, with 2 message IDs, about the same single GCS event.
The important takeaway of the warning is that you should not attempt to dedupe GCS events by Pub/Sub message ID.
An at-least-once delivery means that the service must receive confirmation from the recipient to ensure that the message was received. In this case, we need some sort of timeout period in order to re-send the message. It is possible, due to network latency or packet loss, etc, to have the recipient send a confirmation, but the sender to not receive the confirmation before the timeout period, and therefore the sender will send the message again.
This is a common problem is network communications and distributed systems, and there are different types of messaging to address this issue.
To answer the question of 'why'
'At least once' delivery just means messages will be retried via some retry mechanism until successfully delivered (i.e. acknowledged). So if there's a failure or timeout then there's a retry.
By it's essence (retrying mechanism) this means you might occasionally have duplicates / more than once delivery. It's the same whether it's PubSub or GCS notifications delivering the message.
In the scenario you quote, you have:
The Publisher (GCS notification) -- may send duplicates of GCS events to pubsub topic
The PubSub topic messages --- may contain duplicates from publisher
no deduplication as messages come in
all messages assigned unique PubSub message_id even if they are duplicates of the same GCS event notification
PubSub topic Subscription(s) --- may also send duplicates of messages to subscribers
With PubSub
Once a message is sent to a subscriber, the subscriber must either acknowledge or drop the message. A message is considered outstanding once it has been sent out for delivery and before a subscriber acknowledges it.
A subscriber has a configurable, limited amount of time, or ackDeadline, to acknowledge the message. Once the deadline has passed, an outstanding message becomes unacknowledged.
Cloud Pub/Sub will repeatedly attempt to deliver any message that has not been acknowledged or that is not outstanding.
Source: https://cloud.google.com/pubsub/docs/subscriber#at-least-once-delivery
With Google Cloud Storage
They need to do something similar internally to 'publish' the notification event from GCS to PubSub - so reason is essentially the same.
Why this matters
You need to expect occasional duplicates originating from GCS notifications as well as the PubSub subscriptions
The PubSub message id can be used to detect duplicates from the pubsub topic -> subscriber
You have to figure out your own idempotent id/token to handle duplicates from the 'publisher' (the GCS notification event)
generation, metageneration, etc.. from the resource representation might help
If you need to de-duplicate or achieve exactly once processing, you can then build your own solution utilising the idempotent ids/tokens or see if Cloud Dataflow can accommodate your needs.
You can achieve exactly once processing of Cloud Pub/Sub message streams using Cloud Dataflow PubsubIO. PubsubIO de-duplicates messages on custom message identifiers or those assigned by Cloud Pub/Sub.
Source: https://cloud.google.com/pubsub/docs/faq#duplicates
If interested in a more fundamental exploration of the why we see:
There is No Now - Problems with simultaneity in distributed systems

How to view Azure storage account queue

I have a queue in my Azure storage account queue. I want a tool to see what is inside that queue.
I have tried 'CloudBerry Explorer for Azure Blob Storage', but that does not let me see the content of the queue.
And I tried 'Azure Storage Explorer', I can only see top 32 messages of the queue.
Can I see all the message in the queue?
And is there a tool allow me to change the order of message in the queue?
Azure Queue storage does not guarantee message ordering and therefore does not also provide a way to reorder messages.
There are two ways to see the contents of existing messages: Get Messages API and Peek Messages API. Both allow retrieving up to 32 messages at a time, so there is no way to view more than 32 messages without making the first 32 invisible (dequeue) first with Get Messages API.