Google Cloud Storage Notification with Pub/Sub and docs - notifications

In the docs about GCP Storage and Pub/Sub notification I find this sentence that is not really clear:
Cloud Pub/Sub also offers at-least-once delivery to the recipient [that's pretty clear],
which means that you could receive multiple messages, with multiple
IDs, that represent the same Cloud Storage event [why?]
Can anyone give a better explanation of this behavior?
Thanks!

Google Cloud Storage uses at-least-once delivery to deliver your notifications To Cloud Pub/Sub. In other words, GCS will publish at least one message into Cloud Pub/Sub for each event that occurs.
Next, a Cloud Pub/Sub subscription will deliver the message to you, the end user, at least once.
So, say that in some rare case, GCS publishes two messages about the same event to Cloud Pub/Sub. Now that one GCS event has two Pub/Sub message IDs. Next, to make it even more unlikely, Pub/Sub delivers each of those messages twice. Now you have received 4 messages, with 2 message IDs, about the same single GCS event.
The important takeaway of the warning is that you should not attempt to dedupe GCS events by Pub/Sub message ID.

An at-least-once delivery means that the service must receive confirmation from the recipient to ensure that the message was received. In this case, we need some sort of timeout period in order to re-send the message. It is possible, due to network latency or packet loss, etc, to have the recipient send a confirmation, but the sender to not receive the confirmation before the timeout period, and therefore the sender will send the message again.
This is a common problem is network communications and distributed systems, and there are different types of messaging to address this issue.

To answer the question of 'why'
'At least once' delivery just means messages will be retried via some retry mechanism until successfully delivered (i.e. acknowledged). So if there's a failure or timeout then there's a retry.
By it's essence (retrying mechanism) this means you might occasionally have duplicates / more than once delivery. It's the same whether it's PubSub or GCS notifications delivering the message.
In the scenario you quote, you have:
The Publisher (GCS notification) -- may send duplicates of GCS events to pubsub topic
The PubSub topic messages --- may contain duplicates from publisher
no deduplication as messages come in
all messages assigned unique PubSub message_id even if they are duplicates of the same GCS event notification
PubSub topic Subscription(s) --- may also send duplicates of messages to subscribers
With PubSub
Once a message is sent to a subscriber, the subscriber must either acknowledge or drop the message. A message is considered outstanding once it has been sent out for delivery and before a subscriber acknowledges it.
A subscriber has a configurable, limited amount of time, or ackDeadline, to acknowledge the message. Once the deadline has passed, an outstanding message becomes unacknowledged.
Cloud Pub/Sub will repeatedly attempt to deliver any message that has not been acknowledged or that is not outstanding.
Source: https://cloud.google.com/pubsub/docs/subscriber#at-least-once-delivery
With Google Cloud Storage
They need to do something similar internally to 'publish' the notification event from GCS to PubSub - so reason is essentially the same.
Why this matters
You need to expect occasional duplicates originating from GCS notifications as well as the PubSub subscriptions
The PubSub message id can be used to detect duplicates from the pubsub topic -> subscriber
You have to figure out your own idempotent id/token to handle duplicates from the 'publisher' (the GCS notification event)
generation, metageneration, etc.. from the resource representation might help
If you need to de-duplicate or achieve exactly once processing, you can then build your own solution utilising the idempotent ids/tokens or see if Cloud Dataflow can accommodate your needs.
You can achieve exactly once processing of Cloud Pub/Sub message streams using Cloud Dataflow PubsubIO. PubsubIO de-duplicates messages on custom message identifiers or those assigned by Cloud Pub/Sub.
Source: https://cloud.google.com/pubsub/docs/faq#duplicates
If interested in a more fundamental exploration of the why we see:
There is No Now - Problems with simultaneity in distributed systems

Related

Message Delivery Guarantee for Multiple Consumers in Pub/Sub and Messaging Queues

Requirement
A system undergoes some state change, and multiple other parts of the system has to know this(lets call them observers) so that they can perform some actions based on the current state, the actions of the observers are important, if some of the observers are not online(not listening currently due to some trouble, but will be back soon), the message should not be discarded till all the observers gets the message.
Trying to accomplish this with pub/sub model, here are my findings, (please correct if this understanding is wrong) -
The publisher creates an event on specific topic, and multiple subscribers can consume the same message. This model either provides no delivery guarantee(in redis), or delivery is guaranteed once(with messaging queues), ie. when one of the consumer acknowledges a message, the message is discarded(rabbitmq).
Example
A new Person Profile entity gets created in DB
Now,
A background verification service has to know this to trigger the verification process.
Subscriptions service has to know this to add default subscriptions to the user.
Now both the tasks are important, unrelated and can run in parallel.
Now In Queue model, if subscription service is down for some reason, a BG verification process acknowledges the message, the message will be removed from the queue, or if it is fire and forget like most of pub/sub, the delivery is anyhow not guaranteed for both the services.
One more point is both the tasks are unrelated and need not be triggered one after other.
In short, my need is to make sure all the consumers gets the same message and they should be able to acknowledge them individually, the message should be evicted only after all the consumers acknowledged it either of the above approaches doesn't do this.
Anything I am missing here ? How should I approach this problem ?
This scenario is explicitly supported by RabbitMQ's model, which separates "exchanges" from "queues":
A publisher always sends a message to an "exchange", which is just a stateless routing address; it doesn't need to know what queue(s) the message should end up in
A consumer always reads messages from a "queue", which contains its own copy of messages, regardless of where they originated
Multiple consumers can subscribe to the same queue, and each message will be delivered to exactly one consumer
Crucially, an exchange can route the same message to multiple queues, and each will receive a copy of the message
The key thing to understand here is that while we talk about consumers "subscribing" to a queue, the "subscription" part of a "pub-sub" setup is actually the routing from the exchange to the queue.
So a RabbitMQ pub-sub system might look like this:
A new Person Profile entity gets created in DB
This event is published as a message to an "events" topic exchange with a routing key of "entity.profile.created"
The exchange routes copies of the message to multiple queues:
A "verification_service" queue has been bound to this exchange to receive a copy of all messages matching "entity.profile.#"
A "subscription_setup_service" queue has been bound to this exchange to receive a copy of all messages matching "entity.profile.created"
The consuming scripts don't know anything about this routing, they just know that messages will appear in the queue for events that are relevant to them:
The verification service picks up the copy of the message on the "verification_service" queue, processes, and acknowledges it
The subscription setup service picks up the copy of the message on the "subscription_setup_service" queue, processes, and acknowledges it
If there are multiple consuming scripts looking at the same queue, they'll share the messages on that queue between them, but still completely independent of any other queue.
Here's a screenshot from this interactive visualisation tool that shows this scenario:
As you mentioned it is not something that you can control with Redis Pub/Sub data structure.
But you can do it easily with Redis Streams.
Streams will allow you to post messages using the XADD command and then control which consumers are dealing with the message and acknowledge that message has been processed.
You can look at these sample application that provides (in Java) example about:
posting and consuming messages
create multiple consumer groups
manage exceptions
Links:
Getting Started with Redis Streams and Java
Redis Streams in Action ( Project that shows how to use ADD/ACK/PENDING/CLAIM and build an error proof streaming application with Redis Streams and SpringData )

Azure Service Bus, AWS SNS, RabbitMQ -> All subscribers get the message?

While looking at the Pub/Sub pattern, i came across the fellowing scenario:
Assume that you have a horizontally scaled app, that has X instances. All of them subscribe to a topic where messages like "Transfer $10 from account A to account B". When someone publish a message to that topic, all subscriber will get that message?
In the case above, clearly, the message should be taken by only 1 subscriber and handled only once.
How does one handle this scenario? Do you abandon the pub/sub and starts pooling?
Let me explain few things with example before you understand that completely. I have worked on Azure service bus so i will explain in that context.
In Pub/sub you have one topic and possible multiple subscription. Lets say we have topic "Shopping-Topic". We have 2 Subscriptions called "Payment-Subscription", "Cart-Subscription". Now we publish message "Payment-processed" on the topic. It's the discretion of subscription to pick that message and reason is that subscription have to mention that which messages it want pick.
In Azure service bus we have something called rule (message label). Default rule is that subscription is listening to all the messages but we can overwrite this behavior and say i am only interested in particular message. In the above case rule added against "Payment-Subscription" to listen the message "Payment-processed" so the message is added to "Payment-Subscription" subscription for it to process. Even though "Cart-Subscription" is also subscribed to the same topic but it is ignoring this message so it's not added to its subscription. This way any intended subscription can listen to particular message not necessarily all of them.
Now we discuss individual subscription. Let's say we have message added to "Payment-Subscription". This subscription has 2 instances/processes that are ready to process the message "Payment-processed". The first process to pick the message will process the message and remove it from subscription.
In RabbitMQ Normally, active consumers connected to the same queue receive messages from it in a round-robin fashion. So this insures that a message is processed exactly once.
So in your case you should design a queue where all the messages for
"Transfer $10 from account A to account B"
Are routed to and all the consumers register themselves on this queue itself , this insures that one message will go to only one subscriber.
Another point not related to your question but is important to know is that there is another concept called "Consumer Priorities" which allows you to ensure that high priority consumers receive messages while they are active, with messages only going to lower priority consumers when the high priority consumers block.
More info can be found here

How would I use Reddis + Azure Event Hubs to handle mobile push notifications archiving for billions of topics?

I need to design a system that allows
Users to subscribe to any topic
No defined topic limit
Control over sending to one device, or all
Recovery when offline clients, (or APNS) that drops a notification. Provide a way to catch up via REST
Discard all updates older than age T.
I studied many different solutions, such as Notification Hubs, Service Bus, Event Hub... and now discovered Kafka and not sure if that's a good fit.
Draft architecture
Use an Event Hub to listen for mobile deviceID registrations, and userIDs that requests for topic subscriptions .. Pass that to Reddis, below
If registering a phone/subscribing to a topic, save the deviceID userID to the topic key.
If sending a message to a topic, query Reddis for the topic key, and send that result to a FIFO queue for processing.
Pipe the output of the previous query into the built in Reddis Pub/Sub features to alert worker roles that there is work pending.
While the workers send notices to Apple and Firebase, archive out the sent notices to some in-memory store below.
Archive server maintains a history of sent events, so that out-of-sync devices can get the most up to date information LIFO-queue style.
Question
What are your thoughts on using this approach to solve the above needs?
What other things should I learn, research, or experiment (measure)?

Read all messages from the very begining

Consider a group chat scenario where 4 clients connect to a topic on an exchange. These clients each send an receive messages to the topic and as a result, they all send/receive messages from this topic.
Now imagine that a 5th client comes in and wants to read everything that was send from the beginning of time (as in, since the topic was first created and connected to).
Is there a built-in functionality in RabbitMQ to support this?
Many thanks,
Edit:
For clarification, what I'm really asking is whether or not RabbitMQ supports SOW since I was unable to find it on the documentations anywhere (http://devnull.crankuptheamps.com/documentation/html/develop/configuration/html/chapters/sow.html).
Specifically, the question is: is there a way for RabbitMQ to output all messages having been sent to a topic upon a new subscriber joining?
The short answer is no.
The long answer is maybe. If all potential "participants" are known up-front, the participant queues can be set up and configured in advance, subscribed to the topic, and will collect all messages published to the topic (matching the routing key) while the server is running. Additional server configurations can yield queues that persist across server reboots.
Note that the original question/feature request as-described is inconsistent with RabbitMQ's architecture. RabbitMQ is supposed to be a transient storage node, where clients connect and disconnect at random. Messages dumped into queues are intended to be processed by only one message consumer, and once processed, the message broker's job is to forget about the message.
One other way of implementing such a functionality is to have an audit queue, where all published messages are distributed to the queue, and a writer service writes them all to an audit log somewhere (usually in a persistent data store or text file). This would be something you would have to build, as there is currently no plug-in to automatically send messages out to a persistent storage (e.g. Couchbase, Elasticsearch).
Alternatively, if used as a debug tool, there is the Firehose plug-in. This is satisfactory when you are able to manually enable/disable it, but is not a good long-term solution as it will turn itself off upon any interruption of the broker.
What you would like to do is not a correct usage for RabbitMQ. Message Queues are not databases. They are not long term persistence solutions, like a RDBMS is. You can mainly use RabbitMQ as a buffer for processing incoming messages, which after the consumer handles it, get inserted into the database. When a new client connects to you service, the database will be read, not the message queue.
Relevant
Also, unless you are building a really big, highly scalable system, I doubt you actually need RabbitMQ.
Apache Kafka is the right solution for this use-case. "Log Compaction enabled topics" a.k.a. compacted topics are specifically designed for this usecase. But the catch is, obviously your messages have to be idempotent, strictly no delta-business. Because kafka will compact from time to time and may retain only the last message of a "key".

Is there a way to do hourly batched writes from Google Cloud Pub/Sub into Google Cloud Storage?

I want to store IoT event data in Google Cloud Storage, which will be used as my data lake. But doing a PUT call for every event is too costly, therefore I want to append into a file, and then do a PUT call per hour. What is a way of doing this without losing data in case a node in my message processing service goes down?
Because if my processing service ACKs the message, the message will no longer be in Google Pub/Sub, but also not in Google Cloud Storage yet, and at that moment if that processing node goes down, I would have lost the data.
My desired usage is similar to this post that talks about using AWS Kinesis Firehose to batch messages before PUTing into S3, but even Kinesis Firehose's max batch interval is only 900 seconds (or 128MB):
https://aws.amazon.com/blogs/big-data/persist-streaming-data-to-amazon-s3-using-amazon-kinesis-firehose-and-aws-lambda/
If you want to continuously receive messages from your subscription, then you would need to hold off acking the messages until you have successfully written them to Google Cloud Storage. The latest client libraries in Google Cloud Pub/Sub will automatically extend the ack deadline of messages for you in the background if you haven't acked them.
Alternatively, what if you just start your subscriber every hour for some portion of time? Every hour, you could start up your subscriber, receive messages, batch them together, do a single write to Cloud Storage, and ack all of the messages. To determine when to stop your subscriber for the current batch, you could either keep it up for a certain length of time or you could monitor the num_undelivered_messages attribute via Stackdriver to determine when you have consumed most of the outstanding messages.