Google PubSub : How to customize distribution of messages to consumers? - google-cloud-messaging

I have a scenario where we will be sending customer data to pubsub and consume it with java subscribers. I have multiple subscribers subscribed to same subscription. Is there a way to route all messages of same customerID to same subscriber ?
I know Google Dataflow has session based windowing. However, I wanted to know if we can achieve it using simple java consumers.

Update June 2020: Filtering is now an available feature in Google Cloud Pub/Sub. When creating a subscription, one can specify a filter that looks at message attributes. If a message does not match the filter, the Pub/Sub service automatically acknowledges the message without delivering it to the subscriber.
In this case, you would need to have different subscriptions and each subscriber would consume messages from one of the subscriptions. Each subscription would have a filter set up to match the customer ID. If you know the list of customer IDs and it is short, you would set up an exact match filter for each customer ID, e.g.,
attribute.customerID = "customerID1"
If you have a lot of customer IDs and wanted to partition the set of IDs received by each subscriber, you could use the prefix operator to do so. For example, if the IDs are numbers, you could have filters such as:
hasPrefix(attribute.customerID, "0")
hasPrefix(attribute.customerID, "1")
hasPrefix(attribute.customerID, "2")
hasPrefix(attribute.customerID, "3")
...
hasPrefix(attribute.customerID, "9")
Previous answer:
At this time, Google Cloud Pub/Sub has no way to filter messages delivered to particular subscribers, no. If you know a priori the number of subscribers you have, you could to it yourself, though. You could create as many topics as you have subscribers and then bucket customer IDs into different topics, publishing messages to the right topic for each customer ID. You'd create a single subscription on each topic and each subscriber would receive messages from one of these subscriptions.
The disadvantage is that if you have any subscribers that want the data for all customer IDs, then you'll have to have an additional subscription on each topic and that subscriber will have to get messages from all of those subscriptions.
Keep in mind that you won't want to create more than 10,000 topics or else you may run up against quotas.

Related

Many to many filtering in Message Broker

I have a Person object in the system. When a Person does some action there is an Administrator actor who is interested in monitoring these kind of events.
Person
{
Id: string
}
PersonAction
{
ActionType: enum
PersonId: string
}
Currently I have this subscription implemented throught ServiceBus topic and subscriptions: Administrators subscribe to actions of all Persons in the system:
Azure Service Bus broker has PersonActions topic.
Every time when Person does any action a PersonAction event is sent to the Topic.
Every Administrator creates it's own subscription to the topic and monitors all Persons actions.
Now I have a new requirement that introduces grouping of Persons and I need a way to allow Administrators to subscribe to PersonActions events based on groups they want to monitor:
Persons can be part of one ore more groups.
Administrators are interested in monitoring groups of Persons and, hence, receiving all PersonAction events for groups they are monitoring.
Administrators may subscribe to one or several groups.
Here are my thoughts how to do this:
Add to PersonAction a routing property that will contain information about groups this Person is member of
When Administrator creates new subscription he will specify a set of groups that he wants to monitor and it should be than somehow used in subscription filter to filter PersonAction messages in the Topic.
So, cutting to the case, I want to leverage Service Bus Topic filtering capabilities to deliver PersonAction messages specificaly to Administrators that are interested in them based on Groups.
In general this doesn't seem to be a straightforward task to do with ServiceBus(or any other message broker) because there is a many-to-many relation: one Person can be in multiple groups and Administrator may want to subscribe to multiple groups. Usually all filters support filtering when event has a single property(like "groupId=1234") and in my case it's an array.
So far, I've came up with two solutions but don't quite like any of them:
Use LIKE SqlFilter. Concatenate all groups of the Person into a single comma-separated string (groups=1,2,5,8) and than have filter groups LIKE %1% OR groups LIKE %5% (in reality group ids will be guids, so don't mind the problem with one group id being a substring of another)
Add each group id as a property with an empty value and than use EXISTS filter to check if event has this group id defined. Filter would be EXISTS(1) OR EXISTS(5) and PersonAction properties: {1:null, 2:null, 5:null, 8:null}
Is there a better way to do such filtering and how is many-to-many filtering rule done in message brokers?
Answers describing this for Any message broker(not only ServiceBus) will be also extremely helpful.
I'm not really that familiar with other brokers but here is something that comes to mind for Azure Service Bus.
You could have 2 (3 with bonus) level of entities instead of 1 for such a scenario
The first level is a topic where all the PersonAction messages come into and would have subscriptions for each group with auto-forward setup to their own topics
The second level is where each group has its own topic and administrators would subscribe to multiple topics based on the groups they want to monitor but will have to de-duplicate messages
You could remove this level and have direct subscriptions (one per group per administrator) but would likely hit the limit of 2000 subscriptions per topic
(Bonus) Auto Forward the messages from the subscriptions into administrator queues and enable Duplicate Detection
Note that the number of operations billed would increase as mentioned in the Auto Forward Considerations section of the docs
Here is a more elaborate explanation for the same
1. Input Topic
This is where the PersonAction messages would first come in.
This topic would have subscriptions that filter messages based on the group (either of your approaches; I'd prefer using a Correlation Filter since its more efficient) and would auto-forward the messages into respective topics.
2. Topic per Group
This is where the PersonAction messages filtered by group go into.
At this point, there would copies of the same message in different topics based on all of the groups the user is part of.
Administrators would create subscriptions to the topics required depending on the groups they want to monitor but will have to handle the duplicate messages that they could potentially receive.
3. (Bonus) Administrator Queues
The subscriptions created by administrators could be setup to auto-forward messages into their personal queue and these queues could have duplicate detection enabled allowing the administrators to freely process the messages as-is without worrying about duplicates.

ActiveMQ new topic, without consumer, doesn't discard messages

I'm building a software solution which creates JMS topics per new category of something. The topic is created when the first round of data is integrated and must be comunicated.
Durable subscriptions to that topic are created by consumers, but only some time after the category and first data are created. All the data belonging to the category is sent as messages to the consumers, so that they are updated too.
Between the moment when the category is created, and when the durable subscriptions are created, it would be better if the messages are discarded. The consumer first does an initial sync of the existing data, then created the durable subscription and listens for create/update messages.
One option would be to let the consumers create the topic when registering the first durable subscription. In the meantime, if data is added to the category, it is not sent by the produces, thus not creating the topic too.
Another option would be to discard the messages if no consumers exist. I'm not talking about active consumers, I'm talking about no consumers at all. Any idea if this can be implemented? Since there are no durable/non-durable subscriptions for the topic, I was expecting that the messages would be discarded automatically, but I was wrong.
Which option would you choose?
If you look at the image below you will see a topic which never had subscribers with 4498 messages enqueued. Am I interpreting this information in a wrong manner?
Messages sent to a topic when no subscriptions exist (whether durable or not) should be discarded. That's the expected behavior.
The "Messages Enqueued" metric visible on the web console does not mean what you think it means. This metric simply indicates the total number of messages sent to the topic since the last restart. It doesn't indicate how many messages have been retained in subscriptions on that topic (if any).

Multiple subscriptions to a topic

I have been using pubsub for a bit of asynchronous work, and was wondering why someone may create multiple subscriptions for a single topic. My default values are as follows:
project_id = 'project'
topic_name = 'app'
subscription_name = 'general'
The routing of the actual function -- and how to process that -- is being doing in the subscriber receiver itself.
What would be reasons why there would be various subscription names? The only thing I can think of is to spread items across multiple servers for processing, such as:
server1 -- `main-1`
server2 -- `main-2`
etc.
Are there any other reasons why a subscription name would not work well with one value?
In general, there are two paradigms for having multiple subscribers:
Load balancing: The goal is to parallelize the processing of the load by having multiple subscribers using the same subscription. In this scenario, every subscriber receives a subset of the messages. One can horizontally scale processing by creating more subscribers for the same subscription.
Fan out: The goal is to have multiple subscribers receive the entire feed of messages. This is accomplished by having multiple subscriptions. The reason to have fan out is if there are multiple downstream applications interested in the full feed of messages. Imagine there is a feed where the messages are user events on a shopping website. Perhaps one application backs up the data to files, another analyzes the feed for trends in what people are looking at, and another looks through activity to try to find potentially fraudulent transactions. In this scenario, every one of those applications acting as a subscriber needs the full feed of messages, which requires separate subscriptions.

Select consumers before publishing a message rabbitmq

I am trying to build a system where I need to select next available and suitable consumer to send a message from a queue (or may be any other solution not using the queue)
Requirements
We have multiple publishers/clients who would send objects (images) to process on one side and multiple Analysts who would process them, once processed the publisher should get the corresponding response.
The publishers do not care which Analyst is going to process the data.
Users have a web app where they can map each client/publisher to one or more or all agents, say for instance if Publisher P1 is mapped to Agents A & B, all objects coming from P1 can be processed by Agent A or Agent B. Note: an object can only be processed by one agent only.
Depending on the mapping I should have a middleware which consumes the messages from all publishers and distributes to the agents
Solution 1
My initial thoughts were to have a queue where all publishers post their messages. Another queue where Agents publish message saying they are waiting to process an object.
A middleware picks the message, gets the possible list of agents it can send the message to (from cached database) and go through the agents queue to find the next suitable and available agent and publish the message to that agent.
The issue with this solution is if I have agents queue like a,b,c,d and the message I receive can only be processed by agent b I will be rejecting agents d & c and they would end up at the tail of the queue and I have around 180 agents so they might never be picked or if the next message can only be processed by agent d (for example) we have to reject all the agents to get there
Solution 2
First bit from publishers to middleware is still the same
Have a scaled fast nosql database where agents add a record to notify there availability. Basically a key value pair
The middleware gets config from cache and gets the next available + suitable agent from the nosql database sends message to the agent's queue (through direct exchange) and updates the nosql to set isavailable false ad gets the next message.
Issue with this solution is the db and middleware can become a bottleneck, also if I scale the middleware I will end up in database concurrency issues for example f I have two copies of middleware running and each recieves a message which can be proceesed by Agents A & B and both agents are available.
The two middleware copies would query the db and might get A as availble and end up sneding both messages to A while B is still waiting for a message to process.
I will have around 100 publishers and 180 agents to start with.
Any ideas how to improve these solutions or any other feasible solution would be highly appreciated?
Depending on this I also need to figure out how the Agent would send response back to the publisher.
Thank you
I'll answer this from the perspective the perspective of my open-source service bus: Shuttle.Esb
Typically one would ignore any content-based routing and simply have a distributor pattern. All message go to the primary endpoint and it will distribute the messages. However, if you decide to stick to these logical groupings you could have primary endpoints for each logical grouping (per agent group). You would still have the primary endpoint but instead of having worker endpoints mapped to agents you would have agent groupings map to the logical primary endpoint with workers backing that.
Then in the primary endpoint you would, based on your content (being the agent identifier), forward the message to the relevant logical primary endpoint. All the while you keep track of the original sender. In the worker you would then send a message back to the queue of the original sender.
I'm sure you could do pretty much the same using any service bus.
I see several requirements in here, that can be boiled down to a few things, I think:
publisher does not care which agent processes the image
publisher needs to know when the image processing is done
agent can only process 1 image at a time
agent can only process certain images
are these assumptions correct? did I miss anything important?
if not, then your solution is pretty much built into RabbitMQ with routing and queues. there should be no need to build custom middle-tier service to manage this.
With RabbitMQ, you can have a consumer set to only process 1 message at a time. The consumer sets it's "prefetch" limit to 1, and retrieves a message from the queue with "no ack" set to false - meaning, it must acknowledge the message when it is done processing it.
To consume only messages that a particular agent can handle, use RabbitMQ's routing capabilities with multiple queues. The queues would be created based on the type of image or some other criteria by which the consumers can select images.
For example, if there are two types of images: TypeA and TypeB, you would have 2 queues - one for TypeA and one for TypeB.
Then, if Agent1 can only handle TypeA images, it would only consume from the TypeA queue. If Agent2 can handle both types of images, it would consume from both queues.
To put the right images in the right queue, the publisher would need to use the right routing key. If you know if the image type (or whatever the selection criteria is), you would change the routing key on the publisher side to match that selection criteria. The routing in RabbitMQ would be set up to move messages for TypeA into the TypeA queue, etc.
The last part is getting a response on when the image is done processing. That can be accomplished through RabbitMQ's "reply to" field and related code. The gist of it is that the publisher has it's own exclusive queue. When it publishes a message, it includes the name of it's exclusive queue in the "reply to" header of the message. When the agent finishes processing the image, it sends a status update message back through the queue found in the "reply to" header. That status update message tells the producer the status of the request.
From a RabbitMQ perspective, these pieces can be put together using the examples and documentation found here:
http://www.rabbitmq.com/getstarted.html
Look at these specifically:
Work Queues: http://www.rabbitmq.com/tutorials/tutorial-two-python.html
Topics: http://www.rabbitmq.com/tutorials/tutorial-five-python.html
RPC (aka Request/Response): http://www.rabbitmq.com/tutorials/tutorial-six-python.html
You'll find examples in many languages, in these docs.
I also cover most of these scenarios (and others) in my RabbitMQ Patterns eBook
Since the total number of senders and receivers are only hundreds, how about to create one queue for each of your senders. Based on your sender receiver mapping, receivers subscribes to the sender queues (update the subscribing on mapping changes). You could configure your receiver to only receive the next message from all the queues it subscribes (in a random way) when it finishes processing one message.

How do I group consumers in RabbitMQ?

We are writing mail sync system, and we use RabbitMQ for that. Every producer pushes mails ids, then consumer gets ids and insert mails to db. In situation when we have 100 consumers (for example) and producers will generate ids too fast, every consumer will get ids and will use api to get mails, so then will be exception about limit of concurrent request to the api.
Сan we limit consumer for each producer ( for example, if max 3 consumer will be receive ids of one producer, then next 3 will receive from other one, and so on) ?
Сan we limit consumer for each producer ( for example, if max 3
consumer will be receive ids of one producer, then next 3 will receive
from other one, and so on) ?
You could do this by using simple routing.
ProducerA sends messages with routing key routeA and consumer1, consumer2 and consumer3 are subscribed to exchange with routing key routeA.
ProducerB sends messages with routing key routeB and consumer4, consumer5 and consumer6 are subscribed to exchange with routing key routeB.
.. and so on
You could also use topic exchange.
However, it seems to me that this may not be the solution to the problem of the exception about limit concurrent requests to the API. You didn't specify which API, so I can assume that this number is configurable and you can increase it, or simply the concurrent access is not allowed (which is hard to imagine since, you know, is not the 70s), in which case the whole idea of parallelism crumbles and falls...