Consuming data from multiple Topics, Aggregate them and process - api

We are consuming data from multiple Kafka Topics (Topic One- Employee Basic Details & Topic 2: Having address details) and then consuming an API (/createEmployee) to another system. In order to call /createEmployee API, we need to aggregate data from both the topics first and them call API.
How can we do that?

Kafka Streams can be used to join and aggregate topics, as well as process the aggregate
https://kafka.apache.org/33/documentation/streams/developer-guide/write-streams.html

Related

How do I design SpringBoot Pagination Restful API with Kafka topic?

I am trying to build a pagination restful API that fetches data from the Kafka topic.
For example, inside my Kafka topics, I have 1 billion messages whose data structure is like the following:
class Record {
String ID;
JsonObject studentInfo;
}
How do I get the paginated query result for a specific student id? For example, I want to get 200 records of the student whose id is 0123 and this student might or might not have 200 records on the Kafka topic.
My intuitive approach was to poll data from the Kafka topic, keep the offset on the topic and keep reading the data on the Kafka topic until I have 200 specific student records or reach the end of the Kafka topic. However, I am not sure if this is the right approach I should take.
The Confluent REST Proxy already does what you want, so I would recommend using that, rather than reinventing the wheel
GET /consumers/(string:group_name)/instances/(string:instance)/records
Fetch data for the topics or partitions specified using one of the subscribe/assign APIs
Where, rather than number of records to poll, you give it a timeout (e.g. consumer.poll(Duration timeout)), or max_bytes (consumer config fetch.max.bytes, I think).
Re-GET that API endpoint to get the next "batch" (i.e. page) of records
https://docs.confluent.io/platform/current/kafka-rest/api.html
for a specific student id?
You wouldn't. That's not how Kafka works. If this is a feature you really need, then you can use Interactive Queries feature from Kafka Streams, which Spring has an InteractiveQueryService class that can help with this.
Or, as mentioned in the comments, dump your topic to a database, indexed by ID, then build an API endpoint that will query and paginate from that.

Many to many filtering in Message Broker

I have a Person object in the system. When a Person does some action there is an Administrator actor who is interested in monitoring these kind of events.
Person
{
Id: string
}
PersonAction
{
ActionType: enum
PersonId: string
}
Currently I have this subscription implemented throught ServiceBus topic and subscriptions: Administrators subscribe to actions of all Persons in the system:
Azure Service Bus broker has PersonActions topic.
Every time when Person does any action a PersonAction event is sent to the Topic.
Every Administrator creates it's own subscription to the topic and monitors all Persons actions.
Now I have a new requirement that introduces grouping of Persons and I need a way to allow Administrators to subscribe to PersonActions events based on groups they want to monitor:
Persons can be part of one ore more groups.
Administrators are interested in monitoring groups of Persons and, hence, receiving all PersonAction events for groups they are monitoring.
Administrators may subscribe to one or several groups.
Here are my thoughts how to do this:
Add to PersonAction a routing property that will contain information about groups this Person is member of
When Administrator creates new subscription he will specify a set of groups that he wants to monitor and it should be than somehow used in subscription filter to filter PersonAction messages in the Topic.
So, cutting to the case, I want to leverage Service Bus Topic filtering capabilities to deliver PersonAction messages specificaly to Administrators that are interested in them based on Groups.
In general this doesn't seem to be a straightforward task to do with ServiceBus(or any other message broker) because there is a many-to-many relation: one Person can be in multiple groups and Administrator may want to subscribe to multiple groups. Usually all filters support filtering when event has a single property(like "groupId=1234") and in my case it's an array.
So far, I've came up with two solutions but don't quite like any of them:
Use LIKE SqlFilter. Concatenate all groups of the Person into a single comma-separated string (groups=1,2,5,8) and than have filter groups LIKE %1% OR groups LIKE %5% (in reality group ids will be guids, so don't mind the problem with one group id being a substring of another)
Add each group id as a property with an empty value and than use EXISTS filter to check if event has this group id defined. Filter would be EXISTS(1) OR EXISTS(5) and PersonAction properties: {1:null, 2:null, 5:null, 8:null}
Is there a better way to do such filtering and how is many-to-many filtering rule done in message brokers?
Answers describing this for Any message broker(not only ServiceBus) will be also extremely helpful.
I'm not really that familiar with other brokers but here is something that comes to mind for Azure Service Bus.
You could have 2 (3 with bonus) level of entities instead of 1 for such a scenario
The first level is a topic where all the PersonAction messages come into and would have subscriptions for each group with auto-forward setup to their own topics
The second level is where each group has its own topic and administrators would subscribe to multiple topics based on the groups they want to monitor but will have to de-duplicate messages
You could remove this level and have direct subscriptions (one per group per administrator) but would likely hit the limit of 2000 subscriptions per topic
(Bonus) Auto Forward the messages from the subscriptions into administrator queues and enable Duplicate Detection
Note that the number of operations billed would increase as mentioned in the Auto Forward Considerations section of the docs
Here is a more elaborate explanation for the same
1. Input Topic
This is where the PersonAction messages would first come in.
This topic would have subscriptions that filter messages based on the group (either of your approaches; I'd prefer using a Correlation Filter since its more efficient) and would auto-forward the messages into respective topics.
2. Topic per Group
This is where the PersonAction messages filtered by group go into.
At this point, there would copies of the same message in different topics based on all of the groups the user is part of.
Administrators would create subscriptions to the topics required depending on the groups they want to monitor but will have to handle the duplicate messages that they could potentially receive.
3. (Bonus) Administrator Queues
The subscriptions created by administrators could be setup to auto-forward messages into their personal queue and these queues could have duplicate detection enabled allowing the administrators to freely process the messages as-is without worrying about duplicates.

Multiple subscriptions to a topic

I have been using pubsub for a bit of asynchronous work, and was wondering why someone may create multiple subscriptions for a single topic. My default values are as follows:
project_id = 'project'
topic_name = 'app'
subscription_name = 'general'
The routing of the actual function -- and how to process that -- is being doing in the subscriber receiver itself.
What would be reasons why there would be various subscription names? The only thing I can think of is to spread items across multiple servers for processing, such as:
server1 -- `main-1`
server2 -- `main-2`
etc.
Are there any other reasons why a subscription name would not work well with one value?
In general, there are two paradigms for having multiple subscribers:
Load balancing: The goal is to parallelize the processing of the load by having multiple subscribers using the same subscription. In this scenario, every subscriber receives a subset of the messages. One can horizontally scale processing by creating more subscribers for the same subscription.
Fan out: The goal is to have multiple subscribers receive the entire feed of messages. This is accomplished by having multiple subscriptions. The reason to have fan out is if there are multiple downstream applications interested in the full feed of messages. Imagine there is a feed where the messages are user events on a shopping website. Perhaps one application backs up the data to files, another analyzes the feed for trends in what people are looking at, and another looks through activity to try to find potentially fraudulent transactions. In this scenario, every one of those applications acting as a subscriber needs the full feed of messages, which requires separate subscriptions.

Google PubSub : How to customize distribution of messages to consumers?

I have a scenario where we will be sending customer data to pubsub and consume it with java subscribers. I have multiple subscribers subscribed to same subscription. Is there a way to route all messages of same customerID to same subscriber ?
I know Google Dataflow has session based windowing. However, I wanted to know if we can achieve it using simple java consumers.
Update June 2020: Filtering is now an available feature in Google Cloud Pub/Sub. When creating a subscription, one can specify a filter that looks at message attributes. If a message does not match the filter, the Pub/Sub service automatically acknowledges the message without delivering it to the subscriber.
In this case, you would need to have different subscriptions and each subscriber would consume messages from one of the subscriptions. Each subscription would have a filter set up to match the customer ID. If you know the list of customer IDs and it is short, you would set up an exact match filter for each customer ID, e.g.,
attribute.customerID = "customerID1"
If you have a lot of customer IDs and wanted to partition the set of IDs received by each subscriber, you could use the prefix operator to do so. For example, if the IDs are numbers, you could have filters such as:
hasPrefix(attribute.customerID, "0")
hasPrefix(attribute.customerID, "1")
hasPrefix(attribute.customerID, "2")
hasPrefix(attribute.customerID, "3")
...
hasPrefix(attribute.customerID, "9")
Previous answer:
At this time, Google Cloud Pub/Sub has no way to filter messages delivered to particular subscribers, no. If you know a priori the number of subscribers you have, you could to it yourself, though. You could create as many topics as you have subscribers and then bucket customer IDs into different topics, publishing messages to the right topic for each customer ID. You'd create a single subscription on each topic and each subscriber would receive messages from one of these subscriptions.
The disadvantage is that if you have any subscribers that want the data for all customer IDs, then you'll have to have an additional subscription on each topic and that subscriber will have to get messages from all of those subscriptions.
Keep in mind that you won't want to create more than 10,000 topics or else you may run up against quotas.

How to use Azure service bus topics & Subscriptions to load balance messages

In reading many MSDN pages about the Azure Service Bus, it alludes to the ability to set up a "Load Balancing" pattern with the "Topic/Subscription" model, but never says how this is done.
My question is, is this possible. Essentially, we are looking to create Topics that would have a possible n number of subscribers that could be dynamically ramped up and down, based upon incoming load. So, it would not be using the traditional "multicast" pattern but round robining the messages to the subscribers. The reason we want to use this pattern is that we want to take advantage of the rules and filtering that reside in the Topics and Subscriptions, while allowing for dynamic scaling.
Any ideas?