Creating multiple subs on same topic to implement load sharing (pub/sub) - rabbitmq

I spent almost a day on google pub sub documentation to create a small app. I am thinking of switching from rabbitMQ to google pub/sub. Here is my question:
I have an app that push messages to a topic (T). I wanted to do load sharing via subscribers. So I created 3 subscribers to T. I have kept the name of all 3 subs same (S), so that I don't get same message 3 times.
I have 2 issues:
There is no where I console I see 3 same subscribers to T. It shows 1
If I try to start all 3 instances of subscribers at same time. I get "A service error has occurred.". Error disappeared if I start in sequential manner.
Lastly, Is google serious about pub/sub ? Looking at the documentations and public participations, I am not sure if I should switch to google pub/sub.
Thanks,

In pub/sub, each subscription gets a copy of every message. So to load balance handling message, you don't want 3 different subscriptions, but a single subscription that distributes messages to 3 workers.
If you are using pull delivery, simply create a single subscription (as a one-time action when you set up the system), and have each worker pull from the same subscription.
If you are using push delivery, have a single subscription pushing to a single endpoint that provides load balancing (e.g. push to a HTTP load balancer with multiple instances in a backend service
Google is serious about Pub/Sub, it is deeply integrated into many products (GCS, BigQuery, Dataflow, Stackdriver, Cloud Functions etc) and Google uses it internally.

As per documentation on GCP,https://cloud.google.com/pubsub/architecture.
Load balanced subscribers are possible, but all of them have to use same subscription. Don't have any code sample or POC ready but working on same.

Related

How to organize scheduled data polling during the application scaling?

I have a microservice that among other things is used as a "caching proxy" (I'm not sure that this term is correct). It is in between the application API and Azure API. This microservice periodically fetches some data from Azure for several resources and stores it in Redis. Application API from the other side requests the resource data but reads it not from Azure itself, but from Redis.
(This is done in order to limit the scale of requests hitting the Azure API when having a high load on the application API.)
The periodical polling is currently implemented as a naive "while not canceled - fetch, update Redis and sleep for 15 seconds".
This worked well while I had only one instance of the microservice. But now due to new requirements, I have an automatic scaling of my microservice. And that means that if there are 5 instances of the microservice running right now - I'm hitting the Azure API 5 times more frequently than I should.
My question is how can I fix this to do "one request to Azure API per resource once in 15 seconds" - no matter how many microservice instances I have?
My constraints are:
do the minimal changes since the microservice is already in Production;
use the existing resources as much as possible (apart from Redis the microservice is already using message queues - Azure Service Bus).
Ideas I have:
make only one instance a "master" - only this instance will fetch data from Azure. But what should I do when auto-scaling shuts this instance down? How can I detect this and decide on a new master instance? Maybe I could store the master instance identifier in a short-living key in Redis and prolong it every time the resource data is retrieved from Azure? If there is no key in Redis - a new master instance is selected.
use Azure Service Bus message scheduling - on microservice application startup the instance schedules a message in the next 15 seconds which will be received by only one microservice instance. On receiving this message the microservice instance will fetch the data from Azure, update Redis - and schedule another message in the next 15 seconds. This time another microservice instance can receive the instance and do the same - fetch data, update Redis, and schedule the next message. But I don't know how to avoid parallel message chains initiated when several microservice instances are started/restarted.
Anyway, I don't see any good solution for my problem and would appreciate a hint.

Handling of pubsub subscribers for distributed longrunning tasks

I am evaluating the use of using pubsub for long-running tasks such as video transcoding, where a particular transcode may take between 2-10 minutes. Is pubsub a good approach for such a task distribution? For example, let's say I have five servers:
- publisher1
- publisher2
- publisher3
- publisher4
- publisher5
And a topic called "videos". Would it be possible to spread out the messages equally across those five servers? What about when servers are added or removed? What would be a good approach to doing this, or is pubsub not the right tool for something like this?
This does sound like a reasonable use case for pubsub. Specifically, if you use a pull subscriber, you can configure flow control settings to have at most one outstanding message to your server, and configure the max ack extension period (in java) to be a reasonable upper bound of your processing time. This api is described here http://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/pubsub/v1/package-summary.html
This should effectively load balance across your servers by default if you use the same subscriber id for all jobs. If a server is added and backlog exists, it will receive a new entry. If a server is removed, it will no longer be sent messages. If it removed while processing or crashes, the message it was working on will be resent to another server.
One concern however is that pubsub has a limit of 10MB per message. You might consider instead putting the data itself in a google cloud storage bucket. Cloud storage can publish the file location to a pubsub topic when an upload is complete. https://cloud.google.com/storage/docs/pubsub-notifications

Sending huge amount of emails using Amazon SES

I'm going to use Amazon SES for sending emails in the website I'm building currently. According to the sample java code they have provided in their API documentation I developed the functionality and I was able to send the emails. But when it comes to handle huge number of emails in a very short time of period what is the best mechanism to follow up? Do they provide any queue mechanism for emails? I couldn't find this from their API documentation and their technical service is available only for users who has purchased the account.
Can anyone has come across a solution for this problem?
Generally I use a custom SQS solution for a batch mailing process like this.
Sending more than a few emails from a web server isn't ideal, so I usually only have the website submit the request for the emails to a back-end process in a single call, then I create an SQS message for each recipient and (in my case) use a windows service that requests messages from SQS and sends the emails at the pace I want them to go out. If errors are encountered the message stays in the queue, and get retried automatically.
With an architecture like this, depending on your volumes you can spin up new instances automatically if the SQS queue size gets too large for a single instance to process in a timely manner.

Masstransit and RabbitMQ - how many consumers are connected

I am using MassTransit and RabbitMQ in both a "competing consumers" model and a Pub/Sub model.
3 tiers,
1st tier = UI, 2nd tier = gateway, 3rd tier = many distributed services
I have a working competing consumers model but I wish to do the following with Pub/Sub:
The gateway service publishes a message that all connected subscriber instances consume and then respond to to the gateway. The gateway doesn't respond to the UI until all its 3rd tier have responded, the gateway accumulates the response and finally passes back to the UI.
I cannot find a way to inspect MassTransit (whether I use SAGAs or not) in the 2nd tier to know how many subscribers i have in the 3rd tier (to work out if they've all responded). The overall goal is that the UI gets a single response with the accumulated results from the 3rd tier.
A similar question is here - no answers as yet.
UPDATE
Effectively I want to count the number of sinks on the inboundPipeline. Should I be doing this and is there a clean way to do it?
Pub/Sub in general doesn't allow you to know how many consumers for a given message exist. The whole idea is that you aren't coupled to that answer.
To do this, you need to build the solution into your application to keep track of that. When a consume comes up, publish a message for the gateway to register with it. When it shutdowns, do the same to remove that registration.

"Archiving" publish/subscribe message in Redis

I am using Redis' publish/subscribe feature. So the server is publishing 10 items then the client gets those 10 items.
Now however, a new client subscribes to the feed. I would like them to get the previous 10 items as well as any new items.
Does Redis have a way of doing this using the publish and subscribe functionality? Is a feed history stored anywhere in the database? Is there an easy way of doing this? Is the best way to also store the messages in a list and have the client do an LRANGE my_list 0 10 on the list?
I'd keep a separate archive of the data and have events added to both. New clients can subscribe and queue the real time events, read the archive until it's up to date with the first published event, then catch up with the published events. That way you shouldn't miss any published events while switching between the archive and real time events.
Stumbled on this during some research. I know it is old but I wanted to add that with the Redis Streams data structure it is not overly complex to implement persistent messaging.
The publisher would publish messages to a Stream and a subscriber would just get the latest message if that is all it cared about. You can also create user groups to limit how many subscribers can get the message and then mark them as acknowledged to avoid duplicate processing. This is good when you want a message to be handled only once and need a way to confirm that.
I ended up creating a nodejs app for this sort of purpose. In my case, user data was published to the redis server which i wanted to store, I subscribed to the redis channel with a nodejs app and then saved the details to a database, ive played around with mysql and mongo so far, let me know if this is of any interest and ill paste some code, there are some similarities in trying to store a publish history...
Cheers