Consume objects from S3 balancing between Mule servers - amazon-s3

Scenario is this:
S3 Bucket full of csv files with hundreds of formatted lines each.
N number of Mule servers. Clustered or not. Both options available.
One unique Mule flow installed in all mule server.
Mule flow behavior is simple. Polls S3 to lazy fetch available files, retrieve each single file contents, transform csv lines into sql statements and insert in DB.
Problems:
All Flows from different Mule server successfully polls s3, retrieves files, process them, and insert in DB. So files and registries are processed several times.
Wish List:
load balance is done between all active servers.
flows installed in different mule servers are equal (we don't modify flow to get different files)
files and registries inside them are not processed twice
Failed Approach:
We tried a processed/non processed mechanism common to all mule servers, in clustered mode. We used Mule's 3.5 Object Store to keep a list of the files that has been processed, visible to all servers. Problem here is, we are not balancing, all workload its on one servers, rest are idle almost all time.
Questions:
Which could be best architecture design is we want load balancing?
Maybe we need an specific mule app to do s3 file download, and let this
app to divide equally the work load between the Mule servers?
Here is an schema of scenario:

Configure your S3 bucket to push events to a SQS queue (see here), and have your mule servers pull events from that queue, instead of polling S3. This way, each event will be pulled by one worker only.
It works as follows: In each worker, you need to repeatedly call ReceiveMessage() to get the next message in the queue. Once a worker gets a message, that message becomes invisible to other workers for a certain amount of time (which you can control by setVisibilityTimeout()). After a worker processes a message, it should call deleteMessage() to remove it completely from the queue. In case of failure in the worker, deleteMessage() is not called, and so after the visibility timeout period, another worker will pick up that message.
In other words, the Queue in SQS doesn't deal with distributing the work. The workers pull messages from the queue when they are ready, and this is what creates the load balancing.

Related

Replaying Kafka events stored in S3

I might be thinking of this incorrectly, but we're looking to set up a connection between Kafka and S3. We are using Kafka as the backbone of our microservice event sourcing system and may occasionally need to replay events from the beginning of time in certain scenarios (i.e. building a new service, rebuilding a corrupted database view).
Instead of storing events indefinitely in AWS EBS storage ($0.10/GB/mo.), we'd like to shift them to S3 ($0.023/Gb/mo. or less) after seven days using the S3 Sink Connector and eventually continually move them down the chain of S3 storage levels.
However, I don't understand that if I need to replay a topic from the beginning to restore a service, how would Kafka get that data back on demand from S3? I know I can utilize a source connector, but it seems that is only for setting up a new topic. Not for pulling data back from an existing topic.
The Confluent S3 Source Connector doesn't dictate where the data is written back into. But you may want to refer the storage configuration properties regarding topics.dir and topic relationship.
Alternatively, write some code to read your S3 events and send them into a Kafka producer client.
Keep in mind, for your recovery payment calculations that reads from different tiers of S3 cost more and more.
You may also want to follow the developments of Kafka native tiered storage support (or similarly, look at Apache Pulsar as an alternative)

Multiple service instances using Hangfire (shared tasks/objects), is it possible?

I need to run multiple instances of the same service, with the same database, for redundancy reason.
I found some question about "Hangfire multiple instances" but for a differenct purpose then mine: usually about running multiple instances for different tasks on the same database, or similar to this.
I need to know if there are problems of concurrency when 2 or more instances of Hangfire use the same Database (we want to use MongoDB) and if this is the solution to make the service resilient.
The goal is to have instance that take care of all the jobs when another instance goes down.
Any suggestion wellcome for covering this scenario.
In our environment, we have a replica set used by about 10 Hangfire servers. If there are multiple Hangfire servers servicing the same queue, it means they will share the load and whichever Hangfire server checks the queue first, picks up the job and continues. If you remove all but 1 server, the jobs will continue (as long as there are enough workers otherwise they will remain queued until a worker is available).
To answer your question, yes, you can have 2 or more Hangfire servers using the same MongoDB. MongoDB provides multi-threading support so its safe to have various servers accessing the same database backend. If you have two servers, both will be active and if one instance goes off line, other instance (based on queues) will continue to process the jobs in queue.
Keep in mind, Hangfire servers processes the jobs in Specific Queues. If both servers are part of the same queue then you are load balancing the jobs among the two servers. If they are part of different queues, then you read about that scenario where each Hangfire instance processes different jobs (because they are part of different queues).
Read about configuring Job Queues here

Using NServiceBus how can I process multiple messages as a batch?

We currently have an process that involves sending a third-party an xml file containing changes that have occurred within our system.
We are moving to use NServiceBus and the changes are modelled as individual commands sent to an endpoint.
We do not want to send these changes as individual files; instead we want to batch receive a number of commands, concatenating this information into a single file.
How might one go about batching multiple commands into a single export file?
Have you looked at Sagas? Sagas allow you to model long-running business processes. So if a file has multiple parts, the Saga could begin when it gets the first part, and complete when it has all the parts it needs.
http://cdn.nservicebus.com/sagas.aspx
You can send the commands in using Bus.Send(IMessage[]messages). Note it takes an array and the messages will be packed into one queue message over the wire. On the receiving side, the handler will be invoked once per message. In your handler you should be able to just keep appending to your file. In the handler you could place logic to determine when to "roll" the file if necessary.

Redis Pub/Sub Usefulness?

I have a question that is bugging me quite heavily. What is the Redis pub/sub feature actually used for? I can only think of inter-process communication over TCP (either locally or distributed), however not much else.
Can someone please prove me wrong.
It's an easy way to plug into an event stream, generally between processes or machines. For instance, an user creates a published event. One process handles updating the database from the event, another updates user stats, another global stats, another updates the text search database, etc. They're all loosely coupled by subscribing to the channel. You can add new processes for testing updates and monitoring the system. It's a little different from a message queue in that there's no storing messages until they're processed, but Redis has other structures for those sorts of jobs.
a real use case in my experience.
Lets say you have a web application deployed on 4 different servers(nodes,virtual machines) mostly on your virtual private cloud.
The web application maintains an in memory java map for its static data cache which occasionally changes .
Now every time the data changes in your database you would need all your servers to update there own in memory caches,this is the problem.
one way is to maintain all the static data in redis or any other cache on a separate server and the cache updates based on a scheduler.But here to access the static content which occasionally changes you need a scheduler and a separate cache server like redis or memcached etc. and each server points to this external cache.
Using pubsub of redis here:
all servers subscribe to redis channel and if redis publishes the message when ever there is an update,addition,deletion of the data as a message to all of its subscribers.On receiving the message object and its type of update(ADD,REMOVED,UPDATED) each server updates its in memory static data map.

Multiple clients load distribution with redis

We are using redis as a queue for asynchronous processing of jobs. One application pushes jobs to redis (lpush), other application reads the redis queue (blpop) and processes the same. We wanted to scale the processing application so we ran two different instances on 2 different machines to process the jobs from queue, but we observed that one instance is taking 70% of the load from queue while other instances processes only a meagre amount. Is there any well defined strategy or configuration in using multiple clients with redis and proper load sharing? Or we have to maintain separate queues for the two instances and push the requests in a round robin manner?