Manage In-memory cache in multiple servers in aws - amazon-s3

Once or twice a day some files are being uploaded to S3 Bucket. I want the uploaded data to be refreshed with the In-memory data of each server on every s3 upload.
Note there are multiple servers running and I want to store the same data in all the servers. Also, the servers are scaling based on the traffic(also on start-up of the new server goes up and older ones go down means server instances will not be the same always).
Like I want to keep updated data in the cache.
I want to build an architecture where auto-scaling of the server can be supported. I came across the FAN-OUT architecture of AWS by using the SNS and multiple SQS from which different servers can poll.
How can we handle the auto-scaling of the queue with respect to servers?
Or is there any other way to handle the scenario?
PS: I m totally new to the AWS environment.
It Will be a great help for any reference.

To me there are a few things that you need to have to make this work. These are opinions and, as with most architectural designs, there is certainly more than one way to handle this.
I start with the assumption that you've got an application running on an EC2 of some sort (Elastic Beanstalk, Fargate, Raw EC2s with auto scaling, etc.) and that you've solved for having the application installed and configured when a scale-up event occurs.
Conceptually I'd have this diagram:
The setup involves having the S3 bucket publish likely s3:ObjectCreated events to the SNS topic. These events will be published when an object in the bucket is updated or created.
Next:
During startup your application will pull the current data from S3.
As part of application startup create a queue named after the instance id of the EC2 (see here for some examples) The queue would need to subscribe to the SNS topic. If the queue already exists then that's not an error.
Your application would have a background thread or process that polls the SQS queue for messages.
If you get a message on the queue then that needs to tell the application to refresh the cache from S3.
When an instance is shut down there is an event from at least Elastic Beanstalk and the load balancers that your instance will be shut down. Remove the SQS queue tied to the instance at that time.
The only issue might be that a hard crash of an environment would leave orphan queues. It may be advisable to either manually clean these up or have a periodic task clean them up.

Related

Scaling Kafka Connect to handle 10K S3 buckets

I want to load data from various S3 buckets (more than 10,000 buckets and each file is around 20-50MB) into Apache Kafka. The list of buckets is dynamic - buckets are added and removed at runtime. Ideally, each bucket configuration should have its own polling interval (how often to scan for new files - at-least 60 seconds, but might be much more) and priority (number of concurrent files being processed).
Note that setting up notifications from each of the S3 buckets to SQS/SNS/Lambda is not an option due to various IT policies in the organizations of each of the bucket owners.
Kafka Connect seems to be most commonly used tool for such tasks, and its pluggable architecture will make it easier to add new sources in the future, so it fits well. Configuring each S3 bucket as its own connector will let me set a different number of tasks (which maps to priorities) and polling interval for each one. And building a Java custom Kafka Connect source task for my expected file format sounds reasonable.
However, the Kafka Connect code indicates that each running task is assigned its own thread for the lifetime of the task. So if I have 10K buckets, each configured with its own connector and with a single task, I will have 10K threads running in my Kafka Connect distributed worker pool. That's a lot of threads that are mostly just sleep()-ing.
What is the correct approach to scaling the number of tasks/connectors in Kafka Connect?
Kafka Connect is distributed framework which could work as stand-alone mode or distributed, as distributed framework you are creating cluster of kafka connect from several commodity servers each one hosts kafka connect instance and can execute connector's tasks , if you need more power you can add more servers hosting connect instances ,
Reading the S3 Source Connector documents I did not find a way to "whitelist" / "regex" to get it read from multiple buckets...

Cloud and local application sync ideas

I've a situation where my central MySQL db and file system (S3) runs on a EC2.
But one of my application runs locally at my client site on a PI-3 device, which needs to look up data and files from both the DB and file system on cloud. The application generates transactional records in turn and need to upload the DB and FS (may be at day end).
The irony is that sometimes the cloud may not be available due to connectivity issues (being in a remote area).
What could be the best strategies to accommodate this kind of a scenario?
Can AWS Greengrass help in here?
How to keep the Lookup data (DB and FS)in sync with the local devices?
How to update/sync the transactional data generated by the local devices?
And finally, what could be the risks in such a deployment model?
Appreciate some help/suggestions.
How to keep the Lookup data (DB and FS)in sync with the local devices?
You can have a Greengrass Group and includes all of the devices in the that group. Make the devices subscribe to a topic e.g. DB/Cloud/update. Once device received the message on that topic, trigger a on-demand lambda to download the latest information from the Cloud. To make sure the device do not miss any update when offline, you can use persistent session, it will make sure device will receive all the missing message when it is back online.
How to update/sync the transactional data generated by the local devices?
You may try with the Stream Manager. https://docs.aws.amazon.com/greengrass/latest/developerguide/stream-manager.html
Right now, it is allowed you to add a local use lambda to pre-process the data and sync it up with the cloud

Clone RabbitMQ admin users, etc. on replacement server

We have a couple of crusty AWS hosts running a RabbitMQ implementation in a cluster. We need to upgrade the hardware, and therefore we developed a Chef cookbook to spawn replacement servers.
One thing that we would rather not recreate by hand is the admin users, the queues, etc.
What is the best method to get that stuff from the old hosts to the new ones? I believe it's everything that lives in the /var/lib/rabbitmq/mnesia directory.
Is it wise to copy the files from one host to another?
Is there a programmatic means to do this?
Can it be coded into our Chef cookbook?
You can definitely export and import configuration via command line: https://www.rabbitmq.com/management-cli.html
I'm not sure about admin user, though.
If you create new rabbitmq nodes on your new hardware, you will get all the users in that new node. This is easy to try:
run docker container with image of rabbitmq (with management plugin)
and create a user
run another container and add that node to the
cluster of the first one
kill rabbitmq on the first one, or delete
the docker container and you will see that you still have the newly
created user on the 2nd (but now master) node
I wrote docker since it's faster to create a cluster this way, but if you already have a cluster you could use it for testing if you prefer.
For the queues and exchanges, I don't want to quote almost everything found in the rabbitmq doc page for the high availability, but I will just say that you have to pay attention to the following:
exclusive queues because they are gone once the client connection is gone
queue mirroring (if you have any set up, if not it would be wise to consider it, if not even necessary)
I would do the migration gradually, waiting for the queues to get emptied and then kill of the nodes on the old hardware. It maybe doable in a big-bang fashion, but seems riskier. If you have a running system, than set up queue mirroring and try to find appropriate moment to do manual sync - but careful, this has a huge impact on the broker performance.
Additionally there is this shovel plugin (I have to point out that I did not use it or even explore it) but that may be another way to go since (quoting form the link):
In essence, a shovel is a simple pump. Each shovel:
connects to the source broker and the destination broker, consumes
messages from the queue, re-publishes each message to the destination
broker (using, by default, the original exchange name and
routing_key).

Does RabbitMQ contain functionality to deal with offline target nodes

Being new to the RabbitMQ I was wondering how to deal with an offline target node.
As an example this scenario:
1 log recording application that stores logs to some persistent storage
N log publishing applications that want their logs to be written to the persistent storage via the log recording server.
There would be two options:
Each publishing application publishes it's log messages to it's local RabbitMQ instance and the log recording server must subscribe to each of these
The log recording application has it's local RabbitMQ instance on which each log publishing application delivers it's messages.
Option 1 would require me to reconfigure/recode/notify the recording application each time a new application appears or moves. Therefore I would think Option 2 is the right one, each new publishing application simply writes to the RabbitMQ Node of the recording application.
The only thing I am struggling with is how to deal with a situation in which the Node of the recording application is down. Do I need to build my own system to store the messages until it's back online or can I use some functionality of RabbitMQ to deal with that? I.e. could the local RabbitMQ of each of the publishing applications just receive the messages and forward them to the recording application RabbitMQ as soon as it's back online?
I found something about the Federated plugin be couldn't understand if that's the solution. Maybe I need something different or maybe I have to write my own local queueing system (which I hope I don't have to) to queue messages when the target Node is offline.
Any links to architectural examples or solutions are more than welcome.
BTW: https://groups.google.com/forum/#!topic/easynetq/nILIKSjxyMg states that you shouldn't be installing a RabbitMQ Node for each application, so maybe I should resort to something like MSQM or ZeroMQ (?)
From experience in what sounds like a similar situation, I would suggest using something other than a queue to store the messages locally, when offline.
Years ago, I built a system that had to work offline - no network connection at all - and then had to push messages through a message queue to the central server, when the laptop was brought back to the office.
I solved this by using a local database (sqlite at the time) to store my messages when the message queue was not available.
You should do something similar. Use a local database or even a plain text file or CSV file to store your messages when RabbitMQ is offline. When it reconnects, read the messages from your local file system and send them through RabbitMQ.
This is a good strategy to use, even if you do not expect RabbitMQ to go offline. Frankly, it will go offline at some point and you will have to deal with it. You should be prepared for that situation, and having a local store for your messages will help that.
...
regarding rqm node per application: bad idea. this adds a ton of complexity to your system. You want as few RabbitMQ nodes as you can get away with. Meaning, 1 per system (a system being comprised of many applications) when possible... with the exception of RabbitMQ clusters for availability - but that's another line of questions and design, entirely.
...
I did an interview with Aria Stewart about designing for failure with RabbitMQ and messaging systems, and have a small excerpt where she talks about how networks fail.
The point is, the network or RabbitMQ or something will fail and you will need a solution like a local datastore so that you can recover when RabbitMQ comes back online.

Where are unique ReceiveFrom addresses really necessary on MassTransit with RabbitMQ?

Background
My group are complete noobs with MassTransit and messaging in general. I understand the simple demos found online, but I'm confused on how to set things up for non-trivial scenarios. (many producers, many consumers, with consumers communicating back to producers)
We currently make 3rd party web service calls directly from web code via synchronous calls. Some of them are notoriously slow and unreliable to the point of browser timeouts and YSODs that aren't directly our code's fault. We want to replace these sync calls with messages and eventual consistency for retries and poison queue.
We also want to replace various scheduled/batch tasks with messaging to get closer to real time processing instead of waiting for next batch to run.
Our website runs on a farm of 6 IIS servers behind a hardware load balancer. There are 2 additional "application" servers that run the scheduled tasks. I figure we will put our new worker services on the app servers or maybe even all 8 servers.
Questions
So... The "common gotchas" section of the MT docs say that each application needs it's own address. My question is around what exactly is the definition of application in this case.
I have 6 web servers running the website. Does each of these need a unique address or can they all just be "rabbitmq://localhost/MyApp/Website". What if IIS is configured for multiple worker processes? Do each of those also need a different rabbit address?
Same question goes for my 2 application servers. If I'm running the same worker on both boxes does it need different addresses? Some stuff says if you want competing consumers to share an address, but if you want "event" type messages to be delivered to everyone they need to be different addresses.
What if you need both event (broadcast) and command (consumed once) messages sent to a worker cluster? (Multiple instances of the same workers to handle more load.)
What if I have consumers hosted in the web application directly? (I'm not sure this is a good idea to start with.)
What about request/response messages? I assume the responses should go back to the originating web server. Otherwise the MT request call will never unblock or at best timeout.
Each instance of an IServiceBus needs it's own RecieveFrom address. And yeah, if there are multiple worker processes, each should have it's own queue. You can use temporary queue for this though in web apps.
For competing consumers, each process/IServiceBus that is one of the consumes should be an exact copy. If there's an event that doesn't need to be competing, then it needs to have it's own process.