Best Practice for Restoring Both S3 and DynamoDB Together - amazon-s3

One of the things I see becoming more of a problem in micro-service architecture is disaster recovery. For instance, a common pattern is to store large data objects in S3 such as multimedia data, whilst JSON data would go in DynamoDB. But what happens when you have a hacker come and manages to delete a whole buck of data from your DynamoDB?
You also need to make sure your S3 bucket is restored to the same state is was at that time but are there elegant ways of doing this? The concern is that it will be difficult to guarantee that both the S3 backup and the DynamoDB database are in sync?

I am not aware of a solution to do a genuine synchronised backup-restore across services. However you could use the native DynamoDB point in time restore and the third party S3-pit-restore library to restore both services to a common point in time.

Related

Using Redis as main database in production

I have used Redis before in many projects, usually as a temporary cache, or to share small pieces of data between different subsystems. you get the idea. But now, I'm building a service that mainly revolves around the concept of a key-value store and needs super-fast data retrieval, so I'm using Redis. And because of this, all my data would be on Redis (it's not just a temporary cache or something of that sort). So I'm wondering, is this OK or should I take backups to another more "stable" database like MongoDB or MySQL?
Note that I'm not talking about the usual periodic backups. I would do that regardless of what database I end up using.

S3 or EBS for storing data in flat files

I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.

DynamoDb + S3 + CloudSearch + Redis

I'm currently creating a scheme for my application and I'm wondering if my thinking is right
Example : Ecommerce site
In DynamoDb, I would put products ( product_id, meta-data link to S3)
S3, i would use it for storing Search Data Format (SDF/JSON)
(Product name, product description, price, ...etc )
Amazon CloudSearch would be used to index documents in S3, and to be able to search them.
Redis would be used to cache results
Is my scheme right? Can s3 be a good "database" ?
Is DynamoDb here even needed?
If you are thinking that S3 would just be the source of record for your products and you are not expecting heavy reads/writes, then it COULD work, but you have to recognize that it will be far far slower than using a real database. Not just 1-2x slower but MANY magnitudes slower. We use S3 for storing audit data for realtime data stored in Postgres - works a charm, but this is data that is written once and read rarely. Retrieval times when it does have to retrieve audit records is > 50ms. This type of speed is usually not acceptable when you need to manipulate multiple records at one time.
If you are going to be using dynamoDB anyway, why not just use that to store what you'd be storing on s3? Trying to adhere to the concept of keep it simple, I would use the following stack:
dynamoDB to be the system of record and to do some searches
Cloudsearch for more flexible searching than what dynamodb can
provide
S3 for static files (product images, etc.)
And again, to keep things simple, Skip Redis for caching if you are already using dynamoDB and don't plan on using any of Redis' specialized dastatypes - ie: your caching will be nothing more than keys to strings, etc. Use Redis if you plan on taking advantage of its other datatypes or if you want to have a cache closer to your app - ie: you plan on using Redis on the webserver.
Dynamo is used for storing write-extensive data. If your application does not require extensive writes over product_id and meta-data, I think RDS/MySQL is better.
When designing an application, you really should keep things as simple as possible from the beginning. It will always get worse with time :)
S3 is not a good DB. It has not been designed for this and is too slow. It's for file storage only. If you want to stick with DynamoDB, you should put all your products info in it, including the metadata.
CloudSearch may be a good option. You can also build you own "indexes" on top of DynamoDB. It requires more design and programming but might be worth considering. Here is a link to an excellent blog-post on this matter: http://blog.coredumped.org/2012/01/amazon-dynamodb.html.
So,
Is DynamoDB even needed: Yes, or RDS, Mongo,... any real DB depending on you needs.
Is S3 a good DB: I don't think so.

AWS s3 standard vs. reduced redundancy storage?

Does anyone know of any real-world analysis on data loss using these two AWS s3 storage options? I know from the AWS docs (via Quora) that one is 99.9999999% guarenteed and the other is only 99.99% gaurenteed but I'm looking for data from a non-AWS source.
Anecdotes or something more thorough would both be great. I apologize if this isn't the right SE site for this question. Feel free to suggest a place to migrate it.
I guess it depends on the data you're storing if you really need 99.999999999% level of durability …
If you keep copies of your data locally and are just using S3 as a convenient place to store data that is actively being accessed by services within the AWS infrastructure, RRS might be the right choice for you :)
In my case, I keep fresh files on the normal durability level till I created a local backup and then move them to RRS, which saves you quite a bit a money.

S3: Duplicate buckets

What is the easiest way to duplicate an entire Amazon S3 bucket to a bucket in a different account?
Ideally, we'd like to duplicate the bucket nightly to a different account in Amazon's European data center for backup purposes.
One thing to consider is that you might want to have whatever is doing this running in an Amazon EC2 VM. If you have your backup running outside of Amazon's cloud then you pay for the data transfer both ways. If you run in an EC2 VM, you pay no bandwidth fees (although I'm not sure if this is true when going between the North American and European stores) - only for the wall time that the EC2 instance is running (and whatever it costs to store the EC2 VM, which should be minimal I think).
Cool, I may look into writing a script to host on Ec2. The main purpose of the backup is to guard against human error on our side -- if a user accidentally deletes a bucket or something like that.
If you're worried about deletion, you should probably look at S3's new Versioning feature.
I suspect there is no "automatic" way to do this. You'll just have to write a simple app that moves the files over. Depending on how you track the files in S3 you could move just the "changes" as well.
On a related note, I'm pretty sure Amazon does a darn good job backup up the data so I don't think you necessarily need to worry about data loss, unless your back up for archival purposes, or you want to safeguard against accidentally deleting files.
You can make an application or service that responsible to create two instances of AmazonS3Client one for the source and the other for the destination, then the source AmazonS3Client start looping in the source bucket and streaming objects in, and the destination AmazonS3Client streaming them out to the destination bucket.
Note: this doesn't work for cross-account syncing, but this works for cross-region on the same account.
For simply copying everything from one bucket to another, you can use the AWS CLI (https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/): aws s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
In your case, you'll need the --source-region flag: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
If you are moving an enormous amount of data, you can optimize how quickly it happens by finding ways to split the transfers into different groups: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/
There are a variety of ways to run this nightly. One is example is the AWS instance-schedule (personally unverified) https://docs.aws.amazon.com/solutions/latest/instance-scheduler/appendix-a.html