S3: Duplicate buckets - amazon-s3

What is the easiest way to duplicate an entire Amazon S3 bucket to a bucket in a different account?
Ideally, we'd like to duplicate the bucket nightly to a different account in Amazon's European data center for backup purposes.

One thing to consider is that you might want to have whatever is doing this running in an Amazon EC2 VM. If you have your backup running outside of Amazon's cloud then you pay for the data transfer both ways. If you run in an EC2 VM, you pay no bandwidth fees (although I'm not sure if this is true when going between the North American and European stores) - only for the wall time that the EC2 instance is running (and whatever it costs to store the EC2 VM, which should be minimal I think).

Cool, I may look into writing a script to host on Ec2. The main purpose of the backup is to guard against human error on our side -- if a user accidentally deletes a bucket or something like that.

If you're worried about deletion, you should probably look at S3's new Versioning feature.

I suspect there is no "automatic" way to do this. You'll just have to write a simple app that moves the files over. Depending on how you track the files in S3 you could move just the "changes" as well.
On a related note, I'm pretty sure Amazon does a darn good job backup up the data so I don't think you necessarily need to worry about data loss, unless your back up for archival purposes, or you want to safeguard against accidentally deleting files.

You can make an application or service that responsible to create two instances of AmazonS3Client one for the source and the other for the destination, then the source AmazonS3Client start looping in the source bucket and streaming objects in, and the destination AmazonS3Client streaming them out to the destination bucket.

Note: this doesn't work for cross-account syncing, but this works for cross-region on the same account.
For simply copying everything from one bucket to another, you can use the AWS CLI (https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/): aws s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
In your case, you'll need the --source-region flag: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
If you are moving an enormous amount of data, you can optimize how quickly it happens by finding ways to split the transfers into different groups: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/
There are a variety of ways to run this nightly. One is example is the AWS instance-schedule (personally unverified) https://docs.aws.amazon.com/solutions/latest/instance-scheduler/appendix-a.html

Related

Ceph Object Gateway: what is the best backup strategy?

I have a Ceph cluster managed by Rook with a single RGW store over it. We are trying to figure out the best backup strategy for this store. We are considering the following options: using rclone to backup object via an S3 interface, using s3fs-fuse (haven’t tested it yet but s3fs-fuse is known to be not reliable enough), and using NFS-Ganesha to reexport the RGW store as an NFS share.
We are going to have quite a lot of RGW users and quite a lot of buckets, so all three solutions do not scale well for us. Another possibility is to perform snapshots of RADOS pools backing the RGW store and to backup these snapshots, but the RTO will be much higher in that case. Another problem with snapshots is that it does not seem possible to perform them consistently across all RGW-backing pools. We never delete objects from the RGW store, so this problem does not seem to be that big if we start snapshotting from the metadata pool - all the data it refers to will remain in place even if we create a snapshot on the data pool a bit later. It won’t be super consistent but it should not be broken either. It’s not entirely clear how to restore single objects in a timely manner using this snapshotting scheme (to be honest, it’s not entirely clear how to restore using this scheme at all), but it seems to be worth trying.
What other options do we have? Am I missing something?
We're planning to implement Ceph in 2021.
We don't expect a large number of users and buckets, initially.
While waiting for https://tracker.ceph.com/projects/ceph/wiki/Rgw_-_Snapshots, I successfully tested this solution to address the Object Store protection by taking advantage of multisite configuration + sync policy (https://docs.ceph.com/en/latest/radosgw/multisite-sync-policy/) in the "Octopus" version.
Assuming you have all zones in the Prod site Zone Sync'd to the DRS,
create a Zone in the DRS, e.g. "backupZone", not Zone Sync'd from
or to any of the other Prod or DRS zones;
the endpoints for this backupZone are in 2 or more DRS cluster
nodes;
using (https://rclone.org/s3/) write a bash script: for each the
"bucket"s in the DRS zones, create a version enabled "bucket"-p in the backupZone
and schedule sync, e.g. twice a day, from "bucket" to "bucket"-p;
protect the access to the backupZone endpoints so that no ordinary
user (or integration) can access them, only accessible from the other nodes in the
cluster (obviously) and the server running the rclone-based script;
when there is a failure, just recover all the objects from the *-p
buckets, once again using rclone, to the original buckets or to
filesystem.
This protects from the following failures:
Infra:
Bucket or pool failure;
Object pervasive corruption;
Loss of a site
Human error:
Deletion of versions or objects;
Removal of buckets
Elimination of entire Pools
Notes:
Only the latest version of each object is sync'd to the protected
(*-p) bucket, but if the script runs several times you have the
latest states of the objects through time;
when an object is deleted in the prod bucket, rnode just flags the
object with the DeleteMarker upon sync
this does not scale!! As the number of buckets increases, the time to
sync becomes untenable

Move many S3 buckets to Glacier

We have a ton of S3 buckets and are in the process of cleaning things up. We identified Glacier as a good way to archive their data. The plan is to store the content of those buckets and then remove them.
It would be a one-shot operation, we don't need something automated.
I know that:
a bucket name may not be available anymore if one day we want to restore it
there's an indexing overhead of about 40kb per file which makes it a not so cost-efficient solution for small files and better to use an Infrequent access storage class or to zip the content
I gave it a try and created a vault. But I couldn't run the aws glacier command. I get some SSL error which is apparently related to a Python library, wether I run it on my Mac or from some dedicated container.
Also, it seems that it's a pain to use the Glacier API directly (and to keep the right file information), and that it's simpler to use it via a dedicated bucket.
What about that? Is there something to do what I want in AWS? Or any advice to do it in a not too fastidious way? What tool would you recommend?
Whoa, so many questions!
There are two ways to use Amazon Glacier:
Create a Lifecycle Policy on an Amazon S3 bucket to archive data to Glacier. The objects will still appear to be in S3, including their security, size, metadata, etc. However, their contents are stored in Glacier. Data stored in Glacier via this method must be restored back to S3 to access the contents.
Send data directly to Amazon Glacier via the AWS API. Data sent this way must be restored via the API.
Amazon Glacier charges for storage volumes, plus per request. It is less-efficient to store many, small files in Glacier. Instead, it is recommended to create archives (eg zip files) that make fewer, larger files. This can make it harder to retrieve specific files.
If you are going to use Glacier directly, it is much easier to use a utility, such as Cloudberry Backup, however these utilities are designed to backup from a computer to Glacier. They probably won't backup S3 to Glacier.
If data is already in Amazon S3, the simplest option is to create a lifecycle policy. You can then use the S3 management console and standard S3 tools to access and restore the data.
Using a S3 archiving bucket did the job.
Here is how I proceeded:
First, I created a S3 bucket called mycompany-archive, with a lifecycle rule that turns the Storage class into Glacier 1 day after the file creation.
Then, (with the aws tool installed on my Mac) I ran the following aws command to obtain the buckets list: aws s3 ls
I then pasted the output into an editor that can do regexp relacements, and I did the following one:
Replace ^\S*\s\S*\s(.*)$ by aws s3 cp --recursive s3://$1 s3://mycompany-archive/$1 && \
It gave me a big command, from which I removed the trailing && \ at the end, and the lines corresponding the buckets I didn't want to copy (mainly mycompany-archive had to be removed from there), and I had what I needed to do the transfers.
That command could be executed directly, but I prefer to run such commands using the screen util, to make sure the process wouldn't stop if I close my session by accident.
To launch it, I ran screen, launched the command, and then pressed CTRL+A then D to detach it. I can then come back to it by running screen -r.
Finally, under MacOS, I ran cafeinate to make sure the computer wouldn't sleep before it's over. To run it, issued ps|grep aws to locate the process id of the command. And then caffeinate -w 31299 (the process id) to ensure my Mac wouldn't allow sleep before the process is done.
It did the job (well, it's still running), I have now a bucket containing a folder for each archived bucket. Next step will be to remove the undesired S3 buckets.
Of course this way of doing could be improved in many ways, mainly by turning everything into a fault-tolerant replayable script. In this case, I have to be pragmatic and thinking about how to improve it would take far more time for almost no gain.

Using Glacier as back end for web crawling

I will be doing a crawl of several million URLs from EC2 over a few months and I am thinking about where I ought to store this data. My eventual goal is to analyze it, but the analysis might not be immediate (even though I would like to crawl it now for other reasons) and I may want to eventually transfer a copy of the data out for storage on a local device I have. I estimate the data will be around 5TB.
My question: I am considering using Glacier for this, with the idea that I will run a multithreaded crawler that stores the crawled pages locally (on EB) and then use a separate thread that combines, compresses, and shuttles that data to Glacier. I know transfer speeds on Glacier are not necessarily good, but since there is no online element of this process, it would seem feasible (esp since I could always increase the size of my local EBS volume in case I'm crawling faster than I can store to Glacier).
Is there a flaw in my approach or can anyone suggest a more cost-effective, reliable way to do this?
Thanks!
Redshift seems more relevant than Glacier. Glacier is all about freeze / thaw and you'll have to move the data prior to doing any analysis.
Redshift is more about adding the data into a large, inexpensive, data warehouse and running queries over it.
Another option is to store the data in EBS and leave it there. When you're done with your crawling take a Snapshot to push the volume into S3 and decomission the volume and EC2 instance. Then when you're ready to do the analysis just create a volume from the snapshot.
The upside of this approach is that it's all file access (no formal data store) which may be easier for you.
Personally, I would probably push the data into Redshift. :-)
--
Chris
If your analysis will not be immediate then you can adopt one of the following 2 approaches
Approach 1) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> archive regularly to glacier. You can store your last X days data in Amazon S3 and use it for adhoc processing as well.
Approach 2) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon Glacier. Retrieve when needed and do the processing on EMR or other processing tools
If you need frequent analysis:
Approach 3) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> Analysis through EMR or other tools and store the processed results in S3/DB/MPP and move the raw files to glacier
Approach 4) if your data is structured, then Amazon EC2 crawler -> store in EBS disks and move them to Amazon RedShift and move the raw files to glacier
Additional tips:
If you can retrieve the data again(from source)then you can use ephemeral disks for your crawlers instead of EBS
Amazon has introduced Data pipeline service, check whether it fits your needs on data movement.

S3 or EBS for storing data in flat files

I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column