How to make a sync backup (aws)? - amazon-s3

Currently I have two buckets in the same region. (50TB)
I am trying to backup all the contents from one to the other through the sync command. But the sync command copies it doesn't really backup
However after the sync finished I realized that the versionIDs in the objects in the new bucket are not the same as in the old... I use the version ID for some features in my applications. So sync is not really an option for me.
Is there an alternative to sync that can backup an entire bucket with its metadata etc... intact ? namely the version Id ?

If you don't have a region limitation, you can use cross-region replication to backup the bucket.
https://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html

Related

Move many S3 buckets to Glacier

We have a ton of S3 buckets and are in the process of cleaning things up. We identified Glacier as a good way to archive their data. The plan is to store the content of those buckets and then remove them.
It would be a one-shot operation, we don't need something automated.
I know that:
a bucket name may not be available anymore if one day we want to restore it
there's an indexing overhead of about 40kb per file which makes it a not so cost-efficient solution for small files and better to use an Infrequent access storage class or to zip the content
I gave it a try and created a vault. But I couldn't run the aws glacier command. I get some SSL error which is apparently related to a Python library, wether I run it on my Mac or from some dedicated container.
Also, it seems that it's a pain to use the Glacier API directly (and to keep the right file information), and that it's simpler to use it via a dedicated bucket.
What about that? Is there something to do what I want in AWS? Or any advice to do it in a not too fastidious way? What tool would you recommend?
Whoa, so many questions!
There are two ways to use Amazon Glacier:
Create a Lifecycle Policy on an Amazon S3 bucket to archive data to Glacier. The objects will still appear to be in S3, including their security, size, metadata, etc. However, their contents are stored in Glacier. Data stored in Glacier via this method must be restored back to S3 to access the contents.
Send data directly to Amazon Glacier via the AWS API. Data sent this way must be restored via the API.
Amazon Glacier charges for storage volumes, plus per request. It is less-efficient to store many, small files in Glacier. Instead, it is recommended to create archives (eg zip files) that make fewer, larger files. This can make it harder to retrieve specific files.
If you are going to use Glacier directly, it is much easier to use a utility, such as Cloudberry Backup, however these utilities are designed to backup from a computer to Glacier. They probably won't backup S3 to Glacier.
If data is already in Amazon S3, the simplest option is to create a lifecycle policy. You can then use the S3 management console and standard S3 tools to access and restore the data.
Using a S3 archiving bucket did the job.
Here is how I proceeded:
First, I created a S3 bucket called mycompany-archive, with a lifecycle rule that turns the Storage class into Glacier 1 day after the file creation.
Then, (with the aws tool installed on my Mac) I ran the following aws command to obtain the buckets list: aws s3 ls
I then pasted the output into an editor that can do regexp relacements, and I did the following one:
Replace ^\S*\s\S*\s(.*)$ by aws s3 cp --recursive s3://$1 s3://mycompany-archive/$1 && \
It gave me a big command, from which I removed the trailing && \ at the end, and the lines corresponding the buckets I didn't want to copy (mainly mycompany-archive had to be removed from there), and I had what I needed to do the transfers.
That command could be executed directly, but I prefer to run such commands using the screen util, to make sure the process wouldn't stop if I close my session by accident.
To launch it, I ran screen, launched the command, and then pressed CTRL+A then D to detach it. I can then come back to it by running screen -r.
Finally, under MacOS, I ran cafeinate to make sure the computer wouldn't sleep before it's over. To run it, issued ps|grep aws to locate the process id of the command. And then caffeinate -w 31299 (the process id) to ensure my Mac wouldn't allow sleep before the process is done.
It did the job (well, it's still running), I have now a bucket containing a folder for each archived bucket. Next step will be to remove the undesired S3 buckets.
Of course this way of doing could be improved in many ways, mainly by turning everything into a fault-tolerant replayable script. In this case, I have to be pragmatic and thinking about how to improve it would take far more time for almost no gain.

Using Glacier as back end for web crawling

I will be doing a crawl of several million URLs from EC2 over a few months and I am thinking about where I ought to store this data. My eventual goal is to analyze it, but the analysis might not be immediate (even though I would like to crawl it now for other reasons) and I may want to eventually transfer a copy of the data out for storage on a local device I have. I estimate the data will be around 5TB.
My question: I am considering using Glacier for this, with the idea that I will run a multithreaded crawler that stores the crawled pages locally (on EB) and then use a separate thread that combines, compresses, and shuttles that data to Glacier. I know transfer speeds on Glacier are not necessarily good, but since there is no online element of this process, it would seem feasible (esp since I could always increase the size of my local EBS volume in case I'm crawling faster than I can store to Glacier).
Is there a flaw in my approach or can anyone suggest a more cost-effective, reliable way to do this?
Thanks!
Redshift seems more relevant than Glacier. Glacier is all about freeze / thaw and you'll have to move the data prior to doing any analysis.
Redshift is more about adding the data into a large, inexpensive, data warehouse and running queries over it.
Another option is to store the data in EBS and leave it there. When you're done with your crawling take a Snapshot to push the volume into S3 and decomission the volume and EC2 instance. Then when you're ready to do the analysis just create a volume from the snapshot.
The upside of this approach is that it's all file access (no formal data store) which may be easier for you.
Personally, I would probably push the data into Redshift. :-)
--
Chris
If your analysis will not be immediate then you can adopt one of the following 2 approaches
Approach 1) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> archive regularly to glacier. You can store your last X days data in Amazon S3 and use it for adhoc processing as well.
Approach 2) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon Glacier. Retrieve when needed and do the processing on EMR or other processing tools
If you need frequent analysis:
Approach 3) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> Analysis through EMR or other tools and store the processed results in S3/DB/MPP and move the raw files to glacier
Approach 4) if your data is structured, then Amazon EC2 crawler -> store in EBS disks and move them to Amazon RedShift and move the raw files to glacier
Additional tips:
If you can retrieve the data again(from source)then you can use ephemeral disks for your crawlers instead of EBS
Amazon has introduced Data pipeline service, check whether it fits your needs on data movement.

EC2 snapshots vs. bundled instances

It's my understanding that EC2 snapshots are incremental in nature, so snapshot B contains only the difference between itself and snapshot A. Then if you delete snapshot A, the difference is allocated to snapshot B in Amazon S3 so that you have a complete snapshot. This also leads me to believe that it isn't prohibitively expensive to have daily snapshots A-Z for example, that it in storage cost it is basically the same as one snapshot.
What I really want is to back up my snapshots to a bucket in Amazon S3, so that if an entire EC2 region is having some problems --ahem cough, cough-- the snapshot can be moved into another region and launched as a backup instance in a new region.
However, it seems you can only bundle an instance and then upload a bundled instance to S3, not a snapshot.
The bundle is the entire instance correct? If this is the case then are historical bundled instances significantly more costly in practice than snapshots?
I use an instance store AMI and store my changing data on EBS volumes using the XFS filesystem. This means I can freeze the filesystems, create a snapshot and unfreeze them.
My volumes are 1GB (although mostly empty) and the storage cost is minuscule.
I don't know how an EBS backed AMI would work with this but I can't see why it would be any different. Note, however,that you need to bundle an instance in order to start it. Perhaps you could just snapshot everything as a backup and only bundle them when required.

I use Amazon S3 for storage. How can I create a backup of that data?

What is the most efficient way to backup data store in Amazon's S3 service?
is it to copy to another bucket? What are some tools for doing this? or should I just code for it?
is it to copy to another service?
or just copy an archive in a data center? Is there an easy way to do it incrementally?
Amazon has a system for you to send them a drive but that seems inefficient.
Copying the data to another bucket is not going to properly cover the case where Amazon loses your data. You'd need to either dump the data from S3, or write out a local copy when writing to S3, and then archive that in a separate location. The S3 export functionality is one way of dumping that data.
As with many things, it ultimately depends on the requirements for your app.

S3: Duplicate buckets

What is the easiest way to duplicate an entire Amazon S3 bucket to a bucket in a different account?
Ideally, we'd like to duplicate the bucket nightly to a different account in Amazon's European data center for backup purposes.
One thing to consider is that you might want to have whatever is doing this running in an Amazon EC2 VM. If you have your backup running outside of Amazon's cloud then you pay for the data transfer both ways. If you run in an EC2 VM, you pay no bandwidth fees (although I'm not sure if this is true when going between the North American and European stores) - only for the wall time that the EC2 instance is running (and whatever it costs to store the EC2 VM, which should be minimal I think).
Cool, I may look into writing a script to host on Ec2. The main purpose of the backup is to guard against human error on our side -- if a user accidentally deletes a bucket or something like that.
If you're worried about deletion, you should probably look at S3's new Versioning feature.
I suspect there is no "automatic" way to do this. You'll just have to write a simple app that moves the files over. Depending on how you track the files in S3 you could move just the "changes" as well.
On a related note, I'm pretty sure Amazon does a darn good job backup up the data so I don't think you necessarily need to worry about data loss, unless your back up for archival purposes, or you want to safeguard against accidentally deleting files.
You can make an application or service that responsible to create two instances of AmazonS3Client one for the source and the other for the destination, then the source AmazonS3Client start looping in the source bucket and streaming objects in, and the destination AmazonS3Client streaming them out to the destination bucket.
Note: this doesn't work for cross-account syncing, but this works for cross-region on the same account.
For simply copying everything from one bucket to another, you can use the AWS CLI (https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/): aws s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
In your case, you'll need the --source-region flag: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
If you are moving an enormous amount of data, you can optimize how quickly it happens by finding ways to split the transfers into different groups: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/
There are a variety of ways to run this nightly. One is example is the AWS instance-schedule (personally unverified) https://docs.aws.amazon.com/solutions/latest/instance-scheduler/appendix-a.html