EBS snapshots: exporting only delta - amazon-s3

Can I somehow export to S3 only EBS snapshot's delta (i.e changes from previous snapshot)?
I'm trying to export EC2 instance to S3 and exporting everytime AMI/full instance volumes looks expensive for me.
Anyway if I don't want to export using AMI way, I have to take EBS snapshot from instance volume and than
make temporary volume from it
attach temporary volume to other instance/agent and than copy contents to S3 bucket
And what I want - to avoid 1 step - i.e somehow directly get blocks changes according to internal EBS snapshot structure and push them to S3.
I heard, that EBS snapshots have already stored in S3 (but in invisible regions) - Amazon really doesn't provide any public API to 1to1 copying to user's regions or make them visible without restoring full history from first full snapshot?

Related

AWS S3 Sync --force-glacier-transfer

A few days back I was experimenting with S3 & Glacier and my data was archived so restoring it back I had to use their expedited service (which costs a lot). I want to move all of my content from one bucket to another bucket in the same region same account.
When I try to sync the data it gives the following error
Completed 10.9 MiB/~10.9 MiB (30.0 KiB/s) with ~0 file(s) remaining (calculatingwarning: Skipping file s3://bucket/zzz0dllquplo1515993694.mp4. Object is of storage class GLACIER. Unable to perform copy operations on GLACIER objects. You must restore the object to be able to perform the operation. See aws s3 copy help for additional parameter options to ignore or force these transfers.
I am using the below command and I was wondering what would it cost me in terms of dollars? Because all of my files storage class is changed to "Glacier" from "Standard". So, I am forced to use --force-glacier-transfer flag
aws s3 sync s3://bucketname1 s3://bucketname2 --force-glacier-transfer --storage-class STANDARD
If you restored them and are before the expiry-date you should be able to sync them without an additional restore. You get the Glacier error for all recursive commands as the API they use doesn't check to see if they are restored. You can read about it in the ticket where they added the --force-glacier-transfer.
https://github.com/aws/aws-cli/issues/1699
When using the --force-glacier-transfer flag it doesn't do another restore, it just ignores the API saying the object is in Glacier and tries anyways. It will fail if the object is not restored (it won't try to restore it).
Note that this is only with the recursive commands (eg. sync and cp/mv with --recursive), if you just copy 1 file it will work without the force flag.
Copy file of a Glacier storage class to a different bucket
You wrote: "I want to move all of my content from one bucket to another bucket in the same region same account."
If you want to copy files kept at a Glacier storage class from one bucket to another bucket even by the sync command, you have to restore the files first, i.e. make the files available for retrieval before you can actually copy them. The exception is when a file is stored in a "Amazon S3 Glacier Instant Retrieval" storage class. In this case, you don't need to explicitly restore the files.
Therefore, you have to issue the restore-object command to each of the files to initiate a restore request. Then you have to wait until the restore request completes. After that, you will be able to copy your files within the number of days that you have specified during the restore request.
Retrieval pricing
You also wrote: "I was wondering what would it cost me in terms of dollars".
With the command you provided, aws s3 sync s3://bucketname1 s3://bucketname2 --force-glacier-transfer --storage-class STANDARD, you copy the files from Glacier to Standard storage class. In this case, you have to first pay for retrieval (one-off) and then you will pay (monthly) for storing both copies of the file: one copy at the glacier their and another copy at the Standard storage class.
According to Amazon (quote),
To change the object's storage class to Amazon S3 Standard, use copy (by overwriting the existing object or copying the object into another location).
However, for a file stored in the Glacier storage class, you can only copy it to another location at S3 within the same bucket, you cannot actually retrieve the file contents unless you restore it, i.e. make it available for retrieval.
Since you have asked "what would it cost me in terms of dollars", you will have to pay according to the retrieval prices and storage prices published by Amazon.
You can check the retrieval pricing at https://aws.amazon.com/s3/glacier/pricing/
The storage prices are available at https://aws.amazon.com/s3/pricing/
The retrieval prices depend on what kind of Glacier storage class you initially selected to store the files: "S3 Glacier Instant Retrieval", "S3 Glacier Flexible Retrieval" or "S3 Glacier Deep Archive". The storage class can be modified by lifecycle rules, so to be more correct, it is the current storage class for each file that matters.
Unless you store your files in the "S3 Glacier Instant Retrieval" storage class, the cheapest option is to first restore the files (make them available for retrieval) using "Bulk" retrieval option (restore tier), which is a free option for "S3 Glacier Flexible Retrieval" and very cheap for "S3 Glacier Deep Archive". Thus you can copy the files with minimal restoration costs if at all.
Since you prefer to use command-line, you can use the Perl script to make the files available for retrieval with the "Bulk" retrieval option (restore tier). Otherwise, the aws s3 sync command that you gave will use the "Standard" restore tier.
As of today, in the Ohio US region, the prices for retrieval are the following.
For "S3 Glacier Instant Retrieval", at the time of writing, it costs $0.03 per GB to restore, with no other options.
For "S3 Glacier Flexible Retrieval", the "Standard" retrieval costs $0.01 per GB while "Bulk" retrieval is free.
For "S3 Glacier Deep Archive", the "Standard" retrieval costs $0.02 while "Bulk" costs $0.0025 per GB.
You will also pay for retrieval requests regardless of the data size. However, for "S3 Glacier Instant Retrieval" you won't pay for retrieval requests; and for "Bulk", retrieval requests costs are minimal (for S3 Glacier Deep Archive), if not free (S3 Glacier Flexible Retrieval).
BUCKET=my-bucket
DATE=$1
BPATH=/pathInBucket/FolderPartitioDate=$DATE
DAYS=5
for x in `aws s3 ls s3://$BUCKET$BPATH --recursive | awk '{print $4}'`;
do
echo "1:Restore $x"
aws s3api --profile sriAthena restore-object --bucket $BUCKET --key $x --restore-request Days=$DAYS,GlacierJobParam
eters={"Tier"="Standard"};
echo "2:Monitor $x"
aws s3api head-object --bucket $BUCKET --key $x;
done
https://aws.amazon.com/premiumsupport/knowledge-center/restore-s3-object-glacier-storage-class/

Move many S3 buckets to Glacier

We have a ton of S3 buckets and are in the process of cleaning things up. We identified Glacier as a good way to archive their data. The plan is to store the content of those buckets and then remove them.
It would be a one-shot operation, we don't need something automated.
I know that:
a bucket name may not be available anymore if one day we want to restore it
there's an indexing overhead of about 40kb per file which makes it a not so cost-efficient solution for small files and better to use an Infrequent access storage class or to zip the content
I gave it a try and created a vault. But I couldn't run the aws glacier command. I get some SSL error which is apparently related to a Python library, wether I run it on my Mac or from some dedicated container.
Also, it seems that it's a pain to use the Glacier API directly (and to keep the right file information), and that it's simpler to use it via a dedicated bucket.
What about that? Is there something to do what I want in AWS? Or any advice to do it in a not too fastidious way? What tool would you recommend?
Whoa, so many questions!
There are two ways to use Amazon Glacier:
Create a Lifecycle Policy on an Amazon S3 bucket to archive data to Glacier. The objects will still appear to be in S3, including their security, size, metadata, etc. However, their contents are stored in Glacier. Data stored in Glacier via this method must be restored back to S3 to access the contents.
Send data directly to Amazon Glacier via the AWS API. Data sent this way must be restored via the API.
Amazon Glacier charges for storage volumes, plus per request. It is less-efficient to store many, small files in Glacier. Instead, it is recommended to create archives (eg zip files) that make fewer, larger files. This can make it harder to retrieve specific files.
If you are going to use Glacier directly, it is much easier to use a utility, such as Cloudberry Backup, however these utilities are designed to backup from a computer to Glacier. They probably won't backup S3 to Glacier.
If data is already in Amazon S3, the simplest option is to create a lifecycle policy. You can then use the S3 management console and standard S3 tools to access and restore the data.
Using a S3 archiving bucket did the job.
Here is how I proceeded:
First, I created a S3 bucket called mycompany-archive, with a lifecycle rule that turns the Storage class into Glacier 1 day after the file creation.
Then, (with the aws tool installed on my Mac) I ran the following aws command to obtain the buckets list: aws s3 ls
I then pasted the output into an editor that can do regexp relacements, and I did the following one:
Replace ^\S*\s\S*\s(.*)$ by aws s3 cp --recursive s3://$1 s3://mycompany-archive/$1 && \
It gave me a big command, from which I removed the trailing && \ at the end, and the lines corresponding the buckets I didn't want to copy (mainly mycompany-archive had to be removed from there), and I had what I needed to do the transfers.
That command could be executed directly, but I prefer to run such commands using the screen util, to make sure the process wouldn't stop if I close my session by accident.
To launch it, I ran screen, launched the command, and then pressed CTRL+A then D to detach it. I can then come back to it by running screen -r.
Finally, under MacOS, I ran cafeinate to make sure the computer wouldn't sleep before it's over. To run it, issued ps|grep aws to locate the process id of the command. And then caffeinate -w 31299 (the process id) to ensure my Mac wouldn't allow sleep before the process is done.
It did the job (well, it's still running), I have now a bucket containing a folder for each archived bucket. Next step will be to remove the undesired S3 buckets.
Of course this way of doing could be improved in many ways, mainly by turning everything into a fault-tolerant replayable script. In this case, I have to be pragmatic and thinking about how to improve it would take far more time for almost no gain.

Using Glacier as back end for web crawling

I will be doing a crawl of several million URLs from EC2 over a few months and I am thinking about where I ought to store this data. My eventual goal is to analyze it, but the analysis might not be immediate (even though I would like to crawl it now for other reasons) and I may want to eventually transfer a copy of the data out for storage on a local device I have. I estimate the data will be around 5TB.
My question: I am considering using Glacier for this, with the idea that I will run a multithreaded crawler that stores the crawled pages locally (on EB) and then use a separate thread that combines, compresses, and shuttles that data to Glacier. I know transfer speeds on Glacier are not necessarily good, but since there is no online element of this process, it would seem feasible (esp since I could always increase the size of my local EBS volume in case I'm crawling faster than I can store to Glacier).
Is there a flaw in my approach or can anyone suggest a more cost-effective, reliable way to do this?
Thanks!
Redshift seems more relevant than Glacier. Glacier is all about freeze / thaw and you'll have to move the data prior to doing any analysis.
Redshift is more about adding the data into a large, inexpensive, data warehouse and running queries over it.
Another option is to store the data in EBS and leave it there. When you're done with your crawling take a Snapshot to push the volume into S3 and decomission the volume and EC2 instance. Then when you're ready to do the analysis just create a volume from the snapshot.
The upside of this approach is that it's all file access (no formal data store) which may be easier for you.
Personally, I would probably push the data into Redshift. :-)
--
Chris
If your analysis will not be immediate then you can adopt one of the following 2 approaches
Approach 1) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> archive regularly to glacier. You can store your last X days data in Amazon S3 and use it for adhoc processing as well.
Approach 2) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon Glacier. Retrieve when needed and do the processing on EMR or other processing tools
If you need frequent analysis:
Approach 3) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> Analysis through EMR or other tools and store the processed results in S3/DB/MPP and move the raw files to glacier
Approach 4) if your data is structured, then Amazon EC2 crawler -> store in EBS disks and move them to Amazon RedShift and move the raw files to glacier
Additional tips:
If you can retrieve the data again(from source)then you can use ephemeral disks for your crawlers instead of EBS
Amazon has introduced Data pipeline service, check whether it fits your needs on data movement.

EC2 snapshots vs. bundled instances

It's my understanding that EC2 snapshots are incremental in nature, so snapshot B contains only the difference between itself and snapshot A. Then if you delete snapshot A, the difference is allocated to snapshot B in Amazon S3 so that you have a complete snapshot. This also leads me to believe that it isn't prohibitively expensive to have daily snapshots A-Z for example, that it in storage cost it is basically the same as one snapshot.
What I really want is to back up my snapshots to a bucket in Amazon S3, so that if an entire EC2 region is having some problems --ahem cough, cough-- the snapshot can be moved into another region and launched as a backup instance in a new region.
However, it seems you can only bundle an instance and then upload a bundled instance to S3, not a snapshot.
The bundle is the entire instance correct? If this is the case then are historical bundled instances significantly more costly in practice than snapshots?
I use an instance store AMI and store my changing data on EBS volumes using the XFS filesystem. This means I can freeze the filesystems, create a snapshot and unfreeze them.
My volumes are 1GB (although mostly empty) and the storage cost is minuscule.
I don't know how an EBS backed AMI would work with this but I can't see why it would be any different. Note, however,that you need to bundle an instance in order to start it. Perhaps you could just snapshot everything as a backup and only bundle them when required.

How to transfer an image to an Amazon EBS volume for EC2 usage?

I have a local filesystem image that I want to transfer to an Amazon EBS volume and boot as an EC2 micro instance. The instance should have the EBS volume as it's root filesystem - and I will be booting the instance with the Amazon PV-GRUB "kernels".
I have used ec2-bundle-image to create a bundle from the image, and I have used ec2-upload-bundle to upload the bundle to Amazon S3. However, now when I'd like to use ec2-register to register the image for usage, I can't seem to find a way to make the uploaded bundle be the ebs root image. It would seem that it requires an EBS snapshot to make the root device, and I have no idea how I would convert the bundle in to an EBS snapshot.
I do realize, that I could probably do this by starting a "common" instance, attaching an EBS volume to it and then just using 'scp' or something to transfer the image directly to the EBS volume - but is this really the only way? Also, I have no desire to use EBS snapshots as such, I'd rather have none - can I create a micro instance with just the EBS volume as root, without an EBS snapshot?
Did not find any way to do this :(
So, I created a new instance, attached a newly created EBS volume to, attached it to the instance, and transferred the data via ssh.
Then, to be able to boot the volume, I still need to create a snapshot of it and then create an AMI that uses the snapshot - and as a result, I get another EBS volume that is created from the snapshot and is the running instance's root volume.
Now, if I want to minimize expences, I can remove the created snapshot, and the original EBS volume.
NOTE: If the only copy of the EBS volume is the root volume of an instance, it may be deleted when the instance is terminated. This setting can be changed with the command-line tools - or the instance may simple by "stopped" instead of "terminated", and then a snapshot can be generated from the EBS volume. After taking a snapshot, the instance can ofcourse be terminated.
Yes, there is no way to upload an EBS Image via S3, and using a Instance where you attach an additional volume is the best way. If you attach that Volume after the instance is started, it will also not be deleted.
Note, do not worry too much about Volume->snapshot->Volume, as those share the same data blocks (as long as you dont modify them). The storage cost ist not trippled, only 1,1times one Volume. EBS snapshots and image creation is quite handy in that regard. Dont hesitate to use multiple snapshots. The less you "work" in a snapshot the smaller is its block usage later on if you start it as an AMI.