AWS S3 Sync --force-glacier-transfer - amazon-s3

A few days back I was experimenting with S3 & Glacier and my data was archived so restoring it back I had to use their expedited service (which costs a lot). I want to move all of my content from one bucket to another bucket in the same region same account.
When I try to sync the data it gives the following error
Completed 10.9 MiB/~10.9 MiB (30.0 KiB/s) with ~0 file(s) remaining (calculatingwarning: Skipping file s3://bucket/zzz0dllquplo1515993694.mp4. Object is of storage class GLACIER. Unable to perform copy operations on GLACIER objects. You must restore the object to be able to perform the operation. See aws s3 copy help for additional parameter options to ignore or force these transfers.
I am using the below command and I was wondering what would it cost me in terms of dollars? Because all of my files storage class is changed to "Glacier" from "Standard". So, I am forced to use --force-glacier-transfer flag
aws s3 sync s3://bucketname1 s3://bucketname2 --force-glacier-transfer --storage-class STANDARD

If you restored them and are before the expiry-date you should be able to sync them without an additional restore. You get the Glacier error for all recursive commands as the API they use doesn't check to see if they are restored. You can read about it in the ticket where they added the --force-glacier-transfer.
https://github.com/aws/aws-cli/issues/1699
When using the --force-glacier-transfer flag it doesn't do another restore, it just ignores the API saying the object is in Glacier and tries anyways. It will fail if the object is not restored (it won't try to restore it).
Note that this is only with the recursive commands (eg. sync and cp/mv with --recursive), if you just copy 1 file it will work without the force flag.

Copy file of a Glacier storage class to a different bucket
You wrote: "I want to move all of my content from one bucket to another bucket in the same region same account."
If you want to copy files kept at a Glacier storage class from one bucket to another bucket even by the sync command, you have to restore the files first, i.e. make the files available for retrieval before you can actually copy them. The exception is when a file is stored in a "Amazon S3 Glacier Instant Retrieval" storage class. In this case, you don't need to explicitly restore the files.
Therefore, you have to issue the restore-object command to each of the files to initiate a restore request. Then you have to wait until the restore request completes. After that, you will be able to copy your files within the number of days that you have specified during the restore request.
Retrieval pricing
You also wrote: "I was wondering what would it cost me in terms of dollars".
With the command you provided, aws s3 sync s3://bucketname1 s3://bucketname2 --force-glacier-transfer --storage-class STANDARD, you copy the files from Glacier to Standard storage class. In this case, you have to first pay for retrieval (one-off) and then you will pay (monthly) for storing both copies of the file: one copy at the glacier their and another copy at the Standard storage class.
According to Amazon (quote),
To change the object's storage class to Amazon S3 Standard, use copy (by overwriting the existing object or copying the object into another location).
However, for a file stored in the Glacier storage class, you can only copy it to another location at S3 within the same bucket, you cannot actually retrieve the file contents unless you restore it, i.e. make it available for retrieval.
Since you have asked "what would it cost me in terms of dollars", you will have to pay according to the retrieval prices and storage prices published by Amazon.
You can check the retrieval pricing at https://aws.amazon.com/s3/glacier/pricing/
The storage prices are available at https://aws.amazon.com/s3/pricing/
The retrieval prices depend on what kind of Glacier storage class you initially selected to store the files: "S3 Glacier Instant Retrieval", "S3 Glacier Flexible Retrieval" or "S3 Glacier Deep Archive". The storage class can be modified by lifecycle rules, so to be more correct, it is the current storage class for each file that matters.
Unless you store your files in the "S3 Glacier Instant Retrieval" storage class, the cheapest option is to first restore the files (make them available for retrieval) using "Bulk" retrieval option (restore tier), which is a free option for "S3 Glacier Flexible Retrieval" and very cheap for "S3 Glacier Deep Archive". Thus you can copy the files with minimal restoration costs if at all.
Since you prefer to use command-line, you can use the Perl script to make the files available for retrieval with the "Bulk" retrieval option (restore tier). Otherwise, the aws s3 sync command that you gave will use the "Standard" restore tier.
As of today, in the Ohio US region, the prices for retrieval are the following.
For "S3 Glacier Instant Retrieval", at the time of writing, it costs $0.03 per GB to restore, with no other options.
For "S3 Glacier Flexible Retrieval", the "Standard" retrieval costs $0.01 per GB while "Bulk" retrieval is free.
For "S3 Glacier Deep Archive", the "Standard" retrieval costs $0.02 while "Bulk" costs $0.0025 per GB.
You will also pay for retrieval requests regardless of the data size. However, for "S3 Glacier Instant Retrieval" you won't pay for retrieval requests; and for "Bulk", retrieval requests costs are minimal (for S3 Glacier Deep Archive), if not free (S3 Glacier Flexible Retrieval).

BUCKET=my-bucket
DATE=$1
BPATH=/pathInBucket/FolderPartitioDate=$DATE
DAYS=5
for x in `aws s3 ls s3://$BUCKET$BPATH --recursive | awk '{print $4}'`;
do
echo "1:Restore $x"
aws s3api --profile sriAthena restore-object --bucket $BUCKET --key $x --restore-request Days=$DAYS,GlacierJobParam
eters={"Tier"="Standard"};
echo "2:Monitor $x"
aws s3api head-object --bucket $BUCKET --key $x;
done
https://aws.amazon.com/premiumsupport/knowledge-center/restore-s3-object-glacier-storage-class/

Related

Copy Files from S3 SignedURL to GCS Signed URL

I am developing a service in which two different cloud storage providers are involved. I am trying to copy data from S3 bucket to GCS.
To access the data I have been offered signedUrls, and to upload the data to GCS I also have signedUrls available which allow me to write content into a specified storage path;
Is there a possibility to move this data "in cloud"? Downloading from S3 and uploading the content to GCS will create bandwidth problems;
I must also mention that this is a on-demand job and it only moves a small number of files. I can not do a full bucket transfer;
Kind regards
You can use Skyplane to move data across clouds object stores. To move a single file from S3 to Google Cloud, you can use the command:
skyplane cp gcs://<BUCKET>/<FILE> s3://<BUCKET>/<FILE>

aws s3 sync cli ignoring multipart upload config when syncing between buckets

I'm trying to sync a large number of files from one bucket to another, some of the files are up to 2GB in size after using the aws cli's s3 sync command like so
aws s3 sync s3://bucket/folder/folder s3://destination-bucket/folder/folder
and verifying the files that had been transferred it became clear that the large files had lost the metadata that was present on the original file in the original bucket.
This is a "known" issue with larger files where s3 switches to multipart upload to handled the transfer.
This multipart handeling can be configured via the .aws/config file which has been done like so
[default]
s3 =
multipart_threshold = 4500MB
However when again testing the transfer the metadata on the larger files is still not present, it is present on any of the smaller files so it's clear that I'm heating the multipart upload issue.
Given this is an s3 to s3 transfer is the local s3 configuration taken into consideration at all?
As an alternative to this is there a way to just sync the metadata now that all the files have been transferred?
Have also tried doing aws s3 cp with no luck either.
You could use Cross/Same-Region Replication to copy the objects to another Amazon S3 bucket.
However, only newly added objects will copy between the buckets. You can, however, trigger the copy by copying the objects onto themselves. I'd recommend you test this on a separate bucket first, to make sure you don't accidentally lose any of the metadata.
The method suggested seems rather complex: Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena | AWS Big Data Blog
The final option would be to write your own code to copy the objects, and copy the metadata at the same time.
Or, you could write a script that compares the two buckets to see which objects did not get their correct metadata, and have it just update the metadata on the target object. This actually involves copying the object to itself, while specifying the metadata. This is probably easier than copying ALL objects yourself, since it only needs to 'fix' the ones that didn't get their metadata.
Finally managed to implement a solution for this and took the oportunity to play around with the Serverless framework and Step Functions.
The general flow I went with was:
Step Function triggered using a Cloudwatch Event Rule targetting S3 Events of the type 'CompleteMultipartUpload', as the metadata is only ever missing on S3 objects that had to be transfered using a multipart process
The initial Task on the Step Function checks if all the required MetaData is present on the object that raised the event.
If it is present then the Step Function is finished
If it is not present then the second lambda task is fired which copies all metadata from the source object to the destination object.
This could be achieved without Step Functions however was a good simple exercise to give them a go. The first 'Check Meta' task is actually redundant as the metadata is never present if multipart transfer is used, I was originally also triggering off of PutObject and CopyObject as well which is why I had the Check Meta task.

Move many S3 buckets to Glacier

We have a ton of S3 buckets and are in the process of cleaning things up. We identified Glacier as a good way to archive their data. The plan is to store the content of those buckets and then remove them.
It would be a one-shot operation, we don't need something automated.
I know that:
a bucket name may not be available anymore if one day we want to restore it
there's an indexing overhead of about 40kb per file which makes it a not so cost-efficient solution for small files and better to use an Infrequent access storage class or to zip the content
I gave it a try and created a vault. But I couldn't run the aws glacier command. I get some SSL error which is apparently related to a Python library, wether I run it on my Mac or from some dedicated container.
Also, it seems that it's a pain to use the Glacier API directly (and to keep the right file information), and that it's simpler to use it via a dedicated bucket.
What about that? Is there something to do what I want in AWS? Or any advice to do it in a not too fastidious way? What tool would you recommend?
Whoa, so many questions!
There are two ways to use Amazon Glacier:
Create a Lifecycle Policy on an Amazon S3 bucket to archive data to Glacier. The objects will still appear to be in S3, including their security, size, metadata, etc. However, their contents are stored in Glacier. Data stored in Glacier via this method must be restored back to S3 to access the contents.
Send data directly to Amazon Glacier via the AWS API. Data sent this way must be restored via the API.
Amazon Glacier charges for storage volumes, plus per request. It is less-efficient to store many, small files in Glacier. Instead, it is recommended to create archives (eg zip files) that make fewer, larger files. This can make it harder to retrieve specific files.
If you are going to use Glacier directly, it is much easier to use a utility, such as Cloudberry Backup, however these utilities are designed to backup from a computer to Glacier. They probably won't backup S3 to Glacier.
If data is already in Amazon S3, the simplest option is to create a lifecycle policy. You can then use the S3 management console and standard S3 tools to access and restore the data.
Using a S3 archiving bucket did the job.
Here is how I proceeded:
First, I created a S3 bucket called mycompany-archive, with a lifecycle rule that turns the Storage class into Glacier 1 day after the file creation.
Then, (with the aws tool installed on my Mac) I ran the following aws command to obtain the buckets list: aws s3 ls
I then pasted the output into an editor that can do regexp relacements, and I did the following one:
Replace ^\S*\s\S*\s(.*)$ by aws s3 cp --recursive s3://$1 s3://mycompany-archive/$1 && \
It gave me a big command, from which I removed the trailing && \ at the end, and the lines corresponding the buckets I didn't want to copy (mainly mycompany-archive had to be removed from there), and I had what I needed to do the transfers.
That command could be executed directly, but I prefer to run such commands using the screen util, to make sure the process wouldn't stop if I close my session by accident.
To launch it, I ran screen, launched the command, and then pressed CTRL+A then D to detach it. I can then come back to it by running screen -r.
Finally, under MacOS, I ran cafeinate to make sure the computer wouldn't sleep before it's over. To run it, issued ps|grep aws to locate the process id of the command. And then caffeinate -w 31299 (the process id) to ensure my Mac wouldn't allow sleep before the process is done.
It did the job (well, it's still running), I have now a bucket containing a folder for each archived bucket. Next step will be to remove the undesired S3 buckets.
Of course this way of doing could be improved in many ways, mainly by turning everything into a fault-tolerant replayable script. In this case, I have to be pragmatic and thinking about how to improve it would take far more time for almost no gain.

How to move object from Amazon S3 to Glacier with Vault Locked enabled?

I'm looking for a solution for moving Amazon S3 objects to Glacier with Vault Lock enabled (like described here https://aws.amazon.com/blogs/aws/glacier-vault-lock/). I'd like to use Amazon built in tools for that (lifecycle management or some other) if possible.
I cannot find any instructions or options to do that. S3 seems to only allow moving object to Glacier storage class. But that does not provide data integrity nor defends against data loss.
I know I could do it with a program. It would download S3 object and move it to Glacier through their respective REST API's. This approach seems too complicated for this simple task.
Picture it this way:
Glacier is a service of AWS.
S3 is a service of AWS.
But S3 is also a customer of the Glacier service.
When you migrate an object in S3 to the Glacier storage class, S3 stores the object in Glacier... using an AWS account that is owned by S3.
Those objects in S3 that use the GLACIER storage class aren't in "your" Glacier vaults, they're in vaults owned by S3.
This is consistent with the externally-observable evidence:
You can't see these S3 objects in vaults from the Glacier console.
You don't have to give S3 any IAM permissions to access Glacier (by contrast, you do have to give S3 permission to publish event notifications to SQS, SNS, or Lambda)
Glacier doesn't bill you for Glacier storage class objects -- S3 does.
In that light, what you are trying to accomplish is completely different. You want to store some archives in your Glacier vault, with your policy, and that content currently just "happens to be" stored in S3 at the moment.
Downloading from S3 and then uploading to Glacier is the solution.
But that does not provide data integrity nor defends against data loss.
The integrity of the payload can be assured when uploading to Glacier because the tree hash algorithm effectively prevents corrupt uploads.
Downloading from S3, unless the object is stored with SSE-C, the ETag is the MD5 hash of the stored object if single-part upload is used, or is the hex-encoded MD5 hash of the concatenated binary-encoded MD5 hashes of the parts, followed by - and the number of parts. Ideally, when uploading to S3, you'd store a better hash (e.g. sha256) in the object metadata, e.g. x-amz-meta-content-sha256.
Defense against data loss -- yes, Glacier does offer more functionality, here, but S3 is not entirely without capability here: bucket policies with a matching DENY action will always override any conflicting ALLOW action, whether it is in the bucket policy or any other IAM policy (e.g. role, user).

Delete an S3 Bucket that has some data archived in Glacier

We have a huge bucket for which we have setup lifecycle rules to archive data to Glacier.
Now we have decided that we do not need the data in that bucket and hence want to delete all the data stored in Glacier as well as s3.
If i delete the bucket from s3, would we incur a glacier cost for retrieval of data or would the delete's be free?
The bucket has TB's of data and we definitely dont want to pay AWS 1000's of $ in retrieval costs
You can't delete a bucket that is not empty, so you'll need to delete everything stored in the bucket, including what's stored in Glacier, first.
If everything in Glacier was migrated to the glacier storage class over 3 months ago, then you should not incur any charges.
If you don't restore the Glacier objects -- you just delete them -- then the only charge will be for anything that as been in Glacier for less than 3 months. Deleting these objects will incur the documented pro-rated charge for early deletions, which is equivalent to the charge for storing the content in Glacier for 3 months, less the charge already incurred for the storage of the objects in Glacier.
http://aws.amazon.com/s3/faqs/#How_am_I_charged_for_deleting_objects_from_Amazon_Glacier_that_are_less_than_3_months_old