How to move object from Amazon S3 to Glacier with Vault Locked enabled? - amazon-s3

I'm looking for a solution for moving Amazon S3 objects to Glacier with Vault Lock enabled (like described here https://aws.amazon.com/blogs/aws/glacier-vault-lock/). I'd like to use Amazon built in tools for that (lifecycle management or some other) if possible.
I cannot find any instructions or options to do that. S3 seems to only allow moving object to Glacier storage class. But that does not provide data integrity nor defends against data loss.
I know I could do it with a program. It would download S3 object and move it to Glacier through their respective REST API's. This approach seems too complicated for this simple task.

Picture it this way:
Glacier is a service of AWS.
S3 is a service of AWS.
But S3 is also a customer of the Glacier service.
When you migrate an object in S3 to the Glacier storage class, S3 stores the object in Glacier... using an AWS account that is owned by S3.
Those objects in S3 that use the GLACIER storage class aren't in "your" Glacier vaults, they're in vaults owned by S3.
This is consistent with the externally-observable evidence:
You can't see these S3 objects in vaults from the Glacier console.
You don't have to give S3 any IAM permissions to access Glacier (by contrast, you do have to give S3 permission to publish event notifications to SQS, SNS, or Lambda)
Glacier doesn't bill you for Glacier storage class objects -- S3 does.
In that light, what you are trying to accomplish is completely different. You want to store some archives in your Glacier vault, with your policy, and that content currently just "happens to be" stored in S3 at the moment.
Downloading from S3 and then uploading to Glacier is the solution.
But that does not provide data integrity nor defends against data loss.
The integrity of the payload can be assured when uploading to Glacier because the tree hash algorithm effectively prevents corrupt uploads.
Downloading from S3, unless the object is stored with SSE-C, the ETag is the MD5 hash of the stored object if single-part upload is used, or is the hex-encoded MD5 hash of the concatenated binary-encoded MD5 hashes of the parts, followed by - and the number of parts. Ideally, when uploading to S3, you'd store a better hash (e.g. sha256) in the object metadata, e.g. x-amz-meta-content-sha256.
Defense against data loss -- yes, Glacier does offer more functionality, here, but S3 is not entirely without capability here: bucket policies with a matching DENY action will always override any conflicting ALLOW action, whether it is in the bucket policy or any other IAM policy (e.g. role, user).

Related

Copy Files from S3 SignedURL to GCS Signed URL

I am developing a service in which two different cloud storage providers are involved. I am trying to copy data from S3 bucket to GCS.
To access the data I have been offered signedUrls, and to upload the data to GCS I also have signedUrls available which allow me to write content into a specified storage path;
Is there a possibility to move this data "in cloud"? Downloading from S3 and uploading the content to GCS will create bandwidth problems;
I must also mention that this is a on-demand job and it only moves a small number of files. I can not do a full bucket transfer;
Kind regards
You can use Skyplane to move data across clouds object stores. To move a single file from S3 to Google Cloud, you can use the command:
skyplane cp gcs://<BUCKET>/<FILE> s3://<BUCKET>/<FILE>

Unique resource identifiers for S3 blobs

Is there any "standard" way to uniquely identify an S3 blob using a single string?
There are multiple services that support the S3 protocol: AWS, Minio, GCS.
Usually, to access an S3 blob, you must provide endpoint (+region), bucket and key. The 's3://' URIs seem to only contain bucket and key. Is there any standard that also includes the endpoint?
At least for GCS there is no standard on how you should name the objects that are uploaded to the bucket, they just need to follow the format rules mentioned here.
Nevertheless, as mentioned in this video, there are some tricks that you can follow in order to have a better performance on the read/write operations in a bucket through object naming.
If you have a bit more time, you can also check this other video which goes more in depth on how GCS works as well as performance an reliability tips.

Get S3 bucket free space in Ceph cluster from amazon S3 API

Is exists any way to get available free space in Ceph cluster from amazon S3 API?
I need to implement automatic deletion of outdated objects from bucket when Ceph cluster has no space to store new objects. I know exists ways to calculate used space in bucket, but its logic data size and I can't compare them to raw size of cluster disks.
From S3 perspective you can't check free space.
Ceph has implemented expiration mechanism, which will delete outdated objects from your object storage.
Check the doc: http://docs.ceph.com/docs/master/radosgw/s3/#features-support

AWS S3 Sync --force-glacier-transfer

A few days back I was experimenting with S3 & Glacier and my data was archived so restoring it back I had to use their expedited service (which costs a lot). I want to move all of my content from one bucket to another bucket in the same region same account.
When I try to sync the data it gives the following error
Completed 10.9 MiB/~10.9 MiB (30.0 KiB/s) with ~0 file(s) remaining (calculatingwarning: Skipping file s3://bucket/zzz0dllquplo1515993694.mp4. Object is of storage class GLACIER. Unable to perform copy operations on GLACIER objects. You must restore the object to be able to perform the operation. See aws s3 copy help for additional parameter options to ignore or force these transfers.
I am using the below command and I was wondering what would it cost me in terms of dollars? Because all of my files storage class is changed to "Glacier" from "Standard". So, I am forced to use --force-glacier-transfer flag
aws s3 sync s3://bucketname1 s3://bucketname2 --force-glacier-transfer --storage-class STANDARD
If you restored them and are before the expiry-date you should be able to sync them without an additional restore. You get the Glacier error for all recursive commands as the API they use doesn't check to see if they are restored. You can read about it in the ticket where they added the --force-glacier-transfer.
https://github.com/aws/aws-cli/issues/1699
When using the --force-glacier-transfer flag it doesn't do another restore, it just ignores the API saying the object is in Glacier and tries anyways. It will fail if the object is not restored (it won't try to restore it).
Note that this is only with the recursive commands (eg. sync and cp/mv with --recursive), if you just copy 1 file it will work without the force flag.
Copy file of a Glacier storage class to a different bucket
You wrote: "I want to move all of my content from one bucket to another bucket in the same region same account."
If you want to copy files kept at a Glacier storage class from one bucket to another bucket even by the sync command, you have to restore the files first, i.e. make the files available for retrieval before you can actually copy them. The exception is when a file is stored in a "Amazon S3 Glacier Instant Retrieval" storage class. In this case, you don't need to explicitly restore the files.
Therefore, you have to issue the restore-object command to each of the files to initiate a restore request. Then you have to wait until the restore request completes. After that, you will be able to copy your files within the number of days that you have specified during the restore request.
Retrieval pricing
You also wrote: "I was wondering what would it cost me in terms of dollars".
With the command you provided, aws s3 sync s3://bucketname1 s3://bucketname2 --force-glacier-transfer --storage-class STANDARD, you copy the files from Glacier to Standard storage class. In this case, you have to first pay for retrieval (one-off) and then you will pay (monthly) for storing both copies of the file: one copy at the glacier their and another copy at the Standard storage class.
According to Amazon (quote),
To change the object's storage class to Amazon S3 Standard, use copy (by overwriting the existing object or copying the object into another location).
However, for a file stored in the Glacier storage class, you can only copy it to another location at S3 within the same bucket, you cannot actually retrieve the file contents unless you restore it, i.e. make it available for retrieval.
Since you have asked "what would it cost me in terms of dollars", you will have to pay according to the retrieval prices and storage prices published by Amazon.
You can check the retrieval pricing at https://aws.amazon.com/s3/glacier/pricing/
The storage prices are available at https://aws.amazon.com/s3/pricing/
The retrieval prices depend on what kind of Glacier storage class you initially selected to store the files: "S3 Glacier Instant Retrieval", "S3 Glacier Flexible Retrieval" or "S3 Glacier Deep Archive". The storage class can be modified by lifecycle rules, so to be more correct, it is the current storage class for each file that matters.
Unless you store your files in the "S3 Glacier Instant Retrieval" storage class, the cheapest option is to first restore the files (make them available for retrieval) using "Bulk" retrieval option (restore tier), which is a free option for "S3 Glacier Flexible Retrieval" and very cheap for "S3 Glacier Deep Archive". Thus you can copy the files with minimal restoration costs if at all.
Since you prefer to use command-line, you can use the Perl script to make the files available for retrieval with the "Bulk" retrieval option (restore tier). Otherwise, the aws s3 sync command that you gave will use the "Standard" restore tier.
As of today, in the Ohio US region, the prices for retrieval are the following.
For "S3 Glacier Instant Retrieval", at the time of writing, it costs $0.03 per GB to restore, with no other options.
For "S3 Glacier Flexible Retrieval", the "Standard" retrieval costs $0.01 per GB while "Bulk" retrieval is free.
For "S3 Glacier Deep Archive", the "Standard" retrieval costs $0.02 while "Bulk" costs $0.0025 per GB.
You will also pay for retrieval requests regardless of the data size. However, for "S3 Glacier Instant Retrieval" you won't pay for retrieval requests; and for "Bulk", retrieval requests costs are minimal (for S3 Glacier Deep Archive), if not free (S3 Glacier Flexible Retrieval).
BUCKET=my-bucket
DATE=$1
BPATH=/pathInBucket/FolderPartitioDate=$DATE
DAYS=5
for x in `aws s3 ls s3://$BUCKET$BPATH --recursive | awk '{print $4}'`;
do
echo "1:Restore $x"
aws s3api --profile sriAthena restore-object --bucket $BUCKET --key $x --restore-request Days=$DAYS,GlacierJobParam
eters={"Tier"="Standard"};
echo "2:Monitor $x"
aws s3api head-object --bucket $BUCKET --key $x;
done
https://aws.amazon.com/premiumsupport/knowledge-center/restore-s3-object-glacier-storage-class/

Move many S3 buckets to Glacier

We have a ton of S3 buckets and are in the process of cleaning things up. We identified Glacier as a good way to archive their data. The plan is to store the content of those buckets and then remove them.
It would be a one-shot operation, we don't need something automated.
I know that:
a bucket name may not be available anymore if one day we want to restore it
there's an indexing overhead of about 40kb per file which makes it a not so cost-efficient solution for small files and better to use an Infrequent access storage class or to zip the content
I gave it a try and created a vault. But I couldn't run the aws glacier command. I get some SSL error which is apparently related to a Python library, wether I run it on my Mac or from some dedicated container.
Also, it seems that it's a pain to use the Glacier API directly (and to keep the right file information), and that it's simpler to use it via a dedicated bucket.
What about that? Is there something to do what I want in AWS? Or any advice to do it in a not too fastidious way? What tool would you recommend?
Whoa, so many questions!
There are two ways to use Amazon Glacier:
Create a Lifecycle Policy on an Amazon S3 bucket to archive data to Glacier. The objects will still appear to be in S3, including their security, size, metadata, etc. However, their contents are stored in Glacier. Data stored in Glacier via this method must be restored back to S3 to access the contents.
Send data directly to Amazon Glacier via the AWS API. Data sent this way must be restored via the API.
Amazon Glacier charges for storage volumes, plus per request. It is less-efficient to store many, small files in Glacier. Instead, it is recommended to create archives (eg zip files) that make fewer, larger files. This can make it harder to retrieve specific files.
If you are going to use Glacier directly, it is much easier to use a utility, such as Cloudberry Backup, however these utilities are designed to backup from a computer to Glacier. They probably won't backup S3 to Glacier.
If data is already in Amazon S3, the simplest option is to create a lifecycle policy. You can then use the S3 management console and standard S3 tools to access and restore the data.
Using a S3 archiving bucket did the job.
Here is how I proceeded:
First, I created a S3 bucket called mycompany-archive, with a lifecycle rule that turns the Storage class into Glacier 1 day after the file creation.
Then, (with the aws tool installed on my Mac) I ran the following aws command to obtain the buckets list: aws s3 ls
I then pasted the output into an editor that can do regexp relacements, and I did the following one:
Replace ^\S*\s\S*\s(.*)$ by aws s3 cp --recursive s3://$1 s3://mycompany-archive/$1 && \
It gave me a big command, from which I removed the trailing && \ at the end, and the lines corresponding the buckets I didn't want to copy (mainly mycompany-archive had to be removed from there), and I had what I needed to do the transfers.
That command could be executed directly, but I prefer to run such commands using the screen util, to make sure the process wouldn't stop if I close my session by accident.
To launch it, I ran screen, launched the command, and then pressed CTRL+A then D to detach it. I can then come back to it by running screen -r.
Finally, under MacOS, I ran cafeinate to make sure the computer wouldn't sleep before it's over. To run it, issued ps|grep aws to locate the process id of the command. And then caffeinate -w 31299 (the process id) to ensure my Mac wouldn't allow sleep before the process is done.
It did the job (well, it's still running), I have now a bucket containing a folder for each archived bucket. Next step will be to remove the undesired S3 buckets.
Of course this way of doing could be improved in many ways, mainly by turning everything into a fault-tolerant replayable script. In this case, I have to be pragmatic and thinking about how to improve it would take far more time for almost no gain.