aws s3 sync cli ignoring multipart upload config when syncing between buckets - amazon-s3

I'm trying to sync a large number of files from one bucket to another, some of the files are up to 2GB in size after using the aws cli's s3 sync command like so
aws s3 sync s3://bucket/folder/folder s3://destination-bucket/folder/folder
and verifying the files that had been transferred it became clear that the large files had lost the metadata that was present on the original file in the original bucket.
This is a "known" issue with larger files where s3 switches to multipart upload to handled the transfer.
This multipart handeling can be configured via the .aws/config file which has been done like so
[default]
s3 =
multipart_threshold = 4500MB
However when again testing the transfer the metadata on the larger files is still not present, it is present on any of the smaller files so it's clear that I'm heating the multipart upload issue.
Given this is an s3 to s3 transfer is the local s3 configuration taken into consideration at all?
As an alternative to this is there a way to just sync the metadata now that all the files have been transferred?
Have also tried doing aws s3 cp with no luck either.

You could use Cross/Same-Region Replication to copy the objects to another Amazon S3 bucket.
However, only newly added objects will copy between the buckets. You can, however, trigger the copy by copying the objects onto themselves. I'd recommend you test this on a separate bucket first, to make sure you don't accidentally lose any of the metadata.
The method suggested seems rather complex: Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena | AWS Big Data Blog
The final option would be to write your own code to copy the objects, and copy the metadata at the same time.
Or, you could write a script that compares the two buckets to see which objects did not get their correct metadata, and have it just update the metadata on the target object. This actually involves copying the object to itself, while specifying the metadata. This is probably easier than copying ALL objects yourself, since it only needs to 'fix' the ones that didn't get their metadata.

Finally managed to implement a solution for this and took the oportunity to play around with the Serverless framework and Step Functions.
The general flow I went with was:
Step Function triggered using a Cloudwatch Event Rule targetting S3 Events of the type 'CompleteMultipartUpload', as the metadata is only ever missing on S3 objects that had to be transfered using a multipart process
The initial Task on the Step Function checks if all the required MetaData is present on the object that raised the event.
If it is present then the Step Function is finished
If it is not present then the second lambda task is fired which copies all metadata from the source object to the destination object.
This could be achieved without Step Functions however was a good simple exercise to give them a go. The first 'Check Meta' task is actually redundant as the metadata is never present if multipart transfer is used, I was originally also triggering off of PutObject and CopyObject as well which is why I had the Check Meta task.

Related

File Copy from one S3 bucket to other S3 bucket using Lambda - timing constraint?

I need to copy large files ( may be even greater than 50 GB) from one S3 bucket to other S3 bucket ( event based). I am planning to use s3.Object.copy_from to do this inside Lambda ( using boto3).
I wanted to see if anyone has tried this? will this have any performance issue for larger files (100 GB etc.) causing Lambda timeout?
If yes, is there any alternate option ? ( I am trying to use code since I might need to do some other additional logic like rename file, move source file to archive etc.).
Note- I am also exploring AWS S3 Replication options, but looking for other solutions in parallel.
You can use AWS S3 replication feature.
It supports key prefix and API filtering as well.

How can I search the changes made on a `s3` bucket between two timestamp?

I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.

Move many S3 buckets to Glacier

We have a ton of S3 buckets and are in the process of cleaning things up. We identified Glacier as a good way to archive their data. The plan is to store the content of those buckets and then remove them.
It would be a one-shot operation, we don't need something automated.
I know that:
a bucket name may not be available anymore if one day we want to restore it
there's an indexing overhead of about 40kb per file which makes it a not so cost-efficient solution for small files and better to use an Infrequent access storage class or to zip the content
I gave it a try and created a vault. But I couldn't run the aws glacier command. I get some SSL error which is apparently related to a Python library, wether I run it on my Mac or from some dedicated container.
Also, it seems that it's a pain to use the Glacier API directly (and to keep the right file information), and that it's simpler to use it via a dedicated bucket.
What about that? Is there something to do what I want in AWS? Or any advice to do it in a not too fastidious way? What tool would you recommend?
Whoa, so many questions!
There are two ways to use Amazon Glacier:
Create a Lifecycle Policy on an Amazon S3 bucket to archive data to Glacier. The objects will still appear to be in S3, including their security, size, metadata, etc. However, their contents are stored in Glacier. Data stored in Glacier via this method must be restored back to S3 to access the contents.
Send data directly to Amazon Glacier via the AWS API. Data sent this way must be restored via the API.
Amazon Glacier charges for storage volumes, plus per request. It is less-efficient to store many, small files in Glacier. Instead, it is recommended to create archives (eg zip files) that make fewer, larger files. This can make it harder to retrieve specific files.
If you are going to use Glacier directly, it is much easier to use a utility, such as Cloudberry Backup, however these utilities are designed to backup from a computer to Glacier. They probably won't backup S3 to Glacier.
If data is already in Amazon S3, the simplest option is to create a lifecycle policy. You can then use the S3 management console and standard S3 tools to access and restore the data.
Using a S3 archiving bucket did the job.
Here is how I proceeded:
First, I created a S3 bucket called mycompany-archive, with a lifecycle rule that turns the Storage class into Glacier 1 day after the file creation.
Then, (with the aws tool installed on my Mac) I ran the following aws command to obtain the buckets list: aws s3 ls
I then pasted the output into an editor that can do regexp relacements, and I did the following one:
Replace ^\S*\s\S*\s(.*)$ by aws s3 cp --recursive s3://$1 s3://mycompany-archive/$1 && \
It gave me a big command, from which I removed the trailing && \ at the end, and the lines corresponding the buckets I didn't want to copy (mainly mycompany-archive had to be removed from there), and I had what I needed to do the transfers.
That command could be executed directly, but I prefer to run such commands using the screen util, to make sure the process wouldn't stop if I close my session by accident.
To launch it, I ran screen, launched the command, and then pressed CTRL+A then D to detach it. I can then come back to it by running screen -r.
Finally, under MacOS, I ran cafeinate to make sure the computer wouldn't sleep before it's over. To run it, issued ps|grep aws to locate the process id of the command. And then caffeinate -w 31299 (the process id) to ensure my Mac wouldn't allow sleep before the process is done.
It did the job (well, it's still running), I have now a bucket containing a folder for each archived bucket. Next step will be to remove the undesired S3 buckets.
Of course this way of doing could be improved in many ways, mainly by turning everything into a fault-tolerant replayable script. In this case, I have to be pragmatic and thinking about how to improve it would take far more time for almost no gain.

Does S3 multipart upload actually create multiple objects in my bucket?

Here is an example for me trying to understand the under the hood mechanism.
I decide to upload a 2GB file onto my S3 bucket, and I decide to use the size of 128MB for the parts. Then I will have
(2 * 1024) / 128 => 16 parts
Here are my questions:
Am I going to see 16 128MB objects in my bucket or a single 2GB
object in my bucket?
How can S3 understand the order of the parts (1->2->...->16) and
reassemble them into a single 2GB file when I download them back? Is
there an extra 'meta' object (see the above question) that I need to download first to help the client to achieve this reassembling-needed information?
When the s3 client download the above in parallel, at what time does it write the file descriptor for this 2GB file in the local file system (I guess it does not know all the needed information before all the parts have been downloaded)?
While uploading the individual parts, there will be multiple uploads stored in Amazon S3 that you can view with the ListMultipartUploads command.
When completing a multipart upload with the CompleteMultipartUpload command, you must specify a list of the individual parts uploaded in the correct order. The uploads will then be combined into a single object.
Downloading depends upon the client/code you use -- you could download an object in parallel or just single-threaded.

AWS S3. Multipart Upload. Can i start downloading file until it's 100% uploaded?

Actually title was a question :)
Do AWS S3 support file streaming in case if file is not 100% uploaded? Client #1 split files into small chunks and start uploading them using Multipart Upload. Client #2 start downloading them from S3. So, as result client #2 don't need to wait until client #1 has uploaded the whole file.
Is it possible to do it without additional streaming server?
This is not natively supported by S3.
S3 allows the individual parts of a multipart upload to be uploaded sequentially, or in parallel, or even out of their logical order, over an essentially unlimited period of time.
It is not until you send the CompleteMultipartUpload request that the parts are verified by S3 as all being present, and having the correct checksums, that the final object is assembled from the parts, and is created in the bucket (or overwrites the former object with the same key, if there was one) if the parts are all present and their integrity is intact. Until then, the object -- as an object at the designated key -- does not technically exist, so it can't be downloaded.