File Copy from one S3 bucket to other S3 bucket using Lambda - timing constraint? - amazon-s3

I need to copy large files ( may be even greater than 50 GB) from one S3 bucket to other S3 bucket ( event based). I am planning to use s3.Object.copy_from to do this inside Lambda ( using boto3).
I wanted to see if anyone has tried this? will this have any performance issue for larger files (100 GB etc.) causing Lambda timeout?
If yes, is there any alternate option ? ( I am trying to use code since I might need to do some other additional logic like rename file, move source file to archive etc.).
Note- I am also exploring AWS S3 Replication options, but looking for other solutions in parallel.

You can use AWS S3 replication feature.
It supports key prefix and API filtering as well.

Related

Merging dask partitions into one file when writing to s3 Bucket in AWS

I've managed to write an oracle database table to an s3 bucket in AWS in parquet format using Dask. However, I was hoping to have a single file written out like in Pandas. I know Dask partitions the data which creates separate files and a folder. I've tried setting append to true and the number of partitions to false but it doesn't make a difference. Is there a way to merge/append the partitions while writing to an s3 Bucket to create a single parquet file without the folder?
Thanks
No this functionality does not exist currently within Dask. It probably is not too hard to leverage pyarrow or fastparquet to do the work, though, taking the partitions and streaming them into whatever new chunking scheme you like.
I am not sure, but it may be possible to use s3 copy functionality to selectively chop out bytes chunks from the data files and paste into the master file you want to make... This would be far more involve.

aws s3 sync cli ignoring multipart upload config when syncing between buckets

I'm trying to sync a large number of files from one bucket to another, some of the files are up to 2GB in size after using the aws cli's s3 sync command like so
aws s3 sync s3://bucket/folder/folder s3://destination-bucket/folder/folder
and verifying the files that had been transferred it became clear that the large files had lost the metadata that was present on the original file in the original bucket.
This is a "known" issue with larger files where s3 switches to multipart upload to handled the transfer.
This multipart handeling can be configured via the .aws/config file which has been done like so
[default]
s3 =
multipart_threshold = 4500MB
However when again testing the transfer the metadata on the larger files is still not present, it is present on any of the smaller files so it's clear that I'm heating the multipart upload issue.
Given this is an s3 to s3 transfer is the local s3 configuration taken into consideration at all?
As an alternative to this is there a way to just sync the metadata now that all the files have been transferred?
Have also tried doing aws s3 cp with no luck either.
You could use Cross/Same-Region Replication to copy the objects to another Amazon S3 bucket.
However, only newly added objects will copy between the buckets. You can, however, trigger the copy by copying the objects onto themselves. I'd recommend you test this on a separate bucket first, to make sure you don't accidentally lose any of the metadata.
The method suggested seems rather complex: Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena | AWS Big Data Blog
The final option would be to write your own code to copy the objects, and copy the metadata at the same time.
Or, you could write a script that compares the two buckets to see which objects did not get their correct metadata, and have it just update the metadata on the target object. This actually involves copying the object to itself, while specifying the metadata. This is probably easier than copying ALL objects yourself, since it only needs to 'fix' the ones that didn't get their metadata.
Finally managed to implement a solution for this and took the oportunity to play around with the Serverless framework and Step Functions.
The general flow I went with was:
Step Function triggered using a Cloudwatch Event Rule targetting S3 Events of the type 'CompleteMultipartUpload', as the metadata is only ever missing on S3 objects that had to be transfered using a multipart process
The initial Task on the Step Function checks if all the required MetaData is present on the object that raised the event.
If it is present then the Step Function is finished
If it is not present then the second lambda task is fired which copies all metadata from the source object to the destination object.
This could be achieved without Step Functions however was a good simple exercise to give them a go. The first 'Check Meta' task is actually redundant as the metadata is never present if multipart transfer is used, I was originally also triggering off of PutObject and CopyObject as well which is why I had the Check Meta task.

AWS DMS Redshift as target

I am planning to do continuous migration of RDS to Redshift using DMS. As per the docs, it states if the target is redshift , DMS uses a S3 bucket to temporarily store the data before copying to redshift. I could not find any document confirming if this S3 bucket is temporary (used only for initial copying) and is deleted once the copying is done. (https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html)
Any thoughts on this?
Probably you've figured out the answer already. If not, yes it does create a bucket and the contents are deleted but the bucket itself is not deleted. Generally the name of the bucket starts with dms-.
Also there is an option to provide a custom bucket.

Lambda triggers high s3 costs

I created a new Lambda based on a 2MB zip file (it has a heavy dependency). After that, my S3 costs really increased (from $12.27 to $31).
Question 1: As this is uploaded from a CI/CD pipeline, could it be that it's storing every version and then increasing costs?
Question 2: Is this storage alternative more expensive than choosing directly an owned s3 bucket instead of the private one owned by Amazon where this zip goes? Looking at the S3 prices list, only 2MB can't result in 19 Dollars.
Thanks!
few things you can do to mitigate cost:
Use Lambda Layers for dependencies
Use S3 Infrequent access for your lambda archive
Being that I don't have your full configuration of S3, its hard to tell what can be causing cost...things like S3 versioning would do it.
The reason was that Object versioning was enabled, and after some stress tests those versions were accumulated and stored. Costs went back to $12 after they were removed.
It's key to keep the "Show" enabled (see image) to keep track of those files.

Move many S3 buckets to Glacier

We have a ton of S3 buckets and are in the process of cleaning things up. We identified Glacier as a good way to archive their data. The plan is to store the content of those buckets and then remove them.
It would be a one-shot operation, we don't need something automated.
I know that:
a bucket name may not be available anymore if one day we want to restore it
there's an indexing overhead of about 40kb per file which makes it a not so cost-efficient solution for small files and better to use an Infrequent access storage class or to zip the content
I gave it a try and created a vault. But I couldn't run the aws glacier command. I get some SSL error which is apparently related to a Python library, wether I run it on my Mac or from some dedicated container.
Also, it seems that it's a pain to use the Glacier API directly (and to keep the right file information), and that it's simpler to use it via a dedicated bucket.
What about that? Is there something to do what I want in AWS? Or any advice to do it in a not too fastidious way? What tool would you recommend?
Whoa, so many questions!
There are two ways to use Amazon Glacier:
Create a Lifecycle Policy on an Amazon S3 bucket to archive data to Glacier. The objects will still appear to be in S3, including their security, size, metadata, etc. However, their contents are stored in Glacier. Data stored in Glacier via this method must be restored back to S3 to access the contents.
Send data directly to Amazon Glacier via the AWS API. Data sent this way must be restored via the API.
Amazon Glacier charges for storage volumes, plus per request. It is less-efficient to store many, small files in Glacier. Instead, it is recommended to create archives (eg zip files) that make fewer, larger files. This can make it harder to retrieve specific files.
If you are going to use Glacier directly, it is much easier to use a utility, such as Cloudberry Backup, however these utilities are designed to backup from a computer to Glacier. They probably won't backup S3 to Glacier.
If data is already in Amazon S3, the simplest option is to create a lifecycle policy. You can then use the S3 management console and standard S3 tools to access and restore the data.
Using a S3 archiving bucket did the job.
Here is how I proceeded:
First, I created a S3 bucket called mycompany-archive, with a lifecycle rule that turns the Storage class into Glacier 1 day after the file creation.
Then, (with the aws tool installed on my Mac) I ran the following aws command to obtain the buckets list: aws s3 ls
I then pasted the output into an editor that can do regexp relacements, and I did the following one:
Replace ^\S*\s\S*\s(.*)$ by aws s3 cp --recursive s3://$1 s3://mycompany-archive/$1 && \
It gave me a big command, from which I removed the trailing && \ at the end, and the lines corresponding the buckets I didn't want to copy (mainly mycompany-archive had to be removed from there), and I had what I needed to do the transfers.
That command could be executed directly, but I prefer to run such commands using the screen util, to make sure the process wouldn't stop if I close my session by accident.
To launch it, I ran screen, launched the command, and then pressed CTRL+A then D to detach it. I can then come back to it by running screen -r.
Finally, under MacOS, I ran cafeinate to make sure the computer wouldn't sleep before it's over. To run it, issued ps|grep aws to locate the process id of the command. And then caffeinate -w 31299 (the process id) to ensure my Mac wouldn't allow sleep before the process is done.
It did the job (well, it's still running), I have now a bucket containing a folder for each archived bucket. Next step will be to remove the undesired S3 buckets.
Of course this way of doing could be improved in many ways, mainly by turning everything into a fault-tolerant replayable script. In this case, I have to be pragmatic and thinking about how to improve it would take far more time for almost no gain.