Can make figure out file dependencies in AWS S3 buckets? - amazon-s3

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?

The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.

I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

Related

How to add the files of s3bucket folder to a zipfile and download the zip file

I have a folder in s3bucket. I want to zip the files inside it and then download the zip file. Whatever i found was related to lambda. Is there a way i can do it without using lambda? if not then what is the proper way to do it.
Thank you in advance.
S3 can't zip it on the fly for you since it's only a file storage service. You could use lambda of course, but the simplest way to download a "folder" on S3 is to use the AWS CLI.
aws s3 sync s3://<bucket_name>/<folder_key> <local_dest_path>
You can then zip it on your local machine if needed.

get zip files from one s3 bucket unzip them to another s3 bucket

I have zip files in one s3 bucket
I need to unzip them and copy the unzipped folder to another s3 bucket and keep the source path
for example - if in source bucket the zip file in under
"s3://bucketname/foo/bar/file.zip"
then in destination bucket it should be "s3://destbucketname/foo/bar/zipname/files.."
how can it be done ?
i know that it is possible somehow to do it with lambda so i wont have to download it locally but i have no idea how
thanks !
If your desire is to trigger the above process as soon as the Zip file is uploaded into the bucket, then you could write an AWS Lambda function
When the Lambda function is triggered, it will be passed the name of the bucket and object that was uploaded. The function should then:
Download the Zip file to /tmp
Unzip the file (Beware: maximum storage available: 500MB)
Loop through the unzipped files and upload them to the destination bucket
Delete all local files created (to free-up space for any future executions of the function)
For a general example, see: Tutorial: Using AWS Lambda with Amazon S3 - AWS Lambda
You can use AWS Lambda for this. You can also set an event notification in your S3 bucket so that a lambda function is triggered everytime a new file arrives. You can write a Python code that uses boto3 to connect to S3. Then you can read files into a buffer, and unzip them using these libraries, gzip them and then reupload to S3 in your desired folder/path:
import gzip
import zipfile
import io
with zipped.open(file, "r") as f_in:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There is also a tutorial here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9
Arguably Python is simpler to use for your Lambda, but if you are considering Java, I've made a library that manages unzipping of data in AWS S3 utilising stream download and multipart upload.
Unzipping is achieved without keeping data in memory or writing to disk. That makes it suitable for large data files - it has been used to unzip files of size 100GB+.
It is available in Maven Central, here is the GitHub link: nejckorasa/s3-stream-unzip

Sync with S3 with s3cmd, but not re-download files that only changed name

I'm syncing a bunch of files between my computer and Amazon S3. Say a couple of the files change name, but their content is still the same. Do I have to have the local file removed by s3cmd and then the "new" file re-downloaded, just because it has a new name? Or is there any other way of checking for changes? I would like s3cmd to, in that case, simply change the name of the local file in accordance with the new name on the server.
s3cmd upstream (github.com/s3tools/s3cmd master branch) and 1.5.0-rc1 latest published version, can figure this out, if you used a recent version to put the file into S3 in the first place that used the --preserve option to store the md5sum of each file. Using the md5sums, it knows that you have a duplicate (even if renamed) file locally, and won't re-download it, but instead will do a local copy (or hardlink) from the file system name to the name from S3.

Is it possible to sync a single file to s3?

I'd like to sync a single file from my filesystem to s3.
Is this possible or can only directories by synced?
Use include/exclude options for the sync-directory command:
e.g. To sync just /var/local/path/filename.xyz to S3 use:
s3 sync /var/local/path s3://bucket/path --exclude='*' --include='*/filename.xyz'
cp can be used to copy a single file to S3. If the filename already exists in the destination, this will replace it:
aws s3 cp local/path/to/file.js s3://bucket/path/to/file.js
Keep in mind that per the docs, sync will only make updates to the target if there have been file changes to the source file since the last run: s3 sync updates any files that have a size or modified time that are different from files with the same name at the destination. However, cp will always make updates to the target regardless of whether the source file has been modified.
Reference: AWS CLI Command Reference: cp
Just to comment on pythonjsgeo's answer. That seems to be the right solution but make sure so execute the command without the = symbol after the include and exclude tag. I was including the = symbol and getting weird behavior with the sync command.
s3 sync /var/local/path s3://bucket/path --exclude '*' --include '*/filename.xyz'
You can mount S3 bucket as a local folder (using RioFS, for example) and then use your favorite tool to synchronize file(-s) or directories.

S3: Move all files from subdirectories into a common directory

I have a lot of subdirectories containing a lot of images (millions) on S3. Having the files in these subdirectories has turned out to be a lot of trouble, and since all file names are actually unique, there is no reason why they should reside in subdirectories. So I need to find a fast and scalable way to move all files from the subdirectories into one common directory or alternatively delete the sub directories without deleting the files.
Is there a way to do this?
I'm on ruby, but open to almost anything
I have added a comment to your other question, explaining why S3 does not have folders, but file name prefixes instead (See Amazon AWS IOS SDK: How to list ALL file names in a FOLDER).
With that in mind, you will probably need to use a combination of two S3 API calls in order to achieve what you want: copy a file to a new one (removing the prefix from the file name) and deleting the original. Maybe there is a Ruby S3 SDK or framework out there exposing a rename feature, but under the hood it will likely be a copy/delete.
Related question: Amazon S3 boto: How do you rename a file in a bucket?