I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't compressed.
I know that by compressing it, it'll be about 2.2GB (gzip). How can I download this file locally as quickly as possible when transfer is the bottleneck (250kB/s).
I've not found any straightforward way to compress the file on S3, or enable compression on transfer in s3cmd, boto, or related tools.
S3 does not support stream compression nor is it possible to compress the uploaded file remotely.
If this is a one-time process I suggest downloading it to a EC2 machine in the same region, compress it there, then upload to your destination.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html
If you need this more frequently
Serving gzipped CSS and JavaScript from Amazon CloudFront via S3
Late answer but I found this working perfectly.
aws s3 sync s3://your-pics .
for file in "$(find . -name "*.jpg")"; do gzip "$file"; echo "$file"; done
aws s3 sync . s3://your-pics --content-encoding gzip --dryrun
This will download all files in s3 bucket to the machine (or ec2 instance), compresses the image files and upload them back to s3 bucket.
Verify the data before removing dryrun flag.
There are now pre-built apps in Lambda that you could use to compress images and files in S3 buckets. So just create a new Lambda function and select a pre-built app of your choice and complete the configuration.
Step 1 - Create a new Lambda function
Step 2 - Search for prebuilt app
Step 3 - Select the app that suits your need and complete the configuration process by providing the S3 bucket names.
Related
I have got an application that requires to create a compressed file from different objects that are saved on S3. The issue I am facing is I would like to compress objects on the fly without downloading files into a container and do the compression. The reason for that is the size of files can be quite big and I can easily run out of disk space and of course, there will be an extra round trip time of downloading files on disk, compressing them and upload the compressed file to s3 again.
It is worth mentioning that I would like to locate files in the output compressed file in different directories, so when a user decompress the file can see it is stored in different folders.
Since S3 does not have the concept of physical folder structure, I am not sure if this is possible and if there is a better way than download/uploading the files.
NOTE
My issue is not about how to use AWS Lambda to export a set of big files. It is about how I can export files from S3 without downloading objects on a local disk and create a zip file and upload to S3. I would like to simply zip the files on S3 on the fly and most importantly being able to customize the directory structure.
For example,
inputs:
big-file1
big-file2
big-file3
...
output:
big-zip.zip
with the directory structure of:
images/big-file1
images/big-file2
videos/big-file3
...
I have almost the same use case as yours. I have researched it for about 2 months and try with multiple ways but finally I have to use ECS (EC2) for my use case because of the zip file can be huge like 100GB ....
Currently AWS doesn't support a native way to perform compress. I have talked to them and they are considering it a feature but there is no time line given yet.
If your files is about 3 GB in term of size, you can think of Lambda to achieve your requirement.
If your files is more than 4 GB, I believe it is safe to do it with ECS or EC2 and attach more volume if it requires more space/memory for compression.
Thanks,
Yes, there are at least two ways: either using AWS-Lambda or AWS-EC2
EC2
Since aws-cli has support of cp command, you can pipe S3 file to any archiver using unix-pipe, e.g.:
aws s3 cp s3://yours-bucket/huge_file - | gzip | aws s3 cp - s3://yours-bucket/compressed_file
AWS-Lambda
Since maintaining and using EC2 instance just for compressing may be too expensive, you can use Lambda for one-time compressions.
But keep in mind that Lambda has a lifetime limit of 15 minutes. So, if your files really huge try this sequence:
To make sure that file will be compressed, try partial file compression using Lambda
Compressed files could me merged on S3 into one file using Upload Part - Copy
I've got a file (4GB) which is too big to upload on AWS S3 with unstable internet connection, so I split the file into several parts using WinZip.
So, file.csv became a series of files:
- file.z01
- file.z02
- ...
- file.z12
After uploading it on AWS S3 I need to unzip it. How do I do it?
You wont be able to do it without the help of an EC2 instance.
If you have already uploaded these small zip files, launch a new EC2 instance, download these files from S3 using curl or wget, combine them together and upload to s3 again.
Since you are using Winzip, consider launching a Windows based instance, as it will be tough for you find a linux based equivalent for winzip.
After some googling it appears there is no API or tool to upload files from a URL directly to S3 without downloading them first?
I could probably download the files locally first and then upload them to S3. Is thee a good tool (Mac) that lets me batch upload all files in a given directory?
Or are there any PHP scripts I could install on a shared hosting account to download a file at a time and then upload to S3?
The AWS Command Line Interface (CLI) can upload files to Amazon S3, eg:
aws s3 cp file s3://my-bucket/file
aws s3 cp . s3://my-bucket/path --recursive
aws s3 sync . s3://my-bucket/path
The sync command is probably best for your use-case. It can synchronize local files with remote files (only copy new/changed files), or use cp to copy specific files.
I am trying to load one of my S3 buckets.
File i am trying to load is huge tarball on the web, I don't want to download file on my disk and then again start uploading it to S3 bucket.
is there any way that I can directly specify this URL and it get added to S3 ?
You have to "put" to S3, and it does not "get".
Is there a way to transfer a file from the web directly to my Amazon S3 account?
For example, I want to transfer a large RDF file from www.data.gov directly to Amazon S3 without having to download the file to my local machine first.
You need a server somewhere that will execute the curl command. The easiest way is probably to use this a tool that I wrote for AWS EC2: https://github.com/mjhm/cURLServer. You can check out the docs on a live version at http://ec2-204-236-157-181.us-west-1.compute.amazonaws.com/