How to decompress split zip files on AWS S3? - amazon-s3

I've got a file (4GB) which is too big to upload on AWS S3 with unstable internet connection, so I split the file into several parts using WinZip.
So, file.csv became a series of files:
- file.z01
- file.z02
- ...
- file.z12
After uploading it on AWS S3 I need to unzip it. How do I do it?

You wont be able to do it without the help of an EC2 instance.
If you have already uploaded these small zip files, launch a new EC2 instance, download these files from S3 using curl or wget, combine them together and upload to s3 again.
Since you are using Winzip, consider launching a Windows based instance, as it will be tough for you find a linux based equivalent for winzip.

Related

Streaming compression to S3 bucket with a custom directory structure

I have got an application that requires to create a compressed file from different objects that are saved on S3. The issue I am facing is I would like to compress objects on the fly without downloading files into a container and do the compression. The reason for that is the size of files can be quite big and I can easily run out of disk space and of course, there will be an extra round trip time of downloading files on disk, compressing them and upload the compressed file to s3 again.
It is worth mentioning that I would like to locate files in the output compressed file in different directories, so when a user decompress the file can see it is stored in different folders.
Since S3 does not have the concept of physical folder structure, I am not sure if this is possible and if there is a better way than download/uploading the files.
NOTE
My issue is not about how to use AWS Lambda to export a set of big files. It is about how I can export files from S3 without downloading objects on a local disk and create a zip file and upload to S3. I would like to simply zip the files on S3 on the fly and most importantly being able to customize the directory structure.
For example,
inputs:
big-file1
big-file2
big-file3
...
output:
big-zip.zip
with the directory structure of:
images/big-file1
images/big-file2
videos/big-file3
...
I have almost the same use case as yours. I have researched it for about 2 months and try with multiple ways but finally I have to use ECS (EC2) for my use case because of the zip file can be huge like 100GB ....
Currently AWS doesn't support a native way to perform compress. I have talked to them and they are considering it a feature but there is no time line given yet.
If your files is about 3 GB in term of size, you can think of Lambda to achieve your requirement.
If your files is more than 4 GB, I believe it is safe to do it with ECS or EC2 and attach more volume if it requires more space/memory for compression.
Thanks,
Yes, there are at least two ways: either using AWS-Lambda or AWS-EC2
EC2
Since aws-cli has support of cp command, you can pipe S3 file to any archiver using unix-pipe, e.g.:
aws s3 cp s3://yours-bucket/huge_file - | gzip | aws s3 cp - s3://yours-bucket/compressed_file
AWS-Lambda
Since maintaining and using EC2 instance just for compressing may be too expensive, you can use Lambda for one-time compressions.
But keep in mind that Lambda has a lifetime limit of 15 minutes. So, if your files really huge try this sequence:
To make sure that file will be compressed, try partial file compression using Lambda
Compressed files could me merged on S3 into one file using Upload Part - Copy

GoReplay - Upload to S3 does not work

I am trying to capture all incoming traffic on a specific port using GoReplay and to upload it directly to S3 servers.
I am running a simple file server on port 8000 and a gor instance using the (simple) command
gor --input-raw :8000 --output-file s3://<MyBucket>/%Y_%m_%d_%H_%M_%S.log
I does create a temporal file at /tmp/ but other than that, id does not upload any thing to S3.
Additional information :
The OS is Ubuntu 14.04
AWS cli is installed.
The AWS credentials are deffined within the environent
It seems the information you are providing or scenario you explained is not complete however to upload a file from your EC2 machine to S3 is simple as written command below.
aws s3 cp yourSourceFile s3://your bucket
To see your file you can see your file by using below command
aws s3 ls s3://your bucket
However, s3 is object storage and you can't use it to upload those files which are continually editing or adding or updating.

Problems reading files from an S3 bucket mounted to Ubuntu using s3fs

I am trying to use sf3f (https://github.com/s3fs-fuse/s3fs-fuse to handle files being uploaded by a different process to an s3 bucket. I can see files listed when I ls in hat mounted directory and the files are showing correctly timestamped and with the correct size. But trying to open the files either in code or using vi or nano just shows garbage such as
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#
If I curl the s3 link then I do see content so it is available and there do not seem to be permission issues.
I am able to create a new file in this directory and the content is saving fine - can view it in the S3 explorer.
Any thoughts? Is this an s3fs-fuse problem and I may fair better using an alternate such as https://github.com/kahing/goofys?
I have signed up for the AWS EFS preview but no idea what the wait time is for that.

Merging pdf files stored on Amazon S3

Currently I'm using pdfbox to download all my pdf files on my server and then using pdfbox to merge them together. It's working perfectly fine but it's very slow--since I have to download them all.
Is there a way to perform all of this on S3 directly? I'm trying to find a way to do it, even if not in java also in python and unable to do so.
I read the following:
Merging files on S3 Amazon
https://github.com/boazsegev/combine_pdf/issues/18
Is there a way to merge files stored in S3 without having to download them?
EDIT
The way I ended up doing it was using concurrent.futures and implementing it with concurrent.futures.ThreadPoolExecutor. I set a maximum of 8 worker threads to download all the pdf files from s3.
Once all files were downloaded I merged them with pdfbox. Simple.
S3 is just a data store, so at some level you need to transfer the PDF files from S3 to a server and then back. You'll probably gain the best speed by doing your conversions on an EC2 instance located in the same region as your S3 bucket.
If you don't want to spin up an EC2 instance yourself just to do this then another alternative may be to make use of AWS Lambda, which is a compute service where you can upload your code and have AWS manage the execution of it.

Can I cUrl a file from the web to Amazon S3?

Is there a way to transfer a file from the web directly to my Amazon S3 account?
For example, I want to transfer a large RDF file from www.data.gov directly to Amazon S3 without having to download the file to my local machine first.
You need a server somewhere that will execute the curl command. The easiest way is probably to use this a tool that I wrote for AWS EC2: https://github.com/mjhm/cURLServer. You can check out the docs on a live version at http://ec2-204-236-157-181.us-west-1.compute.amazonaws.com/