Merging pdf files stored on Amazon S3 - amazon-s3

Currently I'm using pdfbox to download all my pdf files on my server and then using pdfbox to merge them together. It's working perfectly fine but it's very slow--since I have to download them all.
Is there a way to perform all of this on S3 directly? I'm trying to find a way to do it, even if not in java also in python and unable to do so.
I read the following:
Merging files on S3 Amazon
https://github.com/boazsegev/combine_pdf/issues/18
Is there a way to merge files stored in S3 without having to download them?
EDIT
The way I ended up doing it was using concurrent.futures and implementing it with concurrent.futures.ThreadPoolExecutor. I set a maximum of 8 worker threads to download all the pdf files from s3.
Once all files were downloaded I merged them with pdfbox. Simple.

S3 is just a data store, so at some level you need to transfer the PDF files from S3 to a server and then back. You'll probably gain the best speed by doing your conversions on an EC2 instance located in the same region as your S3 bucket.
If you don't want to spin up an EC2 instance yourself just to do this then another alternative may be to make use of AWS Lambda, which is a compute service where you can upload your code and have AWS manage the execution of it.

Related

Streaming compression to S3 bucket with a custom directory structure

I have got an application that requires to create a compressed file from different objects that are saved on S3. The issue I am facing is I would like to compress objects on the fly without downloading files into a container and do the compression. The reason for that is the size of files can be quite big and I can easily run out of disk space and of course, there will be an extra round trip time of downloading files on disk, compressing them and upload the compressed file to s3 again.
It is worth mentioning that I would like to locate files in the output compressed file in different directories, so when a user decompress the file can see it is stored in different folders.
Since S3 does not have the concept of physical folder structure, I am not sure if this is possible and if there is a better way than download/uploading the files.
NOTE
My issue is not about how to use AWS Lambda to export a set of big files. It is about how I can export files from S3 without downloading objects on a local disk and create a zip file and upload to S3. I would like to simply zip the files on S3 on the fly and most importantly being able to customize the directory structure.
For example,
inputs:
big-file1
big-file2
big-file3
...
output:
big-zip.zip
with the directory structure of:
images/big-file1
images/big-file2
videos/big-file3
...
I have almost the same use case as yours. I have researched it for about 2 months and try with multiple ways but finally I have to use ECS (EC2) for my use case because of the zip file can be huge like 100GB ....
Currently AWS doesn't support a native way to perform compress. I have talked to them and they are considering it a feature but there is no time line given yet.
If your files is about 3 GB in term of size, you can think of Lambda to achieve your requirement.
If your files is more than 4 GB, I believe it is safe to do it with ECS or EC2 and attach more volume if it requires more space/memory for compression.
Thanks,
Yes, there are at least two ways: either using AWS-Lambda or AWS-EC2
EC2
Since aws-cli has support of cp command, you can pipe S3 file to any archiver using unix-pipe, e.g.:
aws s3 cp s3://yours-bucket/huge_file - | gzip | aws s3 cp - s3://yours-bucket/compressed_file
AWS-Lambda
Since maintaining and using EC2 instance just for compressing may be too expensive, you can use Lambda for one-time compressions.
But keep in mind that Lambda has a lifetime limit of 15 minutes. So, if your files really huge try this sequence:
To make sure that file will be compressed, try partial file compression using Lambda
Compressed files could me merged on S3 into one file using Upload Part - Copy

How to decompress split zip files on AWS S3?

I've got a file (4GB) which is too big to upload on AWS S3 with unstable internet connection, so I split the file into several parts using WinZip.
So, file.csv became a series of files:
- file.z01
- file.z02
- ...
- file.z12
After uploading it on AWS S3 I need to unzip it. How do I do it?
You wont be able to do it without the help of an EC2 instance.
If you have already uploaded these small zip files, launch a new EC2 instance, download these files from S3 using curl or wget, combine them together and upload to s3 again.
Since you are using Winzip, consider launching a Windows based instance, as it will be tough for you find a linux based equivalent for winzip.

block file system on S3

i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?
Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.

Allowing users to download files as a batch from AWS s3 or Cloudfront

I have a website that allows users to search for music tracks and download those they they select as mp3.
I have the site on my server and all of the mp3s on s3 and then distributed via cloudfront. So far so good.
The client now wishes for users to be able to select a number of music track and then download them all in bulk or as a batch instead of 1 at a time.
Usually I would place all the files in a zip and then present the user a link to that new zip file to download. In this case, as the files are on s3 that would require I first copy all the files from s3 to my webserver process them in to a zip and then download from my server.
Is there anyway i can create a zip on s3 or CF or is there someway to batch / group files in to a zip?
Maybe i could set up an EC2 instance to handle this?
I would greatly appreciate some direction.
Best
Joe
I am afraid you won't be able to create the batches w/o additional processing. firing up an EC2 instance might be an option to create a batch per user
I am facing the exact same problem. So far the only thing I was able to find is Amazon's s3sync tool:
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
In my case, I am using Rails + its Paperclip addon which means that I have no way to easily download all of the user's images in one go, because the files are scattered in a lot of subdirectories.
However, if you can group your user's files in a better way, say like this:
/users/<ID>/images/...
/users/<ID>/songs/...
...etc., then you can solve your problem right away with:
aws s3 sync s3://<your_bucket_name>/users/<user_id>/songs /cache/<user_id>
Do have in mind you'll have to give your server the proper credentials so the S3 CLI tools can work without prompting for usernames/passwords.
And that should sort you.
Additional discussion here:
Downloading an entire S3 bucket?
s3 is single http request based.
So the answer is threads to achieve the same thing
Java api - uses TransferManager
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html
You can get great performance with multi threads.
There is no bulk download sorry.