How do I scalably sort compressed (gzip) S3 file using AWS-Lambda? - gzip

How do I scalably sort compressed (gzip) S3 file using AWS-Lambda?
Is there an example of how to do it using Python ?
I want to avoid downloading file or renting EC2 instance.

Related

How to make ohif look at s3 for loading studies

I have built object storage plugin to store orthanc data in s3 bucket in legacy mode. I am now trying to eliminate local storage of files of orthanc and move it to s3 completely. I also have OHIF viewer integrated which is serving orthanc data, How do I make it fetch from s3 bucket? I have read that json file of dicom file can be used to do this, but I dont know how to do that because the json file has url of each instance in s3 bucket. How do i generate this json file if this is the way to do it?

Streaming compression to S3 bucket with a custom directory structure

I have got an application that requires to create a compressed file from different objects that are saved on S3. The issue I am facing is I would like to compress objects on the fly without downloading files into a container and do the compression. The reason for that is the size of files can be quite big and I can easily run out of disk space and of course, there will be an extra round trip time of downloading files on disk, compressing them and upload the compressed file to s3 again.
It is worth mentioning that I would like to locate files in the output compressed file in different directories, so when a user decompress the file can see it is stored in different folders.
Since S3 does not have the concept of physical folder structure, I am not sure if this is possible and if there is a better way than download/uploading the files.
NOTE
My issue is not about how to use AWS Lambda to export a set of big files. It is about how I can export files from S3 without downloading objects on a local disk and create a zip file and upload to S3. I would like to simply zip the files on S3 on the fly and most importantly being able to customize the directory structure.
For example,
inputs:
big-file1
big-file2
big-file3
...
output:
big-zip.zip
with the directory structure of:
images/big-file1
images/big-file2
videos/big-file3
...
I have almost the same use case as yours. I have researched it for about 2 months and try with multiple ways but finally I have to use ECS (EC2) for my use case because of the zip file can be huge like 100GB ....
Currently AWS doesn't support a native way to perform compress. I have talked to them and they are considering it a feature but there is no time line given yet.
If your files is about 3 GB in term of size, you can think of Lambda to achieve your requirement.
If your files is more than 4 GB, I believe it is safe to do it with ECS or EC2 and attach more volume if it requires more space/memory for compression.
Thanks,
Yes, there are at least two ways: either using AWS-Lambda or AWS-EC2
EC2
Since aws-cli has support of cp command, you can pipe S3 file to any archiver using unix-pipe, e.g.:
aws s3 cp s3://yours-bucket/huge_file - | gzip | aws s3 cp - s3://yours-bucket/compressed_file
AWS-Lambda
Since maintaining and using EC2 instance just for compressing may be too expensive, you can use Lambda for one-time compressions.
But keep in mind that Lambda has a lifetime limit of 15 minutes. So, if your files really huge try this sequence:
To make sure that file will be compressed, try partial file compression using Lambda
Compressed files could me merged on S3 into one file using Upload Part - Copy

loading csv file from S3 in neo4j graphdb

I am seeking some suggestion about loading csv files from s3 bucket to neo4j graphdb. In S3 bucket the files are in csv.gz format. I need to import them into my neo4j graph db which is in ec2 instance.
1. Is there any direct way to load csv.gz into neo4j db without unzipping it ?
2. can I set/add s3 bucket path into neo4j.conf file at neo4j.dbms.directory which is by default neo4j/import ?
kindly help me to suggest some idea to load files from S3
Thank you
You can achieve both of these goals with APOC. The docs give you two approaches:
Load from the GZ file directly, assuming the file in the bucket has a public URL
Load the file from S3 directly, with an access token
Here's an example of the first approach - the section after the ! is the filename within the zip file to load, and this should work with .zip, .gz, .tar files etc.
CALL apoc.load.csv("https://pablissimo-so-test.s3-us-west-2.amazonaws.com/mycsv.zip!mycsv.csv")

Compress file on S3

I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't compressed.
I know that by compressing it, it'll be about 2.2GB (gzip). How can I download this file locally as quickly as possible when transfer is the bottleneck (250kB/s).
I've not found any straightforward way to compress the file on S3, or enable compression on transfer in s3cmd, boto, or related tools.
S3 does not support stream compression nor is it possible to compress the uploaded file remotely.
If this is a one-time process I suggest downloading it to a EC2 machine in the same region, compress it there, then upload to your destination.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html
If you need this more frequently
Serving gzipped CSS and JavaScript from Amazon CloudFront via S3
Late answer but I found this working perfectly.
aws s3 sync s3://your-pics .
for file in "$(find . -name "*.jpg")"; do gzip "$file"; echo "$file"; done
aws s3 sync . s3://your-pics --content-encoding gzip --dryrun
This will download all files in s3 bucket to the machine (or ec2 instance), compresses the image files and upload them back to s3 bucket.
Verify the data before removing dryrun flag.
There are now pre-built apps in Lambda that you could use to compress images and files in S3 buckets. So just create a new Lambda function and select a pre-built app of your choice and complete the configuration.
Step 1 - Create a new Lambda function
Step 2 - Search for prebuilt app
Step 3 - Select the app that suits your need and complete the configuration process by providing the S3 bucket names.

Is it possible to add files to Amazon S3 buckets using web URL as source?

I am trying to load one of my S3 buckets.
File i am trying to load is huge tarball on the web, I don't want to download file on my disk and then again start uploading it to S3 bucket.
is there any way that I can directly specify this URL and it get added to S3 ?
You have to "put" to S3, and it does not "get".