Can i concatenate two already gzipped files (using gzip) and then gunzip them? - gzip

Can i concatenate two already gzipped files (using gzip) and then gunzip them?
As of today, I download the gzipped files from remote servers, gunzip them individually and then cat them to merge.
Looking to make things faster here by merging the gzipped files and then gunzipping.

Yes. The concatenation of gzip streams is also a valid gzip stream. The result of gunzipping is the concatenation of the uncompressed data.
You could have just tried it.

Related

consume gzip files with databricks autoloader

I am currently unable to find a direct way to load .gz files via autoloader. I can load the files as a binary content but I cannot extract the compressed xml files and process them further in a streaming way.
Therefore, I would like to know if there is a way to consume the content of a gzip file via databricks autoloader

Streaming compression to S3 bucket with a custom directory structure

I have got an application that requires to create a compressed file from different objects that are saved on S3. The issue I am facing is I would like to compress objects on the fly without downloading files into a container and do the compression. The reason for that is the size of files can be quite big and I can easily run out of disk space and of course, there will be an extra round trip time of downloading files on disk, compressing them and upload the compressed file to s3 again.
It is worth mentioning that I would like to locate files in the output compressed file in different directories, so when a user decompress the file can see it is stored in different folders.
Since S3 does not have the concept of physical folder structure, I am not sure if this is possible and if there is a better way than download/uploading the files.
NOTE
My issue is not about how to use AWS Lambda to export a set of big files. It is about how I can export files from S3 without downloading objects on a local disk and create a zip file and upload to S3. I would like to simply zip the files on S3 on the fly and most importantly being able to customize the directory structure.
For example,
inputs:
big-file1
big-file2
big-file3
...
output:
big-zip.zip
with the directory structure of:
images/big-file1
images/big-file2
videos/big-file3
...
I have almost the same use case as yours. I have researched it for about 2 months and try with multiple ways but finally I have to use ECS (EC2) for my use case because of the zip file can be huge like 100GB ....
Currently AWS doesn't support a native way to perform compress. I have talked to them and they are considering it a feature but there is no time line given yet.
If your files is about 3 GB in term of size, you can think of Lambda to achieve your requirement.
If your files is more than 4 GB, I believe it is safe to do it with ECS or EC2 and attach more volume if it requires more space/memory for compression.
Thanks,
Yes, there are at least two ways: either using AWS-Lambda or AWS-EC2
EC2
Since aws-cli has support of cp command, you can pipe S3 file to any archiver using unix-pipe, e.g.:
aws s3 cp s3://yours-bucket/huge_file - | gzip | aws s3 cp - s3://yours-bucket/compressed_file
AWS-Lambda
Since maintaining and using EC2 instance just for compressing may be too expensive, you can use Lambda for one-time compressions.
But keep in mind that Lambda has a lifetime limit of 15 minutes. So, if your files really huge try this sequence:
To make sure that file will be compressed, try partial file compression using Lambda
Compressed files could me merged on S3 into one file using Upload Part - Copy

How can I decompress a GZip compressed file in chunks in memory?

I can decompress a small GZip file in memory but there are memory limitations on the cloud box that this will run on. I can get around this by doing it in chunks (~32 k). Is there an easy way to split up a GZip compressed file without reading through it?
Thanks,
Marc
Yes, you can use zlib to read a gzip file in chunks. No, you cannot split a gzip file without decoding it.

Most straightforward way of inflating gzip memory stream

I have a gzipped file that I need to read and decompress in my application. I just read through the zlib manual, and it appears that the zlib function are able to operate via memory buffers, but the gzip interface is all file-based.
What is the most common method of dealing with gzipped files like this? Do I need to handle the gzip file format myself, pull out the deflated data, and pass it to the zlib functions?
Note: The reason file-based will not work is because the file is in an archive on a read-only medium, so I can't extract the file first and use the gzip functions from zlib. This is an embedded Linux system.
You need to "read through" the zlib manual again, this time reading through it. inflateInit2() has an option to decompress gzip streams.

Why do Amazon S3 returns me an Error 330 about simple files?

I have added the "Content-Encoding: gzip" header to my S3 files and now when I try to access them, it returns me a "Error 330 (net::ERR_CONTENT_DECODING_FAILED)".
Note that my files are simply images, js and css.
How do I solve that issue?
You're going to have to manually gzip them and then upload them to S3. S3 doesn't have the ability to gzip on the fly like your web server does.
EDIT: Images are already compressed so don't gzip them.
Don't know if you are using Grunt as deployment tool but, use this to compress your files:
https://github.com/gruntjs/grunt-contrib-compress
Then:
https://github.com/MathieuLoutre/grunt-aws-s3
To upload compressed files to Amazon S3. Et voila!