consume gzip files with databricks autoloader - gzip

I am currently unable to find a direct way to load .gz files via autoloader. I can load the files as a binary content but I cannot extract the compressed xml files and process them further in a streaming way.
Therefore, I would like to know if there is a way to consume the content of a gzip file via databricks autoloader

Related

How to make ohif look at s3 for loading studies

I have built object storage plugin to store orthanc data in s3 bucket in legacy mode. I am now trying to eliminate local storage of files of orthanc and move it to s3 completely. I also have OHIF viewer integrated which is serving orthanc data, How do I make it fetch from s3 bucket? I have read that json file of dicom file can be used to do this, but I dont know how to do that because the json file has url of each instance in s3 bucket. How do i generate this json file if this is the way to do it?

how can I upload a gzipped json file to bigquery via the HTTP API?

When I try to upload an uncompressed json file, it works fine; but when I try a gzipped version of the same json file, the job would fail with lexical error resulted from failure to parse the json content.
I gzipped the json file with the gzip command from Mac OSX 10.8 and I have set the sourceFormat to: "NEWLINE_DELIMITED_JSON".
Did I do something incorrectly or gzipped json file should be processed differently?
I believe that using the multipart/related request it is not possible to submit binary data (such as the compressed file. However, if you don't want to use uncompressed data, you may be able to use resumable upload.
What language are you coding in? The python jobs.insert() api takes a media upload parameter, which you should be able to give a filename to in order to do resumable upload (which sends your job metadata and new table data as separate streams). I was able to use this to upload a compressed file.
This is what bq.py uses, so you could look at the source code here.
If you aren't using python, the googleapis client libraries for other languages should have similar functionality.
You can upload gzipped files to Google Cloud Storage, and BigQuery will be able to ingest it with a load job:
https://developers.google.com/bigquery/loading-data-into-bigquery#loaddatagcs

Compress file on S3

I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't compressed.
I know that by compressing it, it'll be about 2.2GB (gzip). How can I download this file locally as quickly as possible when transfer is the bottleneck (250kB/s).
I've not found any straightforward way to compress the file on S3, or enable compression on transfer in s3cmd, boto, or related tools.
S3 does not support stream compression nor is it possible to compress the uploaded file remotely.
If this is a one-time process I suggest downloading it to a EC2 machine in the same region, compress it there, then upload to your destination.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html
If you need this more frequently
Serving gzipped CSS and JavaScript from Amazon CloudFront via S3
Late answer but I found this working perfectly.
aws s3 sync s3://your-pics .
for file in "$(find . -name "*.jpg")"; do gzip "$file"; echo "$file"; done
aws s3 sync . s3://your-pics --content-encoding gzip --dryrun
This will download all files in s3 bucket to the machine (or ec2 instance), compresses the image files and upload them back to s3 bucket.
Verify the data before removing dryrun flag.
There are now pre-built apps in Lambda that you could use to compress images and files in S3 buckets. So just create a new Lambda function and select a pre-built app of your choice and complete the configuration.
Step 1 - Create a new Lambda function
Step 2 - Search for prebuilt app
Step 3 - Select the app that suits your need and complete the configuration process by providing the S3 bucket names.

Why do Amazon S3 returns me an Error 330 about simple files?

I have added the "Content-Encoding: gzip" header to my S3 files and now when I try to access them, it returns me a "Error 330 (net::ERR_CONTENT_DECODING_FAILED)".
Note that my files are simply images, js and css.
How do I solve that issue?
You're going to have to manually gzip them and then upload them to S3. S3 doesn't have the ability to gzip on the fly like your web server does.
EDIT: Images are already compressed so don't gzip them.
Don't know if you are using Grunt as deployment tool but, use this to compress your files:
https://github.com/gruntjs/grunt-contrib-compress
Then:
https://github.com/MathieuLoutre/grunt-aws-s3
To upload compressed files to Amazon S3. Et voila!

Is there a way to load a Gzipped file from Amazon S3 into Pentaho (PDI / Spoon / Kettle)?

Is there a way to load a Gzipped file from Amazon S3 into Pentaho Data Integration (Spoon)?
There is a "Text File Input" that has a Compression attribute that supports Gzip, but this module can't connect to S3 as a source.
There is an "S3 CSV Input" module, but no Compression attribute, so it can't decompress the Gzipped content into tabular form.
Also, there is no way to save the data from S3 to a local file. The downloaded content can only be "hopped" to another Step, but no Step can read gzipped data from a previous Step, the Gzip-compatible steps all read only from files.
So, I can get gzipped data from S3, but I can't send that data anywhere that can consume it.
Am I missing something? Is there a way to unzip zipped data from a non-file source?
Kettle uses VFS (Virtual File System) when working with files. Therefore, you can fetch a file through http, ssh, ftp, zip, ... and use it as a regular, local file in all the steps that read files. Just use the right "url". You will find more here and here, and a very nice tutorial here. Also, check out VFS transformation examples that come with Kettle.
This is url template for S3: s3://<Access Key>:<Secret Access Key>#s3<file path>
In your case, you would use "Text file input" with compression settings you mentioned and selected file would be:
s3://aCcEsSkEy:SecrEttAccceESSKeeey#s3/your-s3-bucket/your_file.gzip
I really don't know how but if you really need this you can look for using S3 through VFS capabilities that Pentaho Data Integration provides. I can se a vsf-providers.xml with the following content in my PDI CE distribution:
../data-integration/libext/pentaho/pentaho-s3-vfs-1.0.1.jar
<providers>
<provider class-name="org.pentaho.s3.vfs.S3FileProvider">
<scheme name="s3"/>
<if-available class-name="org.jets3t.service.S3Service"/>
</provider>
</providers>
You can also try with GZIP input control in peanatho kettle it is there.