Is there a way to load a Gzipped file from Amazon S3 into Pentaho (PDI / Spoon / Kettle)? - amazon-s3

Is there a way to load a Gzipped file from Amazon S3 into Pentaho Data Integration (Spoon)?
There is a "Text File Input" that has a Compression attribute that supports Gzip, but this module can't connect to S3 as a source.
There is an "S3 CSV Input" module, but no Compression attribute, so it can't decompress the Gzipped content into tabular form.
Also, there is no way to save the data from S3 to a local file. The downloaded content can only be "hopped" to another Step, but no Step can read gzipped data from a previous Step, the Gzip-compatible steps all read only from files.
So, I can get gzipped data from S3, but I can't send that data anywhere that can consume it.
Am I missing something? Is there a way to unzip zipped data from a non-file source?

Kettle uses VFS (Virtual File System) when working with files. Therefore, you can fetch a file through http, ssh, ftp, zip, ... and use it as a regular, local file in all the steps that read files. Just use the right "url". You will find more here and here, and a very nice tutorial here. Also, check out VFS transformation examples that come with Kettle.
This is url template for S3: s3://<Access Key>:<Secret Access Key>#s3<file path>
In your case, you would use "Text file input" with compression settings you mentioned and selected file would be:
s3://aCcEsSkEy:SecrEttAccceESSKeeey#s3/your-s3-bucket/your_file.gzip

I really don't know how but if you really need this you can look for using S3 through VFS capabilities that Pentaho Data Integration provides. I can se a vsf-providers.xml with the following content in my PDI CE distribution:
../data-integration/libext/pentaho/pentaho-s3-vfs-1.0.1.jar
<providers>
<provider class-name="org.pentaho.s3.vfs.S3FileProvider">
<scheme name="s3"/>
<if-available class-name="org.jets3t.service.S3Service"/>
</provider>
</providers>

You can also try with GZIP input control in peanatho kettle it is there.

Related

how to decompress the filename.csv.Z file in Azure data factory

We are having a filename.csv.Z file coming from the source(sftp server), we would need to unzip this file using Azure data factory and load into Azure SQL database. we have tried the compression type as "Zip Deflate" for the source(sftp) dataset but throwing us the below error
“Can't find SFTP path 'xxxxxxxxxxxx.csv.Z'. Please check if the path exists. If the path you configured does not start with '/', note it is a relative path under the given user's default folder. No such file”
Need help to decompress the .Z file in Azure data factory or by using power shell will also help.
Data Factory only support bzip2, gzip, deflate, ZipDeflate, snappy, or lz4 format compressed file.
Z file is not supported, we can not compress or decompress it.
For more details, please reference:
Copy data from and to SFTP server using Azure Data Factory
Supported file formats and compression codecs in Azure Data Factory
Hope this helps.

loading csv file from S3 in neo4j graphdb

I am seeking some suggestion about loading csv files from s3 bucket to neo4j graphdb. In S3 bucket the files are in csv.gz format. I need to import them into my neo4j graph db which is in ec2 instance.
1. Is there any direct way to load csv.gz into neo4j db without unzipping it ?
2. can I set/add s3 bucket path into neo4j.conf file at neo4j.dbms.directory which is by default neo4j/import ?
kindly help me to suggest some idea to load files from S3
Thank you
You can achieve both of these goals with APOC. The docs give you two approaches:
Load from the GZ file directly, assuming the file in the bucket has a public URL
Load the file from S3 directly, with an access token
Here's an example of the first approach - the section after the ! is the filename within the zip file to load, and this should work with .zip, .gz, .tar files etc.
CALL apoc.load.csv("https://pablissimo-so-test.s3-us-west-2.amazonaws.com/mycsv.zip!mycsv.csv")

CSV Files from AWS S3 to MarkLogic 8

Can csv files from the AWS S3 bucket be configured to go straight into ML or do the files need to land somewhere and then the CSV files have to get ingested using MCLP?
Assuming you have CSV files in the S3 Bucket and that one row in the CSV file is to be inserted as a single XML record...that wasn't clear in your question, but is the most common use case. If your plan is to just pull the files in and persist them as CSV files, there are undocumented XQuery functions that could be used to access the S3 bucket and pull the files in off that. Anyway, the MLCP documents are very helpful in understanding this very versatile and powerful tool.
According to the documentation (https://developer.marklogic.com/products/mlcp) the supported data sources are:
Local filesystem
HDFS
MarkLogic Archive
Another MarkLogic Database
You could potentially mount the S3 Bucket to a local filesystem on EC2 to bypass the need to make the files accessible to MLCP. Google's your friend if that's important. I personally haven't seen a production-stable method for that, but it's been a long time since I've tried.
Regardless, you need to make those files available on a supported source, most likely a filesystem location in this case, where MLCP can be run and can reach the files. I suppose that's what you meant by having the files land somewhere. MLCP can process delimited files in import mode. The documentation is very good for understanding all the options.

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.

how can I upload a gzipped json file to bigquery via the HTTP API?

When I try to upload an uncompressed json file, it works fine; but when I try a gzipped version of the same json file, the job would fail with lexical error resulted from failure to parse the json content.
I gzipped the json file with the gzip command from Mac OSX 10.8 and I have set the sourceFormat to: "NEWLINE_DELIMITED_JSON".
Did I do something incorrectly or gzipped json file should be processed differently?
I believe that using the multipart/related request it is not possible to submit binary data (such as the compressed file. However, if you don't want to use uncompressed data, you may be able to use resumable upload.
What language are you coding in? The python jobs.insert() api takes a media upload parameter, which you should be able to give a filename to in order to do resumable upload (which sends your job metadata and new table data as separate streams). I was able to use this to upload a compressed file.
This is what bq.py uses, so you could look at the source code here.
If you aren't using python, the googleapis client libraries for other languages should have similar functionality.
You can upload gzipped files to Google Cloud Storage, and BigQuery will be able to ingest it with a load job:
https://developers.google.com/bigquery/loading-data-into-bigquery#loaddatagcs