Hive to read encrypted Parquet files - hive

I have an encrypted parquet file that I would like to read in Hive through an external table. If the file is not encrypted, I can read without any problem.
Per PARQUET-1817, I should set parquet.crypto.factory.class to my implementation of the DecryptionPropertiesFactory interface, but I'm not quite sure where to put this setting. I tried a couple of places, but none of them is working. The example in PARQUET-1817 is using Spark. I tested this example and it's working without any issue in Spark, so my implementation of the DecryptionPropertiesFactory interface must be ok.
So now, I'm wondering if Hive supports PARQUET-1817. If so, how should I config it? I'm using Hive 3.1.3 with Hive Standalone Metastore 3.0.0.
Thanks.

Related

Spark unable to write to S3 Encrypted Bucket even after specifying the hadoopConfigs

When i try to write to an S3 Bucket which is AES-256 Encrypted from my Spark Streaming App running on EMR it is throwing 403. For what ever reason the Spark Session is not honoring the "fs.s3a.server-side-encryption-algorithm" config option.
Here is the code i am using.
sparkSession.sparkContext().hadoopConfiguration().set("fs.s3a.access.key",accessKeyId);
sparkSession.sparkContext().hadoopConfiguration().set("fs.s3a.secret.key", secretKeyId);
sparkSession.sparkContext().hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm","AES256");
When i use regular Java Code using AWS SDK i can upload the files without any issues.
Some how the Spark Session is not honoring this.
Thanks
Sateesh
Able to resolve it. Silly mistake on my part.
We need to have the following property as well.
sparkSession.sparkContext().hadoopConfiguration().set("fs.s3.enableServerSideEncryption","true");

Move S3 files to Snowflake stage using Airflow PUT command

I am trying to find a solution to move files from an S3 bucket to Snowflake internal stage (not table directly) with Airflow but it seems that the PUT command is not supported with current Snowflake operator.
I know there are other options like Snowpipe but I want to showcase Airflow's capabilities.
COPY INTO is also an alternative solution but I want to load DDL statements from files, not run them manually in Snowflake.
This is the closest I could find but it uses COPY INTO table:
https://artemiorimando.com/2019/05/01/data-engineering-using-python-airflow/
Also : How to call snowsql client from python
Is there any way to move files from S3 bucket to Snowflake internal stage through Airflow+Python+Snowsql?
Thanks!
I recommend you execute the COPY INTO command from within Airflow to load the files directly from S3, instead. There isn't a great way to get files to internal stage from S3 without hopping the files to another machine (like the Airflow machine). You'd use SnowSQL to GET from S3 to local, and the PUT from local to S3. The only way to execute a PUT to Internal Stage is through SnowSQL.

CSV Files from AWS S3 to MarkLogic 8

Can csv files from the AWS S3 bucket be configured to go straight into ML or do the files need to land somewhere and then the CSV files have to get ingested using MCLP?
Assuming you have CSV files in the S3 Bucket and that one row in the CSV file is to be inserted as a single XML record...that wasn't clear in your question, but is the most common use case. If your plan is to just pull the files in and persist them as CSV files, there are undocumented XQuery functions that could be used to access the S3 bucket and pull the files in off that. Anyway, the MLCP documents are very helpful in understanding this very versatile and powerful tool.
According to the documentation (https://developer.marklogic.com/products/mlcp) the supported data sources are:
Local filesystem
HDFS
MarkLogic Archive
Another MarkLogic Database
You could potentially mount the S3 Bucket to a local filesystem on EC2 to bypass the need to make the files accessible to MLCP. Google's your friend if that's important. I personally haven't seen a production-stable method for that, but it's been a long time since I've tried.
Regardless, you need to make those files available on a supported source, most likely a filesystem location in this case, where MLCP can be run and can reach the files. I suppose that's what you meant by having the files land somewhere. MLCP can process delimited files in import mode. The documentation is very good for understanding all the options.

Hive LLAP doesn't work with Parquet format

After finding out Hive LLAP, I really want to use it.
I started Azure HDinsight cluster with LLAP enabled. However, it doesn't seem to work any better than normal Hive. I have data stored in Parquet files. I only see ORC files mentioned in LLAP related docs or talks.
Does it also support Parquet format?
Answering my own question.
We reached out to Azure support. Hive LLAP only works with ORC file format (as of 05.2017).
So with Parquet either we have to use Apache Impala for fast interactive queries (https://impala.incubator.apache.org) as alternative to LLAP or change the stored file format to ORC.
Update: This is currently work in progress and no longer be the case with the next release of HDP. As of HDP 3.0. LLAP will support caching for the Parquet file format. This update should flow into HDInsights shortly after the 3.0 release.

how to query amazon dynamodb using pentaho

I have tried to connect amazon dynamodb using pentaho using a generic database connection in pentaho but i'm unable to connect. how to query Amazon dynamodb using pentaho?
Assumption: No DyanmoDBInput or DynamoDBQuery kettle plugins available
My Suggested Solutions:
You can write your own kettle plugin using AWS DynamoDB Java SDK
Bonus: Contribute it to the community :)
You can write a Java class that does what you need and uses AWS DynamoDB SDK. Build a jar file that has all the dependencies in it. Drop the jar file into kettle lib directory.
Then use "User Defined Java Class" plugin, create an instance of your class and pass the parameters from the stream as input to dynamodb query and pass any output & error codes/messages to the stream.
Both these solutions are reusable. I have used the 2nd solution to connect to other data sources and it works well.
Goodluck.