Hive LLAP doesn't work with Parquet format - hive

After finding out Hive LLAP, I really want to use it.
I started Azure HDinsight cluster with LLAP enabled. However, it doesn't seem to work any better than normal Hive. I have data stored in Parquet files. I only see ORC files mentioned in LLAP related docs or talks.
Does it also support Parquet format?

Answering my own question.
We reached out to Azure support. Hive LLAP only works with ORC file format (as of 05.2017).
So with Parquet either we have to use Apache Impala for fast interactive queries (https://impala.incubator.apache.org) as alternative to LLAP or change the stored file format to ORC.

Update: This is currently work in progress and no longer be the case with the next release of HDP. As of HDP 3.0. LLAP will support caching for the Parquet file format. This update should flow into HDInsights shortly after the 3.0 release.

Related

Hive to read encrypted Parquet files

I have an encrypted parquet file that I would like to read in Hive through an external table. If the file is not encrypted, I can read without any problem.
Per PARQUET-1817, I should set parquet.crypto.factory.class to my implementation of the DecryptionPropertiesFactory interface, but I'm not quite sure where to put this setting. I tried a couple of places, but none of them is working. The example in PARQUET-1817 is using Spark. I tested this example and it's working without any issue in Spark, so my implementation of the DecryptionPropertiesFactory interface must be ok.
So now, I'm wondering if Hive supports PARQUET-1817. If so, how should I config it? I'm using Hive 3.1.3 with Hive Standalone Metastore 3.0.0.
Thanks.

Redshift External tables via Hive metastore

I've a redshift DB setup and we do periodic archival of the data into S3. I would like to create redshift external tables on top of these archived files. AWS documentation suggests that this can be done either via athena or via hive metastore. Since athena is quite expensive, I would like to get this done via Hive metastore. But I'm struggling with the connectivity here.
Below are the links of the steps that I followed:
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html
Creating the external schema works out fine; but while creating the table i get the follow error:
Invalid operation: Hive Metastore error. HOST: XX.XXX.XXX.XX PORT: 9083 ERROR: Default TException.
Any idea what can be done here?

CSV Files from AWS S3 to MarkLogic 8

Can csv files from the AWS S3 bucket be configured to go straight into ML or do the files need to land somewhere and then the CSV files have to get ingested using MCLP?
Assuming you have CSV files in the S3 Bucket and that one row in the CSV file is to be inserted as a single XML record...that wasn't clear in your question, but is the most common use case. If your plan is to just pull the files in and persist them as CSV files, there are undocumented XQuery functions that could be used to access the S3 bucket and pull the files in off that. Anyway, the MLCP documents are very helpful in understanding this very versatile and powerful tool.
According to the documentation (https://developer.marklogic.com/products/mlcp) the supported data sources are:
Local filesystem
HDFS
MarkLogic Archive
Another MarkLogic Database
You could potentially mount the S3 Bucket to a local filesystem on EC2 to bypass the need to make the files accessible to MLCP. Google's your friend if that's important. I personally haven't seen a production-stable method for that, but it's been a long time since I've tried.
Regardless, you need to make those files available on a supported source, most likely a filesystem location in this case, where MLCP can be run and can reach the files. I suppose that's what you meant by having the files land somewhere. MLCP can process delimited files in import mode. The documentation is very good for understanding all the options.

Save and Load models from S3

Any way to allow an H2O cluster to save/load directly to S3?
model.save('s3n://my-domain/gbm-from-the-future')
model.load('s3n://my-domain/gbm-from-the-future')
Historically, I have achieved this by:
- Saving to a file-system off of the Cluster
- Syncing with S3
- Downloading from S3
- Loading from the file-system
Obviously, there has to be a better way from the cluster itself.
According to the Python docs for h2o.save_model() this is already supported (you did not mention which of the APIs you are using, so I am using Python as an example). Have you tried putting an S3 address in the file location argument of the standard model save and load functions? If you find that this is not working, please file a bug report on the H2O JIRA.

Sentiment Analysis of Twitter

Sir I want to do the sentiments analysis of twitter data using Apache hive , flume Now I have a twitter account and i have set the conf file .But the problem is with format of data . it is not loading in hive .Kindly help me, I am working in it for a month.
I think you are able to configure the Flume agent to fetching the data from Twitter. Your problem is the format of the data.
Apache Flume offers several Sink Types. Two of them are useful to your requirement.
HDFS Sink
Hive Sink
Using HDFS Sink:
Configure Flume agent with TwitterSource and HDFS Sink.
Provide your Twitter OAuth details i.e., keys to Flume Agent.
Once Agent configuration is done, then start it.
This agent will fetch the data i.e., tweets from Twitter and stores it
in HDFS path as JSON Documents.
Once data is available in HDFS, the create an Hive external table with JSON SerDe with location clause.
JSON SerDe Code link: https://github.com/cloudera/cdh-twitter-example/blob/master/hive-serdes/src/main/java/com/cloudera/hive/serde/JSONSerDe.java
Using Hive Sink:
Flume allows writing the data into Hive Table using Hive Sink. So we need to configure the Flume agent as follows:
TwiiterSource --> Channel --> Hive Sink
Configure Flume agent with TwitterSource and Hive Sink.
Provide your Twitter OAuth details i.e., keys to Flume Agent.
Once Agent configuration is done, then start it.
This agent will fetch the data i.e., tweets from Twitter and stores it
in Hive table. This uses JSON SerDe.
Hive Sink has parameter called serializer to tell the type of SerDe.
Supported serializers: DELIMITED and JSON
So please configure your Flume agent using any one of the way above solutions.
Please use this documentation link to get the more details about Sink Parameters (HDFS + Hive)
https://flume.apache.org/FlumeUserGuide.html
You can try adding this jar file
hive-serdes-1.0-SNAPSHOT.jar
You can follow the below blog for complete reference of performing sentiment analysis using Hive.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/