Can Spark SQL be executed against Hive tables without any Map/Reduce (/Yarn) running? - hive

It is my understanding that Spark SQL reads hdfs files directly - no need for M/R here. Specifically none of the Map/Reduce based Hadoop Input/OutputFormat's are employed (except in special cases like HBase)
So then are there any built-in dependencies on a functioning hive server? Or is it only required to have
a) Spark Standalone
b) HDFS and
c) Hive metastore server running
i.e Yarn/MRV1 are not required?
The hadoop related I/O formats for accessing hive files seem to include:
TextInput/Output Format
ParquetFileInput/Output Format
Can Spark SQL/Catalyst read Hive tables stored in those formats - with only the Hive Metastore server running ?

Yes.
The Spark SQL Readme says:
Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
This is implemented by depending on Hive libraries for reading the data. But the processing happens inside Spark. So no need for MapReduce or YARN.

Related

apache atlas,When I used Impala, Spark and Presto to operate Hive, I found that I could not participate in the Hive hook competition

I'm testing the Apache Atlas Hive function. Our company operates hive with Presto Impala, and Atlas can't find the tables or blood ties created by Presto Impala. The reason is that Presto Impala will not produce Hive hook when operating Hive. Have you encountered such a situation?

ConvertJsonToSQL for Hive Insert query

I want to insert Json to hive database.
I try to transform Json to SQL using ConvertJsonToSQL Ni-Fi processor. How can I use PARTITION (....) part into my query??
Can I do this or I should use ReplaceText processor for making query?
What version of Hive are you using? There are Hive 1.2 and Hive 3 versions of PutHiveStreaming and PutHive3Streaming (respectively) that let you put the data directly into Hive without having to issue HiveQL statements. For external Hive tables in ORC format, there are also ConvertAvroToORC (for Hive 1.2) and PutORC (for Hive 3) processors.
Assuming those don't work for your use case, you may also consider ConvertRecord with a FreeFormTextRecordSetWriter that generates the HiveQL with the PARTITION statement and such. It gives a lot more flexibility than trying to patch a SQL statement to turn it into HiveQL for a partitioned table.
EDIT: I forgot to mention that the Hive 3 NAR/components are not included with the NiFi release due to space reasons. You can find the Hive 3 NAR for NiFi 1.11.4 here.

Presto query engine with Azure Data Lake

I have a requirement to deploy a presto server which can help me query data stored in ADLS in Avro file formats.
I have gone through this tutorial and it seems that the Hive is used as a catalogue/connector in presto to query from ADLS. Can I bypass Hive and have any connector to extract data from ADLS?
Can I bypass Hive and have any connector to extract data from ADLS?
No.
Hive here plays two roles here:
storage for metadata. It contains information like:
schema and table name
columns
data format
data location
execution
it is capable to read data from (HDFS) distributed file systems (like HDFS, S3, ADLS)
it tells how execution can be distributed.

Loading or pointing to multiple parquet paths for data analysis with hive or prestodb

I have couple of spark jobs that produce parquet files in AWS S3. Every once in a while i need to run some ad-hoc queries on a given date range of this data. I don't want to do this in spark because I want our QA team which has no knowledge os spark be able to do this. What i like to do is to spin up an AWS EMR cluster and load the parquet files into HDFS and run my queries against it. I have figured out how to create tables with hive and point it to one s3 path. But then that limits my data to only one day. because each day of date has multiple files under a path like
s3://mybucket/table/date/(parquet files 1 ... n).
So problem one is to figure how to load multiple days of data into hive. ie
s3://mybucket/table_a/day_1/(parquet files 1 ... n).
s3://mybucket/table_a/day_2/(parquet files 1 ... n).
s3://mybucket/table_a/day_3/(parquet files 1 ... n).
...
s3://mybucket/table_b/day_1/(parquet files 1 ... n).
s3://mybucket/table_b/day_2/(parquet files 1 ... n).
s3://mybucket/table_b/day_3/(parquet files 1 ... n).
I know hive can support partitions but my s3 files are not setup that way.
I have also looked into prestodb which looks like to be the favorite tool for this type of data analysis. The fact it supports ansi SQL makes it a great tool for people that have SQL knowledge but know very little about hadoop or spark. I did install this on my cluster and it works great. But looks like you can't really load data into your tables and you have to rely on Hive to do that part. Is this the right way to use prestodb? I watched a netflix presentation about their use of prestodb and using s3 in place of HDFS. If this works its great but i wonder how the data is moved into memory. At what point the parquet files will be moved from s3 to the cluster. Do i need to have cluster that can load the entire data into memory? how is this generally setup?
You can install Hive and create Hive tables with you data in S3, described in the blog post here: https://blog.mustardgrain.com/2010/09/30/using-hive-with-existing-files-on-s3/
Then install Presto on AWS, configure Presto to connect the hive catalog which you installed previously. Then you can query the your data on S3, with Presto by using SQL.
Rather than trying to load multiple files, you could instead use the API to concatenate the days you want into a single object, which you can then load through the means you already mention.
AWS has a blog post highlighting how to do this exact thing purely through the API (without downloading + re-uploading the data):
https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby

Where to create staging data table in BigData environment?

I am currently having Hadoop-2, PIG, HIVE and HBASE.
I have an inputdata. I have loaded that data in HDFS.
I want to create staging data in this environment.
My query is -
In which BigData component, I should create Staging Table(Pig/HIVE/HBASE) ; this will have data coming in based on a condition? Later, we might want to run MapReduce Jobs with complex logic on it.
Please assist
Hive: If you have OLAP kind of workload and dont need realtime read/write.
HBase: If you have OLTP kind of workload. You need to do realtime/streaming read/write. Some batch or OLAP processing can be done by using MapReduce. SQL-like querying is possible by using Apache Phoenix.
You can run MapReduce job on HIVE and HBase both.
Anywhere you want. Pig is not an option as it does not have a metastore. Hive if you want SQL Like queries. HBase based on your access patterns.
When you run a Hive query on top of data it is converted into MR.
When you create it in Hive use Hive Queries & not MR. If you are using MR then use Pig. You will not benefit creating a Hive table on top of data.