hivecontext vs spark engine in hive [duplicate]

hivecontext vs spark engine in hive [duplicate] - sql

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?

When SparkSQL uses hive
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292
Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL
Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.

SparkSQL vs Spark API you can simply imagine you are in RDBMS world:
SparkSQL is pure SQL, and Spark API is language for writing stored procedure
Hive on Spark is similar to SparkSQL, it is a pure SQL interface that use spark as execution engine, SparkSQL uses Hive's syntax, so as a language, i would say they are almost the same.
but Hive on Spark has a much better support for hive features, especially hiveserver2 and security features, hive features in SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983
i believe hive support in spark project is really very low priority stuff...
sadly Hive on spark integration is not that easy, there are a lot of dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301
and, when i'm trying hive with spark integration, for debug purpose, i'm always starting hive cli like this:
export HADOOP_USER_CLASSPATH_FIRST=true
bin/hive --hiveconf hive.root.logger=DEBUG,console
our requirement is using spark with hiveserver2 in a secure way (with authentication and authorization), currently SparkSQL alone can not provide this, we are using ranger/sentry + Hive on Spark.
hope this can help you to get a better idea which direction you should go.

here is related answer I find in the hive official site:
1.3 Comparison with Shark and Spark SQL 
There are two related projects in the Spark ecosystem that provide Hive QL support on Spark: Shark and Spark SQL. 
●The Shark project translates query plans generated by Hive into its own representation and executes them over Spark.  
●Spark SQL is a feature in Spark. It uses Hive’s parser as the frontend to provide Hive QL support. Spark application developers can easily express their data processing logic in SQL, as well as the other Spark operators, in their code. Spark SQL supports a different use case than Hive. 
Compared with Shark and Spark SQL, our approach by design supports all existing Hive features, including Hive QL (and any future extension), and Hive’s integration with authorization, monitoring, auditing, and other operational tools. 
3. Hive-Level Design 
As noted in the introduction, this project takes a different approach from that of Shark or Spark SQL in the sense that we are not going to implement SQL semantics using Spark's primitives. On the contrary, we will implement it using MapReduce primitives. The only new thing here is that these MapReduce primitives will be executed in Spark. In fact, only a few of Spark's primitives will be used in this design. 
The approach of executing Hive’s MapReduce primitives on Spark that is different from what Shark or Spark SQL does has the following direct advantages: 
1.Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future. 
2.This approach avoids or reduces the necessity of any customization work in Hive’s Spark execution engine.
3.It will also limit the scope of the project and reduce longterm maintenance by keeping Hive-on-Spark congruent to Hive MapReduce and Tez.

Related

how a hive UDF runs when a sql is running?

I'm confused about the execution of Hive UDFs in SQL.
The nodes get JARs from HiveServer when an SQL compiles into an MR job? If some static variables are declared in UDF, where these static variables exist?

Yes, your understanding is right. Hive basically makes these jars available across all its nodes by adding it to distributed cache as add Jar.
You have to register the UDF such that hive recognizes it. Registering a UDF will make an entry into the hive metastore. This is usually done by doing something like CREATE FUNCTION <function_name> AS org.apache...
This should usually be available in hive-env.sh

Does schemacrawler supports Hive?

I am looking for data visualization tool which supports Hive database.
Does schemacrawler supports Hive? If not, any roadmap to support it in future.
Any other tools which support Hive to view its metadata?
Let me know.

Please take a look at the SchemaCrawler Database System Support, which states that SchemaCrawler supports any database that has a JDBC driver. It seems that Hive has a JDBC driver, so you can try it out.

Vora Spark shell syntax

Are there any programmatic differences in the Spark shell used for Vora over the Scala Spark syntax. I need to make sure I can use the widely available Spark examples. Thanks.

The Spark shell is the same, using it with Vora means defining a new SAP sql context that adds to Spark sql context new functionality.
So you can still use it for running Spark sql scenarios that do not necessarily involve Vora.

Using Hive UDF's in Pig

Is there any reason not use Hive UDF's in Pig 0.15?
I'm thinking mostly about performance, but if there are any other reasons I'd be happy to hear them.
For example, we have a simple Java implementation of lpad that we use. Should we bother keeping it, or can we use the Hive version?

Hive UDFs are supported in pig 0.15 version. See below.
http://hortonworks.com/blog/announcing-apache-pig-0-15-0/

How to access the Impala Parser

Does Impala reuse hive SQL parser?
I am trying to write a custom Java code to check for query correctness in my application. I am searching for an api which can consume the sql query and let me know if it is grammatically correct for impala.
How can I access the parser from a custom Java code to check for query compatibility?

No, Impala does not reuse the Hive parser.
Further, Impala does not expose a Java API for checking if a query is grammatically correct.
The easiest thing to do is probably to submit an explain query via JDBC and check the result.
If you don't have a running Impala cluster, in theory you should be able to instantiate the scanner and parser as Impala does in the parser unit tests, but I can imagine it might be difficult to get that working as the Impala build/test environment is quite complicated. Note that this is not a supported API.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

hivecontext vs spark engine in hive [duplicate] - sql

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?

Related

how a hive UDF runs when a sql is running?

Does schemacrawler supports Hive?

Vora Spark shell syntax

Using Hive UDF's in Pig

How to access the Impala Parser

Categories

Resources