I am looking for data visualization tool which supports Hive database.
Does schemacrawler supports Hive? If not, any roadmap to support it in future.
Any other tools which support Hive to view its metadata?
Let me know.
Please take a look at the SchemaCrawler Database System Support, which states that SchemaCrawler supports any database that has a JDBC driver. It seems that Hive has a JDBC driver, so you can try it out.
Related
SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?
When SparkSQL uses hive
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292
Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL
Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.
SparkSQL vs Spark API you can simply imagine you are in RDBMS world:
SparkSQL is pure SQL, and Spark API is language for writing stored procedure
Hive on Spark is similar to SparkSQL, it is a pure SQL interface that use spark as execution engine, SparkSQL uses Hive's syntax, so as a language, i would say they are almost the same.
but Hive on Spark has a much better support for hive features, especially hiveserver2 and security features, hive features in SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983
i believe hive support in spark project is really very low priority stuff...
sadly Hive on spark integration is not that easy, there are a lot of dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301
and, when i'm trying hive with spark integration, for debug purpose, i'm always starting hive cli like this:
export HADOOP_USER_CLASSPATH_FIRST=true
bin/hive --hiveconf hive.root.logger=DEBUG,console
our requirement is using spark with hiveserver2 in a secure way (with authentication and authorization), currently SparkSQL alone can not provide this, we are using ranger/sentry + Hive on Spark.
hope this can help you to get a better idea which direction you should go.
here is related answer I find in the hive official site:
1.3 Comparison with Shark and Spark SQL
There are two related projects in the Spark ecosystem that provide Hive QL support on Spark: Shark and Spark SQL.
●The Shark project translates query plans generated by Hive into its own representation and executes them over Spark.
●Spark SQL is a feature in Spark. It uses Hive’s parser as the frontend to provide Hive QL support. Spark application developers can easily express their data processing logic in SQL, as well as the other Spark operators, in their code. Spark SQL supports a different use case than Hive.
Compared with Shark and Spark SQL, our approach by design supports all existing Hive features, including Hive QL (and any future extension), and Hive’s integration with authorization, monitoring, auditing, and other operational tools.
3. Hive-Level Design
As noted in the introduction, this project takes a different approach from that of Shark or Spark SQL in the sense that we are not going to implement SQL semantics using Spark's primitives. On the contrary, we will implement it using MapReduce primitives. The only new thing here is that these MapReduce primitives will be executed in Spark. In fact, only a few of Spark's primitives will be used in this design.
The approach of executing Hive’s MapReduce primitives on Spark that is different from what Shark or Spark SQL does has the following direct advantages:
1.Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future.
2.This approach avoids or reduces the necessity of any customization work in Hive’s Spark execution engine.
3.It will also limit the scope of the project and reduce longterm maintenance by keeping Hive-on-Spark congruent to Hive MapReduce and Tez.
How can I see different version of Hbase data in Hive.
As per my understanding using HbaseStorageHandler only latest version of Hbase data will be available in Hive .Is my understanding correct/updated?
Is there any way to access different version of Hbase data using Hive??
Thanks in advance :)
(New to Hbase-Hive Integration)
That would depend on the version of hive that you are using.
Prior to hive 1.1, hbase timestamps were not accessible through the hive-hbase integration [1] (Related: [2]).
So the answer being, You require hive 1.1 or higher.
Hope it helps.
[1] https://issues.apache.org/jira/browse/HIVE-2828
[2] https://issues.apache.org/jira/browse/HIVE-8267
Not 100% answer but directions. In normal life HBase is always about special cases.
Here is slightly outdated but really simple article to understand approach:
http://hortonworks.com/blog/hbase-via-hive-part-1/
So practically you can implement any InputFormat or OutputFormat you need.
But this is related to MapReduce gears.
In principle Spark can always rely on InputFormat too so the question is only about your special case.
Another good idea is depicted here: http://www.slideshare.net/HBaseCon/ecosystem-session-3a
So snapshots could help to take state of tables you really need and then you are free to use any gear to connect Hive with HBase if it follow standards.
In general basic idea is to tune gears which connects Hive to your HBase data so they will apply needed version filters to you. This does not depend so much on versions as this interface is pretty stable.
Hope this will help you.
I'm using a third party tool to query our data stored in Bigquery. The third party tool uses a Bigquery JDBC driver. I would like to take advantage of UDF's but I do not see any documentation or support for UDF's and the jdbc driver. Is it supported? If not is there an ETA?
we're currently only supporting UDFs through the API, but do have work planned to support declarative definition (and persistence) of your functions. This won't be shipped before 2016.
Just out of curiosity, what tool are you using? Tableau?
There is an inline way to define a UDF, but it is "alpha", unsupported, and undocumented. Use at your own risk.
https://stackoverflow.com/a/36208489/2259571
I am using CloverETL Designer Version: 4.0.0.030M2 for one of my projects. I want to read from a MySQL database and do some comparison and then write to the database. But I cant find a MySQL reader in the tool. Whereas a mongodb reader is present as well as a MySQL writer. Please help me on how to read from MySQL database in a clover graph. Thanks.
You can use DbInputTable (http://doc.cloveretl.com/documentation/UserGuide/index.jsp?topic=/com.cloveretl.gui.docs/docs/dbinputtable.html) component which is generic reader for all JDBC-enabled databases. For writing to and JDBC database you can use DbOutputTable.
Writers, like MySQLDataWriter are used for fast bulk data writing using database-specific infrastructure.
Btw. best place for asking CloverETL related questions is http://forum.cloveretl.com/
Does Mondrian support nosql db like mongodb in the current version. I read some blogs and bugs related to the same.
Any help is appreciated.
thanks
Lokesh
please read the following Blog from Julian Hyde, creator of Mondrian
http://julianhyde.blogspot.com.es/2014/03/improvements-to-optiqs-mongodb-adapter.html
Here you can see Julian have been working on a new approach that converts even complex SQL queries into MongoDB Queries behind the scenes.
Mondrian does not directly support MongoDB at the moment. MongoDB does not have a JDBC implementation.
There are a few options. One of them can be setup if you have access to a Pentaho Data Integration server. You can use a thin JDBC implementation which will allow Mondrian to access a SQL to Mongo bridge.
There are certainly other ways to set this up, since there are a lot of data federation engines out there.
Not directly as far as I know. Maybe someone is working on a dialect, is that even possible..? Interesting question though... May be worth linking the blogs you found so far?
One solution however could be to use the kettle jdbc driver, this driver works with mondrian and then the other end can be any ETL process. So you could use a mongodb input step etc.
There is Apache Drill. You can query MongoDB through standard SQL and Drill has a JDBC driver. So maybe it is possible that Mondrian uses this driver.
Uwe