Is there any way to get table statistics information from Hive metastore so that I can use it for processing via an api?
I found two ways to get statistics,
Meta tool client
API - IMetastoreClient interface
Can anyone tell me the difference between these two? or is there any other way to access it
Related
I have huge data from different DB sources ( Oracle, Mongo, Cassandra ) and also eventing data available in Kafka. Using Tableau for analytics and facing performance issue with huge data. So, planning to store data in some other way and use Tableau for visualization also. Have multiple options now and need some help to finalize the approach.
Option 1:-
Read DB data and store them in Parquet file and then expose it over Spark SQL or HiveQL or Presto SQL and let Tableau connect to this SQL.
Option 2:-
Read DB data and store them in Parquet file in S3 and then use AWS Athena for analytics and let Tableau connect to Athena.
Option 3:-
Read DB data and store them in Parquet file in S3 and then move to Redshift for analytics and let Tableau connect to Redshift.
Not sure if any of the above approach will be a good solution for streaming data( Kafka ) analytics as well.
Note:- I have multiple big tables and need joins b/w them.
I understand you have huge data from different sources, and you also have access to AWS. Then, you plan to use this data for analytics and dashboarding via Tableau.
Option 1 and 2
Your Options 1 and 2 are basically the same, as AWS Athena and Hive are based on the same principle of creating tables over flat files via a metastore which stores table definition. Both Athena's Presto engine and Spark are distributed and highly efficient on huge data (TB data). The main difference is the pricing model (Athena is based on price per data processed per request and is serverless, whereas Spark may imply infrastructure cost).
Then, both options may not perform well as they are not OLAP systems designed for self service BI (they are better use for ad hoc queries over huge data regarding).
Then, you may have trouble in managing your data model using flat files and table or views over them (data storage and compression won't be optimized for each table which may impact Tableau performance).
Option 3
Option 3 is better as it is based on Redshift which is designed to support OLAP system. You can connect Tableau directly to Redshift but you'll suffer from latency and you may have trouble managing your cluster load depending on the number of users and/or requests. But it can work the way you describe it.
Then, if you have performance issues, you'll be able to create data source extracts from Redshift to Tableau later on. You can also implement an intermediate database to store pre-aggregated queries (= datamarts) and connect Tableau directly to it which will avoid performing the same query on Redshift each time a dashboard is opened in Tableau (in that case Redshift also caches queries).
Then, as you need to perform multiple joins, you'll be able to optimize data storage for such queries using Redshift by setting the right partition and sort keys.
To conclude, you can also directly access flat files from Redshift using Redshift Spectrum (via Athena/Glue metastore).
Documentations:
https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
https://aws.amazon.com/fr/athena/pricing/
Would like to know if we can access the Spark External tables with MS SQL as metastore and external files on Azure Data lake using Hive Metastore service (Presto) from a Linux Machine.
We are trying to access the spark delta tables having parquet files on ADLS through Presto. Below is the scenario. I would like to know if there is a possible way to achieve this. We are doing this as a POC only and we believe knowing the answer will take us to next step.
Our central data repository is all spark Delta tables created by many pipelines. The data is stored in Parquet format. MS SQL is the external metastore. Data in these spark tables are used by other teams/applications and they would like to access these data through Presto.
We learnt that Presto uses the metastore service of Hive to access the hive table details. We tried accessing the tables from Hive (thinking if this works Presto also works). But we find problems with different filesystems. We have setup Hadoop and Hive in one single Linux machine. The versions are 3.1.2 & 3.1.1. The hive service is connecting to the SQL metastore and showing the results of few basic commands. However when it comes to accessing the actual data stored in parquet in a ADLS path, it fails saying File system exception. I understand this problem that it is an interaction of many file systems like (ADFS, HDFS, linux) but not finding any blogs that guides us. Kindly help.
Hive Show Database command:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> SHOW DATABASES;
OK
7nowrtpocqa
c360
default
digital
Hive Listing tables:
hive> SHOW TABLES;
OK
amzn_order_details
amzn_order_items
amzn_product_details
Query data from Orders table:
hive> select * from dlvry_orders limit 3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"
Time taken: 3.37 seconds
How can I make my setup access the Datalake files and bring in the data?
I believe my metastore should have the exact full path of the ADLS where files are stored. If it is, how will my Hive/Hadoop in Linux will understand the path.
If it can recognize the path also, in which configuraion file should I give the credentials for accessing the data lake (in any .XML)
How can the different file systems interact
Kindly help. Thanks for all the inputs.
I want to design Web UI which fetches data from HDFS. I want to generate some reports using this data which is stored in HDFS. I have my own custom reports format. I am writing REST API's to fetch data. But running HIVE queries gives latency issues Hence I want different approach for this, I could think of two.
Using IMPALA to create tables. But I am not sure about REST support for IMPALA.
Using HIVE but instead of MR use SPARK as execution engine. .
spark-job-server provides REST support, and fetch data with SPARK-SQL.
Which of the approach will be suitable or is there any better approach for this?
Please can anyone help as I am very new in this.
I'd prefer to choose impala if latency is the main consideration. It's dedicated to SQL processing on hdfs and does it well. About REST api and the application logic you are achieving, this seems to be a good example
As part of my spark Job I am storing the output to a Hive table on HDInsight. I now want to expose the data to any COTS tools that can consume Odata feed like Tableau or other such tools. I was wondering if any one has some pointers on how this can be accomplished?
It's easy to do if data is stored in hive.
HDInsight Spark cluster has thrift server setup that allows BI tools like Tableau/PowerBI to process data via spark.
See:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/
Can one hive instance store different tables across hdfs clusters. and then do hive ql on these tables?
My use case is that I have one hive table on one hdfs cluster. I want to do some process on it with hive ql and have the output been written to another hdfs cluster. I wish to achieve this directly only by hive, not need to run through some dump / copy / import process. So Is that possible? I don't really think it is possible, however, I notice a design page on :
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27837073
in it , it said that :
"Note that, even today, different partitions/tables can span multiple dfs's, and hive does not enforce any restrictions. Those dfs's can be in different data centers also"
except this, I failed to google anything related.
anyone have any ideas on this? Thanks.
There are multiple ways to handle this. you can go with mirroring (use tools like Apache Falcon). In this case you have data stored in both the clusters. If you want to query across clusters having a different table without mirroring then use tools like Apache Drill which can join data from different datasources. it currently supports hive,mongo,json, kudu etc