How to access Metastore from beeline? - hive

I need to do some SQL queries (as here) directly from Metasore. PS: the commands SHOW/DESCRIBE are not enough.
How to enable access from it as database, or what the database name of Metastore? ... In nowadays (2019) it is possible?
NOTES
What is Metastore? For me is a very important element of the Hive architecture, final user need some access to it... "All Hive implementation need a metastore service, where it stores metadata. It is implemented using tables in relational database. By default, Hive uses built-in Derby SQL server", 1. Of course, you need in your context a "standard" Metastore. On my corporation's Hadoop cluster we are planning to standardize Metastore (local and long term standard), perhaps PostgreSQL and also an (PostgREST API for external consume of some SQL-Views from it).
The SQL definitions (table names, etc.) will be stable and Metastore queries will be reliable when Metastore is a long-term local standard.
The Metastore it is closely connected to Hive, where it is a Java API, but Metastore is also a standard RDBMS and offers standard connection (by SQL) for the external universe. PS: my interest on Metastore is in this external context.

Spark-shell solution
Spark access Metastore under the hood, it have a first class metadata method, that returns a dataframe, the schema property, that expose names, types, etc. and offers getComment method.
See https://stackoverflow.com/a/57857021/287948

Related

Accessing Spark Tables (Parquet on ADLS) using Presto from a Local Linux Machine

Would like to know if we can access the Spark External tables with MS SQL as metastore and external files on Azure Data lake using Hive Metastore service (Presto) from a Linux Machine.
We are trying to access the spark delta tables having parquet files on ADLS through Presto. Below is the scenario. I would like to know if there is a possible way to achieve this. We are doing this as a POC only and we believe knowing the answer will take us to next step.
Our central data repository is all spark Delta tables created by many pipelines. The data is stored in Parquet format. MS SQL is the external metastore. Data in these spark tables are used by other teams/applications and they would like to access these data through Presto.
We learnt that Presto uses the metastore service of Hive to access the hive table details. We tried accessing the tables from Hive (thinking if this works Presto also works). But we find problems with different filesystems. We have setup Hadoop and Hive in one single Linux machine. The versions are 3.1.2 & 3.1.1. The hive service is connecting to the SQL metastore and showing the results of few basic commands. However when it comes to accessing the actual data stored in parquet in a ADLS path, it fails saying File system exception. I understand this problem that it is an interaction of many file systems like (ADFS, HDFS, linux) but not finding any blogs that guides us. Kindly help.
Hive Show Database command:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> SHOW DATABASES;
OK
7nowrtpocqa
c360
default
digital
Hive Listing tables:
hive> SHOW TABLES;
OK
amzn_order_details
amzn_order_items
amzn_product_details
Query data from Orders table:
hive> select * from dlvry_orders limit 3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"
Time taken: 3.37 seconds
How can I make my setup access the Datalake files and bring in the data?
I believe my metastore should have the exact full path of the ADLS where files are stored. If it is, how will my Hive/Hadoop in Linux will understand the path.
If it can recognize the path also, in which configuraion file should I give the credentials for accessing the data lake (in any .XML)
How can the different file systems interact
Kindly help. Thanks for all the inputs.

Managing data in two relational databases in a single location

Background: We currently have our data split between two relational databases (Oracle and Postgres). There is a need to run ad-hoc queries that involve tables in both databases. Currently we are doing this in one of two ways:
ETL from one database to another. This requires a lot of developer
time.
Oracle foreign data wrapper on our
Postgres server. This is working, but the queries run extremely
slowly.
We already use Google Cloud Platform (for the project that uses the Postgres server). We are familiar with Google BigQuery (BQ).
What we want to do:
We want most of our tables from both these databases (as-is) available at a single location, so querying them is easy and fast. We are thinking of copying over the data from both DB servers into BQ, without doing any transformations.
It looks like we need to take full dumps of our tables on a periodic basis (daily) and update BQ since BQ is append-only. The recent availability of DML in BQ seems to be very limited.
We are aware that loading the tables as is to BQ is not an optimal solution and we need to denormalize for efficiency, but this is a problem we have to solve after analyzing the feasibility.
My question is whether BQ is a good solution for us, and if yes, how to efficiently keep BQ in sync with our DB data, or whether we should look at something else (like say, Redshift)?
WePay has been publishing a series of articles on how they solve these problems. Check out https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka.
To keep everything synchronized they:
The flow of data starts with each microservice’s MySQL database. These
databases run in Google Cloud as CloudSQL MySQL instances with GTIDs
enabled. We’ve set up a downstream MySQL cluster specifically for
Debezium. Each CloudSQL instance replicates its data into the Debezium
cluster, which consists of two MySQL machines: a primary (active)
server and secondary (passive) server. This single Debezium cluster is
an operational trick to make it easier for us to operate Debezium.
Rather than having Debezium connect to dozens of microservice
databases directly, we can connect to just a single database. This
also isolates Debezium from impacting the production OLTP workload
that the master CloudSQL instances are handling.
And then:
The Debezium connectors feed the MySQL messages into Kafka (and add
their schemas to the Confluent schema registry), where downstream
systems can consume them. We use our Kafka connect BigQuery connector
to load the MySQL data into BigQuery using BigQuery’s streaming API.
This gives us a data warehouse in BigQuery that is usually less than
30 seconds behind the data that’s in production. Other microservices,
stream processors, and data infrastructure consume the feeds as well.

Use hive metastore service WITHOUT Hadoop/HDFS

I know the question is a little bit strange. I love Hadoop & HDFS, but recently work on SparkSQL with Hive Metastore.
I want to use SparkSQL as a vertical SQL engine to run OLAP query across different datasources like RDB, Mongo, Elastic ... without ETL process. Then I register different schema as external tables in Metastore with corresponding Hive storage Handler.
Moreover, HDFS is not used as a datasource in my work. Then, given Map/R is already replaced by Spark engine. That sound to me that Hadoop/HDFS is useless but to base the installation of Hive. I don't want to buy them all.
I wonder If I only start Hive metastore service without Hadoop/HDFS to support SparkSQL, what kind of issue will happen. Would I put myself into the jungle?
What you need is "Hive Local Mode" (search for "Hive, Map-Reduce and Local-Mode" in the page).
Also this may help.
This configuration is only suggested if you are experimenting locally. But in this case you only need the metastore.
Also from here;
Spark SQL uses Hive Metastore, even if when we don't configure it to . When not configured it uses a default Derby DB as metastore.
So this seems to be quite legal;
Arrange your metastore in Hive
Start Hive in local mode
And make Spark use Hive metastore
Use Spark as an SQL engine for all datasources supported by Hive.

Accessing Redshift dataset from SQL Server

We have a vendor who has their data on a Redshift instance. We have a SQL Server instance which stores all our data. Is there a way that I could 'mount' the Redshift instance as a schema or something so that we are able to access the vendor data by connecting to the remote Redshift instance?
Imho it's not possible (and it is not possible for any combination of vendors).
Different DBMS are working with different (proprietary) formats how they are storing data, communicating with other databases is out of scope for them.
For connecting to disparate data sources you could use some ETL tool (Microsoft has SSIS, you can try CloverETL, Talend, Pentaho).
(Or you can try implement some offloading from Redshift to CSV, loading that CSVs to your MSSQL, but again, to do reliably, to keep data reasonably up-to-date etc. you would need some scripting or ETL tool.)

Hive metastore in hdfs

Usually you store the metadata onto a jdbc compliant database like mysql. Is it possible to keep it in hdfs somehow like in Hbase. Did not find anything useful in the wikis. Thanks.
No, this is not possible to do as neither HDFS nor HBase support a proper JDBC interface yet, as is demanded by Hive's current metastore mechanisms.