What does a catalog mean in Presto? - hive

I understand the concept of DB and schemas within a DB and tables in a schema. Where does a catalog fit in this whole space. Does this extend to hive as well ?

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schema and tables inside of the catalog.
For example, the hive connector maps each hive database to a schema, so if the hive connector is mounted as the hive catalog
Catalog is mount point.

Related

Export data table from Databricks dbfs to azure sql database

I am quite new to databricks and looking for a smart way to export a data table from databricks gold scheme to an azure sql database.
I am using databricks as a part of azure resource group, however I do not find data from databricks in any of the storage accounts that are within the same resource group. Does it mean that is is physically stored at en implicit databricks storage account/data lake?
Thanks in advance :-)
The tables you see in Databricks could be have the data stored within that Databricks Workspace file system (DBFS) or somewhere external (e.g. Data Lake, which could be in a different Azure Resource Group) - see here: Databricks databases and tables
For writing data from Databricks to Azure SQL, I would suggest the Apache Spark connector for SQL.

How to access Metastore from beeline?

I need to do some SQL queries (as here) directly from Metasore. PS: the commands SHOW/DESCRIBE are not enough.
How to enable access from it as database, or what the database name of Metastore? ... In nowadays (2019) it is possible?
NOTES
What is Metastore? For me is a very important element of the Hive architecture, final user need some access to it... "All Hive implementation need a metastore service, where it stores metadata. It is implemented using tables in relational database. By default, Hive uses built-in Derby SQL server", 1. Of course, you need in your context a "standard" Metastore. On my corporation's Hadoop cluster we are planning to standardize Metastore (local and long term standard), perhaps PostgreSQL and also an (PostgREST API for external consume of some SQL-Views from it).
The SQL definitions (table names, etc.) will be stable and Metastore queries will be reliable when Metastore is a long-term local standard.
The Metastore it is closely connected to Hive, where it is a Java API, but Metastore is also a standard RDBMS and offers standard connection (by SQL) for the external universe. PS: my interest on Metastore is in this external context.
Spark-shell solution
Spark access Metastore under the hood, it have a first class metadata method, that returns a dataframe, the schema property, that expose names, types, etc. and offers getComment method.
See https://stackoverflow.com/a/57857021/287948

Accessing Spark Tables (Parquet on ADLS) using Presto from a Local Linux Machine

Would like to know if we can access the Spark External tables with MS SQL as metastore and external files on Azure Data lake using Hive Metastore service (Presto) from a Linux Machine.
We are trying to access the spark delta tables having parquet files on ADLS through Presto. Below is the scenario. I would like to know if there is a possible way to achieve this. We are doing this as a POC only and we believe knowing the answer will take us to next step.
Our central data repository is all spark Delta tables created by many pipelines. The data is stored in Parquet format. MS SQL is the external metastore. Data in these spark tables are used by other teams/applications and they would like to access these data through Presto.
We learnt that Presto uses the metastore service of Hive to access the hive table details. We tried accessing the tables from Hive (thinking if this works Presto also works). But we find problems with different filesystems. We have setup Hadoop and Hive in one single Linux machine. The versions are 3.1.2 & 3.1.1. The hive service is connecting to the SQL metastore and showing the results of few basic commands. However when it comes to accessing the actual data stored in parquet in a ADLS path, it fails saying File system exception. I understand this problem that it is an interaction of many file systems like (ADFS, HDFS, linux) but not finding any blogs that guides us. Kindly help.
Hive Show Database command:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> SHOW DATABASES;
OK
7nowrtpocqa
c360
default
digital
Hive Listing tables:
hive> SHOW TABLES;
OK
amzn_order_details
amzn_order_items
amzn_product_details
Query data from Orders table:
hive> select * from dlvry_orders limit 3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"
Time taken: 3.37 seconds
How can I make my setup access the Datalake files and bring in the data?
I believe my metastore should have the exact full path of the ADLS where files are stored. If it is, how will my Hive/Hadoop in Linux will understand the path.
If it can recognize the path also, in which configuraion file should I give the credentials for accessing the data lake (in any .XML)
How can the different file systems interact
Kindly help. Thanks for all the inputs.

Audit hive table

I have a hive table lets say it as table A. My requirement is to capture all the DML and DDL operations on table A in table B. Is there any way to capture the same?
Thanks in advance..
I have not come across any such tool however Cloudera Navigator helps to manage it. Refer the detailed documentation.
Cloudera Navigator
Cloudera Navigator auditing supports tracking access to:
HDFS entities accessed by HDFS, Hive, HBase, Impala, and Solr
services
HBase and Impala
Hive metadata
Sentry
Solr
Cloudera Navigator Metadata Server
Alternatively, if you are not using cloudera distribution, you can still access hive-metastore log file under /var/log/hive/hadoop-cmf-hive-HIVEMETASTORE.log.out and check the changes applied to the different table.
I haven't used Apache atlas yet, but from the documentation, it looks like they have Audit store and hive bridge. That works for operational events as well.
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/atlas-overview/content/apache_atlas_features.html

infer avro schema for BigQuery table load

I'm using the java api, trying to load data from avro files into BigQuery.
When creating external tables, BigQuery automatically detects the schema from the .avro files.
Is there a way to specify a schema/data file in GCS when creating a regular BigQuery table for data to be loaded into?
thank you in advance
You could create manually the schema definition with the configuration.load.schema, however, the documentation says that:
When you load Avro, Parquet, ORC, Cloud Firestore export data, or Cloud Datastore export data, BigQuery infers the schema from the source data.
Seems the problem was that the table existed already, and I did not specify the CreateDisposition.CREATE_IF_NEEDED.
You do not need to specify the schema at all, just like for the external tables