Using Azure HDInsight and Hive

Using Azure HDInsight and Hive - hive

I have created an HDInsight cluster but wants to upload a database on portal and use hive on it. What are the steps i need to take?
I know how to use hive but don't know how to connect the data being uploaded in container blob and hive. Btw I am using Powershell

Need to link storage account of the container with hdinsight cluster.
To do that, add following property in core-site.xml
<property>
<name>fs.azure.account.key.[STORAGE ACCOUNT NAME].blob.core.windows.net</name>
<value>[STORAGE ACCOUNT KEY]</value>
</property>
Once its linked, you will be to access that storage account.
To Create hive table on data residing in blob, use external hive table with location pointing to blob directory of your data.
example : CREATE EXTERNAL TABLE (col1 datatype, ....)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Location 'wasb://#.blob.core.windows.net/PATH/OF/DATA/'

Related

How read data partitons in S3 from Trino

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command

In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS

Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

Load Data into Hive on EMR

I created a cluster under the EMR service then I connected with Putty.
In the meantime, I chose 'presto' when building the cluster.
How do I transfer a file from S3 or on my local computer into the hive?
For example, I need to upload the student file but when I run the following code, I naturally get an error. Where do I put the student file?
hive > load data local inpath 'student' into table student_nopart;
I'm trying to make an example here.
https://github.com/weltond/LearnBasicBigDataTech

In your code,
load data local inpath ...
the local is meaning the EMR node, not your computer. By using sftp or something, you should upload the file into EMR first and load it.
OR use this.
load data inpath 's3://bucket/path/to/file/' into table <tablename>

If you already have data in S3, you can build Hive table on top of the S3 location or alter existing Hive table.
ALTER TABLE student SET location='s3://bucket/path/to/folder_with_table_files';

Presto query engine with Azure Data Lake

I have a requirement to deploy a presto server which can help me query data stored in ADLS in Avro file formats.
I have gone through this tutorial and it seems that the Hive is used as a catalogue/connector in presto to query from ADLS. Can I bypass Hive and have any connector to extract data from ADLS?

Can I bypass Hive and have any connector to extract data from ADLS?
No.
Hive here plays two roles here:
storage for metadata. It contains information like:
schema and table name
columns
data format
data location
execution
it is capable to read data from (HDFS) distributed file systems (like HDFS, S3, ADLS)
it tells how execution can be distributed.

Hive External Table with Azure Blob Storage

Is there a way to create a Hive external table using SerDe with location pointing to Azure Storage, organized in such a way that the data uses the fewest number of blobs. For example if insert 10000 records, I would like it to create just 100 page blobs with 100 line records each instead of maybe 10000 with 1 record each. I am de serializing from the blob, so fewer blobs will require lesser time.What would be the most optimal format in hive?

First, there is a way to create a Hive external table using Serde with localtion pointing to Azure Blob Storage, but not directly, please see the section Create Hive database and tables like the HiveQL below.
create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
(
field1 string,
field2 int,
field3 float,
field4 double,
...,
fieldN string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' lines terminated by '<line separator>'
STORED AS TEXTFILE LOCATION '<storage location>' TBLPROPERTIES("skip.header.line.count"="1");
And focus the below content for explaination <storage location>.
<storage location>: the Azure storage location to save the data of Hive tables. If you do not specify LOCATION , the database and the tables are stored in hive/warehouse/ directory in the default container of the Hive cluster by default. If you want to specify the storage location, the storage location has to be within the default container for the database and tables. This location has to be referred as location relative to the default container of the cluster in the format of 'wasb:///<directory 1>/' or 'wasb:///<directory 1>/<directory 2>/', etc. After the query is executed, the relative directories are created within the default container.
So it means you can access Azure Blob Storage location on Hive via wasb protocol, which required hadoop-azure library that support Hadoop access HDFS on Azure Storage. If your Hive on Hadoop not deployed on Azure, you need to refer to the Hadoop offical document Hadoop Azure Support: Azure Blob Storage to configure it.
For using serde, it is depended on the file format you used, like for orc file format, the hql code using OrcSerde like below.
CREATE EXTERNAL TABLE IF NOT EXSISTS <table name> (<column_name column_type>, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS ORC
LOCATION '<orcfile path>'
For your second, the most optimal format is ORC File Format in Hive.

What does the hive metastore and name node do in a cluster?

In a cluster having Hive installed, What does the metastore and namenode have? i understand that the Metastore has all the table schema and partition details and metadata. Now what is this metadata? then what does the namenode have? and where is this metastore present in a cluster?

The NameNode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It also keeps track of all the DataNode(Dead+Live) through heartbeat mechanism. It also helps client for reads/writes by receiving their requests and redirecting them to the appropriate DataNode.
The metadata which metastore stores contains things like :
IDs of Database
IDs of Tables
IDs of Index
The time of creation of an Index
The time of creation of a Table
IDs of roles assigned to a particular user
InputFormat used for a Table
OutputFormat used for a Table etc etc.
Is this what you wanted to know?
And it is not mandatory to have metastore in the cluster itself. Any machine(inside or outside the cluster) having a JDBC-compliant database can be used for the metastore.
HTH
P.S : You might find the E/R diagram of metastore useful.

Hive data (not metadata) is spread across Hadoop HDFS DataNode servers. Typically, each block of data is stored on 3 different DataNodes. The NameNode keeps track of which DataNodes have which blocks of actual data.
For a Hive production environment, the metastore service should run in an isolated JVM. Hive processes can communicate with the metastore service using Thrift. The Hive metastore data is persisted in an ACID database such as Oracle DB or MySQL. You can use SQL to find out what is in the Hive metastore:
Here are the tables in the Hive metastore:
SQL> select table_name from user_tables;
DBS
DATABASE_PARAMS
SEQUENCE_TABLE
SERDES
TBLS
SDS
CDS
BUCKETING_COLS
TABLE_PARAMS
PARTITION_KEYS
SORT_COLS
SD_PARAMS
COLUMNS_V2
SERDE_PARAMS
You can describe the structure of each table:
SQL> describe partition_keys;
TBL_ID NUMBER
PKEY_COMMENT VARCHAR2(4000)
PKEY_NAME VARCHAR2(128)
PKEY_TYPE VARCHAR2(767)
INTEGER_IDX NUMBER(10)
And find the contents of each table:
SQL> select * from partition_keys;
So if in Hive you "CREATE TABLE xxx (...) PARTITIONED BY (...)" the Hive partitioning data is stored into the metastore (Oracle, MySQL...) database.
For example, in Hive if you create a table like this:
hive> create table employee_table (id bigint, name string) partitioned by (region string);
You will find this in the metastore:
SQL> select tbl_id,pkey_name from partition_keys;
TBL_ID PKEY_NAME
------ ---------
8 region
SQL> select tbl_name from tbls where tbl_id=8;
TBL_NAME
--------
employee_table
When you insert data into employee_table, the data will be stored in HDFS on Hadoop DataNodes and the NameNode will keep track of which DataNodes have the data.

Metastore - Its a database which stores metadata a.k.a all the details about the tables you create in HIVE. By default, HIVE comes with and uses Derby database. But you can use any other database like MySQL or Oracle.
Use of Metastore: Whenever you fire a query from your Hive CLI, the Execution engine gathers all the details regarding the table and creates an Execution plan(Job). These details comes from Metastore. Finally the Execution engine sends the Job to Hadoop. From here, the common Hadoop Map Reduce Job is executed and the result is send back to Hive. The Name node communicates with Execution engine to successfully execute the MR Job.

Above diagram is excellent one to understand Hive and hadoop communication.
Regarding Hive-Metastore (not hadoop - metastore):
It is not necessary/compulsory to have metastore in your hadoop environment as it is only required if you are using HIVE on top of your HDFS cluster.
Metastore is the metadata repository for HIVE only and used by HIVE to store created database object's meta information only(not actual data, which is already in HDFS because HIVE do not store data. Hive uses already stored datain File system)
Hive implementation required a metastore service using any RDBMS.
Regarding Namenode (hadoop -namenode):
core part of Hadoop, which behaves like metastore for cluster.
Not a RDBMS . Stores file system meta info in File System only.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using Azure HDInsight and Hive - hive

I have created an HDInsight cluster but wants to upload a database on portal and use hive on it. What are the steps i need to take? I know how to use hive but don't know how to connect the data being uploaded in container blob and hive. Btw I am using Powershell

Related

How read data partitons in S3 from Trino

Load Data into Hive on EMR

Presto query engine with Azure Data Lake

Hive External Table with Azure Blob Storage

What does the hive metastore and name node do in a cluster?

Categories

Resources