I have a requirement to make datawarehouse in Hive and use HBase to serve real time access
So I would like to know what would be the architecture for the same
Can I first dump the data on HBase and access it as Rest Service and create external table in Hive and run hive queries on it ?
Will Hive be distributed i.e i need to install Hive on all nodes of my cluster or it it will be central
In answer to your questions:
Hive will be distributed.
For best performance, I would consider installing Hive on every node of the cluster. Hive translates HiveQL into MapReduce jobs - the jobs will be performed where the data is. If that's not possible, the data will have to move to the job. For the sake of response time, you'll want Hive on every node.
To create a Hive table that references data stored in HBase, you can check out the Hive - HBase Integration wiki. Here's a quick example:
CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
Related
Hello :) I am preparing to move the entire data of 1 hbase table to hive. The size of the table is very large (500Terabytes)
As a result of the search, there is hbase export, but only supports data movement between hbase and hbase (files dropped in hdfs are not plain text, so hive cannot read them immediately)
Also, hive's hbase handler cannot be used because hbase is a remote cluster and various security policies.
It would be nice if INSERT INTO syntax was supported like Hive to Hive, but I am looking for another way. Is there a good way to separate each colume of Hbase table by comma and drop it to hdfs?
You can try ExportSnapshot tool to move data from Hbase to HDFS on another cluster, e.g.,
$ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://yourserver:8020/hbase_root_dir -mappers 16
Check this out for more details.
Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,
Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.
I want to create a Hive table/view which will be accessing BigSQL (BigInsights 4.2) table. Data will be loaded in BigSQL table and I'm trying to fetch that data from Hive. Is there any procedure to sync data from BigSQL and Hive table?
It will be automatic. the tables already belong to Hive, so when you insert data in a bigSQL Hadoop table you should be able to see it through Hive queries.
The procedure that synchronizes Hive MetaStore and BigSQL is HCAT SYNC , it runs automatically.
db2 "call SYSHADOOP.HCAT_SYNC_OBJECTS('Schema', 'TableName', 'a', 'REPLACE', 'CONTINUE')"
As may be IBM Knowledge Center :
Tables that are created under the Hive default schema are not automatically synced; you must synchronize these tables manually if you want them in Db2 Big SQL.
You can invoke the HCAT_SYNC_OBJECTS stored procedure for all current schemas and tables by selecting the Run Metadata Sync service actions menu item.
I am currently having Hadoop-2, PIG, HIVE and HBASE.
I have an inputdata. I have loaded that data in HDFS.
I want to create staging data in this environment.
My query is -
In which BigData component, I should create Staging Table(Pig/HIVE/HBASE) ; this will have data coming in based on a condition? Later, we might want to run MapReduce Jobs with complex logic on it.
Please assist
Hive: If you have OLAP kind of workload and dont need realtime read/write.
HBase: If you have OLTP kind of workload. You need to do realtime/streaming read/write. Some batch or OLAP processing can be done by using MapReduce. SQL-like querying is possible by using Apache Phoenix.
You can run MapReduce job on HIVE and HBase both.
Anywhere you want. Pig is not an option as it does not have a metastore. Hive if you want SQL Like queries. HBase based on your access patterns.
When you run a Hive query on top of data it is converted into MR.
When you create it in Hive use Hive Queries & not MR. If you are using MR then use Pig. You will not benefit creating a Hive table on top of data.
In a cluster having Hive installed, What does the metastore and namenode have? i understand that the Metastore has all the table schema and partition details and metadata. Now what is this metadata? then what does the namenode have? and where is this metastore present in a cluster?
The NameNode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It also keeps track of all the DataNode(Dead+Live) through heartbeat mechanism. It also helps client for reads/writes by receiving their requests and redirecting them to the appropriate DataNode.
The metadata which metastore stores contains things like :
IDs of Database
IDs of Tables
IDs of Index
The time of creation of an Index
The time of creation of a Table
IDs of roles assigned to a particular user
InputFormat used for a Table
OutputFormat used for a Table etc etc.
Is this what you wanted to know?
And it is not mandatory to have metastore in the cluster itself. Any machine(inside or outside the cluster) having a JDBC-compliant database can be used for the metastore.
HTH
P.S : You might find the E/R diagram of metastore useful.
Hive data (not metadata) is spread across Hadoop HDFS DataNode servers. Typically, each block of data is stored on 3 different DataNodes. The NameNode keeps track of which DataNodes have which blocks of actual data.
For a Hive production environment, the metastore service should run in an isolated JVM. Hive processes can communicate with the metastore service using Thrift. The Hive metastore data is persisted in an ACID database such as Oracle DB or MySQL. You can use SQL to find out what is in the Hive metastore:
Here are the tables in the Hive metastore:
SQL> select table_name from user_tables;
DBS
DATABASE_PARAMS
SEQUENCE_TABLE
SERDES
TBLS
SDS
CDS
BUCKETING_COLS
TABLE_PARAMS
PARTITION_KEYS
SORT_COLS
SD_PARAMS
COLUMNS_V2
SERDE_PARAMS
You can describe the structure of each table:
SQL> describe partition_keys;
TBL_ID NUMBER
PKEY_COMMENT VARCHAR2(4000)
PKEY_NAME VARCHAR2(128)
PKEY_TYPE VARCHAR2(767)
INTEGER_IDX NUMBER(10)
And find the contents of each table:
SQL> select * from partition_keys;
So if in Hive you "CREATE TABLE xxx (...) PARTITIONED BY (...)" the Hive partitioning data is stored into the metastore (Oracle, MySQL...) database.
For example, in Hive if you create a table like this:
hive> create table employee_table (id bigint, name string) partitioned by (region string);
You will find this in the metastore:
SQL> select tbl_id,pkey_name from partition_keys;
TBL_ID PKEY_NAME
------ ---------
8 region
SQL> select tbl_name from tbls where tbl_id=8;
TBL_NAME
--------
employee_table
When you insert data into employee_table, the data will be stored in HDFS on Hadoop DataNodes and the NameNode will keep track of which DataNodes have the data.
Metastore - Its a database which stores metadata a.k.a all the details about the tables you create in HIVE. By default, HIVE comes with and uses Derby database. But you can use any other database like MySQL or Oracle.
Use of Metastore: Whenever you fire a query from your Hive CLI, the Execution engine gathers all the details regarding the table and creates an Execution plan(Job). These details comes from Metastore. Finally the Execution engine sends the Job to Hadoop. From here, the common Hadoop Map Reduce Job is executed and the result is send back to Hive. The Name node communicates with Execution engine to successfully execute the MR Job.
Above diagram is excellent one to understand Hive and hadoop communication.
Regarding Hive-Metastore (not hadoop - metastore):
It is not necessary/compulsory to have metastore in your hadoop environment as it is only required if you are using HIVE on top of your HDFS cluster.
Metastore is the metadata repository for HIVE only and used by HIVE to store created database object's meta information only(not actual data, which is already in HDFS because HIVE do not store data. Hive uses already stored datain File system)
Hive implementation required a metastore service using any RDBMS.
Regarding Namenode (hadoop -namenode):
core part of Hadoop, which behaves like metastore for cluster.
Not a RDBMS . Stores file system meta info in File System only.