How to migrate hive derby metastore to postgres metastore - hive

I have been using derby as hive metastore for quite some time.
Is there a way to migrate the metastore to Postgresql.
I am using Apache Hive - 0.13

The best approach I have found so far is as below:
**Export from existing database**
Use the derby tool 'ij' (assuming you are placed in the root installation folder for the pillar):
java -cp lib/derby-10.10.1.1.jar:lib/derbytools-10.10.1.1.jar:lib/derbyclient-10.10.1.1.jar org.apache.derby.tools.ij
Then run the following commands to extract the content of the somedb database:
CONNECT 'jdbc:derby:/path/to/somedb'
CALL SYSCS_UTIL.SYSCS_EXPORT_TABLE(null, 'TABLE1', 'table1', null, null, null);
This should create the file: 'table1'.
**Import the data to the PostgreSQL database**
Run the 'psql' application on the console/terminal.
Log onto somedb and ingest data (and fix the automated sequences)
\c somedb
COPY table1 FROM '/path/to/table1' with csv;
SELECT SETVAL('table1_guid_seq', (SELECT MAX(guid) FROM table1));
Repeat this for all tables you want to export from derby and import into postgresql.

Related

Export external hive table from one VM to another

I have 2 environments namely Dev and stage. Both has hive installed (same version 2.1). On Dev I have external hive tables pointing to hbase table. I have to export this hive table to stage. No compulsion that hbase table also be migrated. If managed hive table is created with data in it, will be sufficient. Can anyone suggest me how to do this? Below is diagrammatic representation of scenario. Solution to any of the expected scenario will be useful.
I tried:
Dump hive table's data into CSV file and load it into managed hive table on stage. But data have Japanese characters (non-utf8) causing higher row count on stage w.r.t. row count on Dev.
I guess, this is completely theoretical problem so not adding queries. Please let me know if you wish to see queries.
Dev Hive table -> Dev HDFS location -> Distcp -> Stage HDFS location -> Import -> Stage Hive table
You can export the hive table data to an HDFS location using the command below.
INSERT OVERWRITE DIRECTORY 'hdfs_exports_location/department' SELECT * FROM department;
Copy the HDFS data to the stage environment HDFS location using distcp
hadoop distcp <hdfs_export_location>/department hdfs://<stage name node>/<import location>
Import the table from the copied HDFS files
import from '<import location>';
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport

Hive tables not shown in /user/hive/warehouse

I have created 2 DBs:
DB1 'Airline',
DB2 'Students'.
I can see DB1 in Hue browser. But cannot see its tables inside /user/hive/warehouse/Airline.db/.
I can see tables of Students.db in /user/hive/warehouse/Students.db. But cannot see it in hue browser.
Is there anything I need to set?
Do you have an acces to Hive CLI? If yes, try run command:
describe database Airline;
You should see something like that:
Airline Airline Hive database hdfs://<host-fqdn>:8020/apps/hive/warehouse public
This is how you can find location of database. You can run the same command for 'Student' database.

HIVE create table is hanging - CDH 5.7

Create table script in HIVE is hanging and it is not completing for long time. I am using CDH 5.7, 'show databases' takes time to retrieve the data and finally it showed list of all databases. Below create script i am using:
create table dept
( dep_id int,
dep_name string
);
Am I missing some configuration settings with related to HIVE? Also I am able to see green image in Cloudera Manager(CM) for HIVE.
Looks like Hive metastore was hanging, after restarting Hive service it started working. Thanks for your help in Cloduera community

HIVE query logs location

I am find very difficult to locate the HIVE query logs, basically i want to see what queries were executed.
Basically i want to find the queries in this state:
select foo, count(*) from table where field=value group by foo;
From Hive documentation:
hive.exec.scratchdir Default Value:
/tmp/${user.name} in Hive 0.2.0 through 0.8.0
/tmp/hive-${user.name} in Hive 0.8.1 through 0.14.0
/tmp/hive in Hive 0.14.0 and later
This directory is used by Hive to store the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages
hive.start.cleanup.scratchdir Default Value: false
Execute the query with below command
hive --hiveconf hive.root.logger=DRFA --hiveconf hive.log.dir=./logs --hiveconf hive.log.level=DEBUG -e "select foo, count(*) from table where field=value group by foo"
It will create a log file in logs folder. Make sure that the logs folder exist in current directory.

What does the hive metastore and name node do in a cluster?

In a cluster having Hive installed, What does the metastore and namenode have? i understand that the Metastore has all the table schema and partition details and metadata. Now what is this metadata? then what does the namenode have? and where is this metastore present in a cluster?
The NameNode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It also keeps track of all the DataNode(Dead+Live) through heartbeat mechanism. It also helps client for reads/writes by receiving their requests and redirecting them to the appropriate DataNode.
The metadata which metastore stores contains things like :
IDs of Database
IDs of Tables
IDs of Index
The time of creation of an Index
The time of creation of a Table
IDs of roles assigned to a particular user
InputFormat used for a Table
OutputFormat used for a Table etc etc.
Is this what you wanted to know?
And it is not mandatory to have metastore in the cluster itself. Any machine(inside or outside the cluster) having a JDBC-compliant database can be used for the metastore.
HTH
P.S : You might find the E/R diagram of metastore useful.
Hive data (not metadata) is spread across Hadoop HDFS DataNode servers. Typically, each block of data is stored on 3 different DataNodes. The NameNode keeps track of which DataNodes have which blocks of actual data.
For a Hive production environment, the metastore service should run in an isolated JVM. Hive processes can communicate with the metastore service using Thrift. The Hive metastore data is persisted in an ACID database such as Oracle DB or MySQL. You can use SQL to find out what is in the Hive metastore:
Here are the tables in the Hive metastore:
SQL> select table_name from user_tables;
DBS
DATABASE_PARAMS
SEQUENCE_TABLE
SERDES
TBLS
SDS
CDS
BUCKETING_COLS
TABLE_PARAMS
PARTITION_KEYS
SORT_COLS
SD_PARAMS
COLUMNS_V2
SERDE_PARAMS
You can describe the structure of each table:
SQL> describe partition_keys;
TBL_ID NUMBER
PKEY_COMMENT VARCHAR2(4000)
PKEY_NAME VARCHAR2(128)
PKEY_TYPE VARCHAR2(767)
INTEGER_IDX NUMBER(10)
And find the contents of each table:
SQL> select * from partition_keys;
So if in Hive you "CREATE TABLE xxx (...) PARTITIONED BY (...)" the Hive partitioning data is stored into the metastore (Oracle, MySQL...) database.
For example, in Hive if you create a table like this:
hive> create table employee_table (id bigint, name string) partitioned by (region string);
You will find this in the metastore:
SQL> select tbl_id,pkey_name from partition_keys;
TBL_ID PKEY_NAME
------ ---------
8 region
SQL> select tbl_name from tbls where tbl_id=8;
TBL_NAME
--------
employee_table
When you insert data into employee_table, the data will be stored in HDFS on Hadoop DataNodes and the NameNode will keep track of which DataNodes have the data.
Metastore - Its a database which stores metadata a.k.a all the details about the tables you create in HIVE. By default, HIVE comes with and uses Derby database. But you can use any other database like MySQL or Oracle.
Use of Metastore: Whenever you fire a query from your Hive CLI, the Execution engine gathers all the details regarding the table and creates an Execution plan(Job). These details comes from Metastore. Finally the Execution engine sends the Job to Hadoop. From here, the common Hadoop Map Reduce Job is executed and the result is send back to Hive. The Name node communicates with Execution engine to successfully execute the MR Job.
Above diagram is excellent one to understand Hive and hadoop communication.
Regarding Hive-Metastore (not hadoop - metastore):
It is not necessary/compulsory to have metastore in your hadoop environment as it is only required if you are using HIVE on top of your HDFS cluster.
Metastore is the metadata repository for HIVE only and used by HIVE to store created database object's meta information only(not actual data, which is already in HDFS because HIVE do not store data. Hive uses already stored datain File system)
Hive implementation required a metastore service using any RDBMS.
Regarding Namenode (hadoop -namenode):
core part of Hadoop, which behaves like metastore for cluster.
Not a RDBMS . Stores file system meta info in File System only.