schematool failing metastore validation for Hive 1.1.0 - hive

I just deployed Cloudera 5.12 and I'm installing Hive. Following the instructions, I run
\i /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-1.1.0.postgres.sql
as a final step, to create the metastore. When I run the schematool validation
schematool -dbType postgres -validate
I get an error:
Validating metastore schema tables
Table(s) [ [compaction_queue, completed_txn_components, hive_locks, next_compaction_queue_id, next_lock_id, next_txn_id, txn_components, txns] ] are missing from the metastore database schema.
Failed in schema table validation.
[FAIL]
Everything else is SUCCESS and I can access the Hive databases without problems. How do I fix this error?

I believe this is a bug. Here is what I found:
$ find . -name "*.sql" -print |xargs grep compaction_queue
./postgres/hive-txn-schema-0.14.0.postgres.sql:CREATE TABLE "compaction_queue" (
./postgres/hive-txn-schema-0.14.0.postgres.sql:CREATE TABLE "next_compaction_queue_id" (
./postgres/hive-txn-schema-0.14.0.postgres.sql:INSERT INTO "next_compaction_queue_id" VALUES(1);
./postgres/hive-schema-0.14.0.postgres.sql:CREATE TABLE "compaction_queue" (
./postgres/hive-schema-0.14.0.postgres.sql:CREATE TABLE "next_compaction_queue_id" (
./postgres/hive-schema-0.14.0.postgres.sql:INSERT INTO "next_compaction_queue_id" VALUES(1);
As you can see, the table next_compaction_queue_id only exists in schema version 0.14 for postgresql. It does not exist in any of other versions or any other database type. I do not believe these are used. If you have Cloudera Support, please create a support case and ask support to create a jira.
A workaround would be to find the table create statements of the CLAIMED missing tables in the hive-schema-0.14.0.postgres.sql file and add these tables to your Hive metastore database. Since they are not used, it won't harm anything, but it will get rid of the error in your schematool command.

I made it work recreating the Hive metastore using schematool. I had to first drop the current metastore on postgreSQL, using the information from here:
> su - postgres
> psql
REVOKE CONNECT ON DATABASE thedb FROM public;
SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = current_database() AND pid <> pg_backend_pid();
drop metastore;
and then recreating using
/usr/lib/hive/bin/schematool -dbType postgres -initSchema -verbose -userName hiveuser -passWord thepassword
Validation worked afterwards:
> /usr/lib/hive/bin/schematool -dbType postgres -validate
Starting metastore validation
Validating schema version
Succeeded in schema version validation.
[SUCCESS]
Validating sequence number for SEQUENCE_TABLE
Succeeded in sequence number validation for SEQUENCE_TABLE.
[SUCCESS]
Validating metastore schema tables
Succeeded in schema table validation.
[SUCCESS]
Validating DFS locations
Succeeded in DFS location validation.
[SUCCESS]
Validating columns for incorrect NULL values.
Succeeded in column validation for incorrect NULL values.
[SUCCESS]
Done with metastore validation: [SUCCESS]
schemaTool completed
So I think that if there's a bug, it's on the metastore creation step:
\i /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-1.1.0.postgres.sql
detailed on the Cloudera installation manual.

Related

Redshift Spectrum and Hive Metastore - Ambiguous Error

From Redshift, I created an external schema using the Hive Metastore. I can see the Redshift metadata about the tables (such as using: select * from SVV_EXTERNAL_TABLES), however when querying one of these tables, I get an ambiguous error "error: Assert"
I tried creating the external schema and querying the tables. I can query the metadata about the tables, but cannot actually query the tables themselves.
I created the external schema as follows:
create external schema hive_schema
from hive metastore
database 'my_database_name'
uri 'my_ip_address' port 9083
iam_role 'arn:aws:iam::123456789:role/my_role_name';
Here is the error message when running "select * from hive_schema.my_table_name;"
-----------------------------------------------
error: Assert
code: 1000
context: loc->length() > 5 && loc->substr(0, 5) == "s3://" -
query: 1764
location: scan_range_manager.cpp:221
process: padbmaster [pid=26902]
-----------------------------------------------
What is the LOCATION of your Hive table? Seems like Redshift is asserting the location to start with s3://.
You should see LOCATIONs of your tables by running that query:
select location from SVV_EXTERNAL_TABLES
Where are your Hive tables stored? Is it maybe HDFS? I doubt whether Redshift supports any other locations than S3 - in the section Considerations When Using AWS Glue Data Catalog of this AWS guide they describe how to setup your Hive Metastore to store data in S3.

How to migrate hive derby metastore to postgres metastore

I have been using derby as hive metastore for quite some time.
Is there a way to migrate the metastore to Postgresql.
I am using Apache Hive - 0.13
The best approach I have found so far is as below:
**Export from existing database**
Use the derby tool 'ij' (assuming you are placed in the root installation folder for the pillar):
java -cp lib/derby-10.10.1.1.jar:lib/derbytools-10.10.1.1.jar:lib/derbyclient-10.10.1.1.jar org.apache.derby.tools.ij
Then run the following commands to extract the content of the somedb database:
CONNECT 'jdbc:derby:/path/to/somedb'
CALL SYSCS_UTIL.SYSCS_EXPORT_TABLE(null, 'TABLE1', 'table1', null, null, null);
This should create the file: 'table1'.
**Import the data to the PostgreSQL database**
Run the 'psql' application on the console/terminal.
Log onto somedb and ingest data (and fix the automated sequences)
\c somedb
COPY table1 FROM '/path/to/table1' with csv;
SELECT SETVAL('table1_guid_seq', (SELECT MAX(guid) FROM table1));
Repeat this for all tables you want to export from derby and import into postgresql.

HIVE query logs location

I am find very difficult to locate the HIVE query logs, basically i want to see what queries were executed.
Basically i want to find the queries in this state:
select foo, count(*) from table where field=value group by foo;
From Hive documentation:
hive.exec.scratchdir Default Value:
/tmp/${user.name} in Hive 0.2.0 through 0.8.0
/tmp/hive-${user.name} in Hive 0.8.1 through 0.14.0
/tmp/hive in Hive 0.14.0 and later
This directory is used by Hive to store the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages
hive.start.cleanup.scratchdir Default Value: false
Execute the query with below command
hive --hiveconf hive.root.logger=DRFA --hiveconf hive.log.dir=./logs --hiveconf hive.log.level=DEBUG -e "select foo, count(*) from table where field=value group by foo"
It will create a log file in logs folder. Make sure that the logs folder exist in current directory.

HIVE Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

I am getting the below error on creating a hive database
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/facebook/fb303/FacebookService$Iface
Hadoop version:**hadoop-1.2.1**
HIVE Version: **hive-0.12.0**
Hadoop path:/home/hadoop_test/data/hadoop-1.2.1
hive path :/home/hadoop_test/data/hive-0.12.0
I have copied hive*.jar ,jline-.jar,antlr-runtime.jar from hive-0.12.0/lib to hadoop-1.2./lib
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
Make sure the location is specified correctly
In the following way, I solved the problem.
set hive.msck.repair.batch.size=1;
set hive.msck.path.validation=ignore;
If you can not set the value, and get the error.Error: Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime (state=42000,code=1)
add content in hive-site:
key:
hive.security.authorization.sqlstd.confwhitelist.append
value:
hive\.msck\.path\.validation|hive\.msck\.repair\.batch\.size
Set hive.metastore.schema.verification property in hive-site.xml to true, by default it is false.
For further details check this link.
Amazon Athena
If you get here because of Amazon Athena errors, you might use this bit below. First check that all you files have the same schema:
If you run an ALTER TABLE ADD PARTITION (or MSCK REPAIR TABLE) statement and mistakenly specify a partition that already exists and an incorrect Amazon S3 location, zero byte placeholder files of the format partition_value_$folder$ are created in Amazon S3. You must remove these files manually.
We removed the files with the awscli.
aws s3 rm s3://bucket/key/table/ --exclude="*" --include="*folder*" --recursive --dryrun
See also the docs with some extra steps included.
To proper fix this with MSCK
Remove the older partitions from metastore, if their path not exists, using
ALTER TABLE dbname.tablename DROP PARTITION IF EXISTS (partition_column_name > 0);
RUN MSCK REPAIR COMMAND
MSCK REPAIR TABLE dbname.tablename;
Why the step 1 is required because MSCK Repair command will through error if the partition is removed from the file system (HDFS), so by removing all the partitions from the metastore first and then sync with MSCK will properly add the required partitions
The reason why we got this error was we added a new column to the external Hive table. set hive.msck.path.validation=ignore; worked upto fixing hive queries but Impala had additional issues which were solved with below steps:
After doing an invalidate metadata, Impala queries started failing with Error: incompatible Parquet schema for column
Impala error SOLUTION: set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
if you're using Cloudera distribution below steps will make the change permanent and you don't have to set the option per session.
Cloudera Manager -> Clusters -> Impala -> Configuration -> Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
Add the value: PARQUET_FALLBACK_SCHEMA_RESOLUTION=name
NOTE: do not use SET or semi-colon when setting the parameter in Cloudera Manager
open hive cli using "hive --hiveconf hive.root.logger=DEBUG,console" to enable logs and debug from there, in my case a camel case name for partition was written on hdfs and i created hive table with its name fully in lowercase.
None of proposed solutions worked for me.
I discovered a 0B file named _$folder$ inside my table location path (at same level of partitions).
Removing it allowed me to run a MSCK REPAIR TABLE t without issues.
This file was comming from a s3 restore (roll back to a previous versionned state)
I faced the same error. Reason in my case was a directory created in the HDFS warehouse with the same name. When this directory was deleted, it resolved my issue.
It's probably because your metastore_db is corrubpted. Delete .lck files from metastore_db.
hive -e "msck repair table database.tablename"
it will repair table metastore schema of table;
setting the below property and then doing msck repair worked for me :
set hive.mapred.mode=unstrict;
I faced similar issue when the underlying hdfs directory got updated with new partitions and hence the hive metastore went out of sync.
Solved using the following two steps:
MSCK table table_name showed what all partitions are out of sync.
MSCK REPAIR table table_name added the missing partitions.

What does the hive metastore and name node do in a cluster?

In a cluster having Hive installed, What does the metastore and namenode have? i understand that the Metastore has all the table schema and partition details and metadata. Now what is this metadata? then what does the namenode have? and where is this metastore present in a cluster?
The NameNode keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It also keeps track of all the DataNode(Dead+Live) through heartbeat mechanism. It also helps client for reads/writes by receiving their requests and redirecting them to the appropriate DataNode.
The metadata which metastore stores contains things like :
IDs of Database
IDs of Tables
IDs of Index
The time of creation of an Index
The time of creation of a Table
IDs of roles assigned to a particular user
InputFormat used for a Table
OutputFormat used for a Table etc etc.
Is this what you wanted to know?
And it is not mandatory to have metastore in the cluster itself. Any machine(inside or outside the cluster) having a JDBC-compliant database can be used for the metastore.
HTH
P.S : You might find the E/R diagram of metastore useful.
Hive data (not metadata) is spread across Hadoop HDFS DataNode servers. Typically, each block of data is stored on 3 different DataNodes. The NameNode keeps track of which DataNodes have which blocks of actual data.
For a Hive production environment, the metastore service should run in an isolated JVM. Hive processes can communicate with the metastore service using Thrift. The Hive metastore data is persisted in an ACID database such as Oracle DB or MySQL. You can use SQL to find out what is in the Hive metastore:
Here are the tables in the Hive metastore:
SQL> select table_name from user_tables;
DBS
DATABASE_PARAMS
SEQUENCE_TABLE
SERDES
TBLS
SDS
CDS
BUCKETING_COLS
TABLE_PARAMS
PARTITION_KEYS
SORT_COLS
SD_PARAMS
COLUMNS_V2
SERDE_PARAMS
You can describe the structure of each table:
SQL> describe partition_keys;
TBL_ID NUMBER
PKEY_COMMENT VARCHAR2(4000)
PKEY_NAME VARCHAR2(128)
PKEY_TYPE VARCHAR2(767)
INTEGER_IDX NUMBER(10)
And find the contents of each table:
SQL> select * from partition_keys;
So if in Hive you "CREATE TABLE xxx (...) PARTITIONED BY (...)" the Hive partitioning data is stored into the metastore (Oracle, MySQL...) database.
For example, in Hive if you create a table like this:
hive> create table employee_table (id bigint, name string) partitioned by (region string);
You will find this in the metastore:
SQL> select tbl_id,pkey_name from partition_keys;
TBL_ID PKEY_NAME
------ ---------
8 region
SQL> select tbl_name from tbls where tbl_id=8;
TBL_NAME
--------
employee_table
When you insert data into employee_table, the data will be stored in HDFS on Hadoop DataNodes and the NameNode will keep track of which DataNodes have the data.
Metastore - Its a database which stores metadata a.k.a all the details about the tables you create in HIVE. By default, HIVE comes with and uses Derby database. But you can use any other database like MySQL or Oracle.
Use of Metastore: Whenever you fire a query from your Hive CLI, the Execution engine gathers all the details regarding the table and creates an Execution plan(Job). These details comes from Metastore. Finally the Execution engine sends the Job to Hadoop. From here, the common Hadoop Map Reduce Job is executed and the result is send back to Hive. The Name node communicates with Execution engine to successfully execute the MR Job.
Above diagram is excellent one to understand Hive and hadoop communication.
Regarding Hive-Metastore (not hadoop - metastore):
It is not necessary/compulsory to have metastore in your hadoop environment as it is only required if you are using HIVE on top of your HDFS cluster.
Metastore is the metadata repository for HIVE only and used by HIVE to store created database object's meta information only(not actual data, which is already in HDFS because HIVE do not store data. Hive uses already stored datain File system)
Hive implementation required a metastore service using any RDBMS.
Regarding Namenode (hadoop -namenode):
core part of Hadoop, which behaves like metastore for cluster.
Not a RDBMS . Stores file system meta info in File System only.