HIVE query logs location - hive

I am find very difficult to locate the HIVE query logs, basically i want to see what queries were executed.
Basically i want to find the queries in this state:
select foo, count(*) from table where field=value group by foo;

From Hive documentation:
hive.exec.scratchdir Default Value:
/tmp/${user.name} in Hive 0.2.0 through 0.8.0
/tmp/hive-${user.name} in Hive 0.8.1 through 0.14.0
/tmp/hive in Hive 0.14.0 and later
This directory is used by Hive to store the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages
hive.start.cleanup.scratchdir Default Value: false

Execute the query with below command
hive --hiveconf hive.root.logger=DRFA --hiveconf hive.log.dir=./logs --hiveconf hive.log.level=DEBUG -e "select foo, count(*) from table where field=value group by foo"
It will create a log file in logs folder. Make sure that the logs folder exist in current directory.

Related

schematool failing metastore validation for Hive 1.1.0

I just deployed Cloudera 5.12 and I'm installing Hive. Following the instructions, I run
\i /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-1.1.0.postgres.sql
as a final step, to create the metastore. When I run the schematool validation
schematool -dbType postgres -validate
I get an error:
Validating metastore schema tables
Table(s) [ [compaction_queue, completed_txn_components, hive_locks, next_compaction_queue_id, next_lock_id, next_txn_id, txn_components, txns] ] are missing from the metastore database schema.
Failed in schema table validation.
[FAIL]
Everything else is SUCCESS and I can access the Hive databases without problems. How do I fix this error?
I believe this is a bug. Here is what I found:
$ find . -name "*.sql" -print |xargs grep compaction_queue
./postgres/hive-txn-schema-0.14.0.postgres.sql:CREATE TABLE "compaction_queue" (
./postgres/hive-txn-schema-0.14.0.postgres.sql:CREATE TABLE "next_compaction_queue_id" (
./postgres/hive-txn-schema-0.14.0.postgres.sql:INSERT INTO "next_compaction_queue_id" VALUES(1);
./postgres/hive-schema-0.14.0.postgres.sql:CREATE TABLE "compaction_queue" (
./postgres/hive-schema-0.14.0.postgres.sql:CREATE TABLE "next_compaction_queue_id" (
./postgres/hive-schema-0.14.0.postgres.sql:INSERT INTO "next_compaction_queue_id" VALUES(1);
As you can see, the table next_compaction_queue_id only exists in schema version 0.14 for postgresql. It does not exist in any of other versions or any other database type. I do not believe these are used. If you have Cloudera Support, please create a support case and ask support to create a jira.
A workaround would be to find the table create statements of the CLAIMED missing tables in the hive-schema-0.14.0.postgres.sql file and add these tables to your Hive metastore database. Since they are not used, it won't harm anything, but it will get rid of the error in your schematool command.
I made it work recreating the Hive metastore using schematool. I had to first drop the current metastore on postgreSQL, using the information from here:
> su - postgres
> psql
REVOKE CONNECT ON DATABASE thedb FROM public;
SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = current_database() AND pid <> pg_backend_pid();
drop metastore;
and then recreating using
/usr/lib/hive/bin/schematool -dbType postgres -initSchema -verbose -userName hiveuser -passWord thepassword
Validation worked afterwards:
> /usr/lib/hive/bin/schematool -dbType postgres -validate
Starting metastore validation
Validating schema version
Succeeded in schema version validation.
[SUCCESS]
Validating sequence number for SEQUENCE_TABLE
Succeeded in sequence number validation for SEQUENCE_TABLE.
[SUCCESS]
Validating metastore schema tables
Succeeded in schema table validation.
[SUCCESS]
Validating DFS locations
Succeeded in DFS location validation.
[SUCCESS]
Validating columns for incorrect NULL values.
Succeeded in column validation for incorrect NULL values.
[SUCCESS]
Done with metastore validation: [SUCCESS]
schemaTool completed
So I think that if there's a bug, it's on the metastore creation step:
\i /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-1.1.0.postgres.sql
detailed on the Cloudera installation manual.

How to migrate hive derby metastore to postgres metastore

I have been using derby as hive metastore for quite some time.
Is there a way to migrate the metastore to Postgresql.
I am using Apache Hive - 0.13
The best approach I have found so far is as below:
**Export from existing database**
Use the derby tool 'ij' (assuming you are placed in the root installation folder for the pillar):
java -cp lib/derby-10.10.1.1.jar:lib/derbytools-10.10.1.1.jar:lib/derbyclient-10.10.1.1.jar org.apache.derby.tools.ij
Then run the following commands to extract the content of the somedb database:
CONNECT 'jdbc:derby:/path/to/somedb'
CALL SYSCS_UTIL.SYSCS_EXPORT_TABLE(null, 'TABLE1', 'table1', null, null, null);
This should create the file: 'table1'.
**Import the data to the PostgreSQL database**
Run the 'psql' application on the console/terminal.
Log onto somedb and ingest data (and fix the automated sequences)
\c somedb
COPY table1 FROM '/path/to/table1' with csv;
SELECT SETVAL('table1_guid_seq', (SELECT MAX(guid) FROM table1));
Repeat this for all tables you want to export from derby and import into postgresql.

How to save the results of an impala query

I've loaded a large set of data from S3 into hdfs, and then inserted the data to a table in impala.
I then ran a query against this data, and I'm looking to get these results back into S3.
I'm using Amazon EMR, with impala 1.2.4. If it's not possible to get the results of the query back to S3 directly, are there options to get the data back to hdfs and then some how send it back to S3 from there?
I have messed around with the impala-shell -o filename options, but that appears to only work on the local linux file system.
I thought this would have been a common scenario, but having trouble finding any information about saving the results of a query anywhere.
Any pointers appreciated.
To add to the knowledge above I am including the command that writes the query results to a file with a delimeter as we declared using the option --output_delimeter and also by using the option
--delimeted which actually switches off the default tab delimeter option.
impala-shell -q "query " --delimited --output_delimiter='\001' --print_header -o 'filename'
What I usually do if it's a smallish result set is run the script from the command line then upload to s3 using the AWS command line tool:
impala-shell -e "select ble from bla" -o filename
aws s3 cp filename s3://mybucket/filename
An alternative is use Hive as the last step in your data pipeline after you've run your query in Impala:
1. Impala step:
create table processed_data
as
select blah
--do whatever else you need to do in here
from raw_data1
join raw_data2 on a=b
2. Hive step:
create external table export
like processed_data
location 's3://mybucket/export/';
insert into table export
select * from processed_data;
If you have aws cli installed you can use standard out of Impala shell , impala shell query | aws cli cp - s3folder/outputfilename
You can use unix pipe and stream (-)

changing schemas in hive from command line

How to change hive schema from command line?
I need to run hql scripts for creating tables but those tables need to be created inside a particular schema.
I am using hive -f createTable.hql to create table
You can pass a parameter(s) to the script:
hive -hiveconf myschema=newschema -f createTable.hql
Then in the script:
CREATE SCHEMA IF NOT EXISTS ${hiveconf:myschema}
LOCATION "/foo/dir";
USE ${hiveconf:myschema};
Any tables you then create will be in that working schema. You can also make the variables part of the name:
hive -hiveconf name=Bob -f createTable.hql
In the script:
CREATE SCHEMA IF NOT EXISTS ${hiveconf:name}_SCHEMA
LOCATION "/foo/dir";
USE ${hiveconf:name}_SCHEMA;
you can do that inside your hql/ddl statement.
USE schemaName;
create table.....
....
So, basically you can tell hive to use a specific schema by using the "USE" statement.
You can also write it in hive CLI.
If you want to run queries in a particular schema via command line you can use the below command:
hive -e "use schema_name; show tables;"
but -f and -e cannot be used both in same command.

HIVE Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

I am getting the below error on creating a hive database
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com/facebook/fb303/FacebookService$Iface
Hadoop version:**hadoop-1.2.1**
HIVE Version: **hive-0.12.0**
Hadoop path:/home/hadoop_test/data/hadoop-1.2.1
hive path :/home/hadoop_test/data/hive-0.12.0
I have copied hive*.jar ,jline-.jar,antlr-runtime.jar from hive-0.12.0/lib to hadoop-1.2./lib
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
Make sure the location is specified correctly
In the following way, I solved the problem.
set hive.msck.repair.batch.size=1;
set hive.msck.path.validation=ignore;
If you can not set the value, and get the error.Error: Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime (state=42000,code=1)
add content in hive-site:
key:
hive.security.authorization.sqlstd.confwhitelist.append
value:
hive\.msck\.path\.validation|hive\.msck\.repair\.batch\.size
Set hive.metastore.schema.verification property in hive-site.xml to true, by default it is false.
For further details check this link.
Amazon Athena
If you get here because of Amazon Athena errors, you might use this bit below. First check that all you files have the same schema:
If you run an ALTER TABLE ADD PARTITION (or MSCK REPAIR TABLE) statement and mistakenly specify a partition that already exists and an incorrect Amazon S3 location, zero byte placeholder files of the format partition_value_$folder$ are created in Amazon S3. You must remove these files manually.
We removed the files with the awscli.
aws s3 rm s3://bucket/key/table/ --exclude="*" --include="*folder*" --recursive --dryrun
See also the docs with some extra steps included.
To proper fix this with MSCK
Remove the older partitions from metastore, if their path not exists, using
ALTER TABLE dbname.tablename DROP PARTITION IF EXISTS (partition_column_name > 0);
RUN MSCK REPAIR COMMAND
MSCK REPAIR TABLE dbname.tablename;
Why the step 1 is required because MSCK Repair command will through error if the partition is removed from the file system (HDFS), so by removing all the partitions from the metastore first and then sync with MSCK will properly add the required partitions
The reason why we got this error was we added a new column to the external Hive table. set hive.msck.path.validation=ignore; worked upto fixing hive queries but Impala had additional issues which were solved with below steps:
After doing an invalidate metadata, Impala queries started failing with Error: incompatible Parquet schema for column
Impala error SOLUTION: set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
if you're using Cloudera distribution below steps will make the change permanent and you don't have to set the option per session.
Cloudera Manager -> Clusters -> Impala -> Configuration -> Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
Add the value: PARQUET_FALLBACK_SCHEMA_RESOLUTION=name
NOTE: do not use SET or semi-colon when setting the parameter in Cloudera Manager
open hive cli using "hive --hiveconf hive.root.logger=DEBUG,console" to enable logs and debug from there, in my case a camel case name for partition was written on hdfs and i created hive table with its name fully in lowercase.
None of proposed solutions worked for me.
I discovered a 0B file named _$folder$ inside my table location path (at same level of partitions).
Removing it allowed me to run a MSCK REPAIR TABLE t without issues.
This file was comming from a s3 restore (roll back to a previous versionned state)
I faced the same error. Reason in my case was a directory created in the HDFS warehouse with the same name. When this directory was deleted, it resolved my issue.
It's probably because your metastore_db is corrubpted. Delete .lck files from metastore_db.
hive -e "msck repair table database.tablename"
it will repair table metastore schema of table;
setting the below property and then doing msck repair worked for me :
set hive.mapred.mode=unstrict;
I faced similar issue when the underlying hdfs directory got updated with new partitions and hence the hive metastore went out of sync.
Solved using the following two steps:
MSCK table table_name showed what all partitions are out of sync.
MSCK REPAIR table table_name added the missing partitions.