When the following sqoop import is run in command shell works well.
import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera -m 1 --table categories --hive-database retail_stage --hive-table categories --fields-terminated-by "|" --hive-import
But the same statement run in Hue workflow it fails with the following error
>>> Invoking Sqoop command line now >>>
2019-02-04 11:46:18,411 [main] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2019-02-04 11:46:18,609 [main] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.6-cdh5.13.0
2019-02-04 11:46:18,664 [main] WARN org.apache.sqoop.tool.BaseSqoopTool - Setting your password on the command-line is insecure. Consider using -P instead.
2019-02-04 11:46:18,696 [main] WARN org.apache.sqoop.ConnFactory - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2019-02-04 11:46:18,936 [main] INFO org.apache.sqoop.manager.MySQLManager - Preparing to use a MySQL streaming resultset.
2019-02-04 11:46:18,951 [main] INFO org.apache.sqoop.tool.CodeGenTool - Beginning code generation
2019-02-04 11:46:20,510 [main] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1
2019-02-04 11:46:20,555 [main] INFO org.apache.sqoop.manager.SqlManager - Executing SQL statement: SELECT t.* FROM `categories` AS t LIMIT 1
2019-02-04 11:46:20,565 [main] INFO org.apache.sqoop.orm.CompilationManager - HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
2019-02-04 11:46:25,907 [main] ERROR org.apache.sqoop.tool.ImportTool - Import failed: java.io.IOException: Error returned by javac
at org.apache.sqoop.orm.CompilationManager.compile(CompilationManager.java:222)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:494)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:621)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
at org.apache.oozie.action.hadoop.SqoopMain.runSqoopJob(SqoopMain.java:187)
at org.apache.oozie.action.hadoop.SqoopMain.run(SqoopMain.java:170)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:81)
at org.apache.oozie.action.hadoop.SqoopMain.main(SqoopMain.java:51)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:235)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
<<< Invocation of Sqoop command completed <<<
No child hadoop job is executed.
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
Bear in mind that when a list-databases command run with Hue workflow, works well.
Cloudera Quickstart VM(docker image) details
Version: Cloudera Express 5.13.0 (#55 built by jenkins on 20171002-1719 git: bd657e597e6743c458ee2c9aabe808b7c972981c)
Java VM Name: Java HotSpot(TM) 64-Bit Server VM
Java VM Vendor: Oracle Corporation
Java Version: 1.7.0_67
Actually any command placed as sqoop action in oozie fails.
The following is how cloudera quickstart vm is started in docker image
Start cloudera quickstart container:
docker run --hostname=quickstart.cloudera --privileged=true -t -i -v /Users/Yunus/Documents/ClouderaShare:/src --publish-all=true -p 8888:8888 -p 8020:8020 -p 8032:8032 -p 7180:7180 -p 80:80 -p 50070:50070 -p 11000:11000 -p 21050:21050 -p 8088:8088 -p 8042:8042 cloudera-5-13 /usr/bin/docker-quickstart
Start cloudera manager: home/cloudera/cloudera-manager --express
Fix Clock Offset Problem: /etc/init.d/ntpd start
Mysql connector already in usr/share/java/ directory: sudo -u hdfs hadoop fs -put usr/share/java/mysql-connector-java-5.1.34-bin.jar /user/oozie/share/lib/lib_20171023234839/sqoop
Related
We are trying to run a fixed number of Hive queries in parallel using GNU parallel.
Even with parallelism set to 1 (i.e. sequential execution) via -j1 the first execution works, but the second one gets stuck:
$ parallel -j1 --eta --verbose beeline -e '"SELECT \"{}\";"' ::: a b c
beeline -e "SELECT \"a\";"
Computers / CPU cores / Max jobs to run
1:local / 40 / 1
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 3 AVG: 0.00s local:1/0/100%/0.0s
+------+
| _c0 |
+------+
| a |
+------+
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console. Set system property 'log4j2.debug' to show Log4j2 internal initialization logging.
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://en02.example.cloud:2181,mn01.example.cloud:2181,mn02.example.cloud:2181/default;principal=hive/_HOST#example.cloud;serviceDiscoveryMode=zooKeeper;ssl=true;zooKeeperNamespace=hiveserver2
21/11/16 10:41:39 [main]: INFO jdbc.HiveConnection: Connected to mn01.example.cloud:10000
Connected to: Apache Hive (version 3.1.3000.7.1.6.0-297)
Driver: Hive JDBC (version 3.1.3000.7.1.6.0-297)
Transaction isolation: TRANSACTION_REPEATABLE_READ
INFO : Compiling command(queryId=hive_20211116104139_7be8d8ee-f58d-4572-88a4-43533846160b): SELECT "a"
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20211116104139_7be8d8ee-f58d-4572-88a4-43533846160b); Time taken: 0.102 seconds
INFO : Executing command(queryId=hive_20211116104139_7be8d8ee-f58d-4572-88a4-43533846160b): SELECT "a"
INFO : Completed executing command(queryId=hive_20211116104139_7be8d8ee-f58d-4572-88a4-43533846160b); Time taken: 0.006 seconds
INFO : OK
1 row selected (0.191 seconds)
Beeline version 3.1.3000.7.1.6.0-297 by Apache Hive
Closing: 0: jdbc:hive2://en02.example.cloud:2181,mn01.example.cloud:2181,mn02.example.cloud:2181/default;principal=hive/_HOST#example.cloud;serviceDiscoveryMode=zooKeeper;ssl=true;zooKeeperNamespace=hiveserver2
beeline -e "SELECT \"b\";"
ETA: 79s Left: 2 AVG: 42.00s local:1/1/100%/47.0s
Simplifying this further, even a parallel call to beeline --help gets stuck for the second run in the same way, so it doesn't seem to be related to the connection to the Hive DB.
The solutions with which we finally got it working is
parallel -j1 --eta --verbose beeline -e '"SELECT \"{}\";"' < /dev/null ::: a b c
and (thanks #OleTange !)
parallel -j1 --eta --verbose --tty beeline -e '"SELECT \"{}\";"' ::: a b c
How we found out:
We added a set -x to the beeline bash script and some of the scripts it's calling, logged the results to separate files for the parallel runs and diffed them.
We saw that there was a part in the logs about
[ -p /dev/stdin ]
and below that a few environment variables that got set in the first parallel execution, but not on the second one.
We then played around with various options to give beeline a stdin and the /dev/null version finally worked.
I'm trying to import all tables from MySQL database into Hive database using Apache Sqoop CLI, when i execute the following import command:
[hadoop#localhost bin]$ sqoop import-all-tables --connect jdbc:mysql://localhost/mysql --username root --password root
Somehow the import fails and i get the following error message at the end of the output
20/05/09 23:06:27 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
.
.
.
20/05/09 23:06:38 ERROR tool.ImportAllTablesTool: Encountered IOException running import job:
java.io.IOException: Generating splits for a textual index column allowed only in case of
"-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter
Prior to running Sqoop-1.4.7 i had the following installed and up & running
[hduser#localhost ~]$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[hadoop#localhost ~]$ hadoop version
Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.7.2.jar
MySQL Server version: 5.6.48
Hive-1.2.2
Based on this configuration, how to import all tables to hive successfully?
It is possible to use a character attribute as split-by attribute.
From the message of your error
Generating splits for a textual index column allowed only in case of
"-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter
You only need to add -Dorg.apache.sqoop.splitter.allow_text_splitter=true
after your 'sqoop job' statement like this:
sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://localhost/mysql \
--username root \
--password root
I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster.
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Exception is,
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/10/21 09:27:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/10/21 09:27:33 INFO manager.SqlManager: Using default fetchSize of 1000
19/10/21 09:27:33 INFO tool.CodeGenTool: Beginning code generation
19/10/21 09:27:33 INFO tool.CodeGenTool: Will generate java class as codegen_addresses
19/10/21 09:27:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/10/21 09:27:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.jar
19/10/21 09:27:37 WARN manager.PostgresqlManager: It looks like you are importing from postgresql.
19/10/21 09:27:37 WARN manager.PostgresqlManager: This transfer can be faster! Use the --direct
19/10/21 09:27:37 WARN manager.PostgresqlManager: option to exercise a postgresql-specific fast path.
19/10/21 09:27:37 INFO mapreduce.ImportJobBase: Beginning import of addresses
19/10/21 09:27:37 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
19/10/21 09:27:39 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:128)
at org.kitesdk.data.Datasets.exists(Datasets.java:624)
at org.kitesdk.data.Datasets.exists(Datasets.java:646)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:118)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:132)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:264)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Note : The same sqoop command with hdfs target dir is working. I'm also able to manually write to s3 bucket from the cluster node (using aws s3 command).
The Kite SDK has been upgraded. All you have to do is to download the new SDK into EMR and run the sqoop command again.
Use wget to download the kite-data-s3-1.1.0.jar
wget https://repo1.maven.org/maven2/org/kitesdk/kite-data-s3/1.1.0/kite-data-s3-1.1.0.jar
Move the JAR to the Sqoop library directory (/usr/lib/sqoop/lib/)
sudo cp kite-data-s3-1.1.0.jar /usr/lib/sqoop/lib/
Grant permission on the JAR
sudo chmod 755 kite-data-s3-1.1.0.jar
Use the s3n connector to import the jar
sqoop import \
--connect "jdbc:postgresql://:/?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username \
--password-file \
--table addresses \
--target-dir s3n://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Font: https://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/
There are two ways to sqoop to parquet -
Using --as-parquetfile
Using HCatalog
But both they way, its not possible to sqoop directly to parquet in EMR 5.X
Problem with both the approach -
Sqoop used Kite SDK to read/write Parquet and it has some limitations. And its not possible to use --as-parquetfile. EMR will remove Kite SDK in future as told by the AWS Support
Support Parquet through HCatalog has been added for hive (v2.4.0, v2.3.7) jira card and hive (v3.0.0) jira card. But EMR 5.X uses hive version 2.3.5.
What could be a workaround till now in EMR(v5.x):
Use a intermediate text table to pull the data. Use a separate hive query to copy the data from text to desired parquet table.
You'll need to change target-dir protocol from s3 to s3a:
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3a://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
#Makubex, I was able to import after adding s3a as URI pattern,
But the time taken by the import job is too high.
I am using EMR 5.26.0. Do I need to do any configuration change for improving the time?
Please try executing sqoop command as specified below :
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--num-mappers 100 \
--split-by id \
--table addresses \
--as-parquetfile \
--target-dir s3://my-bucket/data/temp
Do make sure the target directory doesn't exist in S3
I have run the aerospike server inside docker container using below command.
$ docker run -d -p 3000:3000 -p 3001:3001 -p 3002:3002 -p 3003:3003 -p 8081:8081 --name aerospike aerospike/aerospike-server
89b29f48c6bce29045ea0d9b033cd152956af6d7d76a9f8ec650067350cbc906
It is running succesfully. I verified it using the below command.
$ docker ps
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS
NAMES
89b29f48c6bc aerospike/aerospike-server "/entrypoint.sh asd"
About a minute ago Up About a minute 0.0.0.0:3000-3003->3000-3003/tcp, 0.0.0.0:8081->8081/tcp aerospike
I'm able to successfully connect it with aql.
$ aql
Aerospike Query Client
Version 3.13.0.1
C Client Version 4.1.6
Copyright 2012-2016 Aerospike. All rights reserved.
aql>
But when I launch the AMC for aerospike server in docker, it is hanging and it is not displaying any data. I've attached the screenshot.
Did I miss any configuration. Why it is not loading any data?
You can try the following:
version: "3.9"
services:
aerospike:
image: "aerospike:ce-6.0.0.1"
environment:
NAMESPACE: testns
ports:
- "3000:3000"
- "3001:3001"
- "3002:3002"
amc:
image: "aerospike/amc"
links:
- "aerospike:aerospike"
ports:
- "8081:8081"
Then go to http://localhost:8081 and enter in the connect window "aerospike:3000"
preamble: i'm new to hadoop / hive. have installed standalone hadoop and now am trying to get hive to work. i keep getting an error about initializing the metastore and cannot seem to figure out how to resolve. (hadoop 2.7.2 and hive 2.0)
HADOOP_HOME AND HIVE_HOME ARE SET
ubuntu15-laptop: ~ $>echo $HADOOP_HOME
/usr/hadoop/hadoop-2.7.2
ubuntu15-laptop: ~ $>echo $HIVE_HOME
/usr/hive
hdfs is working
ubuntu15-laptop: ~ $>hadoop fs -ls /
Found 2 items
drwxrwxr-x - testuser supergroup 0 2016-04-13 21:37 /tmp
drwxrwxr-x - testuser supergroup 0 2016-04-13 21:38 /user
ubuntu15-laptop: ~ $>hadoop fs -ls /user
Found 1 items
drwxrwxr-x - testuser supergroup 0 2016-04-13 21:38 /user/hive
ubuntu15-laptop: ~ $>hadoop fs -ls /user/hive
Found 1 items
drwxrwxr-x - testuser supergroup 0 2016-04-13 21:38 /user/hive/warehouse
ubuntu15-laptop: ~ $>groups
testuser adm cdrom sudo dip plugdev lpadmin sambashare
hive is not working. says i need to initialize my metastore
ubuntu15-laptop: ~ $>hive
Logging initialized using configuration in
jar:file:/usr/hive/lib/hive-common-2.0.0.jar!/hive-log4j2.properties
Exception in thread "main" java.lang.RuntimeException: Hive metastore database
is not initialized. Please use schematool (e.g. ./schematool -initSchema
-dbType ...) to create the schema. If needed, don't forget to include the
option to auto-create the underlying database in your JDBC connection string
(e.g. ?createDatabaseIfNotExist=true for mysql)
so i try to initialize it useing postgres - but schematool tries to use derby
ubuntu15-laptop: ~ $>schematool -initSchema -dbType postgres
Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User: APP
Starting metastore schema initialization to 2.0.0
Initialization script hive-schema-2.0.0.postgres.sql
Error: Syntax error: Encountered "statement_timeout" at line 1, column 5.
(state=42X01,code=30000)
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization
FAILED! Metastore state would be inconsistent !!
*** schemaTool failed ***
so i change hive-site.xml to use postgres drivers etc. but because i don't
have the drivers installed, it fails
ubuntu15-laptop: ~ $>cp /usr/hive/conf/hive-site.xml.templ /usr/hive/conf/hive-site.xml
ubuntu15-laptop: ~ $>schematool -initSchema -dbType postgres
Metastore connection URL: jdbc:postgresql://localhost:5432/hivedb
Metastore Connection Driver : org.postgresql.Driver
Metastore connection User: 123456
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to load driver
*** schemaTool failed ***
so then i try to use derby
first move the hive-site.xml out of the way again so default is derby
ubuntu15-laptop: ~ $>mv /usr/hive/conf/hive-site.xml /usr/hive/conf/hive-site.xml.templ
then i try intializing again with derby but it appears to already be
initialized per the error "Error: FUNCTION 'NUCLEUS_ASCII' already exists"
ubuntu15-laptop: ~ $>schematool -initSchema -dbType derby
Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User: APP
Starting metastore schema initialization to 2.0.0
Initialization script hive-schema-2.0.0.derby.sql
Error: FUNCTION 'NUCLEUS_ASCII' already exists. (state=X0Y68,code=30000)
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization
FAILED! Metastore state would be inconsistent !!
*** schemaTool failed ***
I've been at this for two days. Any help would be very much appreciated.
So..
Here's what happened.
After installing hive, the first thing I did was run hive, which attempted to create/initialize the metastore_db, but apparently didn't get it right. On that initial run, I got this error:
Exception in thread "main" java.lang.RuntimeException: Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ?createDatabaseIfNotExist=true for mysql)
Running hive, even though it failed, created a metastore_db directory in the directory from which I ran hive:
ubuntu15-laptop: ~ $>ls -l |grep meta
drwxrwxr-x 5 testuser testuser 4096 Apr 14 12:44 metastore_db
So when I then tried running
ubuntu15-laptop: ~ $>schematool -initSchema -dbType derby
The metastore already existed, but not in complete form.
Soooooo the answer is:
Before you run hive for the first time, run
schematool -initSchema -dbType derby
If you already ran hive and then tried to initSchema and it's failing:
mv metastore_db metastore_db.tmp
Re run
schematool -initSchema -dbType derby
Run hive again
**Also of note: if you change directories, the metastore_db created above won't be found! I'm sure there's a good reason for this that I don't know yet because I'm literally trying to use hive for the first time today. Ahhh here's information on this: metastore_db created wherever I run Hive