Problem in importing all tables from MySQL to Hive using Apache Sqoop-1.4.7 - hive

I'm trying to import all tables from MySQL database into Hive database using Apache Sqoop CLI, when i execute the following import command:
[hadoop#localhost bin]$ sqoop import-all-tables --connect jdbc:mysql://localhost/mysql --username root --password root
Somehow the import fails and i get the following error message at the end of the output
20/05/09 23:06:27 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
.
.
.
20/05/09 23:06:38 ERROR tool.ImportAllTablesTool: Encountered IOException running import job:
java.io.IOException: Generating splits for a textual index column allowed only in case of
"-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter
Prior to running Sqoop-1.4.7 i had the following installed and up & running
[hduser#localhost ~]$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[hadoop#localhost ~]$ hadoop version
Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.7.2.jar
MySQL Server version: 5.6.48
Hive-1.2.2
Based on this configuration, how to import all tables to hive successfully?

It is possible to use a character attribute as split-by attribute.
From the message of your error
Generating splits for a textual index column allowed only in case of
"-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter
You only need to add -Dorg.apache.sqoop.splitter.allow_text_splitter=true
after your 'sqoop job' statement like this:
sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://localhost/mysql \
--username root \
--password root

Related

Sqoop Export Error: Mixed update/insert is not supported against the target database yet

I am trying to export my data from Hive table to RDMBS (Microsoft SQL Server 2016 ) using this command:
sqoop export \
--connect connectionStirng \
--username name \
--password password \
--table Lab_Orders \
--update-mode allowinsert \
--update-key priKey \
--driver net.sourceforge.jtds.jdbc.Driver \
--hcatalog-table lab_orders \
-m 4
I want to do incremental export so I have specified update-mode and update-key. However when I run this command it fails with this error:
ERROR tool.ExportTool: Error during export:
Mixed update/insert is not supported against the target database yet
at org.apache.sqoop.manager.ConnManager.upsertTable(ConnManager.java:684)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:73)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:99)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
I went through all possible solutions including removing --driver. if I remove driver it doesn't recognize the RDBMS table. I am using sqoop version
Sqoop 1.4.6-cdh5.11.1 on cloudera cluster.
Can someone please help with possible solution?

Sqoop import postgres to S3 failing

I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster.
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Exception is,
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/10/21 09:27:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/10/21 09:27:33 INFO manager.SqlManager: Using default fetchSize of 1000
19/10/21 09:27:33 INFO tool.CodeGenTool: Beginning code generation
19/10/21 09:27:33 INFO tool.CodeGenTool: Will generate java class as codegen_addresses
19/10/21 09:27:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/10/21 09:27:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.jar
19/10/21 09:27:37 WARN manager.PostgresqlManager: It looks like you are importing from postgresql.
19/10/21 09:27:37 WARN manager.PostgresqlManager: This transfer can be faster! Use the --direct
19/10/21 09:27:37 WARN manager.PostgresqlManager: option to exercise a postgresql-specific fast path.
19/10/21 09:27:37 INFO mapreduce.ImportJobBase: Beginning import of addresses
19/10/21 09:27:37 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
19/10/21 09:27:39 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:128)
at org.kitesdk.data.Datasets.exists(Datasets.java:624)
at org.kitesdk.data.Datasets.exists(Datasets.java:646)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:118)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:132)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:264)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Note : The same sqoop command with hdfs target dir is working. I'm also able to manually write to s3 bucket from the cluster node (using aws s3 command).
The Kite SDK has been upgraded. All you have to do is to download the new SDK into EMR and run the sqoop command again.
Use wget to download the kite-data-s3-1.1.0.jar
wget https://repo1.maven.org/maven2/org/kitesdk/kite-data-s3/1.1.0/kite-data-s3-1.1.0.jar
Move the JAR to the Sqoop library directory (/usr/lib/sqoop/lib/)
sudo cp kite-data-s3-1.1.0.jar /usr/lib/sqoop/lib/
Grant permission on the JAR
sudo chmod 755 kite-data-s3-1.1.0.jar
Use the s3n connector to import the jar
sqoop import \
--connect "jdbc:postgresql://:/?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username \
--password-file \
--table addresses \
--target-dir s3n://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Font: https://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/
There are two ways to sqoop to parquet -
Using --as-parquetfile
Using HCatalog
But both they way, its not possible to sqoop directly to parquet in EMR 5.X
Problem with both the approach -
Sqoop used Kite SDK to read/write Parquet and it has some limitations. And its not possible to use --as-parquetfile. EMR will remove Kite SDK in future as told by the AWS Support
Support Parquet through HCatalog has been added for hive (v2.4.0, v2.3.7) jira card and hive (v3.0.0) jira card. But EMR 5.X uses hive version 2.3.5.
What could be a workaround till now in EMR(v5.x):
Use a intermediate text table to pull the data. Use a separate hive query to copy the data from text to desired parquet table.
You'll need to change target-dir protocol from s3 to s3a:
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3a://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
#Makubex, I was able to import after adding s3a as URI pattern,
But the time taken by the import job is too high.
I am using EMR 5.26.0. Do I need to do any configuration change for improving the time?
Please try executing sqoop command as specified below :
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--num-mappers 100 \
--split-by id \
--table addresses \
--as-parquetfile \
--target-dir s3://my-bucket/data/temp
Do make sure the target directory doesn't exist in S3

sqoop failed to overwrite

Im using the below command for importing data from sqlserver to Azure blob storage
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "jdbc:sqlserver://server-IP;database=database_name;username=user;password=password"
--username test --password "password" --query "select top 5 * from employ where \$CONDITIONS" --delete-target-dir --target-dir 'wasb://sample#workingclusterblob.blob.core.windows.net/source/employ'
-m 1
getting below error
18/01/30 03:35:45 INFO tool.ImportTool: Destination directory wasb://sample#workingclusterblob.blob.core.windows.net/source/employ is not present, hence not deleting.
18/01/30 03:35:45 INFO mapreduce.ImportJobBase: Beginning query import.
18/01/30 03:35:46 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.19:10200
18/01/30 03:35:46 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory wasb://sample#workingclusterblob.blob.core.windows.net/source/employ already exists
Logs statement are confusing which tells both as not present for deleting and present while writing.
From Apache Sqoop user guide:
By default, imports go to a new target location. If the destination
directory already exists in HDFS, Sqoop will refuse to import and
overwrite that directory’s contents. If you use the --append argument,
Sqoop will import data to a temporary directory and then rename the
files into the normal target directory in a manner that does not
conflict with existing filenames in that directory.
I'm not in the Azure environment to reproduce and validate the solution, But Try adding the --append to your sqoop import and let me know.
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "jdbc:sqlserver://server-IP;database=database_name;username=user;password=password"
--username test --password "password" --query "select top 5 * from employ where \$CONDITIONS" --delete-target-dir --append --target-dir 'wasb://sample#workingclusterblob.blob.core.windows.net/source/employ'
-m 1

Cannot find hive odbc connector error messages using unixODBC

I am trying to setup unixodbc to use the hive driver connector from cloudera (in an Ubuntu machine).
In my ~/.local/lib folder I have links to the .so files provided by cloudera,
also the env variable LD_LIBRARY_PATH contains /home/luca/.local/lib:/opt/cloudera/hiveodbc/lib/64/.
I created the file /etc/odbcinst.ini containing the following:
[hive]
Description = Cloudera ODBC Driver for Apache Hive (64-bit)
Driver = /home/luca/.local/lib/libclouderahiveodbc64.so
ODBCInstLib= /home/luca/.local/lib/libodbcinst.so
UsageCount = 1
DriverManagerEncoding=UTF-16
ErrorMessagesPath=/opt/cloudera/hiveodbc/ErrorMessages/
LogLevel=0
SwapFilePath=/tmp
and in my home folder I have .odbc.ini containing:
[hive]
Driver=hive
HOST=<thehost>
PORT=<theport>
Schema=<theschema>
FastSQLPrepare=0
UseNativeQuery=0
HiveServerType=2
AuthMech=2
#KrbHostFQDN=[Hive Server 2 Host FQDN]
#KrbServiceName=[Hive Server 2 Kerberos service name]
UID=<myuid>
When I test the connection using isql -v hive
I get the following error message:
[S1000][unixODBC][DSI] The error message NoSQLGetPrivateProfileString could not be found in the en-US locale. Check that /en-US/ODBCMessages.xml exists.
[ISQL]ERROR: Could not SQLConnect
How can I fix this issue (why is the path absolute for /en-US/)?
The SQLGetPrivateProfileString was not found in your ODBCInstLib library. Either the library could not be loaded, or the library did not contain the symbol.
Use strace isql -v hive 2>&1 | grep ini to see if your configuration file is being loaded. Use strace isql -v hive 2>&1 | grep odbcinst.so to see where it is looking for the library.
Make sure that the library exists in the given location and has the correct architecture. Use file -L /home/luca/.local/lib/libodbcinst.so to check the architecture. Use nm /home/luca/.local/lib/libodbcinst.so | grep SQLGetPrivateProfilString to check if it has the correct symbol.

How can import table from rdbms to to hive?

I am trying to import table from mysql to hive then i am getting following error.
can you please provide the solution for this.
bin/sqoop import --connect jdbc:mysql://202.63.155.22:3306/demo --username careuser --P --table caremanager --hive-import --verbose -m 1
3/12/30 02:42:05 WARN hive.TableDefWriter:
Column createddate had to be cast to a less precise type in Hive
13/12/30 02:42:05 ERROR tool.ImportTool: Encountered IOException running import job:
java.io.IOException: Cannot run program "hive": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
Actually sqoop is trying to access hive for creating hive table but cant execute it. Is hive installed on your machine and is Hive home is set in PATH environment variable.Please verify it.Hope these will solve your problem.