Sqoop import postgres to S3 failing - amazon-s3

I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster.
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Exception is,
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/10/21 09:27:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/10/21 09:27:33 INFO manager.SqlManager: Using default fetchSize of 1000
19/10/21 09:27:33 INFO tool.CodeGenTool: Beginning code generation
19/10/21 09:27:33 INFO tool.CodeGenTool: Will generate java class as codegen_addresses
19/10/21 09:27:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/10/21 09:27:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.jar
19/10/21 09:27:37 WARN manager.PostgresqlManager: It looks like you are importing from postgresql.
19/10/21 09:27:37 WARN manager.PostgresqlManager: This transfer can be faster! Use the --direct
19/10/21 09:27:37 WARN manager.PostgresqlManager: option to exercise a postgresql-specific fast path.
19/10/21 09:27:37 INFO mapreduce.ImportJobBase: Beginning import of addresses
19/10/21 09:27:37 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
19/10/21 09:27:39 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:128)
at org.kitesdk.data.Datasets.exists(Datasets.java:624)
at org.kitesdk.data.Datasets.exists(Datasets.java:646)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:118)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:132)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:264)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Note : The same sqoop command with hdfs target dir is working. I'm also able to manually write to s3 bucket from the cluster node (using aws s3 command).

The Kite SDK has been upgraded. All you have to do is to download the new SDK into EMR and run the sqoop command again.
Use wget to download the kite-data-s3-1.1.0.jar
wget https://repo1.maven.org/maven2/org/kitesdk/kite-data-s3/1.1.0/kite-data-s3-1.1.0.jar
Move the JAR to the Sqoop library directory (/usr/lib/sqoop/lib/)
sudo cp kite-data-s3-1.1.0.jar /usr/lib/sqoop/lib/
Grant permission on the JAR
sudo chmod 755 kite-data-s3-1.1.0.jar
Use the s3n connector to import the jar
sqoop import \
--connect "jdbc:postgresql://:/?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username \
--password-file \
--table addresses \
--target-dir s3n://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Font: https://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/

There are two ways to sqoop to parquet -
Using --as-parquetfile
Using HCatalog
But both they way, its not possible to sqoop directly to parquet in EMR 5.X
Problem with both the approach -
Sqoop used Kite SDK to read/write Parquet and it has some limitations. And its not possible to use --as-parquetfile. EMR will remove Kite SDK in future as told by the AWS Support
Support Parquet through HCatalog has been added for hive (v2.4.0, v2.3.7) jira card and hive (v3.0.0) jira card. But EMR 5.X uses hive version 2.3.5.
What could be a workaround till now in EMR(v5.x):
Use a intermediate text table to pull the data. Use a separate hive query to copy the data from text to desired parquet table.

You'll need to change target-dir protocol from s3 to s3a:
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3a://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile

#Makubex, I was able to import after adding s3a as URI pattern,
But the time taken by the import job is too high.
I am using EMR 5.26.0. Do I need to do any configuration change for improving the time?

Please try executing sqoop command as specified below :
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--num-mappers 100 \
--split-by id \
--table addresses \
--as-parquetfile \
--target-dir s3://my-bucket/data/temp
Do make sure the target directory doesn't exist in S3

Related

Problem in importing all tables from MySQL to Hive using Apache Sqoop-1.4.7

I'm trying to import all tables from MySQL database into Hive database using Apache Sqoop CLI, when i execute the following import command:
[hadoop#localhost bin]$ sqoop import-all-tables --connect jdbc:mysql://localhost/mysql --username root --password root
Somehow the import fails and i get the following error message at the end of the output
20/05/09 23:06:27 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
.
.
.
20/05/09 23:06:38 ERROR tool.ImportAllTablesTool: Encountered IOException running import job:
java.io.IOException: Generating splits for a textual index column allowed only in case of
"-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter
Prior to running Sqoop-1.4.7 i had the following installed and up & running
[hduser#localhost ~]$ java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[hadoop#localhost ~]$ hadoop version
Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.7.2.jar
MySQL Server version: 5.6.48
Hive-1.2.2
Based on this configuration, how to import all tables to hive successfully?
It is possible to use a character attribute as split-by attribute.
From the message of your error
Generating splits for a textual index column allowed only in case of
"-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter
You only need to add -Dorg.apache.sqoop.splitter.allow_text_splitter=true
after your 'sqoop job' statement like this:
sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://localhost/mysql \
--username root \
--password root

Sqoop Export Error: Mixed update/insert is not supported against the target database yet

I am trying to export my data from Hive table to RDMBS (Microsoft SQL Server 2016 ) using this command:
sqoop export \
--connect connectionStirng \
--username name \
--password password \
--table Lab_Orders \
--update-mode allowinsert \
--update-key priKey \
--driver net.sourceforge.jtds.jdbc.Driver \
--hcatalog-table lab_orders \
-m 4
I want to do incremental export so I have specified update-mode and update-key. However when I run this command it fails with this error:
ERROR tool.ExportTool: Error during export:
Mixed update/insert is not supported against the target database yet
at org.apache.sqoop.manager.ConnManager.upsertTable(ConnManager.java:684)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:73)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:99)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
I went through all possible solutions including removing --driver. if I remove driver it doesn't recognize the RDBMS table. I am using sqoop version
Sqoop 1.4.6-cdh5.11.1 on cloudera cluster.
Can someone please help with possible solution?

Error while using sqoop to copy data to s3

I am using sqoop to copy a Postgres table to s3 using the following command
sqoop import -m 1 --connect jdbc:postgresql://xx.us-west-2.rds.amazonaws.com:5432/prod_db --username user_ro --password user_pwd --table content --target-dir s3://test/user/sqoop_test --as-avrodatafile
This works for the first time. Before the next execution I deleted the target directory using:
aws s3 rm s3://test/user/sqoop_test
Next execution of sqoop results in following error:
18/07/21 05:31:53 ERROR tool.ImportTool: Encountered IOException running import job: com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: Directory 'user/sqoop_test' present in the metadata but not s3
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:453)
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:380)
I have also tried doing "emrfs delete..." followed by "emrfs import..." & "emrfs sync.." But that didn't help resolve the problem. Any help will be appreciated.

sqoop import to hive failing with java.net.UnknownHostException

I have a sqoop import script to load data from Oracle to Hive.
Query
sqoop-import -D mapred.child.java.opts="-Djava.security.egd=file:/dev /../dev/abc" -D mapreduce.job.queuename="queue_name" \
--connect ${jdbc_url} \
--username ${username} \
--password ${password} \
--query "$query" \
--target-dir ${target_hdfs_dir} \
--delete-target-dir \
--fields-terminated-by ${hiveFieldsDelimiter} \
--hive-import ${o_write} \
--null-string '' \
--hive-table ${hiveTableName} \
$split_opt \
$numMapper\
Logs
18/02/01 12:38:45 DEBUG hive.TableDefWriter: Create statement: CREATE
TABLE IF NOT EXISTS db_test.test ( USR_ID STRING, ENT_TYPE
STRING, VAL STRING) COMMENT 'Imported by sqoop on 2018/02/01
12:38:45' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' LINES
TERMINATED BY '\012' STORED AS TEXTFILE 18/02/01 12:38:45 DEBUG
hive.TableDefWriter: Load statement: LOAD DATA INPATH
'hdfs://nn-sit/user/test_tmp' OVERWRITE INTO TABLE db_test.test
18/02/01 12:38:45 INFO hive.HiveImport: Loading uploaded data into
Hive 18/02/01 12:38:45 DEBUG hive.HiveImport: Using in-process Hive
instance. 18/02/01 12:38:45 DEBUG util.SubprocessSecurityManager:
Installing subprocess security manager
Logging initialized using configuration in
jar:file:/path/demoapp/lib/demoapp-0.0.1-SNAPSHOT.jar!/hive-log4j.properties
OK Time taken: 2.189 seconds Loading data to table db_test.test
Failed with exception java.net.UnknownHostException: nn-dev FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Observations:
We recently migrated to new cluster. The new namenode is nn-sit,
earlier it was nn-dev. Can somebody enlighten me where from Sqoop is
reading the older namenode name nn-dev as shown in the error
message:
Failed with exception java.net.UnknownHostException: nn-dev
It is successful in importing the data from Oracle to the target
hdfs path: hdfs://nn-sit/user/test_tmp. However it is failing while
loading to the Hive table.
The below individual command is successful from beeline.
LOAD DATA INPATH 'hdfs://nn-sit/user/test_tmp' OVERWRITE INTO TABLE db_test.test
Any help would be greatly appreciated.

sqoop failed to overwrite

Im using the below command for importing data from sqlserver to Azure blob storage
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "jdbc:sqlserver://server-IP;database=database_name;username=user;password=password"
--username test --password "password" --query "select top 5 * from employ where \$CONDITIONS" --delete-target-dir --target-dir 'wasb://sample#workingclusterblob.blob.core.windows.net/source/employ'
-m 1
getting below error
18/01/30 03:35:45 INFO tool.ImportTool: Destination directory wasb://sample#workingclusterblob.blob.core.windows.net/source/employ is not present, hence not deleting.
18/01/30 03:35:45 INFO mapreduce.ImportJobBase: Beginning query import.
18/01/30 03:35:46 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.19:10200
18/01/30 03:35:46 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory wasb://sample#workingclusterblob.blob.core.windows.net/source/employ already exists
Logs statement are confusing which tells both as not present for deleting and present while writing.
From Apache Sqoop user guide:
By default, imports go to a new target location. If the destination
directory already exists in HDFS, Sqoop will refuse to import and
overwrite that directory’s contents. If you use the --append argument,
Sqoop will import data to a temporary directory and then rename the
files into the normal target directory in a manner that does not
conflict with existing filenames in that directory.
I'm not in the Azure environment to reproduce and validate the solution, But Try adding the --append to your sqoop import and let me know.
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "jdbc:sqlserver://server-IP;database=database_name;username=user;password=password"
--username test --password "password" --query "select top 5 * from employ where \$CONDITIONS" --delete-target-dir --append --target-dir 'wasb://sample#workingclusterblob.blob.core.windows.net/source/employ'
-m 1