Sqoop logs in Amazon EMR - amazon-emr

I'm running a sqoop job in EMR:
sqoop import --connect 'jdbc:sqlserver://server.com:1433;databaseName=db' --table myTable --target-dir s3://mylocation --username admin --password pass
It was working fine for previous runs. But now it's stuck for one of the tables and it does not throw any errors. After running the job, it is stuck at
17/03/07 13:28:19 INFO mapreduce.Job: The url to track the job: http://xxx:20888/proxy/application_1488891031868_0010/
17/03/07 13:28:19 INFO mapreduce.Job: Running job: job_1488891031868_0010
How can I see the detailed log and find out what went wrong? Thanks.

Add --verbose in your sqoop command to check extended logs. It will print more information regarding classpath, split values, JDBC connection, MapReduce job, etc.

Related

Sqoop import postgres to S3 failing

I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster.
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Exception is,
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/10/21 09:27:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/10/21 09:27:33 INFO manager.SqlManager: Using default fetchSize of 1000
19/10/21 09:27:33 INFO tool.CodeGenTool: Beginning code generation
19/10/21 09:27:33 INFO tool.CodeGenTool: Will generate java class as codegen_addresses
19/10/21 09:27:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/10/21 09:27:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.jar
19/10/21 09:27:37 WARN manager.PostgresqlManager: It looks like you are importing from postgresql.
19/10/21 09:27:37 WARN manager.PostgresqlManager: This transfer can be faster! Use the --direct
19/10/21 09:27:37 WARN manager.PostgresqlManager: option to exercise a postgresql-specific fast path.
19/10/21 09:27:37 INFO mapreduce.ImportJobBase: Beginning import of addresses
19/10/21 09:27:37 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
19/10/21 09:27:39 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:128)
at org.kitesdk.data.Datasets.exists(Datasets.java:624)
at org.kitesdk.data.Datasets.exists(Datasets.java:646)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:118)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:132)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:264)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Note : The same sqoop command with hdfs target dir is working. I'm also able to manually write to s3 bucket from the cluster node (using aws s3 command).
The Kite SDK has been upgraded. All you have to do is to download the new SDK into EMR and run the sqoop command again.
Use wget to download the kite-data-s3-1.1.0.jar
wget https://repo1.maven.org/maven2/org/kitesdk/kite-data-s3/1.1.0/kite-data-s3-1.1.0.jar
Move the JAR to the Sqoop library directory (/usr/lib/sqoop/lib/)
sudo cp kite-data-s3-1.1.0.jar /usr/lib/sqoop/lib/
Grant permission on the JAR
sudo chmod 755 kite-data-s3-1.1.0.jar
Use the s3n connector to import the jar
sqoop import \
--connect "jdbc:postgresql://:/?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username \
--password-file \
--table addresses \
--target-dir s3n://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Font: https://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/
There are two ways to sqoop to parquet -
Using --as-parquetfile
Using HCatalog
But both they way, its not possible to sqoop directly to parquet in EMR 5.X
Problem with both the approach -
Sqoop used Kite SDK to read/write Parquet and it has some limitations. And its not possible to use --as-parquetfile. EMR will remove Kite SDK in future as told by the AWS Support
Support Parquet through HCatalog has been added for hive (v2.4.0, v2.3.7) jira card and hive (v3.0.0) jira card. But EMR 5.X uses hive version 2.3.5.
What could be a workaround till now in EMR(v5.x):
Use a intermediate text table to pull the data. Use a separate hive query to copy the data from text to desired parquet table.
You'll need to change target-dir protocol from s3 to s3a:
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3a://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
#Makubex, I was able to import after adding s3a as URI pattern,
But the time taken by the import job is too high.
I am using EMR 5.26.0. Do I need to do any configuration change for improving the time?
Please try executing sqoop command as specified below :
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--num-mappers 100 \
--split-by id \
--table addresses \
--as-parquetfile \
--target-dir s3://my-bucket/data/temp
Do make sure the target directory doesn't exist in S3

Google DataProc Hive and Presto query doesn't work

I have a Google DataProc cluster with presto installed as an optional component. I create a external table in Hive and its size is ~1GB. While the table is queryable(for example, groupby statement, distinct, etc succeed), I have problems with perform a simple select * from tableA with Hive and Presto:
For Hive, if I logged in to master node of cluster, and run the query from Hive command line, it success. However, when I run the following command from my local machine:
gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;"
I get the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
ERROR: (gcloud.dataproc.jobs.submit.hive) Job [3e165c0edcda4e35ad0d5f62b77725bc] entered state [ERROR] while waiting for [DONE].
Though I've updated the configurations in mapred-site.xml as:
mapreduce.map.memory.mb=9000;
mapreduce.map.java.opts=-Xmx7000m;
mapreduce.reduce.memory.mb=9000;
mapreduce.reduce.java.opts=-Xmx7000m;
For Presto, similarly the statements such as groupBy and distinct work. However, for the select * from tableA, everytime it just hangs forever at about RUNNING 60% until timeout. And regardless if I run from local machine or from master node of cluster, I get the same issue.
I don't understand why such a small external table can have such issue. Any help is appreciated, thank you!
The Presto CLI binary /usr/bin/presto specifies a jvm -Xmx argument inline (it uses some tricks to bootstrap itself as a java binary); unfortunately, that -Xmx is not normally fetched from /opt/presto-server/etc/jvm.config like the settings for the actual presto-server.
In your case, if you're selecting everything from a 1G parquet table, you're probably actually dealing with something like 6G uncompressed text, and you're trying to stream all of that to the console output. This is likely also not going to work with the Dataproc job-submission because the streamed output is designed to print out human-readable amounts of data, and will slow down considerably if dealing with non-human amounts of data.
If you want to still try doing that with the CLI, you can run:
sudo sed -i "s/Xmx1G/Xmx5G/" /usr/bin/presto
To modify the jvm settings for the CLI on the master, before starting it back up. You'd probably then want to pipe the output to a local file instead of streaming it to your console, because you won't be able to read 6G of text streaming through your screen.
I think the problem is that the output of gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;" went through the Dataproc server which OOMed. To avoid that, you can query data from the cluster directly without going through the server.
Try following the Dataproc Presto tutorial - Presto CLI queries, run these commands from your local machine:
gcloud compute ssh <master-node> \
--project=${PROJECT} \
--zone=${ZONE} \
-- -D 1080 -N
./presto-cli \
--server <master-node>:8080 \
--socks-proxy localhost:1080 \
--catalog hive \
--schema default

Error while using sqoop to copy data to s3

I am using sqoop to copy a Postgres table to s3 using the following command
sqoop import -m 1 --connect jdbc:postgresql://xx.us-west-2.rds.amazonaws.com:5432/prod_db --username user_ro --password user_pwd --table content --target-dir s3://test/user/sqoop_test --as-avrodatafile
This works for the first time. Before the next execution I deleted the target directory using:
aws s3 rm s3://test/user/sqoop_test
Next execution of sqoop results in following error:
18/07/21 05:31:53 ERROR tool.ImportTool: Encountered IOException running import job: com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: Directory 'user/sqoop_test' present in the metadata but not s3
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:453)
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:380)
I have also tried doing "emrfs delete..." followed by "emrfs import..." & "emrfs sync.." But that didn't help resolve the problem. Any help will be appreciated.

sqoop failed to overwrite

Im using the below command for importing data from sqlserver to Azure blob storage
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "jdbc:sqlserver://server-IP;database=database_name;username=user;password=password"
--username test --password "password" --query "select top 5 * from employ where \$CONDITIONS" --delete-target-dir --target-dir 'wasb://sample#workingclusterblob.blob.core.windows.net/source/employ'
-m 1
getting below error
18/01/30 03:35:45 INFO tool.ImportTool: Destination directory wasb://sample#workingclusterblob.blob.core.windows.net/source/employ is not present, hence not deleting.
18/01/30 03:35:45 INFO mapreduce.ImportJobBase: Beginning query import.
18/01/30 03:35:46 INFO client.AHSProxy: Connecting to Application History server at headnodehost/10.0.0.19:10200
18/01/30 03:35:46 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory wasb://sample#workingclusterblob.blob.core.windows.net/source/employ already exists
Logs statement are confusing which tells both as not present for deleting and present while writing.
From Apache Sqoop user guide:
By default, imports go to a new target location. If the destination
directory already exists in HDFS, Sqoop will refuse to import and
overwrite that directory’s contents. If you use the --append argument,
Sqoop will import data to a temporary directory and then rename the
files into the normal target directory in a manner that does not
conflict with existing filenames in that directory.
I'm not in the Azure environment to reproduce and validate the solution, But Try adding the --append to your sqoop import and let me know.
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect "jdbc:sqlserver://server-IP;database=database_name;username=user;password=password"
--username test --password "password" --query "select top 5 * from employ where \$CONDITIONS" --delete-target-dir --append --target-dir 'wasb://sample#workingclusterblob.blob.core.windows.net/source/employ'
-m 1

Sqoop hive import from mysql to hive is failing

I am trying to load a table from mysql to hive using --hive-import in parquet format, We want to do incremental update of hive table. When we try below command. its failing with the error mentioned below. Can anybody please help here.
sqoop job --create users_test_hive -- import --connect 'jdbc:mysql://dbhost/dbname?characterEncoding=utf8&dontTrackOpenResources=true&defaultFetchSize=1000&useCursorFetch=true&useUnicode=yes&characterEncoding=utf8' --table users --incremental lastmodified --check-column n_last_updated --username username --password password --merge-key user_id --mysql-delimiters --as-parquetfile --hive-import --warehouse-dir /usr/hive/warehouse/ --hive-table users_test_hive.
Error while running it.
16/02/27 21:33:17 INFO mapreduce.Job: Task Id : attempt_1454936520418_0239_m_000000_1, Status : FAILED
Error: parquet.column.ParquetProperties.newColumnWriteStore(Lparquet/schema/MessageType;Lparquet/column/page/PageWriteStore;I)Lparquet/column/ColumnWriteStore;