I have create a dataproc cluster with an updated init action to install datalab.
All works fine, except that when I query a Hive table from the Datalab notebook, i run into
hc.sql(“””select * from invoices limit 10”””)
"java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found" exception
Create cluster
gcloud beta dataproc clusters create ds-cluster \
--project my-exercise-project \
--region us-west1 \
--zone us-west1-b \
--bucket dataproc-datalab \
--scopes cloud-platform \
--num-workers 2 \
--enable-component-gateway \
--initialization-actions gs://dataproc_mybucket/datalab-updated.sh,gs://dataproc-initialization-actions/connectors/connectors.sh \
--metadata 'CONDA_PACKAGES="python==3.5"' \
--metadata gcs-connector-version=1.9.11
datalab-updated.sh
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
In the datalab notebook
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show tables in default""").show()
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://my-exercise-project-ds-team/datasets/invoices’”””)
hc.sql(“””select * from invoices limit 10”””)
UPDATE
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "~/Downloads/my-exercise-project-f47054fc6fd8.json")
UPDATE 2 ( datalab-updated.sh )
function run_datalab(){
if docker run -d --restart always --net=host \
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
echo 'Cloud Datalab Jupyter server successfully deployed.'
else
err 'Failed to run Cloud Datalab'
fi
}
You should use Datalab initialization action to install Datalab on Dataproc cluster:
gcloud dataproc clusters create ${CLUSTER} \
--image-version=1.3 \
--scopes cloud-platform \
--initialization-actions=gs://dataproc-initialization-actions/datalab/datalab.sh
After this Hive works with GCS out of the box in Datalab:
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""SHOW TABLES IN default""").show()
Output:
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Creating external table on GCS using Hive in Datalab:
hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://<BUCKET>/datasets/invoices'""")
Output:
DataFrame[]
Querying GCS table using Hive in Datalab:
hc.sql("""SELECT * FROM invoices LIMIT 10""")
Output:
DataFrame[SubmissionDate: date, TransactionAmount: double, TransactionType: string]
If you want to use Hive in datalab, you have to enable hive metastore
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"
In your case will be
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://$PROJECT-warehouse/datasets/invoices’”””)
And make sure add following setting to enable GCS
sc._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
sc._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
sc._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
Related
I am trying to export my data from Hive table to RDMBS (Microsoft SQL Server 2016 ) using this command:
sqoop export \
--connect connectionStirng \
--username name \
--password password \
--table Lab_Orders \
--update-mode allowinsert \
--update-key priKey \
--driver net.sourceforge.jtds.jdbc.Driver \
--hcatalog-table lab_orders \
-m 4
I want to do incremental export so I have specified update-mode and update-key. However when I run this command it fails with this error:
ERROR tool.ExportTool: Error during export:
Mixed update/insert is not supported against the target database yet
at org.apache.sqoop.manager.ConnManager.upsertTable(ConnManager.java:684)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:73)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:99)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
I went through all possible solutions including removing --driver. if I remove driver it doesn't recognize the RDBMS table. I am using sqoop version
Sqoop 1.4.6-cdh5.11.1 on cloudera cluster.
Can someone please help with possible solution?
I'm currently importing postgres data to hdfs. I'm planning to move the storage from hdfs to S3. When i'm trying to provide S3 location, the sqoop job is failing. I'm running it on EMR(emr-5.27.0) cluster and I've read/write access to that s3 bucket from all nodes in the cluster.
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Exception is,
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/10/21 09:27:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/10/21 09:27:33 INFO manager.SqlManager: Using default fetchSize of 1000
19/10/21 09:27:33 INFO tool.CodeGenTool: Beginning code generation
19/10/21 09:27:33 INFO tool.CodeGenTool: Will generate java class as codegen_addresses
19/10/21 09:27:34 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:34 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/10/21 09:27:37 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/412c4a70c10c6569443f4c38dbdc2c99/codegen_addresses.jar
19/10/21 09:27:37 WARN manager.PostgresqlManager: It looks like you are importing from postgresql.
19/10/21 09:27:37 WARN manager.PostgresqlManager: This transfer can be faster! Use the --direct
19/10/21 09:27:37 WARN manager.PostgresqlManager: option to exercise a postgresql-specific fast path.
19/10/21 09:27:37 INFO mapreduce.ImportJobBase: Beginning import of addresses
19/10/21 09:27:37 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "addresses" AS t LIMIT 1
19/10/21 09:27:39 INFO conf.HiveConf: Found configuration file file:/etc/hive/conf.dist/hive-site.xml
19/10/21 09:27:39 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:s3://<bucket>/<data>/temp
Check that JARs for s3 datasets are on the classpath
at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:128)
at org.kitesdk.data.Datasets.exists(Datasets.java:624)
at org.kitesdk.data.Datasets.exists(Datasets.java:646)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:118)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:132)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:264)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Note : The same sqoop command with hdfs target dir is working. I'm also able to manually write to s3 bucket from the cluster node (using aws s3 command).
The Kite SDK has been upgraded. All you have to do is to download the new SDK into EMR and run the sqoop command again.
Use wget to download the kite-data-s3-1.1.0.jar
wget https://repo1.maven.org/maven2/org/kitesdk/kite-data-s3/1.1.0/kite-data-s3-1.1.0.jar
Move the JAR to the Sqoop library directory (/usr/lib/sqoop/lib/)
sudo cp kite-data-s3-1.1.0.jar /usr/lib/sqoop/lib/
Grant permission on the JAR
sudo chmod 755 kite-data-s3-1.1.0.jar
Use the s3n connector to import the jar
sqoop import \
--connect "jdbc:postgresql://:/?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username \
--password-file \
--table addresses \
--target-dir s3n://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
Font: https://aws.amazon.com/premiumsupport/knowledge-center/unknown-dataset-uri-pattern-sqoop-emr/
There are two ways to sqoop to parquet -
Using --as-parquetfile
Using HCatalog
But both they way, its not possible to sqoop directly to parquet in EMR 5.X
Problem with both the approach -
Sqoop used Kite SDK to read/write Parquet and it has some limitations. And its not possible to use --as-parquetfile. EMR will remove Kite SDK in future as told by the AWS Support
Support Parquet through HCatalog has been added for hive (v2.4.0, v2.3.7) jira card and hive (v3.0.0) jira card. But EMR 5.X uses hive version 2.3.5.
What could be a workaround till now in EMR(v5.x):
Use a intermediate text table to pull the data. Use a separate hive query to copy the data from text to desired parquet table.
You'll need to change target-dir protocol from s3 to s3a:
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--table addresses \
--target-dir s3a://my-bucket/data/temp \
--num-mappers 100 \
--split-by id \
--as-parquetfile
#Makubex, I was able to import after adding s3a as URI pattern,
But the time taken by the import job is too high.
I am using EMR 5.26.0. Do I need to do any configuration change for improving the time?
Please try executing sqoop command as specified below :
sqoop import \
--connect "jdbc:postgresql://<machine_ip>:<port>/<database>?sslfactory=org.postgresql.ssl.NonValidatingFactory&ssl=true" \
--username <username> \
--password-file <password_file_path> \
--num-mappers 100 \
--split-by id \
--table addresses \
--as-parquetfile \
--target-dir s3://my-bucket/data/temp
Do make sure the target directory doesn't exist in S3
I've created an S3 Batch Operation using an S3 inventory JSON file that's pointing to a few billion objects in my S3 bucket.
The operation has been stuck on "Preparing" status for 24 hours now.
What are the preparation times to expect in these kinds of volumes?
Would preparation time shorten if instead of providing it with the JSON manifest I'll join all the inventory CSVs into one uber-CSV?
I've used awscli to create the request like so:
aws s3control create-job \
--region ... \
--account-id ... \
--operation '{"S3PutObjectCopy":{"TargetResource":"arn:aws:s3:::some-bucket","MetadataDirective":"COPY"}}' \
--manifest '{"Spec":{"Format":"S3InventoryReport_CSV_20161130"},"Location":{"ObjectArn":"arn:aws:s3:::path_to_manifest/manifest.json","ETag":"..."}}' \
--report '{"Bucket":"arn:aws:s3:::some-bucket","Prefix":"reports", "Format":"Report_CSV_20180820", "Enabled":true, "ReportScope":"AllTasks"}' \
--priority 42 \
--role-arn ... \
--client-request-token $(uuidgen) \
--description "Batch request"
After ~4 days the tasks completed the preparation phase and were ready to be ran
How do I return a list of RUNNING jobs using the BigQuery CLI
This doesn't work bq ls -j -a --filter='states:RUNNING'
The filter flag with state:[STATE]is applied for the transfer runs. Also, the --filter flag is to filter labels.
A workaround for this could be running the command bq ls -j -a | grep RUNNING in case you're using a Linux CL, or you could run a curl command to call the BigQuery API. For instance:
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" 'https://www.googleapis.com/bigquery/v2/projects/<Your Project>/jobs?alt=json&allUsers=true&projection=full&stateFilter=RUNNING'
Additionally, you could filling a Feature request to the BigQuery engineering team to consider having the option to filter the running jobs by using the bq ls -j command
I have a sqoop import script to load data from Oracle to Hive.
Query
sqoop-import -D mapred.child.java.opts="-Djava.security.egd=file:/dev /../dev/abc" -D mapreduce.job.queuename="queue_name" \
--connect ${jdbc_url} \
--username ${username} \
--password ${password} \
--query "$query" \
--target-dir ${target_hdfs_dir} \
--delete-target-dir \
--fields-terminated-by ${hiveFieldsDelimiter} \
--hive-import ${o_write} \
--null-string '' \
--hive-table ${hiveTableName} \
$split_opt \
$numMapper\
Logs
18/02/01 12:38:45 DEBUG hive.TableDefWriter: Create statement: CREATE
TABLE IF NOT EXISTS db_test.test ( USR_ID STRING, ENT_TYPE
STRING, VAL STRING) COMMENT 'Imported by sqoop on 2018/02/01
12:38:45' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' LINES
TERMINATED BY '\012' STORED AS TEXTFILE 18/02/01 12:38:45 DEBUG
hive.TableDefWriter: Load statement: LOAD DATA INPATH
'hdfs://nn-sit/user/test_tmp' OVERWRITE INTO TABLE db_test.test
18/02/01 12:38:45 INFO hive.HiveImport: Loading uploaded data into
Hive 18/02/01 12:38:45 DEBUG hive.HiveImport: Using in-process Hive
instance. 18/02/01 12:38:45 DEBUG util.SubprocessSecurityManager:
Installing subprocess security manager
Logging initialized using configuration in
jar:file:/path/demoapp/lib/demoapp-0.0.1-SNAPSHOT.jar!/hive-log4j.properties
OK Time taken: 2.189 seconds Loading data to table db_test.test
Failed with exception java.net.UnknownHostException: nn-dev FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Observations:
We recently migrated to new cluster. The new namenode is nn-sit,
earlier it was nn-dev. Can somebody enlighten me where from Sqoop is
reading the older namenode name nn-dev as shown in the error
message:
Failed with exception java.net.UnknownHostException: nn-dev
It is successful in importing the data from Oracle to the target
hdfs path: hdfs://nn-sit/user/test_tmp. However it is failing while
loading to the Hive table.
The below individual command is successful from beeline.
LOAD DATA INPATH 'hdfs://nn-sit/user/test_tmp' OVERWRITE INTO TABLE db_test.test
Any help would be greatly appreciated.