is there a better way of running multiple jobs in parallel without repetition.
I need to run multiple python scripts simultaneously with different variables such as:
python main.py --table table_1
python main.py --table table_2
python main.py --table table_3
gitlab-ci.yml looks like this:
extract-table1:
stage:
- run
extends:
- .execute-script
script:
- python main.py --table table_1
extract-table2:
stage:
- run
extends:
- .execute-script
script:
- python main.py --table table_2
...
and so on..
How do avoid repetition like this? But I don't want to loop them because I need them in parallel. I have 10 tables to do. So repetition seems too much.
you can use parallel:matrix look like:
extract-table:
stage:
- run
extends:
- .execute-script
script:
- python main.py --table "$TABLE_NAME" --class "$CLASS_NAME"
parallel:
matrix:
- TABLE_NAME: [table_1, table_2, table_3]
CLASS_NAME: class_1
- TABLE_NAME: [table_4, table_5, table_6]
CLASS_NAME: class_2
- TABLE_NAME: [table_7, table_8]
CLASS_NAME: [class_3, class_4]
job will run
python main.py --table "table_1" --class "class_1"
python main.py --table "table_2" --class "class_1"
python main.py --table "table_3" --class "class_1"
python main.py --table "table_4" --class "class_2"
python main.py --table "table_5" --class "class_2"
python main.py --table "table_6" --class "class_2"
python main.py --table "table_7" --class "class_3"
python main.py --table "table_8" --class "class_3"
python main.py --table "table_7" --class "class_4"
python main.py --table "table_8" --class "class_4"
Related
I have create a dataproc cluster with an updated init action to install datalab.
All works fine, except that when I query a Hive table from the Datalab notebook, i run into
hc.sql(“””select * from invoices limit 10”””)
"java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found" exception
Create cluster
gcloud beta dataproc clusters create ds-cluster \
--project my-exercise-project \
--region us-west1 \
--zone us-west1-b \
--bucket dataproc-datalab \
--scopes cloud-platform \
--num-workers 2 \
--enable-component-gateway \
--initialization-actions gs://dataproc_mybucket/datalab-updated.sh,gs://dataproc-initialization-actions/connectors/connectors.sh \
--metadata 'CONDA_PACKAGES="python==3.5"' \
--metadata gcs-connector-version=1.9.11
datalab-updated.sh
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
In the datalab notebook
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show tables in default""").show()
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://my-exercise-project-ds-team/datasets/invoices’”””)
hc.sql(“””select * from invoices limit 10”””)
UPDATE
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "~/Downloads/my-exercise-project-f47054fc6fd8.json")
UPDATE 2 ( datalab-updated.sh )
function run_datalab(){
if docker run -d --restart always --net=host \
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
echo 'Cloud Datalab Jupyter server successfully deployed.'
else
err 'Failed to run Cloud Datalab'
fi
}
You should use Datalab initialization action to install Datalab on Dataproc cluster:
gcloud dataproc clusters create ${CLUSTER} \
--image-version=1.3 \
--scopes cloud-platform \
--initialization-actions=gs://dataproc-initialization-actions/datalab/datalab.sh
After this Hive works with GCS out of the box in Datalab:
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""SHOW TABLES IN default""").show()
Output:
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Creating external table on GCS using Hive in Datalab:
hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://<BUCKET>/datasets/invoices'""")
Output:
DataFrame[]
Querying GCS table using Hive in Datalab:
hc.sql("""SELECT * FROM invoices LIMIT 10""")
Output:
DataFrame[SubmissionDate: date, TransactionAmount: double, TransactionType: string]
If you want to use Hive in datalab, you have to enable hive metastore
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"
In your case will be
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://$PROJECT-warehouse/datasets/invoices’”””)
And make sure add following setting to enable GCS
sc._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
sc._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
sc._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
I'm trying to do a Sqoop incremental import to a Hive table using "--incremental append".
I did an initial sqoop import and then create a job for the incremental imports.
Both are executed successfully and new files have been added to the same original Hive table directory in HDFS, but when I check my Hive table, the imported observations are not there. The Hive table is equal before the sqoop incremental import.
How can I solve that?
I have about 45 Hive tables and would like to update them daily automatically after the Sqoop incremental import.
First Sqoop Import:
sqoop import \
--connect jdbc:db2://... \
--username root \
-password 9999999 \
--class-name db2fcs_cust_atu \
--query "SELECT * FROM db2fcs.cust_atu WHERE \$CONDITIONS" \
--split-by PTC_NR \
--fetch-size 10000 \
--delete-target-dir \
--target-dir /apps/hive/warehouse/fcs.db/db2fcs_cust_atu \
--hive-import \
--hive-table fcs.cust_atu \
-m 64;
Then I run Sqoop incremental import:
sqoop job \
-create cli_atu \
--import \
--connect jdbc:db2://... \
--username root \
--password 9999999 \
--table db2fcs.cust_atu \
--target-dir /apps/hive/warehouse/fcs.db/db2fcs_cust_atu \
--hive-table fcs.cust_atu \
--split-by PTC_NR \
--incremental append \
--check-column TS_CUST \
--last-value '2018-09-09'
It might be difficult to understand/answer your question without looking at your full query because your outcome also depends on your choice of arguments and directories. Mind to share your query?
I am trying to import all tables from mysql schema to hive by using blow sqoop query:-
sqoop import-all-tables --connect jdbc:mysql://ip-172-31-20-247:3306/retail_db --username sqoopuser -P --hive-import --hive-import --create-hive-table -m 3
it is saying ,
18/09/01 09:24:52 ERROR tool.ImportAllTablesTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
hdfs://ip-172-31-35-141.ec2.internal:8020/user/kumarrupesh2389619/categories already exists
Run the below command
hdfs dfs -rmr /user/kumarrupesh2389619/categories
Your command is failing since the directory already exists.
I'm trying to import data from Oracle to HDFS using Sqoop. Oracle version: 10.2.0.2
Table is not having constraints. When I mention number of mappers(-m) and --split-by parameters, it's showing the error: No more data to read from socket. If I mention -m 1(setting the number of mappers as 1), it's running, but taking too much time.
Sqoop command:
sqoop import --connect jdbc:oracle:thin:#host:port:SID --username uname --password pwd --table abc.market_price --target-dir /ert/etldev/etl/market_price -m 4 --split-by MNTH_YR
Please help me.
Instead of giving the num of mappers why dont you try using --direct ..
what does it show then??
sqoop import --connect jdbc:oracle:thin:#host:port:SID --username uname --password pwd --table abc.market_price --target-dir /ert/etldev/etl/market_price --direct
or
sqoop import --connect jdbc:oracle:thin:#host:port:SID --username uname --password pwd --table abc.market_price --target-dir /ert/etldev/etl/market_price --split-by MNTH_YR --direct
I am importing all tables from rdbms to hive using the sqoop command(v1.4.6).
Below is the command
sqoop-import-all-tables --verbose --connect jdbcconnection --username user --password pass --hive-import -m 1
This command works fine and it is loading all the tables in default schema.Is there a way to load the tables in particular schema?
Regards
Prakash
Use --hive-database <db name> in your import query.
Modified command:
sqoop-import-all-tables --verbose --connect jdbcconnection --username user --password pass --hive-import --hive-database new_db -m 1