Run same repetitive job in parallel in Gitlab CICD - gitlab-ci

is there a better way of running multiple jobs in parallel without repetition.
I need to run multiple python scripts simultaneously with different variables such as:
python main.py --table table_1
python main.py --table table_2
python main.py --table table_3
gitlab-ci.yml looks like this:
extract-table1:
stage:
- run
extends:
- .execute-script
script:
- python main.py --table table_1
extract-table2:
stage:
- run
extends:
- .execute-script
script:
- python main.py --table table_2
...
and so on..
How do avoid repetition like this? But I don't want to loop them because I need them in parallel. I have 10 tables to do. So repetition seems too much.

you can use parallel:matrix look like:
extract-table:
stage:
- run
extends:
- .execute-script
script:
- python main.py --table "$TABLE_NAME" --class "$CLASS_NAME"
parallel:
matrix:
- TABLE_NAME: [table_1, table_2, table_3]
CLASS_NAME: class_1
- TABLE_NAME: [table_4, table_5, table_6]
CLASS_NAME: class_2
- TABLE_NAME: [table_7, table_8]
CLASS_NAME: [class_3, class_4]
job will run
python main.py --table "table_1" --class "class_1"
python main.py --table "table_2" --class "class_1"
python main.py --table "table_3" --class "class_1"
python main.py --table "table_4" --class "class_2"
python main.py --table "table_5" --class "class_2"
python main.py --table "table_6" --class "class_2"
python main.py --table "table_7" --class "class_3"
python main.py --table "table_8" --class "class_3"
python main.py --table "table_7" --class "class_4"
python main.py --table "table_8" --class "class_4"

Related

Issue querying a Hive table in Datalab

I have create a dataproc cluster with an updated init action to install datalab.
All works fine, except that when I query a Hive table from the Datalab notebook, i run into
hc.sql(“””select * from invoices limit 10”””)
"java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found" exception
Create cluster
gcloud beta dataproc clusters create ds-cluster \
--project my-exercise-project \
--region us-west1 \
--zone us-west1-b \
--bucket dataproc-datalab \
--scopes cloud-platform \
--num-workers 2 \
--enable-component-gateway \
--initialization-actions gs://dataproc_mybucket/datalab-updated.sh,gs://dataproc-initialization-actions/connectors/connectors.sh \
--metadata 'CONDA_PACKAGES="python==3.5"' \
--metadata gcs-connector-version=1.9.11
datalab-updated.sh
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
In the datalab notebook
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show tables in default""").show()
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://my-exercise-project-ds-team/datasets/invoices’”””)
hc.sql(“””select * from invoices limit 10”””)
UPDATE
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "~/Downloads/my-exercise-project-f47054fc6fd8.json")
UPDATE 2 ( datalab-updated.sh )
function run_datalab(){
if docker run -d --restart always --net=host \
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
echo 'Cloud Datalab Jupyter server successfully deployed.'
else
err 'Failed to run Cloud Datalab'
fi
}
You should use Datalab initialization action to install Datalab on Dataproc cluster:
gcloud dataproc clusters create ${CLUSTER} \
--image-version=1.3 \
--scopes cloud-platform \
--initialization-actions=gs://dataproc-initialization-actions/datalab/datalab.sh
After this Hive works with GCS out of the box in Datalab:
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""SHOW TABLES IN default""").show()
Output:
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Creating external table on GCS using Hive in Datalab:
hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://<BUCKET>/datasets/invoices'""")
Output:
DataFrame[]
Querying GCS table using Hive in Datalab:
hc.sql("""SELECT * FROM invoices LIMIT 10""")
Output:
DataFrame[SubmissionDate: date, TransactionAmount: double, TransactionType: string]
If you want to use Hive in datalab, you have to enable hive metastore
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"
In your case will be
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore"
hc.sql(“””CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://$PROJECT-warehouse/datasets/invoices’”””)
And make sure add following setting to enable GCS
sc._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
sc._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
sc._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
sc._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')

Hive table outdated after Sqoop incremental import

I'm trying to do a Sqoop incremental import to a Hive table using "--incremental append".
I did an initial sqoop import and then create a job for the incremental imports.
Both are executed successfully and new files have been added to the same original Hive table directory in HDFS, but when I check my Hive table, the imported observations are not there. The Hive table is equal before the sqoop incremental import.
How can I solve that?
I have about 45 Hive tables and would like to update them daily automatically after the Sqoop incremental import.
First Sqoop Import:
sqoop import \
--connect jdbc:db2://... \
--username root \
-password 9999999 \
--class-name db2fcs_cust_atu \
--query "SELECT * FROM db2fcs.cust_atu WHERE \$CONDITIONS" \
--split-by PTC_NR \
--fetch-size 10000 \
--delete-target-dir \
--target-dir /apps/hive/warehouse/fcs.db/db2fcs_cust_atu \
--hive-import \
--hive-table fcs.cust_atu \
-m 64;
Then I run Sqoop incremental import:
sqoop job \
-create cli_atu \
--import \
--connect jdbc:db2://... \
--username root \
--password 9999999 \
--table db2fcs.cust_atu \
--target-dir /apps/hive/warehouse/fcs.db/db2fcs_cust_atu \
--hive-table fcs.cust_atu \
--split-by PTC_NR \
--incremental append \
--check-column TS_CUST \
--last-value '2018-09-09'
It might be difficult to understand/answer your question without looking at your full query because your outcome also depends on your choice of arguments and directories. Mind to share your query?

sqoop all table from mysql to Hive import

I am trying to import all tables from mysql schema to hive by using blow sqoop query:-
sqoop import-all-tables --connect jdbc:mysql://ip-172-31-20-247:3306/retail_db --username sqoopuser -P --hive-import --hive-import --create-hive-table -m 3
it is saying ,
18/09/01 09:24:52 ERROR tool.ImportAllTablesTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
hdfs://ip-172-31-35-141.ec2.internal:8020/user/kumarrupesh2389619/categories already exists
Run the below command
hdfs dfs -rmr /user/kumarrupesh2389619/categories
Your command is failing since the directory already exists.

Sqoop import from oracle to hdfs: No more data to read from socket

I'm trying to import data from Oracle to HDFS using Sqoop. Oracle version: 10.2.0.2
Table is not having constraints. When I mention number of mappers(-m) and --split-by parameters, it's showing the error: No more data to read from socket. If I mention -m 1(setting the number of mappers as 1), it's running, but taking too much time.
Sqoop command:
sqoop import --connect jdbc:oracle:thin:#host:port:SID --username uname --password pwd --table abc.market_price --target-dir /ert/etldev/etl/market_price -m 4 --split-by MNTH_YR
Please help me.
Instead of giving the num of mappers why dont you try using --direct ..
what does it show then??
sqoop import --connect jdbc:oracle:thin:#host:port:SID --username uname --password pwd --table abc.market_price --target-dir /ert/etldev/etl/market_price --direct
or
sqoop import --connect jdbc:oracle:thin:#host:port:SID --username uname --password pwd --table abc.market_price --target-dir /ert/etldev/etl/market_price --split-by MNTH_YR --direct

Sqoop command-How to give schema name for Import All Tables

I am importing all tables from rdbms to hive using the sqoop command(v1.4.6).
Below is the command
sqoop-import-all-tables --verbose --connect jdbcconnection --username user --password pass --hive-import -m 1
This command works fine and it is loading all the tables in default schema.Is there a way to load the tables in particular schema?
Regards
Prakash
Use --hive-database <db name> in your import query.
Modified command:
sqoop-import-all-tables --verbose --connect jdbcconnection --username user --password pass --hive-import --hive-database new_db -m 1