Run Bash script on GCP Dataproc - apache-pig

I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script.
Currently this scripts are running on AWS with the help of script-runner.jar. I am not sure how to move this to Dataproc. Is there anything similar which is available for Dataproc?
Or I will have to change all my scripts and calculate the arguments in Pig with the help of pig sh or pig fs?

As Aniket mentions, pig sh would itself be considered the script-runner for Dataproc jobs; instead of having to turn your wrapper script into a Pig script in itself, just use Pig to bootstrap any bash script you want to run. For example, suppose you have an arbitrary bash script hello.sh:
gsutil cp hello.sh gs://${BUCKET}/hello.sh
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
-e 'fs -cp -f gs://${BUCKET}/hello.sh file:///tmp/hello.sh; sh chmod 750 /tmp/hello.sh; sh /tmp/hello.sh'
The pig fs command uses Hadoop paths so to copy your script from GCS you must copy to a destination specified as file:/// to make sure it's on the local filesystem instead of HDFS; then the sh commands afterwards will be referencing local filesystem automatically so you don't use file:/// there.
Alternatively, you can take advantage of the way --jars works to automatically stage a file into the temporary directory created just for your Pig job rather than explicitly copying from GCS into a local directory; you simply specify your shell script itself as a --jars argument:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
Or:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars gs://${BUCKET}/hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
In these cases, the script would only temporarily be downloaded into a directory that looks like /tmp/59bc732cd0b542b5b9dcc63f112aeca3 and which only exists for the lifetime of the pig job.

There is no shell job in Dataproc at the moment. As an alternative, you can use a use a pig job with sh command that forks your shell script which can then (again) run your pig job. (You can use pyspark similarly if you prefer python).
For example-
# cat a.sh
HELLO=hello
pig -e "sh echo $HELLO"
# pig -e "sh $PWD/a.sh"

Related

Using aws emr add-steps for spark-submit

We have a complicated spark-submit command that we would like to submit to AWS EMR using the aws emr add-steps CLI command. We are having trouble figuring out the correct syntax to use. For example, consider the example command from Apache's Running Spark on YARN page:
$ ./bin/spark-submit --class my.main.Class \
--master yarn \
--deploy-mode cluster \
--jars my-other-jar.jar,my-other-other-jar.jar \
my-main-jar.jar \
app_arg1 app_arg2
Following the guidance from this EMR command-runner page, we created something like this:
$ aws emr add-steps\
--cluster-id j-123456789 \
--steps Type=CUSTOM_JAR,Name='Test_Job',Jar='command-runner.jar',ActionOnFailure=CONTINUE,Args=[\
./bin/spark-submit,\
--class,my.main.Class,\
--master,yarn,\
--deploy-mode cluster,\
--jars,my-other-jar.jar,my-other-other-jar.jar,my-main-jar.jar,app_arg1,app_arg2]
However, due to the apparent parsing at commas, the command appears to only associate --jars with "my-other-jar", whereas "my-other-other-jar" is not. I'm hoping somebody can tell us the proper syntax to use. For example, should we use --jars for each extra jar, like:
--jars,my-other-jar.jar,--jars,my-other-other-jar.jar
or maybe there is some special list syntax, e.g.,
--jars,[my-other-jar.jar,my-other-other-jar.jar]
or something else. Can anybody tell us, or point us to, the correct syntax to use for spark-submit arguments that might take a list, i.e., not just --jars, but also --conf, --files, ...?

Shell script to copy data from remote server to Google Cloud Storage using Cron

I want to Sync my server data to Google Cloud Storage to copy automatically using shell script. I don't know how to make script. Every time i need to use:
gsutil -m rsync -d -r [Source] gs://[Bucket-name]
If anyone knows the answer please help me!
To automate the sync process use cron job:
Create a script to run with cron $ nano backup.sh
Paste your gsutil command in the script $ gsutil -m rsync -d -r [Source_PATH] gs://bucket-name
Make the script executable $ chmod +x backup.sh
Based on your use case, put the shell script (backup.sh) in one of the below folders: a) /etc/cron.daily b) /etc/cron.hourly c) /etc/cron.monthly d)
/etc/cron.weekly
If you want to run this script for a specific time then go to the terminal and type: $ crontab -e
Then simply call out the script with cron as often as you want, for example, in midnight: 00 00 * * * /path/to/your/backup.sh
In case you are using Windows on your local server, The commands will be the same as above but make sure to use Windows path instead.

How to start pig with -t ColumnMapKeyPrune on aws emr

In my pig script i want file name with each record for some further processing so i used -tagFile option. Now after using -tagFile option, the column names were getting un aligned so i used below command to get only required columns after referring this blog : http://www.webopius.com/content/764/resolved-apache-pig-with-tagsource-tagfile-option-generates-incorrect-columns
pig -x mapreduce -t ColumnMapKeyPrune
Now i want to run the script on AWS EMR but i am not sure how to enable this -t ColumnMapKeyPrune option on EMR Pig.?
I am using AWS CLI to create aws cluster and submit jobs.
Any pointer for how to enable -t ColumnMapKeyPrune on EMR Pig.?
I got the solution. I need to add below line in pig script:
set pig.optimizer.rules.disabled 'ColumnMapKeyPrune';

Getting results of a pig script on a remote cluster

Is there a way to get the results of a pig script run on a remote cluster directly without STORE-ing them and retrieving them separately?
So you can use a pig parameters to run your scripts. For example:
example.pig
A = LOAD '$PATH_TO_FOLDER_WITH_DATA' AS (f1:int, f2:int, f3:int);
--# Do Something With Your Data, and get output
C = STORE ouput INTO '$OUTPUT_PATH'
Then you can run the script like:
pig -p "/path/to/local/file" -p "/path/to/the/output" example.pig
So to automate in BASH:
storelocal.sh
#!/bin/bash
pig -p '$PATH_TO_FILES' -p '$PATH_TO_HDFS_OUT' example.pig
hdfs dfs -getmerge '$PATH_TO_HDFS_OUT' '$PATH_TO_LOCAL'
And you can run it ./storelocal.sh /path/to/local/file /path/to/the/local/output

Using beeline to compile ddl objects from .hql file

We have couple of hql files for compiling ddls.
in hive we used the following command from bash :
hive -v -f abc.hql
but, in beeline this doesn't work from bash. Any idea what can be the equivalent command for beeline.
Make sure your hiveserver2 is up & running on some port
In beeline
beeline -u "jdbc:hive2://localhost:port/database_name/" -f abc.hql
Refer this doc for more commands
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Refer this doc if you have not yet configured hiveserver2
https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2