Using aws emr add-steps for spark-submit - hadoop-yarn

We have a complicated spark-submit command that we would like to submit to AWS EMR using the aws emr add-steps CLI command. We are having trouble figuring out the correct syntax to use. For example, consider the example command from Apache's Running Spark on YARN page:
$ ./bin/spark-submit --class my.main.Class \
--master yarn \
--deploy-mode cluster \
--jars my-other-jar.jar,my-other-other-jar.jar \
my-main-jar.jar \
app_arg1 app_arg2
Following the guidance from this EMR command-runner page, we created something like this:
$ aws emr add-steps\
--cluster-id j-123456789 \
--steps Type=CUSTOM_JAR,Name='Test_Job',Jar='command-runner.jar',ActionOnFailure=CONTINUE,Args=[\
./bin/spark-submit,\
--class,my.main.Class,\
--master,yarn,\
--deploy-mode cluster,\
--jars,my-other-jar.jar,my-other-other-jar.jar,my-main-jar.jar,app_arg1,app_arg2]
However, due to the apparent parsing at commas, the command appears to only associate --jars with "my-other-jar", whereas "my-other-other-jar" is not. I'm hoping somebody can tell us the proper syntax to use. For example, should we use --jars for each extra jar, like:
--jars,my-other-jar.jar,--jars,my-other-other-jar.jar
or maybe there is some special list syntax, e.g.,
--jars,[my-other-jar.jar,my-other-other-jar.jar]
or something else. Can anybody tell us, or point us to, the correct syntax to use for spark-submit arguments that might take a list, i.e., not just --jars, but also --conf, --files, ...?

Related

AWS CLI for S3 Select

I have the following code, which is used to run a SQL query on a keyfile, located in a S3 bucket. This runs perfectly. My question is, I do not wish to have the output written over to an output file. Could I see the output on the screen (my preference #1)? If not, what about an ability to append to the output file, rather than over-write it (my preference #2). I am using the AWS-CLI binaries to run this query. If there is another way, I am happy to try (as long as it is within bash)
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' "OutputFile"
Of course, you can use AWS CLI to do this since stdout is just a special file in linux.
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' /dev/stdout
Note the /dev/stdout in the end.
The AWS CLI does not offer such options.
However, you are welcome to instead call it via an AWS SDK of your choice.
For example, in the boto3 Python SDK, there is a select_object_content() function that returns the data as a stream. You can then read, manipulate, print or save it however you wish.
I think it opens /dev/stdout twice causing kaos.

Run Bash script on GCP Dataproc

I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script.
Currently this scripts are running on AWS with the help of script-runner.jar. I am not sure how to move this to Dataproc. Is there anything similar which is available for Dataproc?
Or I will have to change all my scripts and calculate the arguments in Pig with the help of pig sh or pig fs?
As Aniket mentions, pig sh would itself be considered the script-runner for Dataproc jobs; instead of having to turn your wrapper script into a Pig script in itself, just use Pig to bootstrap any bash script you want to run. For example, suppose you have an arbitrary bash script hello.sh:
gsutil cp hello.sh gs://${BUCKET}/hello.sh
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
-e 'fs -cp -f gs://${BUCKET}/hello.sh file:///tmp/hello.sh; sh chmod 750 /tmp/hello.sh; sh /tmp/hello.sh'
The pig fs command uses Hadoop paths so to copy your script from GCS you must copy to a destination specified as file:/// to make sure it's on the local filesystem instead of HDFS; then the sh commands afterwards will be referencing local filesystem automatically so you don't use file:/// there.
Alternatively, you can take advantage of the way --jars works to automatically stage a file into the temporary directory created just for your Pig job rather than explicitly copying from GCS into a local directory; you simply specify your shell script itself as a --jars argument:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
Or:
gcloud dataproc jobs submit pig --cluster ${CLUSTER} \
--jars gs://${BUCKET}/hello.sh \
-e 'sh chmod 750 ${PWD}/hello.sh; sh ${PWD}/hello.sh'
In these cases, the script would only temporarily be downloaded into a directory that looks like /tmp/59bc732cd0b542b5b9dcc63f112aeca3 and which only exists for the lifetime of the pig job.
There is no shell job in Dataproc at the moment. As an alternative, you can use a use a pig job with sh command that forks your shell script which can then (again) run your pig job. (You can use pyspark similarly if you prefer python).
For example-
# cat a.sh
HELLO=hello
pig -e "sh echo $HELLO"
# pig -e "sh $PWD/a.sh"

Google DataProc Hive and Presto query doesn't work

I have a Google DataProc cluster with presto installed as an optional component. I create a external table in Hive and its size is ~1GB. While the table is queryable(for example, groupby statement, distinct, etc succeed), I have problems with perform a simple select * from tableA with Hive and Presto:
For Hive, if I logged in to master node of cluster, and run the query from Hive command line, it success. However, when I run the following command from my local machine:
gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;"
I get the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
ERROR: (gcloud.dataproc.jobs.submit.hive) Job [3e165c0edcda4e35ad0d5f62b77725bc] entered state [ERROR] while waiting for [DONE].
Though I've updated the configurations in mapred-site.xml as:
mapreduce.map.memory.mb=9000;
mapreduce.map.java.opts=-Xmx7000m;
mapreduce.reduce.memory.mb=9000;
mapreduce.reduce.java.opts=-Xmx7000m;
For Presto, similarly the statements such as groupBy and distinct work. However, for the select * from tableA, everytime it just hangs forever at about RUNNING 60% until timeout. And regardless if I run from local machine or from master node of cluster, I get the same issue.
I don't understand why such a small external table can have such issue. Any help is appreciated, thank you!
The Presto CLI binary /usr/bin/presto specifies a jvm -Xmx argument inline (it uses some tricks to bootstrap itself as a java binary); unfortunately, that -Xmx is not normally fetched from /opt/presto-server/etc/jvm.config like the settings for the actual presto-server.
In your case, if you're selecting everything from a 1G parquet table, you're probably actually dealing with something like 6G uncompressed text, and you're trying to stream all of that to the console output. This is likely also not going to work with the Dataproc job-submission because the streamed output is designed to print out human-readable amounts of data, and will slow down considerably if dealing with non-human amounts of data.
If you want to still try doing that with the CLI, you can run:
sudo sed -i "s/Xmx1G/Xmx5G/" /usr/bin/presto
To modify the jvm settings for the CLI on the master, before starting it back up. You'd probably then want to pipe the output to a local file instead of streaming it to your console, because you won't be able to read 6G of text streaming through your screen.
I think the problem is that the output of gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;" went through the Dataproc server which OOMed. To avoid that, you can query data from the cluster directly without going through the server.
Try following the Dataproc Presto tutorial - Presto CLI queries, run these commands from your local machine:
gcloud compute ssh <master-node> \
--project=${PROJECT} \
--zone=${ZONE} \
-- -D 1080 -N
./presto-cli \
--server <master-node>:8080 \
--socks-proxy localhost:1080 \
--catalog hive \
--schema default

How to start pig with -t ColumnMapKeyPrune on aws emr

In my pig script i want file name with each record for some further processing so i used -tagFile option. Now after using -tagFile option, the column names were getting un aligned so i used below command to get only required columns after referring this blog : http://www.webopius.com/content/764/resolved-apache-pig-with-tagsource-tagfile-option-generates-incorrect-columns
pig -x mapreduce -t ColumnMapKeyPrune
Now i want to run the script on AWS EMR but i am not sure how to enable this -t ColumnMapKeyPrune option on EMR Pig.?
I am using AWS CLI to create aws cluster and submit jobs.
Any pointer for how to enable -t ColumnMapKeyPrune on EMR Pig.?
I got the solution. I need to add below line in pig script:
set pig.optimizer.rules.disabled 'ColumnMapKeyPrune';

AWS EMR --steps

I am running the following .sh to run a command on AWS using EMR:
aws emr create-cluster --name "Big Matrix Re Run 5" --ami-version 3.1.0 --auto-terminate --log-uri FILE LOCATION --enable-debugging --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.xlarge InstanceGroupType=CORE,InstanceCount=3,InstanceType=c3.xlarge --steps NAME AND LOCATION OF FILE
I've deleted the pertinent file name and locations as those aren't my issue, but I am having an issue with the --steps portion of the script.
How do I specify the steps that I want to run in the cluster? The documentation doesn't give any examples.
Here is the error:
Error parsing parameter '--steps': should be: Key value pairs, where values are separated by commas, and multiple pairs are separated by spaces.
--steps Name=string1,Jar=string1,ActionOnFailure=string1,MainClass=string1,Type=string1,Properties=string1,Args=string1,string2 Name=string1,Jar=string1,ActionOnFailure=string1,MainClass=string1,Type=string1,Properties=string1,Args=string1,string2
Thanks!
The documentation page for the AWS Command-Line Interface create-cluster command shows examples for using the --steps parameter.
Steps can be supplied on the command-line, or can refer to files available within HDFS or Amazon S3.
Within HDFS:
aws emr create-cluster --steps file://./multiplefiles.json --ami-version 3.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
Within Amazon S3:
aws emr create-cluster --steps Type=HIVE,Name='Hive program',ActionOnFailure=CONTINUE,ActionOnFailure=TERMINATE_CLUSTER,Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/model-build.q,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/2014-04-18/11-07-32,-d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs] --applications Name=Hive --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge