how to get EMR cluster version from a running cluster - amazon-emr

I have several accounts and they run different versions of EMR. I need to run a query to figure out what version (list-release-labels) they are running. I see the list-release-labels but it is not very intuitive. It looks as if I have to use list-clusters --active and maybe list-release-labels.
Appreciate for any pointers
Thanks
this just gives me the list of active clusters. I need to findout the release/version
aws emr list-clusters --active --query "Clusters[*].{ClusterName:Name}" --output text

Unfortunately, there is no EMR API that would list clusters and include the release label in the response, so you will have to list your clusters first (using aws emr list-clusters) and then look up the release label being used by the cluster (using aws emr describe-cluster). The list-release-labels command is completely unrelated and is used for listing the available release labels that could be used when creating clusters.
Here is some example shell script code that could be used to look up the release label for each of your active clusters:
for cluster in $(aws emr list-clusters --active --query 'Clusters[*].Id' --output text); do
echo -n "$cluster\t"
aws emr describe-cluster --cluster-id $cluster --query 'Cluster.ReleaseLabel' --output text
done

Related

How do I enable LLAP for Hive 2.1.0 in Dataproc cluster?

I'm trying to set up LLAP (interactive query) for Hive 2.1.0 which comes along with the Google Cloud Dataproc. I have already enabled Tez as the execution engine, but I'm not able to find any documentation/steps for enabling LLAP for making Hive even faster. Most of the available ones are for Hortonworks cluster, which is done through Ambari.
I think you can follow the Hive Configuration Properties - LLAP to add the following properties when creating the cluster.
--properties 'hive:hive.llap.execution.mode=<mode>,hive:hive.server2.llap.concurrent.queries=<n>,...'
Note that, "hive:" prefix is necessary for Dataproc to plumb the properties to Hive.
According to this document using apache hive on cloud dataproc, and Cloud SQL I/O and Hive Metastore
gcloud dataproc clusters create hive-cluster \
--scopes sql-admin \
--image-version 1.3 \
--initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
--properties 'hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets,hive:hive.llap.execution.mode=<mode>,hive:hive.server2.llap.concurrent.queries=<n>' \
--metadata "hive-metastore-instance=<PROJECT_ID>:<REGION>:<INSTANCE_NAME>"
If you need to setup any hive configuration (hive-site.xml), just add hive:xxx in your properties.

aws emr with yarn scheduler

I am creating AWS EMR using cloudformation template. I need to run the steps parallel. For that I am trying to change the YARN Scheduler from FIFO to fair / capacity scheduler.
I have added:
yarn.resourcemanager.scheduler.class : 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler'
Do I need to add FairScheduler.xml file in conf.empty folder? If so, can you please share the xml file.
and if I want to add fairscheduler.xml through cloudformation template, then do I need to use bootstrap for it? if so could you provide me the bootstrap file please.
Looks like even though after changing the scheduler, EMR won't allow to run jobs concurrently.
You can configure your cluster by specifying the configuration in cloud-formation scripts.
This a example to configure
- Classification: fair-scheduler
ConfigurationProperties:
<key1>: <value1>
<key2>: <value2>
- Classification: yarn-site
ConfigurationProperties:
yarn.acl.enable: true
yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
Please follow these -
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-configuration.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
EMR recently allows you to run multiple steps in parallel -
https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/

How do I download one or more files from a stopped Fargate task?

I have an ECS task that runs some test cases. I have it running in Fargate. Yay!
Now I want to download the test results file(s) from the container. I have the task and container IDs handy. I can find the exit code with
aws ecs describe-tasks --cluster Fargate --tasks <my-task-id>
How do I download the log and/or files produced?
It looks like, as of right now, the only way to get test results off of my server is to send the results to S3 before the container shuts down.
From this thread, there's no way to mount a volume / EFS onto a Fargate container.
Here's my bash script for running my tests (in build.sh) and then uploading the results to S3:
#!/bin/bash
echo Running tests...
pushd ~circleci/project/
export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_KEY
commandToRun="~/project/.circleci/build_scripts/build.sh";
# Run the command
eval $commandToRun 2>&1 | tee /tmp/build.log
# Get the exit code
exitCode=$?
aws s3 cp /tmp/build-$FEATURE.log s3://$CICD_BUCKET/build.log \
--storage-class REDUCED_REDUNDANCY \
--region us-east-1
exit ${exitCode}
Of course, you'll have to pass in the AWS_ACCESS_KEY, AWS_SECRET_KEY and CICD_BUCKET environment variables. The bucket name you choose needs to be pre-created, but any directory structure below it does NOT need to be created in advance.
You probably want to look at using CodeBuild for this use case, which can automatically copy artifacts to S3.
It's actually quite easy to orchestrate the following using a simple bash script and the AWS CLI:
Idempotently Create/Update a CodeBuild project (using a simple CloudFormation template you can define in your source repository)
Run a Codebuild job that executes a given revision of your source repository (using again a self-defining buildspec.yml specification defined in your source repository)
Attach to the CloudWatch logs log group for your CodeBuild job and stream log output
Finally detect when the job has completed successfully or not and then download any artifacts locally using S3
I use this approach to run builds in CodeBuild, with Bamboo as the overarching continuous delivery system.

Flink on EMR cannot access S3 bucket from "flink run" command

I'm prototyping the use of AWS EMR for a Flink-based system that we're planning to deploy. My cluster has the following versions:
Release label: emr-5.10.0
Hadoop distribution: Amazon 2.7.3
Applications: Flink 1.3.2
In the documentation provided by Amazon here: Amazon flink documentation
and the documentation from Flink: Apache flink documentation
both mention directly using S3 resources as an integrated file system with the s3://<bucket>/<file> pattern. I have verified that all the correct permissions are set, I can use the AWS CLI to copy S3 resources to the Master node with no problem, but attempting to start a Flink job using a Jar from S3 does not work.
I am executing the following step:
JAR location : command-runner.jar
Main class : None
Arguments : flink run -m yarn-cluster -yid application_1513333002475_0001 s3://mybucket/myapp.jar
Action on failure: Continue
The step always fails with
JAR file does not exist: s3://mybucket/myapp.jar
I have spoken to AWS support, and they suggested having a previous step copy the S3 file to the local Master node and then referencing it with a local path. While this would obviously work, I would rather get the native S3 integration working.
I have also tried using the s3a filesystem and get the same result.
You need to download your jar from s3 to be available in the classpath.
aws s3 cp s3://mybucket/myapp.jar myapp.jar
and then run the run -m yarn-cluster myapp.jar

External checkpoints to S3 on EMR

I am trying to deploy a production cluster for my Flink program. I am using a standard hadoop-core EMR cluster with Flink 1.3.2 installed, using YARN to run it.
I am trying to configure my RocksDB to write my checkpoints to an S3 bucket. I am trying to go through these docs: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/aws.html#set-s3-filesystem. The problem seems to be getting the dependencies working correctly. I receive this error when trying run the program:
java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:93)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.initialize(HadoopFileSystem.java:328)
at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:350)
at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389)
at org.apache.flink.core.fs.Path.getFileSystem(Path.java:293)
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory.<init>(FsCheckpointStreamFactory.java:99)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createStreamFactory(FsStateBackend.java:282)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createStreamFactory(RocksDBStateBackend.java:273
I have tried both leaving and adjusting the core-site.xml and leaving it as is. I have tried setting the HADOOP_CLASSPATH to the /usr/lib/hadoop/share that contains(what I assume are) most of the JARs described in the above guide. I tried downloading the hadoop 2.7.2 binaries, and copying over them into the flink/libs directory. All resulting in the same error.
Has anyone successfully gotten Flink being able to write to S3 on EMR?
EDIT: My cluster setup
AWS Portal:
1) EMR -> Create Cluster
2) Advanced Options
3) Release = emr-5.8.0
4) Only select Hadoop 2.7.3
5) Next -> Next -> Next -> Create Cluster ( I do fill out names/keys/etc)
Once the cluster is up I ssh into the Master and do the following:
1 wget http://apache.claz.org/flink/flink-1.3.2/flink-1.3.2-bin-hadoop27-scala_2.11.tgz
2 tar -xzf flink-1.3.2-bin-hadoop27-scala_2.11.tgz
3 cd flink-1.3.2
4 ./bin/yarn-session.sh -n 2 -tm 5120 -s 4 -d
5 Change conf/flink-conf.yaml
6 ./bin/flink run -m yarn-cluster -yn 1 ~/flink-consumer.jar
My conf/flink-conf.yaml I add the following fields:
state.backend: rocksdb
state.backend.fs.checkpointdir: s3:/bucket/location
state.checkpoints.dir: s3:/bucket/location
My program's checkpointing setup:
env.enableCheckpointing(getCheckpointRate,CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(getCheckpointMinPause)
env.getCheckpointConfig.setCheckpointTimeout(getCheckpointTimeout)
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
env.setStateBackend(new RocksDBStateBackend("s3://bucket/location", true))
If there are any steps you think I am missing, please let me know
I assume that you installed Flink 1.3.2 on your own on the EMR Yarn cluster, because Amazon does not yet offer Flink 1.3.2, right?
Given that, it seems as if you have a dependency conflict. The method org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration) was only introduced with Hadoop 2.4.0. Therefore I assume that you have deployed a Flink 1.3.2 version which was built with Hadoop 2.3.0. Please deploy a Flink version which was built with the Hadoop version running on EMR. This will most likely solve all dependency conflicts.
Putting the Hadoop dependencies into the lib folder seems to not reliably work because the flink-shaded-hadoop2-uber.jar appears to have precedence in the classpath.