aws emr with yarn scheduler

aws emr with yarn scheduler - hadoop-yarn

I am creating AWS EMR using cloudformation template. I need to run the steps parallel. For that I am trying to change the YARN Scheduler from FIFO to fair / capacity scheduler.
I have added:
yarn.resourcemanager.scheduler.class : 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler'
Do I need to add FairScheduler.xml file in conf.empty folder? If so, can you please share the xml file.
and if I want to add fairscheduler.xml through cloudformation template, then do I need to use bootstrap for it? if so could you provide me the bootstrap file please.

Looks like even though after changing the scheduler, EMR won't allow to run jobs concurrently.

You can configure your cluster by specifying the configuration in cloud-formation scripts.
This a example to configure
- Classification: fair-scheduler
ConfigurationProperties:
<key1>: <value1>
<key2>: <value2>
- Classification: yarn-site
ConfigurationProperties:
yarn.acl.enable: true
yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
Please follow these -
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-configuration.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
EMR recently allows you to run multiple steps in parallel -
https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/

Related

Apache Flink to use S3 for backend state and checkpoints

Background
I was planning to use S3 to store the Flink's checkpoints using the FsStateBackend. But somehow I was getting the following error.
Error
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
Flink version: I am using Flink 1.10.0 version.

I have found the solution for the above issue, so here I am listing it in steps that are required.
Steps
We need to add some configs in the flink-conf.yaml file which I have listed below.
state.backend: filesystem
state.checkpoints.dir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
state.backend.fs.checkpointdir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
s3.access-key: XXXXXXXXXXXXXXXXXXX #your-access-key
s3.secret-key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #your-secret-key
s3.endpoint: http://127.0.0.1:9000 #your-endpoint-hostname (I have used Minio)
After completing the first step we need to copy the respective(flink-s3-fs-hadoop-1.10.0.jar and flink-s3-fs-presto-1.10.0.jar) JAR files from the opt directory to the plugins directory of your Flink.
E.g:--> 1. Copy /flink-1.10.0/opt/flink-s3-fs-hadoop-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.10.0.jar // Recommended for StreamingFileSink
2. Copy /flink-1.10.0/opt/flink-s3-fs-presto-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-presto/flink-s3-fs-presto-1.10.0.jar //Recommended for checkpointing
Add this in checkpointing code
env.setStateBackend(new FsStateBackend("s3://s3-bucket/checkpoints/"))
After completing all the above steps re-start the Flink if it is already running.
Note:
If you are using both(flink-s3-fs-hadoop and flink-s3-fs-presto) in Flink then please use s3p:// specificly for flink-s3-fs-presto and s3a:// for flink-s3-fs-hadoop instead of s3://.
For more details click here.

Snakemake - Tibanna config support

I trying to run snakemake --tibanna to deploy Snakemake on AWS using the "Unicorn" Step Functions Tibanna creates.
I can't seem to find a way to change the different arguments Tibanna accepts like which subnet, AZ or Security Group will be used for the actual EC2 instance deployed.
Argument example (when running Tibanna without Snakemake):
https://github.com/4dn-dcic/tibanna/blob/master/test_json/unicorn/shelltest4.json#L32
Thanks!

Did you noticed this option?
snakemake --help
--tibanna-config TIBANNA_CONFIG [TIBANNA_CONFIG ...]
Additional tibanan config e.g. --tibanna-config
spot_instance=true subnet= security
group=
I think it was added recently.
-jk

In GitlabCI - How can we trigger a build/pipeline if a specific build/pipeline is completed successfully?

We are using GitLab Enterprise Edition 10.8.7-ee 075705a and trying to use Gitlab CI.
Here is my scenario:-
I've two repositories repo1 and repo2 and I'm setting up two pipelines pipeline1 and pipeline2.
Now I'm looking for an option where I can configure pipeline2 to trigger a build if pipeline1 build is successful. One more thing, I need to get the version number of the pipeline1 in pipeline2
Note:- I know we can trigger pipeline2 from pipeline1 but I need other way around.
Please suggest.

A couple of options.
Use the gitlab api to do this (triggers).
Use webhooks to do this.
gitlab webhooks docs
gitlab triggers docs
with this. you can get any data / meta data for your stack.
and can automagically call it/set it on any condition.
This can also be done if your stack is using aws (CLI) and (or) Jenkins
Some sections that may interest you in gitlab triggers docs
When used with multi-project Pipelines
When a pipeline depends on the artifacts of another Pipeline
Triggering a pipeline from a webhook
Using cron to trigger nightly (or pretty much *ly) pipelines

Ambari - Execute script when adding node to cluster

Is it possible (and how) to specify a shell script somewhere which will be executed each time a new node is added to Ambari cluster?
I'm using HDP Ambari for that and I would like to add some symbolic links when setup of new node is completed, but I want to automatize that so that I (or someone else) don't forget it.

There is no functionality that currently exists that will enable you to execute a script when a node is added to the cluster. What you're asking for is a custom hook. You would have to look through the Ambari source code and see if you can define a custom hook for the stack. There are a few hooks provided in each stack, for examples see: https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks

How to run scripts automatically after deployment in AWS using EB CLI?

I am trying to make a Django server on AWS. My django app depends on some mathematical python libraries like numpy, scipy, sklearn etc. However there is an issue for which I need to this after every deployment
sudo nano /etc/httpd/conf.d/wsgi.conf
---------------------------------------
add this line in the file
WSGIApplicationGroup %{GLOBAL}
---------------------------------------
sudo /etc/init.d/httpd reload
Basically I need "WSGIApplicationGroup %{GLOBAL}" in my wsgi.conf file otherwise I get 504. I am using a Custom AMI built on top of Amazon Linux 2014 and I am using EB CLI for deployment. However whenever I deploy the wsgi.conf is reset and it does not contain the line that I have added previously and I need to manually SSH into the EC2 instance and do this task myself. It gives a overhead on every deployment and its also not feasible once we scale up (cloning or creating instances also resets it). So is there a way that this will be automatically done after every deployment ?
The content of the wsgi.conf is fixed, so basically I can make a script easily to create it but the issue is how to trigger the script automatically ?
PS:I am new to AWS

You need to use AWS Elastic Beanstalk feature called .ebextensions: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html
In your case you can't use Files or Commands sections, because:
The commands are processed in alphabetical order by name, and they run
before the application and web server are set up and the application
version file is extracted.
You need to use Container_commands section:
They run after the application and web server have been set up and the
application version file has been extracted, but before the
application version is deployed.
Example .ebextensions/01wsgi.config (not tested :-))
container_commands:
apache_reload:
command: |
echo "WSGIApplicationGroup %{GLOBAL}" >> /etc/httpd/conf.d/wsgi.conf
/etc/init.d/httpd reload
Feel free to tweak my example as you want, for example you can copy your temporary wsgi.conf file somewhere and then replace original in Container_commands section.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

aws emr with yarn scheduler - hadoop-yarn

Looks like even though after changing the scheduler, EMR won't allow to run jobs concurrently.

Related

Apache Flink to use S3 for backend state and checkpoints

Snakemake - Tibanna config support

In GitlabCI - How can we trigger a build/pipeline if a specific build/pipeline is completed successfully?

Ambari - Execute script when adding node to cluster

How to run scripts automatically after deployment in AWS using EB CLI?

Categories

Resources