How to redo a failed step in AWS emr - amazon-emr

My step in AWS emr is failing. How to redo only the step without creating another cluster using the UI . Couldn't find this info online.

The option to 'clone' the step appears when you select the step. You will find this under the 'steps' tab.
Example of steps tab.

Related

Apache Flink to use S3 for backend state and checkpoints

Background
I was planning to use S3 to store the Flink's checkpoints using the FsStateBackend. But somehow I was getting the following error.
Error
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded.
Flink version: I am using Flink 1.10.0 version.
I have found the solution for the above issue, so here I am listing it in steps that are required.
Steps
We need to add some configs in the flink-conf.yaml file which I have listed below.
state.backend: filesystem
state.checkpoints.dir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
state.backend.fs.checkpointdir: s3://s3-bucket/checkpoints/ #"s3://<your-bucket>/<endpoint>"
s3.access-key: XXXXXXXXXXXXXXXXXXX #your-access-key
s3.secret-key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx #your-secret-key
s3.endpoint: http://127.0.0.1:9000 #your-endpoint-hostname (I have used Minio)
After completing the first step we need to copy the respective(flink-s3-fs-hadoop-1.10.0.jar and flink-s3-fs-presto-1.10.0.jar) JAR files from the opt directory to the plugins directory of your Flink.
E.g:--> 1. Copy /flink-1.10.0/opt/flink-s3-fs-hadoop-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-hadoop/flink-s3-fs-hadoop-1.10.0.jar // Recommended for StreamingFileSink
2. Copy /flink-1.10.0/opt/flink-s3-fs-presto-1.10.0.jar to /flink-1.10.0/plugins/s3-fs-presto/flink-s3-fs-presto-1.10.0.jar //Recommended for checkpointing
Add this in checkpointing code
env.setStateBackend(new FsStateBackend("s3://s3-bucket/checkpoints/"))
After completing all the above steps re-start the Flink if it is already running.
Note:
If you are using both(flink-s3-fs-hadoop and flink-s3-fs-presto) in Flink then please use s3p:// specificly for flink-s3-fs-presto and s3a:// for flink-s3-fs-hadoop instead of s3://.
For more details click here.

aws emr with yarn scheduler

I am creating AWS EMR using cloudformation template. I need to run the steps parallel. For that I am trying to change the YARN Scheduler from FIFO to fair / capacity scheduler.
I have added:
yarn.resourcemanager.scheduler.class : 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler'
Do I need to add FairScheduler.xml file in conf.empty folder? If so, can you please share the xml file.
and if I want to add fairscheduler.xml through cloudformation template, then do I need to use bootstrap for it? if so could you provide me the bootstrap file please.
Looks like even though after changing the scheduler, EMR won't allow to run jobs concurrently.
You can configure your cluster by specifying the configuration in cloud-formation scripts.
This a example to configure
- Classification: fair-scheduler
ConfigurationProperties:
<key1>: <value1>
<key2>: <value2>
- Classification: yarn-site
ConfigurationProperties:
yarn.acl.enable: true
yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
Please follow these -
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-configuration.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
EMR recently allows you to run multiple steps in parallel -
https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/

Spinnaker:There is no stage type to be selected

After I deploy the Spinnaker with the command './InstallSpinnaker.sh' on https://github.com/spinnaker/spinnaker, but when I create the Pipeline and add stage ,I found there is no stage type to be selected, would you help on it ? thanks!
Remember this was the option for spinnaker as a dev deployment option on localdebian
Please install in a Kubernetes cluster or get easily started with Minnaker
thanks

Ambari - Execute script when adding node to cluster

Is it possible (and how) to specify a shell script somewhere which will be executed each time a new node is added to Ambari cluster?
I'm using HDP Ambari for that and I would like to add some symbolic links when setup of new node is completed, but I want to automatize that so that I (or someone else) don't forget it.
There is no functionality that currently exists that will enable you to execute a script when a node is added to the cluster. What you're asking for is a custom hook. You would have to look through the Ambari source code and see if you can define a custom hook for the stack. There are a few hooks provided in each stack, for examples see: https://github.com/apache/ambari/tree/trunk/ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks

Use of Enable blocking in PDI - Pig Script Executor

I am exploring Big data plugin in Pentaho 5.2. I was trying to run Pig Script executor. I am unable to understand the usage of
Enabling Blocking. The PDI documentation says that
If checked, the Pig Script Executor job entry will prevent downstream
entries from executing until the script has finished processing.
I am aware that running a pig script will convert the execution to Map reduce jobs. I am running the job with Start job -> Pig Script. If I disable the Enable blocking step I am unable to execute the script. I am getting permission denied errors. As per the documentation " ".
What does downstream mean here. I do not pass any hops from the pig script out. I am unable to understand the Enable blocking step. Any hints can be helpful and will be appreciated.
Enable blocking: the task is deployed to the Hadoop cluster; PDI will follow up on progress and only proceed with the rest of the job tasks AFTER the execution of the Hadoop job finishes;
Enable blocking is disabled: PDI deploys the task to the Hadoop cluster and forgets about it. The rest of the job tasks proceed immediately after the cluster accepts the task, but doesn't wait for it to complete.