Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)? - amazon-emr

Is there a way to setup bootstrap actions to run on EMR after core services are installed (Spark etc)? I am using emr-5.27.0.

You can submit some script as a step, not a bootstrap. For example, I made an SSL certificate update script and it is applied to the EMR by a step. This is a part of my lambda function in Python language. But you can add this step by manually on the console, or other languages.
Steps=[{
'Name': 'PrestoCertificate',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3://ap-northeast-2.elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': ['s3://myS3/PrestoSteps_InstallCertificate.sh']
}
}]
The key point is script-runner.jar that is pre-built by amazon and you can use that for each region by changing the region prefix. It receives a .sh file and runs it.
One thing you should know is, the script will run on all the nodes and if you want to do it only the master instance then you have to use if-else statement.
#!/bin/bash
BOOL=`cat /emr/instance-controller/lib/info/instance.json | jq .isMaster`
if [ $BOOL == "true" ]
then
<your code>
fi

Related

Snakemake - Tibanna config support

I trying to run snakemake --tibanna to deploy Snakemake on AWS using the "Unicorn" Step Functions Tibanna creates.
I can't seem to find a way to change the different arguments Tibanna accepts like which subnet, AZ or Security Group will be used for the actual EC2 instance deployed.
Argument example (when running Tibanna without Snakemake):
https://github.com/4dn-dcic/tibanna/blob/master/test_json/unicorn/shelltest4.json#L32
Thanks!
Did you noticed this option?
snakemake --help
--tibanna-config TIBANNA_CONFIG [TIBANNA_CONFIG ...]
Additional tibanan config e.g. --tibanna-config
spot_instance=true subnet= security
group=
I think it was added recently.
-jk

aws emr with yarn scheduler

I am creating AWS EMR using cloudformation template. I need to run the steps parallel. For that I am trying to change the YARN Scheduler from FIFO to fair / capacity scheduler.
I have added:
yarn.resourcemanager.scheduler.class : 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler'
Do I need to add FairScheduler.xml file in conf.empty folder? If so, can you please share the xml file.
and if I want to add fairscheduler.xml through cloudformation template, then do I need to use bootstrap for it? if so could you provide me the bootstrap file please.
Looks like even though after changing the scheduler, EMR won't allow to run jobs concurrently.
You can configure your cluster by specifying the configuration in cloud-formation scripts.
This a example to configure
- Classification: fair-scheduler
ConfigurationProperties:
<key1>: <value1>
<key2>: <value2>
- Classification: yarn-site
ConfigurationProperties:
yarn.acl.enable: true
yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
Please follow these -
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-elasticmapreduce-cluster-configuration.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
EMR recently allows you to run multiple steps in parallel -
https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/

Running custom script extension on deployed scale set instances

currently I'm using custom script extension to run scripts on demand on my azure vm server as part of our software solution, our other dev team is moving an application to a scale set and i am no longer capable of deploying custom script extension on demand to the scale set instances. the only solution i have found for running custom script extension on scale set instances is to reconfigure the deployment template with it, this method is not good for me as the scripts should be run on demand and are changed frequently and updating the template every time is bad practice.
Is there anyway to configure custom script extension on scale set instances on demand like on regular virtual machines?
powershell example for regular on demand script deployment on vm:
Set-AzureRmVMCustomScriptExtension -ResourceGroupName myResourceGroup `
-VMName myVM `
-Location myLocation `
-FileUri myURL `
-Run 'myScript.ps1' `
-Name DemoScriptExtension
I found a workaround for this using PowerShell and ARM JSON templates (I'm using Powershell version 5.1). In commandToExecute under virtualMachineProfile in your json template, specify a value that almost always changes and it will force the command to re-execute to run every time your template is deployed. You will see in my template that I added: ' -Date ', deployment().name to the commandToExecute. The value for deployment().name is specified in my New-AzureRmResourceGroupDeployment command as:
-Name $($(Get-Date -format "MM_dd_yyyy_HH_mm"))
The deployment name is based on the date and time, which will be different per minute.
PowerShell Command:
New-AzureRmResourceGroupDeployment -ResourceGroupName $ResourceGroupName -TemplateFile $PathToJsonTemplate -TemplateParameterFile $PathToParametersFile -Debug -Name $($(Get-Date -format "MM_dd_yyyy_HH_mm")) -force
The custom script extension section under virtualMachineProfile in my script appears as such (pay attention to the commandToExecute):
"virtualMachineProfile": {
"extensionProfile": {
"extensions": [
{
"type": "Microsoft.Compute/virtualMachines/extensions",
"name": "MyExtensionName",
"location": "[parameters('location')]",
"properties": {
"publisher": "Microsoft.Compute",
"type": "CustomScriptExtension",
"typeHandlerVersion": "1.8",
"autoUpgradeMinorVersion": true,
"settings": {
"fileUris": [
"[concat(parameters('customScriptExtensionSettings').storageAccountUri, '/scripts/MyScript.ps1')]"
],
"commandToExecute": "[concat('powershell -ExecutionPolicy Unrestricted -File MyScript.ps1', ' -Date ', deployment().name)]"
},
"protectedSettings": {
"storageAccountName": "[parameters('customScriptExtensionSettings').storageAccountName]",
"storageAccountKey": "[listKeys(variables('accountid'),'2015-05-01-preview').key1]"
}
}
},
This will allow you to update a Custom Script Extension on a Virtual Machine Scale Set that has already been deployed. I hope this helps!
Is there anyway to configure custom script extension on scale set
instances on demand like on regular virtual machines?
For now, Azure does not support this.
We only can use VMSS custom script to install software at the time the scale set is provisioned.
More information about VMSS extension, please refer to this link.

How do you make use of cloudformation outputs within serverless framework?

If you deploy a cloudformation creating a kinesis stream how can you provide the outputs such as an arn to a lambda created in the same deployment. Does cf happen before serverless creates the lambdas and is there a way to store the cloudformation values in the lambda?
To store the Arn from your CloudFormation Template "s-resource-cf.json", add some items into the "Outputs" section.
"Outputs": {
"InsertVariableNameForLaterUse": {
"Description": "This is the Arn of My new Kinesis Stream",
"Value": {
"Fn::GetAtt": [
"InsertNameOfCfSectionToFindArnOf",
"Arn"
]
}
}
}
The Fn::GetAtt is a function in CF to get a reference from another resource being created.
When you deploy the CF Template using serverless resources deploy -s dev -r eu-west-1 the Kinesis Stream is created for that Stage/Region and the Arn will be saved into the region properties file /_meta/resources/variables/s-variables-dev-euwest1.json. Note the initial capitalisation change insertVariableNameForLaterUse.
You can then use that in the function's s-function.json as
${insertVariableNameForLaterUse}, such as the environment section:
"environment": {
"InsertVariableNameWeWantToUseInLambda": "${insertVariableNameForLaterUse}"
...
}
and reference this variable in your Lambda using something like:
var myKinesisStreamArn = process.env.InsertVariableNameWeWantToUseInLambda;
CloudFormation happens before Lambda Deployments. Though you should probably control that with a script rather than just using the dashboard:
serverless resources deploy -s dev -r eu-west-1
serverless function deploy --a -s dev -r eu-west-1
serverless endpoint deploy --a -s dev -r eu-west-1
Hope that helps.
What are the steps of deployment you are following here from Serverless? For the first part of your ask, I believe you can do a 'sls resources deploy' to deploy all CF related resources, and then you do a 'sls function deploy' OR 'sls dash deploy' to deploy the lambda functions. So technically, resource deploy (CF) does not actually deploy lambda functions.
For the second part of your ask, if you have a use-case where you want to use the output of a CF resource being created, (as of now) this feature has been added/merged to v0.5 of Serverless which has not yet been released.

How to run scripts automatically after deployment in AWS using EB CLI?

I am trying to make a Django server on AWS. My django app depends on some mathematical python libraries like numpy, scipy, sklearn etc. However there is an issue for which I need to this after every deployment
sudo nano /etc/httpd/conf.d/wsgi.conf
---------------------------------------
add this line in the file
WSGIApplicationGroup %{GLOBAL}
---------------------------------------
sudo /etc/init.d/httpd reload
Basically I need "WSGIApplicationGroup %{GLOBAL}" in my wsgi.conf file otherwise I get 504. I am using a Custom AMI built on top of Amazon Linux 2014 and I am using EB CLI for deployment. However whenever I deploy the wsgi.conf is reset and it does not contain the line that I have added previously and I need to manually SSH into the EC2 instance and do this task myself. It gives a overhead on every deployment and its also not feasible once we scale up (cloning or creating instances also resets it). So is there a way that this will be automatically done after every deployment ?
The content of the wsgi.conf is fixed, so basically I can make a script easily to create it but the issue is how to trigger the script automatically ?
PS:I am new to AWS
You need to use AWS Elastic Beanstalk feature called .ebextensions: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html
In your case you can't use Files or Commands sections, because:
The commands are processed in alphabetical order by name, and they run
before the application and web server are set up and the application
version file is extracted.
You need to use Container_commands section:
They run after the application and web server have been set up and the
application version file has been extracted, but before the
application version is deployed.
Example .ebextensions/01wsgi.config (not tested :-))
container_commands:
apache_reload:
command: |
echo "WSGIApplicationGroup %{GLOBAL}" >> /etc/httpd/conf.d/wsgi.conf
/etc/init.d/httpd reload
Feel free to tweak my example as you want, for example you can copy your temporary wsgi.conf file somewhere and then replace original in Container_commands section.