AWS EMR - how to copy files to all the nodes? - amazon-emr

is there a way to copy a file to all the nodes in EMR cluster thought EMR command line? I am working with presto and have created my custom plugin. The problem is I have to install this plugin on all the nodes. I don't want to login to all the nodes and copy it.

You can add it as a bootstrap script to let this happen during the launch of the cluster.
#Sanket9394 Thanks for the edit!

If you have the control to Bring up a new EMR, then you should consider using the bootstrap script of the EMR.
But incase you want to do it on Existing EMR (bootstrap is only available during launch time)
You can do this with the help of AWS Systems Manager (ssm) and EMR inbuilt client.
Something like (python):
emr_client = boto3.client('emr')
ssm_client = boto3.client('ssm')
You can get the list of core instances using emr_client.list_instances
finally send a command to each of these instance using ssm_client.send_command
Ref : Check the last detailed example Example Installing Libraries on Core Nodes of a Running Cluster on https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html#emr-jupyterhub-install-libs
Note: If you are going with SSM , you need to have proper IAM policy of ssm attached to the IAM role of your master node.

Related

What is the best approach to sync data from AWS 3 bucket to Azure Data Lake Gen 2

Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/

Failure to start a Neptune notebook

I can't seem to make a neptune notebook, everytime I try I get the following error:
Notebook Instance Lifecycle Config 'arn:aws:sagemaker:us-west-2:XXXXXXXX:notebook-instance-lifecycle-config/aws-neptune-tutorial-lc'
for Notebook Instance 'arn:aws:sagemaker:us-west-2:XXXXXXXXX:notebook-instance/aws-neptune-tutorial'
took longer than 5 minutes.
Please check your CloudWatch logs for more details if your Notebook Instance has Internet access.
Note that the cloudwatch logs that it suggests to look at don't exist.
The neptune database was created using this cloudformation template: https://github.com/awslabs/aws-cloudformation-templates/blob/master/aws/services/NeptuneDB/Neptune.yaml
Which created the neptune cluster in the default VPC.
The notebook instance was created using this cloudformation template: https://s3.amazonaws.com/aws-neptune-customer-samples/neptune-sagemaker/cloudformation-templates/neptune-sagemaker/neptune-sagemaker-nested-stack.json
passing in the relevant values from in for the created neptune stack.
Has anyone seen this type of error and knows how to get over it?
I had to go in and modify the predefined install script used by neptune and add and nohup command to the final section of the install as described here https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-lifecycle-script-timeout/
Probably what is happening is that your notebook instance does not have access to the internet. Check your NAT configuration for your VPC and their security groups have allowed outbound rules to all

Spinnaker Support for App ELB in AWS

Am facing 2 issues with Spinnaker new installation.
I could not see my Application load balancers listed in dropdown of load balancers tab while creating pipeline. We are currently using only app. load balancers in our current set up. I tried editing the JSON file of pipeline with below config and it didn't work. I verfied it by checking the ASG created in my AWS account and checked if there is any ELB/Target group associated but I couldn't see any.
"targetGroups": [
"TG-APP-ELB-NAME-DEV"
],
Hence, I would like to confirm how I can get support of App. ELB into Spinnaker installation and how to use it.
Also I have an ami search issue found.My current set up briefing is below.
One managing account - prod where my spinnaker ec2 is running & my prod application instances are running
Two managed accounts - dev & test where my application test instances are running.
When I create a new AMI in my dev AWS account and am trying to search the newly created AMI from my Spinnaker and it failed with error that it couldn't search the AMI first. Then I shared my AMI in dev to prod after which it was able to search it but failed with UnAuthorized error
Please help me clarify
1. If sharing is required for any new AMI from dev -> Prod or our spinnakerManaged role would take care of permissions
2. How to fix this problem and create AMI successfully.
Regarding #1, have you created the App Load Balancer through the Spinnaker UI or directly through AWS?
If it is the former, then make sure it follows the naming convention agreed by Spinnaker (I believe the balancer name should start with the app name)

Is there an Ansible module for creating 'instance-store' based AMI's?

Creating AMI's from EBS backed instances is exceedingly easy, but doing the same from an instance-store based instance seems like it can only be done manually using the CLI.
So far I've been able to bootstrap the creation of an 'instance-store' based server off of an HVM Amazon Linux AMI with Ansible, but I'm getting lost on the steps that follow... I'm trying to follow this: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-instance-store-ami.html#amazon_linux_instructions
Apparently I need to store my x.509 cert and key on the instance, but which key is that? Is that...
one I have to generate on the instance with openssl,
one that I generate/convert from AWS,
one I generate with Putty, or
one that already exists in my AWS account?
After that, I can't find any reference to ec2-bundle-vol in Ansible. So I'm left wondering if the only way to do this is with Ansible's command module.
Basically what I'm hoping to find out is: Is there a way to easily create instance-store based AMI's using Ansible, and if not, if anyone can reference the steps necessary to automate this? Thanks!
Generally speaking, Ansible AWS modules are meant to manage AWS resources by interacting with AWS HTTP API (ie. actions you could otherwise do in the AWS Management Console).
They are not intended to run AWS specific system tools on EC2 instances.
ec2-bundle-vol and ec2-upload-bundle must be run on the EC2 instance itself. It is not callable via the HTTP API.
I'm afraid you need to write a custom playbook / role to automate the process.
On the other hand, aws ec2 register-image is an AWS API call and correspond to the ec2_ami Ansible module.
Unfortunately, this module doesn't seem to support image registering from an S3 bucket.

Reading file inside S3 from EC2 instance

I would like to use AWS Data Pipeline to start an EC2 instance and then run a python script that is stored in S3.
Is it possible? I would like to make a single ETL step using a python script.
Is it the best way?
Yes, it is possible and relatively straight forward using Shell Command Activity.
I believe from the details you have provided so far, it seems to be the best way - as DataPipeline provisions the EC2 instance for you ondemand and shuts it down afterwards.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
There is also a tutorial that you can follow to get acclimated to ShellCommndActivity of Data Pipeline.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-gettingstartedshell.html
yes, you can direct upload and backup your data in s3
http://awssolution.blogspot.in/2015/10/how-to-backup-share-and-organize-data.html