I just set up an EMR cluster with built-in Spark, JupyterHub and so on. I am able to access the Jupyter Notebook at http://master_hostname:9443/hub/login but I have no idea what credential I can use to log in or where I can set this up in the EMR.
Thank you in advance!
As documented here
the default admin is named "jovyan" and the password is "jupyter" :)
Recently I installed aws cli on a linux machine following the documentation from aws official website. In the first go, I was able to run the s3 commands without any issue. As part of my development, I uninstalled aws-cli and re-installed it. I was getting the error botocore.utils.BadIMDSRequestError: <botocore.awsrequest.AWSRequest object at 0x7f3f6cb44d00>
when I execute aws s3 ls
I figured it out.
I just need to add the region
aws configure
AWS Access Key ID [******************RW]:
AWS Secret Access Key [******************7/]:
Default region name [None]: **us-east-1**
Then it works!
is there a way to copy a file to all the nodes in EMR cluster thought EMR command line? I am working with presto and have created my custom plugin. The problem is I have to install this plugin on all the nodes. I don't want to login to all the nodes and copy it.
You can add it as a bootstrap script to let this happen during the launch of the cluster.
#Sanket9394 Thanks for the edit!
If you have the control to Bring up a new EMR, then you should consider using the bootstrap script of the EMR.
But incase you want to do it on Existing EMR (bootstrap is only available during launch time)
You can do this with the help of AWS Systems Manager (ssm) and EMR inbuilt client.
Something like (python):
emr_client = boto3.client('emr')
ssm_client = boto3.client('ssm')
You can get the list of core instances using emr_client.list_instances
finally send a command to each of these instance using ssm_client.send_command
Ref : Check the last detailed example Example Installing Libraries on Core Nodes of a Running Cluster on https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html#emr-jupyterhub-install-libs
Note: If you are going with SSM , you need to have proper IAM policy of ssm attached to the IAM role of your master node.
I am trying to use s3fs mount following the link as below using IAM role on the EC2 instance spin out of the RHEL 7.7 AMI.
The problem is s3fs mount is not working after following the steps. Can anyone help what is wrong ?
I can't seem to make a neptune notebook, everytime I try I get the following error:
Notebook Instance Lifecycle Config 'arn:aws:sagemaker:us-west-2:XXXXXXXX:notebook-instance-lifecycle-config/aws-neptune-tutorial-lc'
for Notebook Instance 'arn:aws:sagemaker:us-west-2:XXXXXXXXX:notebook-instance/aws-neptune-tutorial'
took longer than 5 minutes.
Please check your CloudWatch logs for more details if your Notebook Instance has Internet access.
Note that the cloudwatch logs that it suggests to look at don't exist.
The neptune database was created using this cloudformation template: https://github.com/awslabs/aws-cloudformation-templates/blob/master/aws/services/NeptuneDB/Neptune.yaml
Which created the neptune cluster in the default VPC.
The notebook instance was created using this cloudformation template: https://s3.amazonaws.com/aws-neptune-customer-samples/neptune-sagemaker/cloudformation-templates/neptune-sagemaker/neptune-sagemaker-nested-stack.json
passing in the relevant values from in for the created neptune stack.
Has anyone seen this type of error and knows how to get over it?
I had to go in and modify the predefined install script used by neptune and add and nohup command to the final section of the install as described here https://aws.amazon.com/premiumsupport/knowledge-center/sagemaker-lifecycle-script-timeout/
Probably what is happening is that your notebook instance does not have access to the internet. Check your NAT configuration for your VPC and their security groups have allowed outbound rules to all
I have set up a local Zeppelin notebook to access Glue Dev endpoint. I'm able to run spark and pyspark code and access the Glue catalog. But when I try spark.sql("show databases").show() or %sql show databases only default is returned.
When spinning up an EMR cluster we have to choose "Use for Hive table metadata" to enable this, but I hoped this would be the default setting for Glue Development endpoint, which seems to be not the case. Any workaround for this?