I have an EMR serverless app that cannot connect to S3 bucket in another region. Is there a workaround for that? Maybe a parameter to set in Job parameters or Spark parameters when submitting a new job.
The error is this:
ExitCode: 1. Last few exceptions: Caused by: java.net.SocketTimeoutException: connect timed out Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectTimeoutException
In order to connect to an S3 bucket in another region or access external services, the EMR Serverless application needs to be created with a VPC.
This is mentioned on the considerations page:
Without VPC connectivity, a job can access some AWS service endpoints in the same AWS Region. These services include Amazon S3, AWS Glue, Amazon DynamoDB, Amazon CloudWatch, AWS KMS, and AWS Secrets Manager.
Here's an example AWS CLI command to create an application in a VPC - you need to provide a list of Subnet IDs and Security Group IDs. More details can be found in configuring VPC access.
aws emr-serverless create-application \
--type SPARK \
--name etl-jobs \
--release-label "emr-6.6.0" \
--network-configuration '{
"subnetIds": ["subnet-01234567890abcdef","subnet-01234567890abcded"],
"securityGroupIds": ["sg-01234566889aabbcc"]
}'
Related
I read that we should use assume-role to create AWS EKS cluster. From the documention, I only find the use of EKS service role but I don't see how to create a cluster with a role. Do I miss anything?
I have a Lambda function that connects to EMR using boto3. I want to run "aws s3 cp" command on EMR using my lambda function to copy files from S3 to EMR's local directory.
Is there a way to run aws cli commands on EMR using Lambda?
No.
Amazon Lambda runs from the Internet. It does not have access to run commands on the EMR cluster instances.
You could, theoretically, install the Systems Manager Agent on EMR. (I haven't tried, it, but it should work.) Your AWS Lambda function can then call the Systems Manager send_command() function to execute some code on the instance.
See: AWS Systems Manager Run Command - AWS Systems Manager
Was wondering if anyone has a solution to transfer files from a non-AWS Linux Server A to a AWS S3 bucket location by using/running commands from a non-AWS Linux Server B? Is it possible to avoid doing two hops? Future plan is to automate the process on Server B.
new info:
I am able to upload files to S3 from ServerA such as:
aws s3 sync /path/files s3://bucket/folder
But not sure how to run/execute it from a different Linux server (ServerB)?
There are several steps to using the aws s3 sync command from any server that supports the aws cli and aws s3 sync command, Linux or otherwise
Enable Programmatic Access for the IAM user/account you will use with the AWS CLI and download the credentials
docs: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html#id_users_create_console
Download and install the aws-cli for your operating system
Instructions available for:
Docker
Linux
macOS
Windows
docs: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
Configure your aws credentials for your cli
e.g. aws configure
docs: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Create the bucket you will sync to and allow your aws user/identity access to this bucket
doc: https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html
Run the aws s3 sync command according to the rules outlined in the official documentation
e.g. aws s3 sync myfile s3://mybucket
docs: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html#examples
I'm running my Play! webapp with Docker Cloud (could also use Rancher) and AWS and I'd like to store all the logs in S3 (via volume). Any ideas on how I could achieve that with minimal effort?
Use docker volumes to store the logs in the host system.
Try S3 aws-cli to sync your local directory with S3 Bucket
aws s3 sync /var/logs/container-logs s3://bucket/
create a cron to run it on every minute or so.
Reference: s3 aws-cli
I can specify what AWS credentials to use to create an EMR cluster via environment variables. However, I would like to run a mapreduce job on another AWS user's S3 bucket for which they gave me a different set of AWS credentials.
Does MRJob provide a way to do this, or would I have to copy the bucket using my account first so that the bucket and EMR keys are the same?