I have a Lambda function that connects to EMR using boto3. I want to run "aws s3 cp" command on EMR using my lambda function to copy files from S3 to EMR's local directory.
Is there a way to run aws cli commands on EMR using Lambda?
No.
Amazon Lambda runs from the Internet. It does not have access to run commands on the EMR cluster instances.
You could, theoretically, install the Systems Manager Agent on EMR. (I haven't tried, it, but it should work.) Your AWS Lambda function can then call the Systems Manager send_command() function to execute some code on the instance.
See: AWS Systems Manager Run Command - AWS Systems Manager
Related
Recently Amazon launched EMR Serverless and I want to repurpose my exiting data pipeline orchestration that uses AWS Step Functions: There are steps that create EMR cluster, run some lambda functions, submit Spark Jobs (mostly Scala jobs using spark-submit) and finally terminate the cluster. All these steps are of sync type (arn:aws:states:::elasticmapreduce:addStep.sync)
There are documentation and github samples that describe submitting jobs from orchestration framework such as AirFlow but there is nothing that describes how to use AWS Step Function with EMR Serverless. Any help in this regard is appreciated.
Primarily I am interested in repurposing task step function of type arn:aws:states:::elasticmapreduce:addStep.sync that takes parameters such as ClusterId but in case of EMR Serverless there is no such id.
In summary is there equivalent of Call Amazon EMR with Step Functions for EMR Serverless?
Currently there is no direct integration of EMR Serverless with Step Functions. However a possible solution is adding a Lambda Layer on top and use the SDK to create emr serverless applications and submit jobs. However you would need an additional lambda to implement a poller that tracks the success of the jobs (in case of interdependent jobs) as it is highly likely that the emr job will outrun the 15 min runtime limitation of the lambda.
Was wondering if anyone has a solution to transfer files from a non-AWS Linux Server A to a AWS S3 bucket location by using/running commands from a non-AWS Linux Server B? Is it possible to avoid doing two hops? Future plan is to automate the process on Server B.
new info:
I am able to upload files to S3 from ServerA such as:
aws s3 sync /path/files s3://bucket/folder
But not sure how to run/execute it from a different Linux server (ServerB)?
There are several steps to using the aws s3 sync command from any server that supports the aws cli and aws s3 sync command, Linux or otherwise
Enable Programmatic Access for the IAM user/account you will use with the AWS CLI and download the credentials
docs: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html#id_users_create_console
Download and install the aws-cli for your operating system
Instructions available for:
Docker
Linux
macOS
Windows
docs: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
Configure your aws credentials for your cli
e.g. aws configure
docs: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
Create the bucket you will sync to and allow your aws user/identity access to this bucket
doc: https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html
Run the aws s3 sync command according to the rules outlined in the official documentation
e.g. aws s3 sync myfile s3://mybucket
docs: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html#examples
We have ETL jobs i.e. a java jar(performs etl operations) is run via shell script. The shell script is passed with some parameters as per the job being run. These shell scripts are run via crontab as well as manually depending on the requirements. Sometimes there is need of running some sql commands/scripts on posgresql RDS DB too, before the shell script run.
We have everything on AWS i.e. Ec2 talend server, Postgresql RDS, Redshift, ansible etc.
How can we automate this process? How to deploy and handle passing custom parameters etc. Pointers are welcome.
I would prefer to go with AWS Data pipeline, and add steps to perform any pre / post operations on your ETL job, like running shell scripts, or any hql etc.
AWS Glue runs on Spark engine, and it has other features as well as such AWS Glue Development Endpoint, Crawler, Catalog, Job schedulers. I think AWS Glue would be ideal if you are starting afresh, or plan to move your ETL to AWS Glue. Please refer here on price comparison.
AWS Pipeline: For details on AWS Pipeline
AWS Glue FAQ:For details on supported languages for AWS Glue
Please note according to AWS Glue FAQ:
Q: What programming language can I use to write my ETL code for AWS
Glue?
You can use either Scala or Python.
Edit: As Jon scott commented, Apache Airflow is another option for job scheduling, but I have not used it.
You can use Aws Glue for performing serverless ETL. Glue also has triggers which lets you automate their jobs.
I'm running my Play! webapp with Docker Cloud (could also use Rancher) and AWS and I'd like to store all the logs in S3 (via volume). Any ideas on how I could achieve that with minimal effort?
Use docker volumes to store the logs in the host system.
Try S3 aws-cli to sync your local directory with S3 Bucket
aws s3 sync /var/logs/container-logs s3://bucket/
create a cron to run it on every minute or so.
Reference: s3 aws-cli
I can specify what AWS credentials to use to create an EMR cluster via environment variables. However, I would like to run a mapreduce job on another AWS user's S3 bucket for which they gave me a different set of AWS credentials.
Does MRJob provide a way to do this, or would I have to copy the bucket using my account first so that the bucket and EMR keys are the same?