S3DistCP error to get path in non-existent ec2 in cluster EMR - amazon-emr

I'm execution a process in a cluster emr with steps that executes a command s3-dist-cp to copy files from my s3 to EMR.
I'm having an error Input path does not exist. But the ip of machine in log does not is part of cluster.
Did someone already face this problem?

Related

how do we restore kubeflow from backups if the installation is destroyed or how we can back the kubflow as it was if the eks cluster is destroyed

How am i going to take backup for my kubeflow pipeline and restore it if the installing is failed or the eks cluster is destroyed. i have some finding to get the image of the vanila i am using for database and find out how to take backup and restore but i didnt have any luck so far.
i have e kubflow running on aws eks cluster
and i have 15/16 kubeflow pipeline running
i used vanilla for database
so now i need yours help to know how to backup the pipelines
and restore the kubeflow pipeline if anything happens to the kubeflow
or eks.
If you inspect the Kubeflow manifest file, you'll see the list of dependencies it has. The largest one is the database. For those on AWS, you can use RDS as a target for running the database rather than a self-hosted one in Kubernetes.
You can see the instructions for that here.
I used to be the Product Manager for Open Source MLOps at AWS and my team wrong that post.

Run aws cli commands on EMR using Lambda

I have a Lambda function that connects to EMR using boto3. I want to run "aws s3 cp" command on EMR using my lambda function to copy files from S3 to EMR's local directory.
Is there a way to run aws cli commands on EMR using Lambda?
No.
Amazon Lambda runs from the Internet. It does not have access to run commands on the EMR cluster instances.
You could, theoretically, install the Systems Manager Agent on EMR. (I haven't tried, it, but it should work.) Your AWS Lambda function can then call the Systems Manager send_command() function to execute some code on the instance.
See: AWS Systems Manager Run Command - AWS Systems Manager

Processing AWS ELB access logs (from S3 bucket to InfluxDB)

We would like to process AWS ELB access logs and write them into InfluxDB
to be used for application metrics and monitoring (ex. Grafana).
We configured ELB to store access logs into S3 bucket.
What would be the best way to process those logs and write them to InfluxDB?
What we tried so far was to mount S3 bucket to filesystem using s3fs and then use Telegraf agent for processing. But this approach has some issues: s3fs mounting looks like a hack, and all the files in the bucket are compressed and need to be unzipped before telegraf can process them which makes this task overcomplicated.
Is there any better way?
Thanks,
Oleksandr
Can you just install the telegraf agent on the AWS instance that is generating the logs, and have them sent directly to InfluxDB in real-time?

Changing AWS EMR log location?

Is it possible to change the bucket & path of the Log URI of a running EMR cluster? Or is the only way to change it to terminate & restart the cluster? We had a typo in the bucket name of a cluster we have already spun up and done some work on, so we'd like to change the log location without terminating the cluster.

Hadoop upload files from local machine to amazon s3

I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)