I created a job flow using Amazon Elastic MapReduce (Amazon EMR) and it failed due to some unknown reasons. Then I tried to terminate the job flow thru the AWS Management Console but the 'Terminate' button was disabled. Then I tried to terminate the job flow using the CLI and it showed that the job flow is terminated, but still it shows as failed in the job flow list when seen thru the CLI as well as in the Elastic MapReduce tab in the management console.
Please let me know how to remove the job flow from the list.
When I tried to debug the job flow it shows two errors:
The debugging functionality is not available for this job flow because you did not specify an Amazon S3 Log Path when you created it.
Job flow failed with reason: Invalid bucket name 'testBucket': buckets names must contain only lowercase letters, numbers, periods (.), and dashes (-).
You are facing two issues here:
Job Flow Failure
First and foremost, the problem triggering the termination state of the Amazon EMR job flow that's irritating you can be remedied immediately:
I created a job flow using Amazon Elastic MapReduce (Amazon EMR) and
it failed due to some unknown reasons.
The reason for the job flow failure can actually be inferred from error 2 in the listing you provided:
Job flow failed with reason: Invalid bucket name 'testBucket': buckets
names must contain only lowercase letters, numbers, periods (.), and
dashes (-). [emphasis mine]
Your bucket name 'testBucket' clearly violates the stated lowercase naming requirement, thus changing the name to lowercase only (e.g. 'testbucket' or 'test-bucket') will allow you to run the job flow as desired.
Termination State
Furthermore, the Job Flow termination state is presumably no problem at all. While it can happen in rare cases, that Amazon EC2 instances or other resources are actually stuck in some state, what you are seeing is perfectly reasonable and normal at first sight:
It may take a while to completely terminate a job flow in the first place, see TerminateJobFlows:
The call to TerminateJobFlows is asynchronous. Depending on the
configuration of the job flow, it may take up to 5-20 minutes for
the job flow to completely terminate and release allocated
resources, such as Amazon EC2 instances. [emphasis mine]
Even terminated EC2 resources may be listed for quite a while still, see e.g. the AWS team response to EC2 Instance stuck in "terminated" state:
Terminated means "gone forever"; although sometimes it hangs around in
the UI for several hours. [emphasis mine]
I regularly witness this behavior for EC2 instances, which usually vanish from the instance listing quite some hours later only indeed. Consequently I suspect that the terminated job flow has meanwhile vanished from your job flow list.
Update
I've actually suspected this to be the case indeed, but am still unable to find related information in the official documentation; however, apparently terminated job flows are potentially visible one way or another for up to two month even, see e.g. the AWS team response to Console not showing jobs older than a month:
While the console lists all running job flows, it only displays
terminated job flows launched in the last month. Alternatively, you
can use the Ruby CLI to list all job flows launched in the last two
months with the following command: [...] [emphasis mine]
if your application is running on hadoop yarn, you can always use yarn to manage your application:
yarn application -list
yarn application -kill application_name
Related
I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)
Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".
Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.
Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.
Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?
Thanks.
This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.
If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:
gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
--execute="sh sleep $((5 * 60 * 60))"
I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error
java.io.IOException: No space left on the device
On analysis, I found AWS Glue workers G1.x has 4 vCPU, 16 GB of memory, 64 GB disk. So we tried to increase the number of workers
Even after increasing the number of Glue workers (G1.X) to 50, Glue Jobs keeps on failing with the same error.
Is there is way to configure the Spark local temp directory to s3 instead of the Local Filesystem? or can we mount EBS volume on the Glue workers.
I had tried configuring the property in the Spark Session builder, but Still, Spark is using the local tmp directory
SparkSession.builder.appName("app").config("spark.local.dir", "s3a://s3bucket/temp").getOrCreate()
As #Prajappati stated, there are several solutions.
These solutions are described in detail in the aws blog that presents s3 shuffle feature. I am going to ommit the shuffle configuration tweaking since it is not too much reliable. So, basically, you can either:
Scale out vertically, increasing the size of the machine (i.e. going from G.1X to G.2X) which increases the cost.
Disaggregate compute and storage: which in this case means using s3 as storage service for spills and shuffles.
At the time of writting, to configure this disaggreagation, the job must be configured with the following settings:
Glue 2.0 Engine
Glue job parameters:
Parameter
Value
Explanation
--write-shuffle-files-to-s3
true
Main parameter (required)
--write-shuffle-spills-to-s3
true
Optional
--conf
spark.shuffle.glue.s3ShuffleBucket=S3://<your-bucket-name>/<your-path>
Optional. If not set, the path --TempDir/shuffle-data will be used instead
Remember to assign the proper iam permissions to the job to access the bucket and write under the s3 path provided or configured by default.
According to the error message, it appears as if the Glue job is running out of disk space when writing a DynamicFrame.
As you may know, Spark will perform a shuffle on certain operations, writing the results to disk. When the shuffle is too large, it the job will fail and
There are 2 option to consider.
Upgrade your worker type to G.2X and/or increase the number of workers.
Implement AWS Glue Spark Shuffle manager with S3 [1]. To implement this option, you will need to downgrade to Glue version 2.0.
The Glue Spark shuffle manager will write the shuffle-files and shuffle-spills data to S3, lowering the probability of your job running out of memory and failing.
Please could you add the following additional job parameters. You can do this via the following steps:
Open the "Jobs" tab in the Glue console.
Select the job you want to apply this to, then click "Actions" then click "Edit Job".
Scroll down and open the drop down named "Security configuration, script libraries, and job parameters (optional)".
Under job parameters, enter the following key value pairs:
Key: --write-shuffle-files-to-s3 Value: true
Key: --write-shuffle-spills-to-s3 Value: true
Key: --conf Value:
spark.shuffle.glue.s3ShuffleBucket=S3://
Remember to replace the triangle brackets <> with the name of the S3 bucket where you would like to store the shuffle data.
5) Click "Save" then run the job.
FWIW I discovered that first thing you need to check is that Spark UI is not enabled on the job: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html
AWS documentation mentions that logs generated for Spark UI are flushed to S3 path every 30 seconds, but it doesn't look like they are rotated on the worker. So sooner or later, depending on the workload and worker type, all of them run out of disk space and the run fails with Command failed with exit code 10.
The documentation states that spark.local.dir is used to specify a local directory only.
This error can be addressed modifying the logging properties or, depending on the cluster manager used, the cluster manager properties such as for yarn in this answer.
We are having one issue with EMR and Spot instances.
We have clusters in different environment (different AWS accounts) in same region :
One master node with market type : On Demand
Two core nodes with market type : Spot
When the spot instances are terminated (over my maximum bids, out of capacity or wathever), the cluster terminate and I've only this message :
All slaves in the job flow were terminated due to Spot
After research, people already had this issues but it was due to a master node with market type Spot, this is not my case :
AWS EMR Presto Cluster Terminated abruptly Error: All slaves in the job flow were terminated due to Spot (thought this one is curious because it presents an "on-demand" master node in the question but then explain its problem by the termination of the master node)
AWS EMR Error : All slaves in the job flow were terminated
I tryied to find a response in AWS documentation but its all tell the opposite of what we suspect :the two core nodes termination, terminates the cluster
Regards,
This has happened because you have chosen core node to be of spot type. If you read the best practices for instance type in AWS EMR, you will find that they suggest using at least one on-demand instance for the core node. Remember that this will come at an extra cost.
You can use instance fleet option for the core node and add both the spot and on-demand instance type to this instance fleet.
So the general thumb rule is
Keep master and core instances as on-demand and task instances as
spot.
I am adding a few links where you can read more about this and configure your cluster accordingly.
Link1: Cluster configuration and Best Practices
Link2: Types of nodes in EMR
I am having trouble with AWS EMR PrestoDB.
I launched a cluster with master nodes as coordinator and core nodes as workers. Core nodes were spot instances. But, master node was on demand. After 5 weeks of cluster launch, i got this error message
Terminated with errorsAll slaves in the job flow were terminated due to Spot
Is it that if all slaves are terminated will make the cluster itself terminate?
I see the spot pricing history, and it didn't reached around the max price I set.
What I have already done?
I have checked logs that are dumped to s3. I didn't found any information about the cause of termination. It just said
Failed to visit ... <many directories>
I am answering my own question. As per the presto community, there must be at least one master node up and running in the AWS EMR Presto cluster. But since it got terminated, the whole cluster got terminated.
To avoid data loss because of spot pricing/interruption the data needs to be backed up by either snapshot, frequent copy to s3 or leaving EBS volume behind.
Ref: https://aws.amazon.com/premiumsupport/knowledge-center/spot-instance-terminate/
Your cluster should be still be up but without task nodes. Under Cluster-> Details -> Hardware you can add the task nodes.
Adding task nodes
Similar scenario : AWS EMR Error : All slaves in the job flow were terminated
For using Spot you might want to use instance termination notice and also setup the max price :
https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/
In Snakemake, as far as I know, we can only adapt job resources dynamically based on the number of attempts a job has made. When trying to re-adjust resources after a failure, it would be useful to discern from the following type of cluster failures:
Program error
Transient node failure
Out of memory
Timeout
The last 3 cases, in particular, are exposed to the SLURM user via different job completion status codes. The snakemake interface to the status script merges all types of failures into a single "failed" status.
Is there any way to do so? Or is this a planned feature? Keeping a list of previous failure reasons, instead of just the attempts count would be most useful.
e.g. goal:
rule foo:
resources:
mem_gb=lambda wildcards, attempts: 100 + (20*attempts.OOM)
time_s=lambda wildcards, attempts: 3600 + (3600*attemps.TIMEOUT)
...
The cluster I have access to has heterogeneous machines where each node is configured with various walltime and memory limits, and it would minimize scheduling times if I didn't have to conservatively bump all resources at once.
Possible workaround: I thought of keeping track of that extra info between the job status script, and the cluster submission script (e.g. keeping a history of status codes for each jobid). Is the attempt# available to the cluster submission and cluster status commands?
I think it would be best to handle such cluster specific functionality in a profile, e.g. in the slurm profile. When an error is detected, the status script could simply silently resubmit with updated resources based on what slurm reports. I don't think the Snakefile has to be cluttered with such platform details.