EMR : All slaves in the job flow were terminated due to Spot - amazon-emr

We are having one issue with EMR and Spot instances.
We have clusters in different environment (different AWS accounts) in same region :
One master node with market type : On Demand
Two core nodes with market type : Spot
When the spot instances are terminated (over my maximum bids, out of capacity or wathever), the cluster terminate and I've only this message :
All slaves in the job flow were terminated due to Spot
After research, people already had this issues but it was due to a master node with market type Spot, this is not my case :
AWS EMR Presto Cluster Terminated abruptly Error: All slaves in the job flow were terminated due to Spot (thought this one is curious because it presents an "on-demand" master node in the question but then explain its problem by the termination of the master node)
AWS EMR Error : All slaves in the job flow were terminated
I tryied to find a response in AWS documentation but its all tell the opposite of what we suspect :the two core nodes termination, terminates the cluster
Regards,

This has happened because you have chosen core node to be of spot type. If you read the best practices for instance type in AWS EMR, you will find that they suggest using at least one on-demand instance for the core node. Remember that this will come at an extra cost.
You can use instance fleet option for the core node and add both the spot and on-demand instance type to this instance fleet.
So the general thumb rule is
Keep master and core instances as on-demand and task instances as
spot.
I am adding a few links where you can read more about this and configure your cluster accordingly.
Link1: Cluster configuration and Best Practices
Link2: Types of nodes in EMR

Related

AWS EMR Presto Cluster Terminated abruptly Error: All slaves in the job flow were terminated due to Spot

I am having trouble with AWS EMR PrestoDB.
I launched a cluster with master nodes as coordinator and core nodes as workers. Core nodes were spot instances. But, master node was on demand. After 5 weeks of cluster launch, i got this error message
Terminated with errorsAll slaves in the job flow were terminated due to Spot
Is it that if all slaves are terminated will make the cluster itself terminate?
I see the spot pricing history, and it didn't reached around the max price I set.
What I have already done?
I have checked logs that are dumped to s3. I didn't found any information about the cause of termination. It just said
Failed to visit ... <many directories>
I am answering my own question. As per the presto community, there must be at least one master node up and running in the AWS EMR Presto cluster. But since it got terminated, the whole cluster got terminated.
To avoid data loss because of spot pricing/interruption the data needs to be backed up by either snapshot, frequent copy to s3 or leaving EBS volume behind.
Ref: https://aws.amazon.com/premiumsupport/knowledge-center/spot-instance-terminate/
Your cluster should be still be up but without task nodes. Under Cluster-> Details -> Hardware you can add the task nodes.
Adding task nodes
Similar scenario : AWS EMR Error : All slaves in the job flow were terminated
For using Spot you might want to use instance termination notice and also setup the max price :
https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/

A solution to increase my EMR master node capacity without shutting down the cluster

My EMR master node has become full and I need to attach some ESB volumne to it, is there any way to do it without terminating the cluster?
You can add additional EBS volumes & also resize
How to explained here :
https://superuser.com/questions/1409373/how-to-add-an-ebs-volume-by-snapshot-id-to-amazon-emr
https://github.com/qyjohn/AWS_Tutorials/wiki/Grow-EBS-volumes-on-EMR-clusters
I don't think so. This is because you set up Amazon Elastic Block Store (Amazon EBS) volumes and configure mount points when the cluster is launched, so it’s difficult to modify the storage capacity after the cluster is running.
The feasible solutions usually involve adding more nodes to your
cluster, backing up your data to a data lake, and then launching a new
cluster with a higher storage capacity. Or, if the data that occupies
the storage is expendable, removing the excess data is usually the way
to go.
For more details,have a look at: https://aws.amazon.com/blogs/big-data/dynamically-scale-up-storage-on-amazon-emr-clusters/

Setting up a Hadoop Cluster on Amazon Web services with EBS

I was wondering how I could setup a hadoop cluster (say 5 nodes) through AWS. I know how to create the cluster on EC2 but I don't know how to face the following challenges.
What happens if I lose my spot instance. How do I keep the cluster going.
I am working with some datasets of Size 1TB. Would it be possible to setup the EBS accordingly. How can I access the HDFS in this scenario.
Any help will be great!
Depending on your requirements, these suggestions would change. However, assuming a 2 Master and 3 Worker setup, you can probably use r3 instances for Master nodes as they are memory intensive app optimized and go for d2 instances for the worker nodes. d2 instances have multiple local disks and thus can withstand some disk failures while still keeping your data safe.
To answer your specific questions,
treat Hadoop machines as any linux applications. What would happen if your general centOS spot instances are lost? Hwnce, generally it is advised to use reserved instances.
Hadoop typically stores data by maintaining 3 copies and distributing them across the worker nodes in forms of 128 or 256 MB blocks. So, you will have 3TB data to store across the three worker nodes. Obviously, you have to consider some overhead while calculating space requirements.
You can use AWS's EMR service - it is designed especially for Hadoop clusters on top of EC2 instances.
It it fully managed, and it comes pre-packed with all the services you need in Hadoop.
Regarding your questions:
There are three main types of nodes in hadoop:
Master - a single node, don't need to spot it.
Core - a node that handle tasks, and have part of the HDFS
Task - a node that handle tasks, but does not have any part of the HDFS
If Task nodes are lost (if they are spot instances) the cluster will continue to work with no problems.
Regarding storage, the default replication factor in EMR is as follows:
1 for clusters < four nodes
2 for clusters < ten nodes
3 for all other clusters
But you can change it - http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hdfs-config.html

Couchbase node failure

My understanding could be amiss here. As I understand it, Couchbase uses a smart client to automatically select which node to write to or read from in a cluster. What I DON'T understand is, when this data is written/read, is it also immediately written to all other nodes? If so, in the event of a node failure, how does Couchbase know to use a different node from the one that was 'marked as the master' for the current operation/key? Do you lose data in the event that one of your nodes fails?
This sentence from the Couchbase Server Manual gives me the impression that you do lose data (which would make Couchbase unsuitable for high availability requirements):
With fewer larger nodes, in case of a node failure the impact to the
application will be greater
Thank you in advance for your time :)
By default when data is written into couchbase client returns success just after that data is written to one node's memory. After that couchbase save it to disk and does replication.
If you want to ensure that data is persisted to disk in most client libs there is functions that allow you to do that. With help of those functions you can also enshure that data is replicated to another node. This function is called observe.
When one node goes down, it should be failovered. Couchbase server could do that automatically when Auto failover timeout is set in server settings. I.e. if you have 3 nodes cluster and stored data has 2 replicas and one node goes down, you'll not lose data. If the second node fails you'll also not lose all data - it will be available on last node.
If one node that was Master goes down and failover - other alive node becames Master. In your client you point to all servers in cluster, so if it unable to retreive data from one node, it tries to get it from another.
Also if you have 2 nodes in your disposal you can install 2 separate couchbase servers and configure XDCR (cross datacenter replication) and manually check servers availability with HA proxies or something else. In that way you'll get only one ip to connect (proxy's ip) which will automatically get data from alive server.
Hopefully Couchbase is a good system for HA systems.
Let me explain in few sentence how it works, suppose you have a 5 nodes cluster. The applications, using the Client API/SDK, is always aware of the topology of the cluster (and any change in the topology).
When you set/get a document in the cluster the Client API uses the same algorithm than the server, to chose on which node it should be written. So the client select using a CRC32 hash the node, write on this node. Then asynchronously the cluster will copy 1 or more replicas to the other nodes (depending of your configuration).
Couchbase has only 1 active copy of a document at the time. So it is easy to be consistent. So the applications get and set from this active document.
In case of failure, the server has some work to do, once the failure is discovered (automatically or by a monitoring system), a "fail over" occurs. This means that the replicas are promoted as active and it is know possible to work like before. Usually you do a rebalance of the node to balance the cluster properly.
The sentence you are commenting is simply to say that the less number of node you have, the bigger will be the impact in case of failure/rebalance, since you will have to route the same number of request to a smaller number of nodes. Hopefully you do not lose data ;)
You can find some very detailed information about this way of working on Couchbase CTO blog:
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html
Note: I am working as developer evangelist at Couchbase

How to terminate/remove a job flow in Amazon EMR?

I created a job flow using Amazon Elastic MapReduce (Amazon EMR) and it failed due to some unknown reasons. Then I tried to terminate the job flow thru the AWS Management Console but the 'Terminate' button was disabled. Then I tried to terminate the job flow using the CLI and it showed that the job flow is terminated, but still it shows as failed in the job flow list when seen thru the CLI as well as in the Elastic MapReduce tab in the management console.
Please let me know how to remove the job flow from the list.
When I tried to debug the job flow it shows two errors:
The debugging functionality is not available for this job flow because you did not specify an Amazon S3 Log Path when you created it.
Job flow failed with reason: Invalid bucket name 'testBucket': buckets names must contain only lowercase letters, numbers, periods (.), and dashes (-).
You are facing two issues here:
Job Flow Failure
First and foremost, the problem triggering the termination state of the Amazon EMR job flow that's irritating you can be remedied immediately:
I created a job flow using Amazon Elastic MapReduce (Amazon EMR) and
it failed due to some unknown reasons.
The reason for the job flow failure can actually be inferred from error 2 in the listing you provided:
Job flow failed with reason: Invalid bucket name 'testBucket': buckets
names must contain only lowercase letters, numbers, periods (.), and
dashes (-). [emphasis mine]
Your bucket name 'testBucket' clearly violates the stated lowercase naming requirement, thus changing the name to lowercase only (e.g. 'testbucket' or 'test-bucket') will allow you to run the job flow as desired.
Termination State
Furthermore, the Job Flow termination state is presumably no problem at all. While it can happen in rare cases, that Amazon EC2 instances or other resources are actually stuck in some state, what you are seeing is perfectly reasonable and normal at first sight:
It may take a while to completely terminate a job flow in the first place, see TerminateJobFlows:
The call to TerminateJobFlows is asynchronous. Depending on the
configuration of the job flow, it may take up to 5-20 minutes for
the job flow to completely terminate and release allocated
resources, such as Amazon EC2 instances. [emphasis mine]
Even terminated EC2 resources may be listed for quite a while still, see e.g. the AWS team response to EC2 Instance stuck in "terminated" state:
Terminated means "gone forever"; although sometimes it hangs around in
the UI for several hours. [emphasis mine]
I regularly witness this behavior for EC2 instances, which usually vanish from the instance listing quite some hours later only indeed. Consequently I suspect that the terminated job flow has meanwhile vanished from your job flow list.
Update
I've actually suspected this to be the case indeed, but am still unable to find related information in the official documentation; however, apparently terminated job flows are potentially visible one way or another for up to two month even, see e.g. the AWS team response to Console not showing jobs older than a month:
While the console lists all running job flows, it only displays
terminated job flows launched in the last month. Alternatively, you
can use the Ruby CLI to list all job flows launched in the last two
months with the following command: [...] [emphasis mine]
if your application is running on hadoop yarn, you can always use yarn to manage your application:
yarn application -list
yarn application -kill application_name