yarn not getting nodes - hadoop-yarn

This is in AWS EMR cluster with 2 task nodes and a Master.
I'm trying the hello-samza that launches a yarn job. The job gets stuck in ACCEPTED STATE. I looked in other posts and it seems that my yarn getting no nodes. Any help on what yarn not getting task nodes will help.
[hadoop#xxx hello-samza]$ deploy/yarn/bin/yarn node -list
17/04/18 23:30:45 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
Total Nodes:0
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
[hadoop#xxx hello-samza]$ deploy/yarn/bin/yarn application -list -appStates ALL
17/04/18 23:26:30 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1492557889328_0001 wikipedia-parser_1 Samza hadoop default ACCEPTED UNDEFINED 0% N/A

I made a complete answer for a similar case I've been experiencing: have a look at it, it might be this kind of conf issue

It seems like the nodemanagers are not running on either node (either not started at all or exited with error). Use jps command to check if all the daemons associated with YARN are running on the two nodes. Additionally, check both nodemanager logs to see if any exceptions might have killed it.

Related

Yarn Job related commands

For any job which is submitted to YARN using YARN console and YARN Cluster UI, how to find:
Who has submitted the job?
To which YARN queue is a job submitted?
How much time did it take to finish the job?
I tried using below command, but it gives lot of details, not specific
yarn application -list
Give a look at Yarn Admin Page, there are the details about all the jobs you have submitted to the cluster.
Just accessing to <Local_ip>:8088 I.E: Localhost:8088.
Also, there is a section for logs at /logs/userlogs directory. This directory will contain logs for all applications running by a user.

s3distcp fail with "mapreduce_shuffle does not exist"

When I running command below,
s3-dist-cp --src s3://test/9.19 --dest hdfs:///user/hadoop/test
I got a error about auxService.
20/02/03 07:52:13 INFO mapreduce.Job: Task Id : attempt_1580716305878_0001_m_000000_2, Status : FAILED
Container launch failed for container_1580716305878_0001_01_000004 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
In many QnA, I found a solution like this
link.
But there is no process for nodemanager.
[hadoop#ip-172-31-37-115 ~]$ initctl list | grep yarn
hadoop-yarn-timelineserver start/running, process 8149
hadoop-yarn-resourcemanager start/running, process 17331
hadoop-yarn-proxyserver start/running, process 8147
My EMR was created by quick menu with emr-5.28.0.
Is there anyone knows about this problem?
Thanks!
I'm sure there's some way to update the configs, but what I did was create a cluster using the 'advanced' setup and chose these software packages:
Ganglia
Hive
Hue
Mahout
Pig
Tez
Spark
Hadoop
(8 in total)
Most of those, except spark, are installed with the default settings (the first radio button for software packages in quick setup). One of these software packages or something related to it is what causes s3-dist-cp to be installed, and I was able to use it with no problems with that setup.

How do I kill a YARN container to test failure scenarios

I'm building an application on AWS EMR using YARN (and Dask) version Hadoop 2.7.3-amzn-1. I'm trying to test various failure scenarios and I'm wanting to simulate a container failure. I can't seem to find an easy way to kill a YARN container - only the whole application. Is there a command-line utility for this?
[root#node1 lillcol]# yarn container -help
20/04/24 15:04:14 INFO client.AHSProxy: Connecting to Application History server at node1/127.0.0.1:10200
usage: container
-help Displays help for all commands.
-list <Application Attempt ID> List containers for application
attempt.
-signal <container ID [signal command]> Signal the container. The
available signal commands are
[OUTPUT_THREAD_DUMP,
GRACEFUL_SHUTDOWN,
FORCEFUL_SHUTDOWN] Default
command is OUTPUT_THREAD_DUMP.
-status <Container ID> Prints the status of the
container.
Through the command yarn container -signal [container-ID] GRACEFUL_SHUTDOWN to achieve.
i've tried and int works,I hope that will be helpful.
YARN has no CLI or REST API that kills a container.
The simplest way to create a container failure is to login to a NodeManager host and kill the process (which would be a container) spawned by the NodeManager.
Seems like it's exposed in API starting from version 2.8.0
https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/yarn/client/api/YarnClient.html#signalToContainer(org.apache.hadoop.yarn.api.records.ContainerId,%20org.apache.hadoop.yarn.api.records.SignalContainerCommand)

YARN error: TaskAttempt killed because it ran on unusable node ... Container released on a *lost* node

I am using CDH 5.4 with Pig 0.12. I am getting a lot of this error from all nodes:
TaskAttempt killed because it ran on unusable nodename:portnumber Container released on a *lost* node
What does this mean? In particular what does "lost" mean here? It doesn't look like the node is really lost in the cluster. Another question (more important question) is how to resolve this issue. Any help would be appreciated.
This particular case turned out to be a data storage problem. I restarted datanode manager from nodes which were lost with the message of "1/1 local-dirs are bad: /data/hadoop/yarn/local;"

Celery node fail, on pidbox already using on restart

I have Celery running with RabbitMQ broker.
Today, I have a failure of a Celery node, it doesn't execute tasks and doesn't respond on service celeryd stop command. After few repeats, the node stopped, but on start I get this message:
[WARNING/MainProcess] celery#nodename ready.
[WARNING/MainProcess] /home/ubuntu/virtualenv/project_1/local/lib/python2.7/site-packages/kombu/pidbox.py:73: UserWarning: A node named u'nodename' is already using this process mailbox!
Maybe you forgot to shutdown the other node or did not do so properly?
Or if you meant to start multiple nodes on the same host please make sure
you give each node a unique node name!
warnings.warn(W_PIDBOX_IN_USE % {'hostname': self.hostname})
Can anyone suggest how to unlock process mailbox?
From here http://celery.readthedocs.org/en/latest/userguide/workers.html#starting-the-worker you might need to name each node uniquely. Example:
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker1.%h
In supervisor escape by using %%h.
Large log file or not enough free space was a reason, i think.
After deletion all is ok