Hortonworks Nodemanager starts but then fails: Connection refused to :8042 - hadoop-yarn

I'm trying to solve an issue with a newly added datanode on our Hortonworks cluster. The YARN namenode manager of the node would fail, shortly after starting. The following error message log is returned:
Connection failed to http://(ipaddress):8042/ws/v1/node/info (Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py", line 166, in execute
connection_timeout=curl_connection_timeout, kinit_timer_ms = kinit_timer_ms)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/curl_krb_request.py", line 198, in curl_krb_request
_, curl_stdout, curl_stderr = get_user_call_output(curl_command, user=user, env=kerberos_env)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output
raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'curl --location-trusted -k --negotiate -u : -b /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 -c /var/lib/ambari-agent/tmp/cookies/4268dd36-9f72-4be0-8d82-5f0a124a3a72 http://gdcdrwhdb821.dir.ucb-group.com:8042/ws/v1/node/info --connect-timeout 5 --max-time 7 1>/tmp/tmp7pZrbM 2>/tmp/tmpgM4wdg' returned 7. % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed connect to (ipaddress):8042; Connection refused
)
This doesn't really tell me WHY the connection was refused though, except that whatever Yarn process corresponds to port 8042 isn't running:
netstat -tulpn | grep 8042
I've been looking for another nodemanager log perhaps with more information, but cannot find anything useful under /var/log/hadoop-yarn or the yarn.nodemanager.local-dirs / yarn.nodemanager.log-dirs
Are there other places I can look for yarn nodemanager error logs? Does anyone know what could be causing this?
Edit: After re-checking I found this useful bit in /var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-(ipaddress).log
2017-04-19 14:01:14,670 FATAL nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(549)) - Error starting NodeManager
org.apache.hadoop.service.ServiceStateException: java.lang.ClassNotFoundException: org.apache.spark.network.yarn.YarnShuffleService

Did you able to fix this?
I faced the similar issue today.
I stopped YARN in my HDP cluster and deleted /var/log/hadoop-yarn/nodemanager/recovery-state directory and started YARN again.
The nodemanager is running without failing now.

Not sure if this helps now. Probably you might have already solved it.
You are using external shuffle service. This runs as an auxiliary service inside nodemanager service. Currently it's not able to find shuffle service jar in classpath.
Please add location of shuffle service jar to yarn.application.classpath in yarn-site.xml

It is also working fine in my side. Please stop the yarn service on the specific node not full YARN service.

I stopped YARN in my HDP cluster and deleted /var/log/hadoop-yarn/nodemanager/recovery-state directory and started YARN again.
This worked for me too. I think that was permission file problem.

Need to increase timeout of healthy check in alerts.

Related

can't start rabbitmq-server after installation

I'm trying to use rabbitmq for a django tutorial but when I want to start the server I get this error:
~$ sudo rabbitmq-server
Configuring logger redirection
14:49:57.041 [error]
14:49:57.044 [error] BOOT FAILED
BOOT FAILED
14:49:57.044 [error] ===========
===========
14:49:57.044 [error] ERROR: could not bind to distribution port 25672, it is in use by another node: rabbit#wss
ERROR: could not bind to distribution port 25672, it is in use by another node: rabbit#wss
14:49:57.045 [error]
14:49:58.046 [error] Supervisor rabbit_prelaunch_sup had child prelaunch started with rabbit_prelaunch:run_prelaunch_first_phase() at undefined exit with reason {dist_port_already_used,25672,"rabbit","wss"} in context start_error
14:49:58.046 [error] CRASH REPORT Process <0.153.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,prelaunch,{dist_port_already_used,25672,"rabbit","wss"}}},{rabbit_prelaunch_app,start,[normal,[]]}} in application_master:init/4 line 138
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{dist_port_already_used,25672,\"rabbit\",\"wss\"}}},{rabbit_prelaunch_app,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{dist_port_already_used,25672,"rabbit","wss"}}},{rabbit_prelau
Crash dump is being written to: erl_crash.dump...done
I've searched for port to see that if it's in use or not and I used lsof -i :25672 and I get nothing.
I don't know too much about these things so if you need anything please tell me.
Try:
sudo lsof -i :25672
sudo kill <PID>
sudo rabbitmq-server
Where <PID> is the process ID that is occupying port 25672
I have encountered this issue. I figured out that this issue is coming because the rabbitmq-server is already running on the machine.
I have used the following command
rabbitmqctl.bat status to know the status of the rabbitmq-server. This helped me to know if the server is up or down.
If it is up, this could the reason you are getting the error that you have specified in your post.
You can issue the following command to make the server down
rabbitmqctl.bat stop
Now you can try starting the rabbitmq-server by issuing the following command
rabbitmq-server start
Note that I am using Windows. And I have executed these commands by pointing the command prompt to C:\Program Files\RabbitMQ\rabbitmq_server-3.8.14\sbin as my rabbitmq installation directory is C:\Program Files\RabbitMQ\rabbitmq_server-3.8.14.
I have encountered this before. Here is what caused it and how I fixed it:
This is one of those commands which requires the magic word sudo (i.e it needs a superuser privilege).
If you forget to add sudo to the command, it begins the process but later fails when it hits a superuser-only roadblock. This leaves you with an incomplete process. Now when you decide to add sudo, it attempts the same process again but finds out that someone without the right privilege has made a mess or is still messing around.
Then the solution will be to cancel out whatever the first command has started and try again.
sudo lsof -i :25672
This list out details about the port 25672
You will see the PID (process ID) e.g 1301
Then stop the process on that port with:
sudo kill <PID>
for example, sudo kill 1301
And make sure you are killing the right process if not you may get into trouble.
Now, retry the command with sudo:
sudo rabbitmq-server
ALSO,
In most cases, this error occurs because without deliberately stopping the rabbitmq-server, it always keeps running even after you restart you system.
another way to stop rabitmq server windows+R then type "services.msc" and then find for RabitMq.slelect and stop from left top corner.
Then re run your rabitmq server.
-Hi guys, I am putting up an answer that can help Googlers to run multiple rabbitmq-server on the same machine. Trying to achieve the latter, I ran into a similar error reported in the first place and solved that by defining:
export RABBITMQ_DIST_PORT=anything_other_than_25672
as stated in the documentation:
https://www.rabbitmq.com/networking.html#epmd-inet-dist-port-range
if you are using windows go to task manager and stop rabbitmq from running...
then reload the rabbitmq-server
For Linux others answered but in Windows you should press Ctrl+Alt+delete and select task management and in that end proccess that depends on erlang.
Note that it requires Administrator previlage.
Now enter this command to start rabbitmq-server:
rabbitmq-server start
Every time you restart your computer you should do these steps.For prevent do them again you should stop rabbitmq service from startup services.
went through same problem in windows, it is already running after installation as a service
so just enable the plugins from the rabbitmq commandline by entering the code as
rabbitmq-plugins enable management_plugin
than go to the localhost:15672 and good to go.
This means that your port 25672 is already in use
try: -
sudo lsof -i :25672
sudo kill <PID>
and now start your rabbitmq server using
sudo rabbitmq-server

why ssh connection timed out in vscode?

I installed git instead of openssl to use Remote-SSH in VSCode.However,after I completed the config document and tried to connect to the remote host.I failed. The error info is showed in the below pic.error info
error info:
[11:27:12.631] remote-ssh#0.48.0
[11:27:12.632] win32 x64
[11:27:12.656] SSH Resolver called for "ssh-remote+23321", attempt 1
[11:27:12.659] SSH Resolver called for host: 23321
[11:27:12.659] Setting up SSH remote "23321"
[11:27:12.790] Using commit id "26076a4de974ead31f97692a0d32f90d735645c0" and quality "stable" for server
[11:27:12.798] Testing ssh with ssh -V
[11:27:13.099] ssh exited with code: 0
[11:27:13.100] Got stderr from ssh: OpenSSH_8.1p1, OpenSSL 1.1.1d 10 Sep 2019
[11:27:13.128] Running script with connection command: ssh -T -D 49485 23321 bash
[11:27:13.132] Install and start server if needed
[11:27:13.151] Terminal shell path: C:\Windows\System32\cmd.exe
[11:27:30.151] Resolver error: Connecting with SSH timed out
[11:27:30.178] ------
I had the same problem but the above solutions didn't work with my setup,
but the following setting did work:
"remote.SSH.useLocalServer": false
I got this solution from github reported issues and fix
In my case, the problem was caused by a too long authentication process on the server-side.
Solved it by extending the Connect Timeout from 15 to 30 seconds.
Instructions:
open your vscode Command Palette (via keyboard shortcut or from the
View menu).
search for the Remote-SSH: Settings.
scroll till you find the Connect Timeout.
change it to a longer duration than 15 secs.
key F1
Remote-SSH: Settings
Connect Timeout: from 15 seconds to 60 seconds solve my connection issue
You can try the following approaches:
ssh to your remote server. Then run the following commands to clean data folder and bin folder under .vscode-server folder on the server:
cd ~/.vscode-server
rm data/* -rf
rm bin/* -rf
If step 1 does not work, ssh to your remote server and delete the entire .vscode-server folder with the following command:
rm -rf ~/.vscode-server
Please note that this will also remove the extensions that you installed on the server.
Downgrade the version of the remote-SSH extension in vscode. Look up the extension in the vscode interface, right click on that, and you will find the option "Install Another Version ...". Install the previous version of the current one. If it does not work keep downgrading the version.
I had the same problem before, I solved this by deleting "terminal.integrated.inheritEnv": false inside ~/.config/Code/User/setting.json
I found the solution here in this thread from user oreilm49:
https://github.com/microsoft/vscode-remote-release/issues/1137
in vscode settings :
search conpty and uncheck it
I had same issue, my problem was solved after changing settings in the json file:
I removed "terminal.integrated.inheritEnv": false inside ~/.config/Code/User/setting.json
I added "remote.SSH.useLocalServer": true inside ~/.config/Code/User/setting.json
It worked for me after so many different trials
This might be a very foolish solution but it actually works for me, so I will write it down in case any other people get into the same problem.
I made modifications to the config file for SSH, then all the trials for connection ran into the error of 'Connecting SSH timed out'. I tried many possible solutions but none of them solved my problem.
Then I just closed the VScode and restarted it. Then everything works.
I had a case of this. I my client (local computer) is a Mac, and I was connecting to Linux host. I just went to the setting "Remote Platform" under Remote.SSH settings, and explicitly told it that I am connecting to a Linux remote. After this, it started to work.
I had this issue because of version missmatch of client and server. After updating both to the same version, it worked for me.
The issue with me was timeout at first. I tried increasing the timeout in settings but then later found the issue was with "tar".
The vscode-server.tar.gz (probably a little change in the file name) was not able to install due to tar not being present in my host.
So I installed tar in the host as "yum install tar"
And then tried reconnecting to the server and it worked

Metasploitable 3 - System error 67

I am trying to set up Metasploitable 3 (VirtualBox) on my Ubuntu 16.04.
I have done everything according to the guidelines of the inventors (https://github.com/rapid7/metasploitable3) when it comes to dependencies etc.
However, when I'm trying to start it (via vagrant up --provision win2k8) I get this nasty little error, that I just can't fix.
It always says:
win2k8: System error 67 has occurred.
win2k8: The network name cannot be found.
The following WinRM command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!
cmd /q /c "c:\tmp\vagrant-shell.bat"
Stdout from the command:
CMDKEY: Credential added successfully.
Stderr from the command:
System error 67 has occurred.
The network name cannot be found.
I just can't find anything out on the internet. I only "know" it has something to do with network settings. But I don't know what to do now.
I'd appreciate some help!

/var/run/redis/redis.pid exists, process is already running or crashed

Redis went quite on me.
user#mycomputer:~$ redis-cli
Could not connect to Redis at 127.0.0.1:6379: Connection refused
I try to restart the service by doing this
sudo /etc/init.d/redis_6379 stop
/var/run/redis/redis.pid exists, process is already running or crashed
But no luck. Logs didn't show an error as well.
Got it fixed by backing up the redis.rdp file mine is located at
/var/lib/redis
check your config file "/etc/redis/redis.conf" for the rdp file's location and do this
sudo mv /var/lib/redis/redis.rdp /var/lib/redis/redis_backup.rdp
Then recreate the the redis.rdp file
sudo touch redis.rdp
Run the redis-server with the conf and it should work
sudo redis-server /etc/redis/redis.conf
Get it fixed in a tidy way: Recreate the the redis.rdp file as suggested here in one of answer, will purge all the cache recorded so far and redis will start up fresh with no cache data.
This is a warning message to notify system crash / improper shutdown: "/var/run/redis/redis.pid exists, process is already running or crashed"
Just delete /var/run/redis/redis.pid file and restart the server again.
Note: You might have lost latest cache changes due to untidy shutdown, which weren't flushed into the disk. This data loss can be minimized using frequent disk flush configuration in redis conf file(in my case it is #/etc/redis/6379.conf)
save 900 1
save 300 10
save 60 10000
Or try AOF persistence, more details [here][1]
Depends on how you installed redis, the pid can be found on /var/run/redis_6379.pid.
What happened is that redis crashed, but the pid is still there. So you just have to delete it.
sudo rm -f /var/run/redis_6379.pid
Then start redis again:
sudo /etc/init.d/redis_6379 start
If you can't find it, I suggest installing redis "more properly". Follow redis quickstart guide in the Installing Redis more properly section.
You can find it here:
https://redis.io/topics/quickstart
Run the redis-server with config.
sudo redis-server redis.conf

"node with name "rabbit" already running", but also "unable to connect to node 'rabbit'"

Rabbitmq server does not start, saying it's already running:
$: rabbitmq-server
Activating RabbitMQ plugins ...
0 plugins activated:
node with name "rabbit" already running on "android-d1af002161676bee"
diagnostics:
- nodes and their ports on android-d1af002161676bee: [{rabbit,52176},
{rabbitmqprelaunch2254,
59205}]
- current node: 'rabbitmqprelaunch2254#android-d1af002161676bee'
- current node home dir: /Users/Jordan
- current node cookie hash: ZSx3slRJURGK/nHXDTBRqQ==
But, rabbitmqctl seems to think otherwise:
rabbitmqctl -n rabbit status
Status of node 'rabbit#android-d1af002161676bee' ...
Error: unable to connect to node 'rabbit#android-d1af002161676bee': nodedown
diagnostics:
- nodes and their ports on android-d1af002161676bee: [{rabbit,52176},
{rabbitmqctl2462,59256}]
- current node: 'rabbitmqctl2462#android-d1af002161676bee'
- current node home dir: /Users/Jordan
- current node cookie hash: ZSx3slRJURGK/nHXDTBRqQ==
Any takers?
The rabbitmq server was running somewhere but it just couldn't be connected to.
One of the following will mention something about rabbits:
$: ps aux | grep epmd
$: ps aux | grep erl
Kill the process with kill -9 {pid of rabbitmq process}
i was having the same problem then I realized I was not issuing the right command.
./rabbitmqctl stop
this works everytime, although it does take down erlang runtime too. also mind where your config file.
I used rabbitmqctl stop and then restarted using rabbitmq-server as root.
This issue can be caused by two issues:
Rabbit is already running on the server. If that is the case, use the answer you found of killing the currently running process (ps aux | grep rabbit | grep -v grep)
You have changed the IP address of your machine but not changed the /etc/hosts file to reflect the new IP address of the machine.
The more common of the issues is the first, but the harder to find is the second (especially if you have rabbit running on the other machine. If rabbit is installed on the other machine it will look at the old IP address and would see another machine already running rabbitmq and give you the same error. This has caused me grief in the past.
I was having this same error # Win 7, but the solutions above did not worked for me, what did solved was to remove and reinstall the service. Using a console with admin rights:
rabbitmq-service remove
rabbitmq-service install
I hope this might help someone else too
$CD RabbitMQ Server\rabbitmq_server-3.7.8\sbin
rabbitmq-service remove
rabbitmq-service install
Go : windows Services
Find : RabbitMQ and Start it
after this Enable plugin :
rabbitmq-plugins enable rabbitmq_management
In my case under Ubuntu 11.10 it helped to
#rabbitmqctl cluster MASTER SLAVE
#rabbitmqctl start_app
before I always got this error message...
Using admin console, in Win 2012R2 ver 3.5.5 rabbit, got it to work using the remove and install then rabbitmq-server restart
then ctr-c to terminate the job, then I was able to use the windows service console and start the rabbitMq service.
In my case(windows),
1. I just ran the stop service.
2. The started the service.