Can't run distributed imagenet inception model (connection failed) - tensorflow

I used 2 Ubuntu servers to host run the distributed tensorflow. Each server installed tensorflow 0.8.0.
I first started ps server on server1:
```
ubuntu#i-mxdcqm20:/data1T5/org_models/inception$ sudo bazel-bin/inception/imagenet_distributed_train \
--job_name='ps' \
--task_id=0 \
--ps_hosts='43.254.55.221:2222' \
--worker_hosts='61.160.41.85:2222'
```、
log shows:
INFO:tensorflow:PS hosts are: ['43.254.55.221:2222']
INFO:tensorflow:Worker hosts are: ['61.160.41.85:2222']
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {61.160.41.85:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222
And when I run sudo netstat -tunlp the server is actually listening on port 2222:
tcp6 0 0 :::2222 :::* LISTEN 3525/python
But when I start worker on server2 it still reports failed to connect:
E0722 10:35:01.142377237 4045 tcp_client_posix.c:191] failed to connect to 'ipv4:43.254.55.221:2222': timeout occurred
I'm running the code according the readme here and I didn't change any code.

Related

Nagios CHECK_NRPE Could not complete SSL handshake

I checked all over, there are many answers to this issue, but none worked.
I am following this tutorial:
https://www.digitalocean.com/community/tutorials/how-to-install-nagios-4-and-monitor-your-servers-on-ubuntu-16-04
The Nagios host is ubuntu 16.04, the client is ubuntu 18.04
Nagios® Core™ 4.3.4
The Nagios server and web is running ok, I can see the localhost status us 'up' in the dashboard.
Something very weird: I installed NRPE 3.2.1 on both the host and the client, but for some reason on the host is 2.15
Host:
root#nagios-1:/tmp/nrpe-nrpe-3.2.1# /usr/local/nagios/libexec/check_nrpe -H 10.142.0.50
NRPE v2.15
Client:
$ /usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
NRPE v3.2.1
Just to make sure, when running check_nrpe from client to server I am using '-2' option to force v2 packets, but I am still getting to error
I added the client ip to the nrpe.cnf (on server), and to be sure also the server ip to the client nrpe.cfg file.
I enabled debug to see the messages in the syslog. this is the response:
Dec 4 00:35:47 nagios-1 check_nrpe: Remote 10.142.0.50 accepted a Version 2 Packet
Dec 4 00:35:51 nagios-1 nrpe[9953]: Connection from 10.142.0.11 port 49889
Dec 4 00:35:51 nagios-1 nrpe[9953]: Host address is in allowed_hosts
Dec 4 00:35:51 nagios-1 nrpe[9953]: Handling the connection...
Dec 4 00:35:51 nagios-1 nrpe[9953]: Error: Could not complete SSL handshake. 1
Dec 4 00:35:51 nagios-1 nrpe[9953]: Connection from closed.
On the host, port 5666 is open and listening
# netstat -at | grep nrpe
tcp 0 0 *:nrpe *:* LISTEN
tcp6 0 0 [::]:nrpe [::]:* LISTEN
I compiled nrpe with --
I am not using xinetd. I use the daemon
# ps aux | grep nrpe
nagios 9866 0.0 0.1 23960 2680 ? Ss 00:35 0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
Host nrpe conf file:
# grep -o '^[^#]*' /etc/nagios/nrpe.cfg
log_facility=daemon
pid_file=/var/run/nagios/nrpe.pid
server_port=5666
nrpe_user=nagios
nrpe_group=nagios
allowed_hosts=127.0.0.1, 10.142.0.50, 10.142.0.0/20,10.142.0.11
dont_blame_nrpe=1
allow_bash_command_substitution=0
debug=1
command_timeout=60
connection_timeout=300
command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_hda1]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/hda1
command[check_zombie_procs]=/usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/lib/nagios/plugins/check_procs -w 150 -c 200
include=/etc/nagios/nrpe_local.cfg
include_dir=/etc/nagios/nrpe.d/
If you need more info let me know and I will add it.
I found the answer!
I had two versions of NRPE on the host. The deamon was running 2.15. I had to kill this version, and I manually run the 3.2.1 version from its other location
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -f
After that I was able to get a response in the client

IPython interface to distributed Dask workers over ssh yields "Connection refused"

Today, I thought I would attempt get to know my workers better through spawning an ipython kernel. Doing so seemed easy enough using the handy
client.start_ipython_workers()
I was able to get the connection information, and then wrote a script to dump it to JSON. I then configured some port forwarding to connect to the worker, however the worker client does not seem to accept connections.
connect channel 2: open failed: connect failed: Connection refused
It is possible that I still have some configuration problems with ssh, however I have been successfully connecting to my Jupyter notebook kernel through a similar channel. Is there some reason why the worker would be blocking connections?
winfo = client.start_ipython_workers()
for worker in winfo.keys():
winfo[worker]['key']=winfo[worker]['key'].decode('utf8')
with open(os.path.join('/home/centos/kernels/','kernel-'+winfo[worker].pop('ip')+'.json'), 'w+') as f:
winfo[worker]['ip']='127.0.0.1'
json.dump(winfo[worker], f,indent=2)
#!/bin/bash
for port in $(cat $2 | grep '_port' | grep -o '[0-9]\+'); do
echo "establishing tunnel to "$port
ssh $1 -f -N -L $port:127.0.0.1:$port
done

Cannot connect to remote jmx server using jvisualvm or jconsole (netcat working)

I have a spark application running on a remote server and I need to get its heap dump for performance purposes. I was able to run the jstatd service on the remote machine and connect to it using visualvm. However jstatd does not enable heap dump of remote machines (I am using visual vm 1.3.8).
To resolve this I started my application with the following extra options:
--conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=54321 \
-Dcom.sun.management.jmxremote.rmi.port=54320 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Djava.rmi.server.hostname=$HOSTNAME"
After running this I used netstat to gather all open ports by the process and got the following output:
sudo netstat -lp | grep 37407
tcp 0 0 *:54321 *:* LISTEN 37407/java
tcp 0 0 *:54320 *:* LISTEN 37407/java
To check if the remote port was accessible through my local machine I used the netcat utility and the connection with remote host on both 54321 and 54320 was successful.
However when I try to connect to the host using visualvm or jconsole it fails to connect. Visual vm reports the following error:
cannot connect to hostname:54321 using service:jmx:rmi:///jndi/rmi://hostname:54321/jmxrmi
What am I doing wrong here?
in order to enable jconsole connection : try adding this flag
-Dcom.sun.management.jmxremote.local.only=false
and in order to a heap dump, you don't need to connect via jconsole, just use jmap:
$>jmap -dump:format=b,live,file=<filename> <process-id>
and finally, if spark has a daemon controlling it, make sure it doesn't kill the process during the heap dump creation.
The problem is that $HOSTNAME is the hostname of the server you are running spark submit from, you need to set to the hostname of the machine the spark driver runs on:
-Djava.rmi.server.hostname=<hostname of spark driver>
BTW, This is the reason it only worked for you when your spark application and the spark submit was on the same server.

How to start IPython notebook remotely?

Following these instructions (Running a notebook server and Remote access to IPython Notebooks
) I proceed as follows:
On the remote server:
1) Setting NotebookApp.password()
In [1]: from IPython.lib import passwd
In [2]: passwd()
Enter password:
Verify password:
Out[2]: 'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'
2) Create profile
user#remote_host$ ipython profile create
3) Edit ~/.ipython/profile_default/ipython_notebook_config.py
# Password to use for web authentication
c = get_config()
c.NotebookApp.password =
u'sha1:67c9e60bb8b6:9ffede0825894254b2e042ea597d771089e11aed'
4) Start notebook on port 8889
user#remote_host$ ipython notebook --no-browser --port=8889
and the notebook starts
[I 16:08:10.012 NotebookApp] Using MathJax from CDN:https://cdn.mathjax.org/mathjax/latest/MathJax.js
[W 16:08:10.131 NotebookApp] Terminals not available (error was No module named 'terminado')
[I 16:08:10.132 NotebookApp] Serving notebooks from local directory: /cluster/home/user
[I 16:08:10.132 NotebookApp] 0 active kernels
[I 16:08:10.132 NotebookApp] The IPython Notebook is running at: http://localhost:8889/
[I 16:08:10.132 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
On my local machine
5) SSH tunneling
user#local$ ssh -N -f -L localhost:8888:127.0.0.1:8889 username#remote_host
On the remote host (/etc/hosts) you find
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
6) Finally, I try to open localhost:8888 on my browser, and I get:
channel 2: open failed: connect failed: Connection refused
channel 2: open failed: connect failed: Connection refused
channel 2: open failed: connect failed: Connection refused
channel 2: open failed: connect failed: Connection refused
channel 2: open failed: connect failed: Connection refused
All these steps work on one server, but fail on another one.
I tried contacting the administrator and said the following:
I assume that you are using two separate SSH connections: one from
which you run ipython and one that you use to do port forwarding.
There is no guarantee that the two connections will land you on the
same login node. In the case where the two connections are on
different hosts, you will experience the exact failure you have
encountered. Therefore you should setup the port forwarding in the
connection that you use to run ipython.
How can I setup the port forwarding in the connection that I use to run ipython?
I tried using my ip address but it didn't work
$ ssh -N -f -L local_ip_address:8888:127.0.0.1:8889 user#remote_host
Finally this is how the problem was solved:
# Login to the server from your local workstation and in the same connection do the port forwarding.
user#local$ ssh -L 8888:localhost:8889 username#remote_host
user#remote_host$ ipython notebook --no-browser --port=8889
Just follow this instruction.
https://coderwall.com/p/ohk6cg/remote-access-to-ipython-notebooks-via-ssh

How do I "dockerize" a redis service using phusion/baseimage-docker

I am getting started with docker, and I am trying by "dockerizing" a simple redis service using Phusion's baseimage. On its website, baseimage says:
You can add additional daemons (e.g. your own app) to the image by
creating runit entries.
Great, so I first started this image interactively with a cmd of /bin/bash. I installed redis-server via apt-get. I created a "redis-server" directory in /etc/service, and made a runfile that reads as follows:
#!/bin/sh
exec /usr/bin/redis-server /etc/redis/redis.conf >> /var/log/redis.log 2>&1
I ensured that daemonize was set to "no" in the redis.conf file
I committed my changes, and then with my newly created image, I started it with the following:
docker run -p 6379:6379 <MY_IMAGE>
I see this output:
*** Running /etc/rc.local...
*** Booting runit daemon...
*** Runit started as PID 98
I then run
boot2docker ip
It gives me back an IP address. But when I run, from my mac,
redis-cli -h <IP>
It cannot connect. Same with
telnet <IP> 6379
I ran a docker ps and see the following:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c7bd2dXXXXXX myuser/redis:latest "/sbin/my_init" 11 hours ago Up 2 minutes 0.0.0.0:6379->6379/tcp random_name
Can anyone suggest what I have done wrong when attempting to dockerize a simple redis service using phusion's baseimage?
It was because I did not comment out the
bind 127.0.0.1
parameter in the redis.conf file.
Now, it works!