Unable to run PBS script on multiple nodes using GNU parallel - ssh

I have been trying to use multiple nodes in my PBS script to run several independent jobs. Each individual job is supposed to use 8 cores and each node in the cluster has 32 cores. So, I would like to have each node run 4 jobs. My PBS script is as follows.
#!/usr/bin/env bash
#PBS -l nodes=2:ppn=32
#PBS -l mem=128gb
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -V
#PBS -l gres=ccm
sort -u $PBS_NODEFILE > nodelist.dat
#cat ${PBS_NODEFILE} > nodelist.dat
export JOBS_PER_NODE=4
PARALLEL="parallel -j $JOBS_PER_NODE --sshloginfile nodelist.dat --wd $PBS_O_WORKDIR"
$PARALLEL -a input_files.dat sh test.sh {}
input_files.dat contains the name of job files. I have successfully used this script to run parallel jobs on one node (in which case I remove --sshloginfile nodelist.dat and sort -u $PBS_NODEFILE > nodelist.dat from the script). However, whenever I try to run this script on more than one node, I get the following error.
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
Here, 922 and 901 are the numbers corresponding to the assigned nodes and are included in the nodelist.dat ($PBS_NODEFILE) file.
I tried to search for this problem but couldn't find much as everyone else seems to be doing fine with --sshloginfile argument, so I am not sure if this is a system specific problem.
Edit:
As #Ole Tange mentioned in his answer and comments, I need to modify the "node number" as produced by $PBS_NODEFILE, which I am doing in the following way inside the PBS script.
# provides a unique number (say, 900) associated with the node.
sort -u $PBS_NODEFILE > nodelist.dat
# changes the contents of the nodelist.dat from "900" to "username#w-900.cluster.uni.edu"
sed -i -r "s/([0-9]+)/username#w-\1.cluster.uni.edu/g" nodelist.dat
I verified that the nodelist.dat contains only one line viz., username#w-900.cluster.uni.edu.
Edit-2:
It seems like the cluster's architecture is responsible for the error I am getting. I ran the same script on a different cluster (say, cluster_2), and it finished without any errors. In my sysadmin's words, the reason why it works on cluster_2 is: "cluster_2 is a single machine. Once your job starts, you are actually on the head node of your PBS job like you would expect."

The variable $PARALLEL is used by GNU Parallel for options. So when you also use it, it is likely to cause confusion. It does not seem to be the root cause here, though, but do yourself a favor and use another variable name (or use it as described in the man page).
The problem here seems to be ssh which will not see a number as a hostname:
$ ssh 8
ssh: connect to host 8 port 22: Invalid argument
Add the domain name, and ssh will see it as a hostname:
$ ssh 8.pi.dk
<<connects>>
If I were you I would talk to your cluster admin and ask if the worker nodes could be renamed to w-XXX, where XXX is their current name.

Related

Unable to Run MPI CLUSTER within a LAN

This is the sanpshot of my /etc/hosts file
karpathy is master & client is slave
I have successfully done
SETUP PASSWORDLESS SSH
Mounted sudo mount -t nfs karpathy:/home/mpiuser/cloud ~/cloud
I can login to my client simply by ssh client
I have followed this blog
http://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/
mpirun -np 5 -hosts karpathy ./cpi output
mpirun -np 5 -hosts client ./cpi
Getting Error
[mpiexec#karpathy] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#karpathy] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:179): error waiting for event
[mpiexec#karpathy] main (./ui/mpich/mpiexec.c:397): process manager error waiting for completion
I hope you already have found the solution, in case you haven't I would suggest doing a couple of things.
1. disabling firewall on both the nodes by doing `
sudo ufw disable
`
2. Creating a file named as machinefile (or whatever u like) and storing the number of CPU's in both nodes along with the hostnames.
my machinefile contains:
master:8
slave:4
master and slave are the hostnames while 8 and 4 are the number of CPUs on each node.
to compile use
mpicc -o filename filename.cpp
to run use the machinefile as an argument
mpirun -np 12 -f machinefile ./filename
12 is th enumber of processes. Since both the nodes have 12 CPUs combined so it's better to divide the code on 12 processes.

autossh quits because ssh (dropbear) can't resolve host

I run autossh on a system which might have internet connectivity or might not. I don't really know when it has a connection but if so I want autossh to establish a ssh tunnel by:
autossh -M 2000 -i /etc/dropbear/id_rsa -R 5022:localhost:22 -R user#host.name -p 6022 -N
After several seconds it throws:
/usr/bin/ssh: Exited: Error resolving 'host.name' port '6022'. Name or service not known
And thats it. Isn't autossh meant to keep the ssh process running no matter what? Do I really have to check for a connection by ping or so?
You need to set the AUTOSSH_GATETIME environment variable to 0. From autossh(1):
Startup behaviour
If the ssh session fails with an exit status of 1 on the very first try, autossh
1. will assume that there is some problem with syntax or the connection setup,
and will exit rather than retrying;
2. There is a "starting gate" time. If the first ssh process fails within the
first few seconds of being started, autossh assumes that it never made it
"out of the starting gate", and exits. This is to handle initial failed
authentication, connection, etc. This time is 30 seconds by default, and can
be adjusted (see the AUTOSSH_GATETIME environment variable below). If
AUTOSSH_GATETIME is set to 0, then both behaviours are disabled: there is no
"starting gate", and autossh will restart even if ssh fails on the first run
with an exit status of 1. The "starting gate" time is also set to 0 when the
-f flag to autossh is used.
AUTOSSH_GATETIME
Specifies how long ssh must be up before we consider it a successful connecā€
tion. The default is 30 seconds. Note that if AUTOSSH_GATETIME is set to 0,
then not only is the gatetime behaviour turned off, but autossh also ignores
the first run failure of ssh. This may be useful when running autossh at
boot.
Usage:
AUTOSSH_GATETIME=0 autossh -M 2000 -i /etc/dropbear/id_rsa -R 5022:localhost:22 -R user#host.name -p 6022 -N

ssh -L forward multiple ports

I'm currently running a bunch of:
sudo ssh -L PORT:IP:PORT root#IP
where IP is the target of a secured machine, and PORT represents the ports I'm forwarding.
This is because I use a lot of applications which I cannot access without this forwarding. After performing this, I can access through localhost:PORT.
The main problem occured now that I actually have 4 of these ports that I have to forward.
My solution is to open 4 shells and constantly search my history backwards to look for exactly which ports need to be forwarded etc, and then run this command - one in each shell (having to fill in passwords etc).
If only I could do something like:
sudo ssh -L PORT1+PORT2+PORT+3:IP:PORT+PORT2+PORT3 root#IP
then that would already really help.
Is there a way to make it easier to do this?
The -L option can be specified multiple times within the same command. Every time with different ports. I.e. ssh -L localPort0:ip:remotePort0 -L localPort1:ip:remotePort1 ...
Exactly what NaN answered, you specify multiple -L arguments. I do this all the time. Here is an example of multi port forwarding:
ssh remote-host -L 8822:REMOTE_IP_1:22 -L 9922:REMOTE_IP_2:22
Note: This is same as -L localhost:8822:REMOTE_IP_1:22 if you don't specify localhost.
Now with this, you can now (from another terminal) do:
ssh localhost -p 8822
to connect to REMOTE_IP_1 on port 22
and similarly
ssh localhost -p 9922
to connect to REMOTE_IP_2 on port 22
Of course, there is nothing stopping you from wrapping this into a script or automate it if you have many different host/ports to forward and to certain specific ones.
For people who are forwarding multiple port through the same host can setup something like this in their ~/.ssh/config
Host all-port-forwards
Hostname 10.122.0.3
User username
LocalForward PORT_1 IP:PORT_1
LocalForward PORT_2 IP:PORT_2
LocalForward PORT_3 IP:PORT_3
LocalForward PORT_4 IP:PORT_4
and it becomes a simple ssh all-port-forwards away.
You can use the following bash function (just add it to your ~/.bashrc):
function pfwd {
for i in ${#:2}
do
echo Forwarding port $i
ssh -N -L $i:localhost:$i $1 &
done
}
Usage example:
pfwd hostname {6000..6009}
jbchichoko and yuval have given viable solutions. But jbchichoko's answer isn't a flexible answer as a function, and the opened tunnels by yuval's answer cannot be shut down by ctrl+c because it runs in the background. I give my solution below solving both the two flaws:
Defing a function in ~/.bashrc or ~/.zshrc:
# fsshmap multiple ports
function fsshmap() {
echo -n "-L 1$1:127.0.0.1:$1 " > $HOME/sh/sshports.txt
for ((i=($1+1);i<$2;i++))
do
echo -n "-L 1$i:127.0.0.1:$i " >> $HOME/sh/sshports.txt
done
line=$(head -n 1 $HOME/sh/sshports.txt)
cline="ssh "$3" "$line
echo $cline
eval $cline
}
A example of running the function:
fsshmap 6000 6010 hostname
Result of this example:
You can access 127.0.0.1:16000~16009 the same as hostname:6000~6009
In my company both me and my team members need access to 3 ports of a non-reachable "target" server so I created a permanent tunnel (that is a tunnel that can run in background indefinitely, see params -f and -N) from a reachable server to the target one. On the command line of the reachable server I executed:
ssh root#reachableIP -f -N -L *:8822:targetIP:22 -L *:9006:targetIP:9006 -L *:9100:targetIP:9100
I used user root but your own user will work. You will have to enter the password of the chosen user (even if you are already connected to the reachable server with that user).
Now port 8822 of the reachable machine corresponds to port 22 of the target one (for ssh/PuTTY/WinSCP) and ports 9006 and 9100 on the reachable machine correspond to the same ports of the target one (they host two web services in my case).
Another one liner that I use and works on debian:
ssh user#192.168.1.10 $(for j in $(seq 20000 1 20100 ) ; do echo " -L$j:127.0.0.1:$j " ; done | tr -d "\n")
One of the benefits of logging into a server with port forwarding is facilitating the use of Jupyter Notebook. This link provides an excellent description of how to it. Here I would like to do some summary and expansion for all of you guys to refer.
Situation 1. Login from a local machine named Host-A (e.g. your own laptop) to a remote work machine named Host-B.
ssh user#Host-B -L port_A:localhost:port_B
jupyter notebook --NotebookApp.token='' --no-browser --port=port_B
Then you can open a browser and enter: http://localhost:port_A/ to do your work on Host-B but see it in Host-A.
Situation 2. Login from a local machine named Host-A (e.g. your own laptop) to a remote login machine named Host-B and from there login to the remote work machine named Host-C. This is usually the case for most analytical servers within universities and can be achieved by using two ssh -L connected with -t.
ssh -L port_A:localhost:port_B user#Host-B -t ssh -L port_B:localhost:port_C user#Host-C
jupyter notebook --NotebookApp.token='' --no-browser --port=port_C
Then you can open a browser and enter: http://localhost:port_A/ to do your work on Host-C but see it in Host-A.
Situation 3. Login from a local machine named Host-A (e.g. your own laptop) to a remote login machine named Host-B and from there login to the remote work machine named Host-C and finally login to the remote work machine Host-D. This is not usually the case but might happen sometime. It's an extension of Situation 2 and the same logic can be applied on more machines.
ssh -L port_A:localhost:port_B user#Host-B -t ssh -L port_B:localhost:port_C user#Host-C -t ssh -L port_C:localhost:port_D user#Host-D
jupyter notebook --NotebookApp.token='' --no-browser --port=port_D
Then you can open a browser and enter: http://localhost:port_A/ to do your work on Host-D but see it in Host-A.
Note that port_A, port_B, port_C, port_D can be random numbers except common port numbers listed here. In Situation 1, port_A and port_B can be the same to simplify the procedure.
Here is a solution inspired from the one from Yuval Atzmon.
It has a few benefits over the initial solution:
first it creates a single background process and not one per port
it generates the alias that allows you to kill your tunnels
it binds only to 127.0.0.1 which is a little more secure
You may use it as:
tnl your.remote.com 1234
tnl your.remote.com {1234,1235}
tnl your.remote.com {1234..1236}
And finally kill them all with tnlkill.
function tnl {
TUNNEL="ssh -N "
echo Port forwarding for ports:
for i in ${#:2}
do
echo " - $i"
TUNNEL="$TUNNEL -L 127.0.0.1:$i:localhost:$i"
done
TUNNEL="$TUNNEL $1"
$TUNNEL &
PID=$!
alias tnlkill="kill $PID && unalias tnlkill"
}
An alternative approach is to tell ssh to work as a SOCKS proxy using the -D flag.
That way you would be able to connect to any remote network address/port accesible through the ssh server as long as the client applications are able to go through a SOCKS proxy (or work with something like socksify).
If you want a simple solution that runs in the background and is easy to kill - use a control socket
# start
$ ssh -f -N -M -S $SOCKET -L localhost:9200:localhost:9200 $HOST
# stop
$ ssh -S $SOCKET -O exit $HOST
I've developed loco for help with ssh forwarding. It can be used to share ports 5000 and 7000 on remote locally at the same ports:
pip install loco
loco listen SSHINFO -r 5000 -r 7000
First It can be done using Parallel Execution by xargs -P 0.
Create a file for binding the ports e.g.
localhost:8080:localhost:8080
localhost:9090:localhost:8080
then run
xargs -P 0 -I xxx ssh -vNTCL xxx <REMOTE> < port-forward
or you can do a one-liner
echo localhost:{8080,9090} | tr ' ' '\n' | sed 's/.*/&:&/' | xargs -P 0 -I xxx ssh -vNTCL xxx <REMOTE>
pros independent ssh port-forwarding, they are independent == avoiding Single Point of Failure
cons each ssh port-forwarding is forked separately, somehow not efficient
second it can be done using curly brackets expansion feature in bash
echo "ssh -vNTC $(echo localhost:{10,20,30,40,50} | perl -lpe 's/[^ ]+/-L $&:$&/g') <REMOTE>"
# output
ssh -vNTC -L localhost:10:localhost:10 -L localhost:20:localhost:20 -L localhost:30:localhost:30 -L localhost:40:localhost:40 -L localhost:50:localhost:50 <REMOTE>
real example
echo "-vNTC $(echo localhost:{8080,9090} | perl -lpe 's/[^ ]+/-L $&:$&/g') gitlab" | xargs ssh
Forwarding 8080 and 9090 to gitlab server.
pros one single fork == efficient
cons by closing this process (ssh) all forwarding are closed == Single Point of Failure
You can use this zsh function (probably works with bash, too)(Put it in ~/.zshrc):
ashL () {
local a=() i
for i in "$#[2,-1]"
do
a+=(-L "${i}:localhost:${i}")
done
autossh -M 0 -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3" -NT "$1" "$a[#]"
}
Examples:
ashL db#114.39.161.24 6480 7690 7477
ashL db#114.39.161.24 {6000..6050} # Forwards the whole range. This is simply shell syntax sugar.

SSH to multiple hosts at once

I have a script which loops through a list of hosts, connecting to each of them with SSH using an RSA key, and then saving the output to a file on my local machine - this all works correctly. However, the commands to run on each server take a while (~30 minutes) and there are 10 servers. I would like to run the commands in parallel to save time, but can't seem to get it working. Here is the code as it is now (working):
for host in $HOSTS; do
echo "Connecting to $host"..
ssh -n -t -t $USER#$host "/data/reports/formatted_report.sh"
done
How can I speed this up?
You should add & to the end of the ssh call, it will run on the background.
for host in $HOSTS; do
echo "Connecting to $host"..
ssh -n -t -t $USER#$host "/data/reports/formatted_report.sh" &
done
I tried using & to send the SSH commands to the background, but I abandoned this because after the SSH commands are completed, the script performs some more commands on the output files, which need to have been created.
Using & made the script skip directly to those commands, which failed because the output files were not there yet. But then I learned about the wait command which waits for background commands to complete before continuing. Now this is my code which works:
for host in $HOSTS; do
echo "Connecting to $host"..
ssh -n -t -t $USER#$host "/data/reports/formatted_report.sh" &
done
wait
Try massh http://m.a.tt/er/massh/. This is a nice tool to run ssh across multiple hosts.
The Hypertable project has recently added a multi-host ssh tool. This tool is built with libssh and establishes connections and issues commands asynchronously and in parallel for maximum parallelism. See Multi-Host SSH Tool for complete documentation. To run a command on a set of hosts, you would run it as follows:
$ ht ssh host00,host01,host02 /data/reports/formatted_report.sh
You can also specify a host name or IP pattern, for example:
$ ht ssh 192.168.17.[1-99] /data/reports/formatted_report.sh
$ ht ssh host[00-99] /data/reports/formatted_report.sh
It also supports a --random-start-delay <millis> option that will delay the start of the command on each host by a random time interval between 0 and <millis> milliseconds. This option can be used to avoid thundering herd problems when the command being run accesses a central resource.

"Connection to localhost closed by remote host." when rsyncing over ssh

I'm trying to set up an automatic rsync backup (using cron) over an ssh tunnel but am getting an error "Connection to localhost closed by remote host.". I'm running Ubuntu 12.04. I've searched for help and tried many solutions such as adding ALL:ALL to /etc/hosts.allow, check for #MaxStartups 10:30:60 in sshd_config, setting UsePrivilegeSeparation no in sshd_config, creating /var/empty/sshd but none have fixed the problem.
I have autossh running to make sure the tunnel is always there:
autossh -M 25 -t -L 2222:destination.address.edu:22 pbeyersdorf#intermediate.address.edu -N -f
This seems to be running fine, and I've been able to use the tunnel for various rsync tasks, and in fact the first time I ran the following rsync task via cron it succeeded:
rsync -av --delete-after /tank/Documents/ peteman#10.0.1.5://Volumes/TowerBackup/tank/Documents/
with the status of each file and the output
sent 7331634 bytes received 88210 bytes 40215.96 bytes/sec
total size is 131944157313 speedup is 17782.61
Ever since that first success, every attempt gives me the following output
building file list ... Connection to localhost closed by remote host.
rsync: connection unexpectedly closed (8 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(605) [sender=3.0.9]
An rsync operation of a smaller subdirectory works as expected. I'd appreciate any ideas on what could be the problem.
It seems the issues is related to autossh. If I create my tunnel via ssh instead of autossh it works fine. I suspect I could tweak the environment variables that affect the autossh configuration, but for my purposes I've solved the problem by wrapping the rsycn command in a script that first opens a tunnel via ssh, executes the backup then kills the ssh tunnel, thereby eliminating the need for the always open tunnel created by autossh:
#!/bin/sh
#Start SSH tunnel
ssh -t -L 2222:destination.address.edu:22 pbeyersdorf#intermediate.address.edu -N -f
#execute backup commands
rsync -a /tank/Documents/ peteman#localhost://Volumes/TowerBackup/tank/Documents/ -e "ssh -p 2222"
#Kill SSH tunnel
pkill -f "ssh.*destination.address"