Unable to Run MPI CLUSTER within a LAN - ssh

This is the sanpshot of my /etc/hosts file
karpathy is master & client is slave
I have successfully done
SETUP PASSWORDLESS SSH
Mounted sudo mount -t nfs karpathy:/home/mpiuser/cloud ~/cloud
I can login to my client simply by ssh client
I have followed this blog
http://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/
mpirun -np 5 -hosts karpathy ./cpi output
mpirun -np 5 -hosts client ./cpi
Getting Error
[mpiexec#karpathy] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#karpathy] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:179): error waiting for event
[mpiexec#karpathy] main (./ui/mpich/mpiexec.c:397): process manager error waiting for completion

I hope you already have found the solution, in case you haven't I would suggest doing a couple of things.
1. disabling firewall on both the nodes by doing `
sudo ufw disable
`
2. Creating a file named as machinefile (or whatever u like) and storing the number of CPU's in both nodes along with the hostnames.
my machinefile contains:
master:8
slave:4
master and slave are the hostnames while 8 and 4 are the number of CPUs on each node.
to compile use
mpicc -o filename filename.cpp
to run use the machinefile as an argument
mpirun -np 12 -f machinefile ./filename
12 is th enumber of processes. Since both the nodes have 12 CPUs combined so it's better to divide the code on 12 processes.

Related

restorecon and manage_accounts running continuously on Google Compute Engine Centos 6

Any idea why 'restorecon ~/.ssh/authorized_keys' processes are being launched continuously on a GCE instance running CentOS 6? That (along with 'manage_accounts') is chewing up a lot of CPU time. The restorecon command is being run for the .ssh/authorized_keys file under several different home directories.
A couple of data points:
I'm unable to ssh into this GCE instance through the Google Cloud Console ('Open in Browser Window' option), but ssh from a regular shell using the appropriate private key (ssh -i #) works fine.
The 'ls -Z ~/.ssh/authorized_keys' command (for displaying the SE Linux context) shows the following output:
-rw-------. system_u:object_r:ssh_home_t:s0 authorized_keys
Rebooting the GCE instance did not fix the problem: The processes continue to be spawned every few seconds.
I'm perplexed. Any clues on what's spawning these processes?

Unable to run PBS script on multiple nodes using GNU parallel

I have been trying to use multiple nodes in my PBS script to run several independent jobs. Each individual job is supposed to use 8 cores and each node in the cluster has 32 cores. So, I would like to have each node run 4 jobs. My PBS script is as follows.
#!/usr/bin/env bash
#PBS -l nodes=2:ppn=32
#PBS -l mem=128gb
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -V
#PBS -l gres=ccm
sort -u $PBS_NODEFILE > nodelist.dat
#cat ${PBS_NODEFILE} > nodelist.dat
export JOBS_PER_NODE=4
PARALLEL="parallel -j $JOBS_PER_NODE --sshloginfile nodelist.dat --wd $PBS_O_WORKDIR"
$PARALLEL -a input_files.dat sh test.sh {}
input_files.dat contains the name of job files. I have successfully used this script to run parallel jobs on one node (in which case I remove --sshloginfile nodelist.dat and sort -u $PBS_NODEFILE > nodelist.dat from the script). However, whenever I try to run this script on more than one node, I get the following error.
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
ssh: connect to host 922 port 22: Invalid argument
ssh: connect to host 901 port 22: Invalid argument
Here, 922 and 901 are the numbers corresponding to the assigned nodes and are included in the nodelist.dat ($PBS_NODEFILE) file.
I tried to search for this problem but couldn't find much as everyone else seems to be doing fine with --sshloginfile argument, so I am not sure if this is a system specific problem.
Edit:
As #Ole Tange mentioned in his answer and comments, I need to modify the "node number" as produced by $PBS_NODEFILE, which I am doing in the following way inside the PBS script.
# provides a unique number (say, 900) associated with the node.
sort -u $PBS_NODEFILE > nodelist.dat
# changes the contents of the nodelist.dat from "900" to "username#w-900.cluster.uni.edu"
sed -i -r "s/([0-9]+)/username#w-\1.cluster.uni.edu/g" nodelist.dat
I verified that the nodelist.dat contains only one line viz., username#w-900.cluster.uni.edu.
Edit-2:
It seems like the cluster's architecture is responsible for the error I am getting. I ran the same script on a different cluster (say, cluster_2), and it finished without any errors. In my sysadmin's words, the reason why it works on cluster_2 is: "cluster_2 is a single machine. Once your job starts, you are actually on the head node of your PBS job like you would expect."
The variable $PARALLEL is used by GNU Parallel for options. So when you also use it, it is likely to cause confusion. It does not seem to be the root cause here, though, but do yourself a favor and use another variable name (or use it as described in the man page).
The problem here seems to be ssh which will not see a number as a hostname:
$ ssh 8
ssh: connect to host 8 port 22: Invalid argument
Add the domain name, and ssh will see it as a hostname:
$ ssh 8.pi.dk
<<connects>>
If I were you I would talk to your cluster admin and ask if the worker nodes could be renamed to w-XXX, where XXX is their current name.

How do I "dockerize" a redis service using phusion/baseimage-docker

I am getting started with docker, and I am trying by "dockerizing" a simple redis service using Phusion's baseimage. On its website, baseimage says:
You can add additional daemons (e.g. your own app) to the image by
creating runit entries.
Great, so I first started this image interactively with a cmd of /bin/bash. I installed redis-server via apt-get. I created a "redis-server" directory in /etc/service, and made a runfile that reads as follows:
#!/bin/sh
exec /usr/bin/redis-server /etc/redis/redis.conf >> /var/log/redis.log 2>&1
I ensured that daemonize was set to "no" in the redis.conf file
I committed my changes, and then with my newly created image, I started it with the following:
docker run -p 6379:6379 <MY_IMAGE>
I see this output:
*** Running /etc/rc.local...
*** Booting runit daemon...
*** Runit started as PID 98
I then run
boot2docker ip
It gives me back an IP address. But when I run, from my mac,
redis-cli -h <IP>
It cannot connect. Same with
telnet <IP> 6379
I ran a docker ps and see the following:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c7bd2dXXXXXX myuser/redis:latest "/sbin/my_init" 11 hours ago Up 2 minutes 0.0.0.0:6379->6379/tcp random_name
Can anyone suggest what I have done wrong when attempting to dockerize a simple redis service using phusion's baseimage?
It was because I did not comment out the
bind 127.0.0.1
parameter in the redis.conf file.
Now, it works!

SSH to multiple hosts at once

I have a script which loops through a list of hosts, connecting to each of them with SSH using an RSA key, and then saving the output to a file on my local machine - this all works correctly. However, the commands to run on each server take a while (~30 minutes) and there are 10 servers. I would like to run the commands in parallel to save time, but can't seem to get it working. Here is the code as it is now (working):
for host in $HOSTS; do
echo "Connecting to $host"..
ssh -n -t -t $USER#$host "/data/reports/formatted_report.sh"
done
How can I speed this up?
You should add & to the end of the ssh call, it will run on the background.
for host in $HOSTS; do
echo "Connecting to $host"..
ssh -n -t -t $USER#$host "/data/reports/formatted_report.sh" &
done
I tried using & to send the SSH commands to the background, but I abandoned this because after the SSH commands are completed, the script performs some more commands on the output files, which need to have been created.
Using & made the script skip directly to those commands, which failed because the output files were not there yet. But then I learned about the wait command which waits for background commands to complete before continuing. Now this is my code which works:
for host in $HOSTS; do
echo "Connecting to $host"..
ssh -n -t -t $USER#$host "/data/reports/formatted_report.sh" &
done
wait
Try massh http://m.a.tt/er/massh/. This is a nice tool to run ssh across multiple hosts.
The Hypertable project has recently added a multi-host ssh tool. This tool is built with libssh and establishes connections and issues commands asynchronously and in parallel for maximum parallelism. See Multi-Host SSH Tool for complete documentation. To run a command on a set of hosts, you would run it as follows:
$ ht ssh host00,host01,host02 /data/reports/formatted_report.sh
You can also specify a host name or IP pattern, for example:
$ ht ssh 192.168.17.[1-99] /data/reports/formatted_report.sh
$ ht ssh host[00-99] /data/reports/formatted_report.sh
It also supports a --random-start-delay <millis> option that will delay the start of the command on each host by a random time interval between 0 and <millis> milliseconds. This option can be used to avoid thundering herd problems when the command being run accesses a central resource.

keep server running on EC2 instance after ssh is terminated

Currently, I have two servers running on an EC2 instance (MongoDB and bottlepy). Everything works when I SSHed to the instance and started those two servers. However, when I closed the SSH session (the instance is still running), I lost those two servers. Is there a way to keep the server running after logging out? I am using Bitvise Tunnelier on Windows 7.
The instance I am using is Ubuntu Server 12.04.3 LTS.
For those landing here from a google search, I would like to add tmux as another option. tmux is widely used for this purpose, and is preinstalled on new Ubuntu EC2 instances.
Managing a single session
Here is a great answer by Hamish Downer given to a similar question over at askubuntu.com:
I would use a terminal multiplexer - screen being the best known, and tmux being a more recent implementation of the idea. I use tmux, and would recommend you do to.
Basically tmux will run a terminal (or set of terminals) on a computer. If you run it on a remote server, you can disconnect from it without the terminal dying. Then when you login in again later you can reconnect, and see all the output you missed.
To start it the first time, just type
tmux
Then, when you want to disconnect, you do Ctrl+B, D (ie press Ctrl+B, then release both keys, and then press d)
When you login again, you can run
tmux attach
and you will reconnect to tmux and see all the output that happened. Note that if you accidentally lose the ssh connection (say your network goes down), tmux will still be running, though it may think it is still attached to a connection. You can tell tmux to detach from the last connection and attach to your new connection by running
tmux attach -d
In fact, you can use the -d option all the time. On servers, I have this in my >.bashrc
alias tt='tmux attach -d'
So when I login I can just type tt and reattach. You can go one step further >if you want and integrate the command into an alias for ssh. I run a mail client >inside tmux on a server, and I have a local alias:
alias maileo='ssh -t mail.example.org tmux attach -d'
This does ssh to the server and runs the command at the end - tmux attach -d The -t option ensures that a terminal is started - if a command is supplied then it is not run in a terminal by default. So now I can run maileo on a local command line and connect to the server, and the tmux session. When I disconnect from tmux, the ssh connection is also killed.
This shows how to use tmux for your specific use case, but tmux can do much more than this. This tmux tutorial will teach you a bit more, and there is plenty more out there.
Managing multiple sessions
This can be useful if you need to run several processes in the background simultaneously. To do this effectively, each session will be given a name.
Start (and connect to) a new named session:
tmux new-session -s session_name
Detach from any session as described above: Ctrl+B, D.
List all active sessions:
tmux list-sessions
Connect to a named session:
tmux attach-session -t session_name
To kill/stop a session, you have two options. One option is to enter the exit command while connected to the session you want to kill. Another option is by using the command:
tmux kill-session -t session_name
If you don't want to run some process as a service (or via an apache module) you can (like I do for using IRC) use gnome-screen Install screen http://hostmar.co/software-small.
screen keeps running on your server even if you close the connection - and thus every process you started within will keep running too.
It would be nice if you provided more info about your environment but assuming it's Ubuntu Linux you can start the services in the background or as daemons.
sudo service mongodb start
nohup python yourbottlepyapp.py &
(Use nohup if you want are in a ssh session and want to prevent it from closing file descriptors)
You can also run your bottle.py app using Apache mod_wsgi. (Running under the apache service) More info here: http://bottlepy.org/docs/dev/deployment.html
Hope this helps.
Addition: (your process still runs after you exit the ssh session)
Take this example time.py
import time
time.sleep(3600)
Then run:
$ python3 time.py &
[1] 3027
$ ps -Af | grep -v grep | grep time.py
ubuntu 3027 2986 0 18:50 pts/3 00:00:00 python3 time.py
$ exit
Then ssh back to the server
$ ps -Af | grep -v grep | grep time.py
ubuntu 3027 1 0 18:50 ? 00:00:00 python3 time.py
Process still running (notice with no tty)
You will want the started services to disconnect from the controlling terminal. I would suggest you use nohup to do that, e.g.
ssh my.server "/bin/sh -c nohup /path/to/service"
you may need to put an & in there (in the quotes) to run it in the background.
As others have commented, if you run proper init scripts to start/stop services (or ubuntu's service command), you should not see this issue.
If on Linux based instances putting the job in the background followed by disown seems to do the job.
$ ./script &
$ disown