open MPI - ring_c on multiple hosts fails - testing

I have recently installed open MPI on two Ubuntu 14.04 hosts and I am now testing its functionality with the two provided test functions hello_c and ring_c. The hosts are called 'hermes' and 'zeus' and they both have the user 'mpiuser' to log in non-interactively (via ssh-agent).
The functions mpirun hello_c and mpirun --host hermes,zeus hello_c both work properly.
Calling the function mpirun --host zeus ring_c locally also works. Output for both hermes and zeus:
mpiuser#zeus:/opt/openmpi-1.6.5/examples$ mpirun --host zeus ring_c
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
But calling the function mpirun --host zeus,hermes ring_c fails and gives following output:
mpiuser#zeus:/opt/openmpi-1.6.5/examples$ mpirun --host hermes,zeus ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[zeus:2930] *** An error occurred in MPI_Recv
[zeus:2930] *** on communicator MPI_COMM_WORLD
[zeus:2930] *** MPI_ERR_TRUNCATE: message truncated
[zeus:2930] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
Process 0 sent to 1
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 2930 on
node zeus exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
I haven't found any documentation on how to solve such a problem and I don't have a clue where to look for the mistake on the basis of the error output. How can I fix this?

You've changed two things between the first and second runs - you've increased the number of processes from 1 to 2, and run on multiple hosts rather than a single host.
I'd suggest you first check you can run on 2 processes on the same host:
mpirun -n 2 ring_c
and see what you get.
When debugging on a cluster it's often useful to know where each process is running. You should always print out the total number of processes as well. Try using the following code at the top of ring_c.c:
char nodename[MPI_MAX_PROCESSOR_NAME];
int namelen;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(nodename, &namelen);
printf("Rank %d out of %d running on node %s\n", rank, size, nodename);
The error you're getting is saying that the incoming message is too large for the receive buffer, which is weird given that the code always sends and receives a single integer.

Related

Why i can't kill Erlang process?

I am spawning 2 processes and it seems i can not kill either of them:
restarter - process that spawns the worker whenever it goes down
worker -process that gets messages from the shell, concatenates them and returns them in the reason of an exit to the restarter which in turn forwards them to the shell.
The worker process can't be killed since the restarter would restart it on any trap exit message. But what keeps the restarter process alive?
-module(mon).
-compile_flags([debug_info]).
-export([worker/1,init/0,restarter/2,clean/1]).
% ctrl+g
init()->
Pid=spawn(?MODULE,restarter,[self(),[]]),
register(restarter,Pid),
Pid.
restarter(Shell,Queue)->
process_flag(trap_exit,true),
Wk=spawn_link(?MODULE,worker,[Queue]),
register(worker,Wk),
receive
{'EXIT',Pid,{Queue,normal}}->Shell ! {Queue,"From res: worker died peacefully, wont restart"};
{'EXIT',Pid,{Queue,horrible}} ->
Shell ! {Queue,"Processed so far:"},
Shell ! "will restart in 5 seconds, select fresh/stale -> 1/0",
receive
1 ->
Shell ! "Will restart fresh",
restarter(Shell,[]);
0 ->Shell ! "Will continue work",
restarter(Shell,Queue)
after 5000 ->
Shell ! "No response -> started with 666",
restarter(Shell,[666])
end;
{MSG}->Shell ! {"Unknown message...closing",MSG}
end.
worker(Queue)->
receive
die->exit({Queue,horrible});
finish->exit({Queue,normal});
MSG->worker([{time(),MSG}|Queue])
end.
Usage
mon:init().
regs(). %worker and restarter are working
whereis(worker) ! "msg 1", whereis(worker) ! "msg2".
whereis(worker) ! finish.
flush(). % should get the first clause from restarter
regs(). % worker should be up and running again
exit(whereis(restarter),reason).
regs(). % restarter should be dead
In this scenario, the restarter process is trapping exits, so exit(whereis(restarter), reason) doesn't kill it. The exit signal gets converted to a message, and gets put into the message queue of the process:
> process_info(whereis(restarter), messages).
{messages,[{'EXIT',<0.76.0>,reason}]}
The reason it's still in the message queue is that none of the clauses in the receive expression matches this message. The first two clauses are specific to the exit reasons used by the worker process, and the last clause might look like a catch-all clause but it actually isn't - it matches any message that is a tuple with one element. If it were written MSG instead of {MSG}, it would have received the exit reason message, and sent "Unknown message" to the shell.
If you really want to kill the process, use the kill reason:
exit(whereis(restarter), kill).
A kill exit signal is untrappable, even if the process is trapping exits.
Another thing: the first two receive clauses will only match if the worker's queue is empty. That is because it reuses the variable name Queue, so the queue in {'EXIT',Pid,{Queue,normal}} must be equal to the value passed as an argument to the restarter function. In a situation like this, you'd normally use NewQueue or something as the variable in the receive clauses.

Application failed to start after timestamp check failed in monit

I am having a simple monit control file which contains the following
check process fooBar1 with pidfile fooBar1PidFile
start program = "/etc/init.d/fooBar1 start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar1 stop"
if 5 restarts within 5 cycles then unmonitor
check process fooBar2 with pidfile fooBar2PidFile
start program = "/etc/init.d/fooBar2 start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar2 stop"
if 5 restarts within 5 cycles then unmonitor
check process fooBar with pidfile fooBarPidFile
start program = "/etc/init.d/fooBar start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar stop"
if 5 restarts within 5 cycles then unmonitor
if memory usage > 25.0 MB for 4 cycles then alert
depends on fooBar1
depends on fooBar2
depends on checkFile
check file checkFile with path pathToFile
if timestamp > 4 minute for 8 cycles then restart
Here the intention is to restart fooBar, fooBar1 and fooBar2 applications when the timestamp check for checkFile fails. But what actually happens is it tries to restart the checkFile instead of fooBar.
This check was working fine with monit version 5.5, but not working with 5.18.
This is what I am getting when timestamp fails and after that 8 cycles are elapsed.
'checkFile' timestamp for pathToFile failed -- current timestamp is Fri, 08 Dec 2017 12:47:04
'fooBar' failed to start -- could not start required services: 'checkFile'
'fooBar' start action failed
Am I missing something here?
Thanks in advance
Tried with a work around,
check file checkFile with path pathToFile
if timestamp > 4 minute for 8 cycles then
exec "/etc/init.d/fooBar restart"
repeat every 1 cycles
This is restarting application fooBar every cycle when timestamp check fails. But just wondering any other better way to do it?

monit call exec on process recovery

What I would like to do, is the following:
if process-x fails (to (re)start) then execute cmd-x
if it recovers then execute cmd-y
For the alerting via E-mail, a notificaton is sent per default on recovery. For the exec method however, I can not find a way to make this work. If I try this in the monitrc:
check process proc_x with pidfile /var/run/proc_x.pid
start program = "/bin/sh -c '/etc/init.d/Sxxproc_x start'"
stop program = "/bin/sh -c '/etc/init.d/Sxxproc_x stop'"
if 3 restarts within 5 cycles then exec "<some error cmd>"
else if succeeded then exec "<some restore cmd>"
this results in a "syntax error 'else'". If I remove the else line, the error command is called as expected. Apparently, the 'else' can not be used for the restarts test. But how can I add to execute a command is program starting succeeds or recovers?
I found a solution thanks to the answer to this topic:
get monit to alert first and restart later
The "if not exist for ..." with corresponding "else" did the trick for me to report the recover. The error report is separate. My monitrc code now:
check process proc_x with pidfile /var/run/proc_x.pid
start program = "/bin/sh -c '/etc/init.d/Sxxproc_x start'"
stop program = "/bin/sh -c '/etc/init.d/Sxxproc_x stop'"
if 1 restart within 1 cycle then exec "<some error cmd>"
repeat every 1 cycle
if not exist for 3 cycles then restart
else if succeeded 2 times within 2 cycles then exec "<some restore cmd>"

Expect - wait until process terminate

I am completely new in Expect, and I want to run my Python script via Telnet.
This py script takes about 1 minute to execute, but when I try to run it via Telnet with Expect - it doesn't work.
I have this expect simple code:
#! /usr/bin/expect
spawn telnet <ip_addr>
expect "login"
send "login\r"
expect "assword"
send "password\r"
expect "C:\\Users\\user>\r"
send "python script.py\r"
expect "C:\\Users\\user>\r"
close
When I replace script.py with the one with shorter execution time - it works great. Could you tell me what should I change, so I can wait until my script.py process will terminate? Should I use timeout or sleep?
If you are sure about the execution time of the script, then you can add sleep or set the timeout to the desired value
send "python script.py\r"
sleep 60; # Sleeping for 1 min
expect "C:\\Users\\user>"; # Now expecting for the prompt
Or
set timeout 60;
send "python script.py\r"
expect "C:\\Users\\user>"; # Now expecting for the prompt
But, if the time is variant, then better handle the timeout event and wait for the prompt till some amount of time. i.e.
set timeout 60; # Setting timeout as 1 min;
set counter 0
send "python script.py\r"
expect {
# Check if 'counter' is equal to 5
# This means, we have waited 5 mins already.
# So,exiting the program.
if {$counter==5} {
puts "Might be some problem with python script"
exit 1
}
# Increase the 'counter' in case of 'timeout' and continue with 'expect'
timeout {
incr counter;
puts "Waiting for the completion of script...";
exp_continue; # Causes the 'expect' to run again
}
# Now expecting for the prompt
"C:\\Users\\user>" {puts "Script execution is completed"}
}
A simpler alternative: if you don't care how long it takes to complete:
set timeout -1
# rest of your code here ...

Monit - how to identify crashes of a program instead of restarts

I am using monit to monitor my program. The program being monitored can potentially crash under 2 situations
Program can randomly crash. It just needs to be restarted
It gets into a bad state and crashes each time it is started subsequently
To fix the latter situation, I have a script to stop the program, reset it to a good state by cleaning its data files and restart it. I tried the below config
check process program with pidfile program.pid
start program = "programStart" as uid username and gid groupname
stop program = "programStop" as uid username and gid groupname
if 3 restarts within 20 cycles then exec "cleanProgramAndRestart" as uid username and gid groupname
if 6 restarts within 20 cycles then timeout
Say monit restarts the program 3 times in 3 cycles. After it is restarted the third time, the cleanProgramAndRestart script runs. However as the cleanProgramAndRestart script restarts the program yet again, the condition of 3 restarts is met again in the next cycle and it becomes an infinite loop
Could anyone suggest any way to fix this?
If any of the below actions are possible, then there may be a way around.
If there is a "crash" keyword, instead of "restarts", I will be able to run the clean script after the program crashes 3 times instead of after it is restarted 3 times
If there is a way to reset the "restarts" counter in some way after running the exec script
If there is a way to exec something only if output of the condition 3 restarts changed
Monit is polling your "tests" every cycle. The cycle length is usually defined in /etc/monitrc, in set daemon cycle_length
So if your cleanProgramAndRestart takes less than a cycle to perform, it shouldn't happen.
As it is happening, I guess your cleanProgramAndRestart takes more than a cycle to perform.
You can:
Increase the cycle length in Monit configuration
check your program every x cycles (make sure that cycle_length*x > cleanProgramAndRestart_length)
If you can't modify these variables, there could be a little workaround, with a temp file:
check process program
with pidfile program.pid
start program = "programStart"
as uid username and gid groupname
stop program = "programStop"
as uid username and gid groupname
if 3 restarts within 20 cycles
then exec "touch /tmp/program__is_crashed"
if 6 restarts within 20 cycles then timeout
check file program_crash with path /tmp/program_crash every x cycles #(make sure that cycle_length*x > cleanProgramAndRestart_length)
if changed timestamp then exec "cleanProgramAndRestart"
as uid username and gid groupname