kill all processes spawned by parent with `ssh -x -n` on other hosts - ssh

Hi
A software named G09 works in parallel using Linda. It spawns its parallel child processes on other nodes (hosts) as
/usr/bin/ssh -x compute-0-127.local -n /usr/local/g09l/g09/linda-exe/l1002.exel ...other_opts...
However, when the master node kills this process, the corresponding child process on other node, namely compute-0-127 does not die but keeps running in background. Right now, I manually go to each node which has these orphaned Linda processes and kill them with kill. Is there any way to kill such child processes?
Look at pastebin 1 for PSTREE before killing the process and at pastebin 2 for PSTREE after parent is killed
pastebin1 - http://pastebin.com/yNXFR28V
pastebin2 - http:// pastebin.com/ApwXrueh
-not enough reputation points for hyperlinking second pastebin, sorry !(
Update to Answer1
Thanks Martin for explaining. I tried following
killme() { kill 0 ; } ; #Make calls to prepare for running G09 ;
g09 < "$g09inp" > "$g09out" &
trap killme 'TERM'
wait
but when Torque/Maui (which handles job execution) kills the job(this script) as qdel $jobid the processes started by G09 as ssh -x $host -n still run in the background. What am I doing wrong here ? (Normal termination is not a problem as G09 itself stops those processes.) Following is pstree before qdel
bash
|-461.norma.iitb. /opt/torque/mom_priv/jobs/461.norma.iitb.ac.in.SC
| `-g09
| `-l1002.exe 1048576000Pd-C-C-addn-H-MO6-fwd-opt.chk
| `-cLindaLauncher/tmp/viaExecDataN6
| |-l1002.exel 1048576000Pd-C-C-addn-H-MO6-fwd-opt.ch
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | |-{l1002.exel}
| | `-{l1002.exel}
| |-ssh -x compute-0-149.local -n ...
| |-ssh -x compute-0-147.local -n ...
| |-ssh -x compute-0-146.local -n ...
| |-{cLindaLauncher}
| `-{cLindaLauncher}
`-pbs_demux
and after qdel it still shows
461.norma.iitb. /opt/torque/mom_priv/jobs/461.norma.iitb.ac.in.SC
`-ssh -x -n compute-0-149 rm\040-rf\040/state/partition1/trirag09/461
l1002.exel 1048576000Pd-C-C-addn-H-MO6-fwd-opt.ch
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
|-{l1002.exel}
`-{l1002.exel}
ssh -x compute-0-149.local -n /usr/local/g09l/g09/linda-exe/l1002.exel
ssh -x compute-0-147.local -n /usr/local/g09l/g09/linda-exe/l1002.exel
ssh -x compute-0-146.local -n /usr/local/g09l/g09/linda-exe/l1002.exel
What am I doing wrong here ? is the trap killme 'TERM' wrong ?

I would try the following approach:
create a script/application that wraps this g09 binary that you are starting, and start that wrapper instead
in the script, wait for the HUP signal to arrive (which should be received when the ssh connection is closed)
in processing the HUP signal, send a signal to your process group (i.e. PID 0) that kills all processes in the group.
Sending a KILL signal to the process group is really easy: kill -9 0. Try this:
#!/bin/sh
./b.sh 1 &
./b.sh 2 &
sleep 10
kill -9 0
where b.sh is
#!/bin/sh
while /bin/true
do
echo $1
sleep 1
done
You can have as many child processes as you want (directly or indirectly); they will all get the signal - as long as they don't detach themselves from the process group.

I had a similar problem using ssh -N (similar to ssh -n), and kill -9 0 does not work for me if I run it inside a script that initiates the ssh call. I find that kill jobs -p does terminate the ssh process, which is not very elegant, but I am using that currently.

Related

How to ask a remote server over SSH to run a background job?

I'm trying to start a long-running process on a remote server, over SSH:
$ echo Hello | ssh user#host "cat > /tmp/foo.txt; sleep 100 &"
Here, sleep 100 is a simulation of my long-running process. I want this command to exit instantly, but it waits for 100 seconds. Important to mention that I need the job to receive an input from me (Hello in the example above).
Server:
$ sshd -?
OpenSSH_8.2p1 Ubuntu-4ubuntu0.5, OpenSSL 1.1.1f 31 Mar 2020
Saying "I want this command to exit instantly" is incompatible with "long-running". Perhaps you mean that you want the long-running command to run in the background.
If output is not immediately needed locally (ie. it can be retrieved by another ssh in future), then nohup is simple:
echo hello |
ssh user#host '
cat >/tmp/foo.txt;
nohup </dev/null >cmd.out 2>cmd.err cmd &
'
If output must be received locally as the command runs, you can background ssh itself using -f:
echo hello |
ssh -f user#host '
cat >/tmp/foo.txt;
cmd
' >cmd.out 2>cmd.err

AIX - How to kill using process name instead of PID

Is there a way to kill a process by specifying the process name instead of PID for AIX?
E.g. for the below process I want to kill it by specifying sapstartsrv instead of 10682424
hmsadm 10682424 1 0 Apr 30 - 0:54 /usr/sap/HMS/ASCS01/exe/sapstartsrv pf=/usr/sap/HMS/SYS/profile/START_ASCS01_H\
Thanks.
You can use command like this:
kill -9 $(ps -ef|grep sapstartsrv|awk '{print $2}')
of course first check if command ps -ef|grep sapstartsrv|awk '{print $2}' return only processes you want to kill
Try this. The brackets around the first letter of process you are trying to kill helps. Obviously change this to a valid server.
while true; do date; ping -c4 server; sleep 500; done &
ps -aef | grep -i [p]ing | awk '{print $2}' | xargs kill -9
If that doesn't work sometimes you have to kill the parent process.
ps -aef | grep -i [p]ing | awk '{print $2 " " $3}' | xargs kill -9

Using tail to monitor an active logging file

I'm running multiple 'shred' commands on multiple hard drives in a workstation. The 'shred' commands are all run in the background in order to run the commands concurrently. The output of each 'shred' is redirected to a text file, and I also have the output directed to the terminal as well. I'm using tail to monitor the log file for errors, and halt the script if any are encountered. If there are no errors, the script should simply continue on to conclusion. When I test it by forcing a drive failure (disconnecting a drive), it detects the I/O errors and the script halts as expected. The problem I'm having is that when there are NO errors, I cannot get 'tail' to terminate once the 'shred' commands have completed, and the script just hangs at that point. Since I put the 'tail' command in the 'while' loop below, I would have thought that 'tail' would continue to run as long as the 'shred' processes were running, but would then halt after the 'shred' processes stopped, thus ending the 'while' loop. But that hasn't been the case. The script still hangs even after the 'shred' processes have ended. If I go to another terminal window while the script is "hangiing," and kill the 'tail' process, the script continues as normal. Any ideas how to get the 'tail' process to end when the 'shred' processes are gone?
My code:
shred -n 3 -vz /dev/sda 2>&1 | tee -a logfile &
shred -n 3 -vz /dev/sdb 2>&1 | tee -a logfile &
shred -n 3 -vz /dev/sdc 2>&1 | tee -a logfile &
pids=$(pgrep shred)
while kill -0 $pids 2> /dev/null; do
tail -qn0 -f logfile | \
read LINE
echo "$LINE" | grep -q "error"
if [ $? = 0 ]; then
killall shred > /dev/null 2>&1
echo "Error encountered. Halting."
exit
fi
done
wait $pids
There is other code after the 'wait' that does other stuff, but this is where the script is hanging
Not directly related to the question, but you can use Daggy - Data Aggregation Utility
In this case, all subprocesses will be end with main daggy process.

Use qdel to delete all my jobs at once, not one at a time

This is a rather simple question but I haven't been able to find an answer.
I have a large number of jobs running in a cluster (>20) and I'd like to delete them all and start over.
According to this site I should be able to just do:
qdel -u netid
to get rid of them all, but in my case that returns:
qdel: invalid option -- 'u'
usage: qdel [{ -a | -c | -p | -t | -W delay | -m message}] [<JOBID>[<JOBID>]|'all'|'ALL']...
-a -c, -m, -p, -t, and -W are mutually exclusive
which obviously indicates that the command does not work.
Just to check, I did:
qstat -u <username>
and I do get a list of all my jobs, but:
qdel -u <username>
also fails.
Found the answer buried in an old supercluster.org thread:
qselect -u <username> | xargs qdel
Worked flawlessly.
Building on what Gabriel answered:
qselect -u <username> | xargs qdel
qselect -u <username> -s <state> | xargs qdel
<state> would be R for running jobs only.
qselect will allow you to select job based on other criterias, like ressources asked (-l), destination queue (-q) ...
qdel -u <username>
will only work with SGE
sometimes a simple grep/cut can help too:
qstat | grep $USER | cut -d. -f1 | xargs qdel
This way we can also grep on a particular keyword for the jobs and delete them.
HTH
Try
$ qdel {id1..id2}
So for example:
$ qdel {1148613..1148650}
For UGE:
qstat -u | gawk '{print $1}' | xargs qdel
# Delete all jobs owned by the current user.
#
# Command breakdown:
# ------------------
#
# qselect
# -u selects all jobs that belong to the current user
# -s EHQRTW selects all job states except for Complete
#
# xargs
# --no-run-if-empty Do not run qdel if the result set is empty
# to avoid triggering a usage error.
#
# qdel
# -a delete jobs asynchronously
#
# The backslashes are a trick to avoid matching any shell aliases.
\qselect -u $(whoami) -s EHQRTW | \xargs --no-run-if-empty \qdel -a
Another possibility is to do qdel all. It deletes all jobs from everyone. When you don't have access for other people's job, it deletes only your jobs.
It is not the most beautiful solution, but it is surely the shortest!
qstat | cut -d. -f1 | sed "s; \(.*\) 0;qdel \1;" | bash
sed's power.
Just use the following command:
qdel all
It will cancel all jobs running on cluster.

How to kill a process in cygwin?

Hi i have the following process which i cant kill:
I am running cygwin in windows xp 32 bit.
I have tried issuing the following commands:
/bin/kill -f 4760
/bin/kill -9 5000
kill -9 5000
kill 5000
When i write /bin/kill -f 4760 i get the message, 'kill: couldn't open pid 4760'.
When i write /bin/kill -9 5000 i get the message, 'kill: 5000: No such process'.
I simply don't understand why this process cant be killed.
Since it has a WINID shouldnt it be killed by /bin/kill -f 4760?
hope someone can help thx :)
The process is locked from Windows most likely. The error you are getting "couldnt open PID XXX" points to this.
To confirm try killing it with windows taskkill
taskkill /PID 4760
Strangely, the following works in Cygwin:
echo PID1 PID2 PID3 | xargs kill -f
For example:
ps -W | grep WindowsPooPoo | awk '{print $1}' | while read line; do echo $line | xargs kill -f; done;
Different Windows programs will handle the signals that kill sends differently; they've never been designed to deal with them in the same way that Linux/Cygwin programs are.
The only reliable method for killing a Windows program is to use a Windows specific tool, such as Task Manager or Process Explorer.
That said, if you've not already, you may have luck with running your Cygwin terminal in administrator mode (right click on your shortcut and select "Run as administrator").
The method presented by #Donal Tobin is correct:
kill -f <pid>
However, I don't need to log in as administrator.
Create a file called killall.sh with this line
ps -W | grep $1 | awk '{print $1}' | while read line; do echo $line | xargs kill -f; done;
Then give it execute permissions.
chmod 777 killall.sh
In your .bash_profile add this line
alias killall="~/killall.sh" (point it to the correct location)
Then you just have to type "killall [name]"
killall.sh - Kill by process name.
#/bin/bash
ps -W | grep "$1" | awk '{print $1}' | xargs kill -f;
Usage:
$ killall <process name>
For me this command does not work on Windows 10 in Cygwin:
$ kill -f 15916
bash: kill: (15916) - No such process
Instead of it, you can use next commands:
$ powershell kill -f 15916
$ netstat -ano | grep ':8080' | awk '{print $5}' | xargs powershell kill -f
$ netstat -ano | grep ':8080' | awk '{print $5}' | while read pid; do powershell kill -f $pid; done;
$ netstat -ano | grep ':8080' | awk '{sub(/\r/,"",$5) ; print $5}' | while read pid; do taskkill /F /PID $pid; done;
SUCCESS: The process with PID 15916 has been terminated.