SLURM script to run indefinitely to avoid session timeout - ssh

Background:
I need to remote access a HPC across the country (place is closed for public, I am interning for the summer)
Once I accomplish this, I load a script, for this purpose let's call it jupyter.sh
There are several nodes in this HPC, and every time I run the .sh script, I get assigned to one, say N123
From Jupyter notebook on browser, I have to run actual code/ calculations/ simulations using python. The data I'm working with takes about 2 hours to run completely so that I can then process it and do my job
Very often, I would get disconnected from that node N123 because "user doesn't have an active job running", even though my jupyter notebook is still running / I'm working on it
This results in me having to run that .sh script again, meaning I will get a different node, say N456 (then the ssh command line for jupyter has to be entered again, this time with the different node number)
Jupyter will disconnect from host, and this forces me to restart kernel and run the entire code again, costing me that 1 hour and something it takes to run the python code.
(Can't get into too many details since I don't know what I am allowed to share without getting in trouble)
My question is,
Is there a way that i can run an sh script with say, an infinite loop, so that the node sees it as an active job running and it doesn't kick me out for "inactivity" ?
I have tried running different notebooks that take about 10 minutes total to run, but this doesn't seem to be enough to be considered an active job (and I am not sure if it even counts)
My experience with slurm, terminal and ssh processes is very limited, so if this is a silly question, please forgive.
Any help is appreciated.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --job-name=pytorch
#SBATCH --mail-type=ALL
#SBATCH --mail-user== NO NEED TO SEE THIS
#SBATCH --partition=shared-gpu
#SBATCH --qos=long
#SBATCH --ntasks=1
#SBATCH --mem=2G
#SBATCH --time=04:00:00
export PORT=8892
kill -9 $(lsof -t -i:$PORT)
# Config stuff
module purge
module load anaconda/Anaconda3
module load cuda/10.2
source activate NO NEED FOR THIS
# Print stuff
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
### Define number of processors
echo This job has allocated $SLURM_JOB_NUM_NODES nodes
# Tell me which nodes it is run on
echo " "
echo This jobs runs on the following processors:
echo $SLURM_JOB_NODELIST
echo " "
jupyter notebook --no-browser --port=$PORT
echo Time is `date

Would something like:
#!/bin/bash
#SBATCH --job-name=jupyter # Job name
#SBATCH --nodes=1 # Run all processes on a single node.
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=1 # Number of CPU cores per task (multithreaded tasks)
#SBATCH --mem=2G # Job memory request. If you do not ask for enough your program will be killed.
#SBATCH --time=04:00:00 # Time limit hrs:min:sec. If your program is still running when this timer ends it will be killed.
srun jupyter.sh
work? You write this up in a text editor, save it as .slurm, then run it using
sbatch jobname.slurm
Salloc might also be a good way to go about this.

Related

Inconsistent performance of GPU subclusters

I'm running my MATLAB code on subclusters provided by my school. One subcluster named 'G' uses Nvidia A100 GPU card and has 12 nodes (G[000-011]) and 128 cores/node.
Whenever I run my code on G[005] and G[006], my code finishes running in just 2 hours. However, strangely, when I run it on any other nodes (i.e.G[000-004, 007-011]), the computation becomes extremely slow (> 4 hours). Since all the nodes should be using the same hardware, I have no idea what is causing this difference.
Does anyone have an idea what is going on? Below is my SLURM job submission file.
Note that I already consulted with a support center at my school, but they also have no idea about this problem yet, so I thought I could get some help here...
#!/bin/sh -l
#SBATCH -A standby
#SBATCH -N 1
#SBATCH -G 1
#SBATCH -n 12
#SBATCH -t 4:00:00
#SBATCH --constraint="C|G|I|J"
#SBATCH --output=slurm-%j-%N.out
/usr/bin/sacct -j "$SLURM_JOBID" --batch-script
/usr/bin/sacct -j "$SLURM_JOBID" --format=NodeList,JobID
echo "------------------------"
cd ..
module load matlab/R2022a
matlab -batch "myfuncion(0,0,0)"

Slurm multiple job arrays on a single GPU?

I want to ask if it is possible to run multiple jobs (via job-array) on a single GPU (i.e. sharing the GPU). I am asking because each task only takes up 3GB of GPU RAM, and hence if it would be better if I could run 8 python scripts on a single GPU?
I tried doing something like :
#!/bin/bash
#SBATCH --job-name parallel_finetune #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH --nodelist=node3 #used node4
#SBATCH -t 48:00:00 # Time for running job #길게 10일넘게 잡음
#SBATCH -o ./shell_output/output_%A_%a.output
#SBATCH -e ./shell_output/error_%A_%a.error
#SBATCH --ntasks=8
#SBATCH --mem-per-cpu=4GB
#SBATCH --gpus=1
#SBATCH --cpus-per-task=2
#SBATCH --array=0-7
(where I didn't use --gpus-per-task) I thought that since --gpu is specified as opposed to --gpus-per-task, the slum would allocate seperate cpus (as specified by --cpus-per-task), but share a single GPU. However, this is not the case and each task gets one GPU. Is there a way to do this?
Thank you in advance for anyone's help!

monitor bash script execution using monit

We have just started using monit for process monitor and pretty much new in monit. I have a bash script at /home/ubuntu/launch_example.sh. This is continuously running. is it possible to monitor this using monit? monit should start the script if it bash scripts terminates. What should be syntax.I tried below syntax but all the commands are not being executed as ubuntu user, like shell script calls some python scripts.
check process launch_example
matching "launch_example"
start program = "/bin/bash -c '/home/ubuntu/launch_example.sh'"
as uid ubuntu and gid ubuntu
stop program = "/bin/bash -c '/home/ubuntu/launch_example.sh'"
as uid ubuntu and gid ubuntu
The simple answer is "no". Monit is just for monitoring and is not some kind of supervisor/process manager. So if you want to monitor your long running executable, you have to wrap it.
check process launch_example with pidfile /run/launch.pid
start program = "/bin/bash -c 'nohup /home/ubuntu/launch_example.sh &'"
as uid ubuntu and gid ubuntu
stop program = "/bin/bash -c 'kill $(cat /run/launch.pid)'"
as uid ubuntu and gid ubuntu
This quick'n'dirty way also needs an additional line for your launch_example.sh to write the pidfile (pidfile matching should always be preferred over string matching) - it could be just the first line after she shebang. It simply writes the current process ID to the pidfile. Nothing fancy here ;)
echo $$ > /run/launch.pid
In fact, it's not even hard to convert your script into a systemd unit. Here is an example on how to. User binding, restarts, pidfile, and "start-on-boot" can then be managed through systemd (eg. start program = "/usr/bin/systemctl start my_unit").

SLURM overcommiting GPU

How can one run multiple jobs in parallel on one GPU? One option which works is to run a script that spawn child processes. But is there also a way to do it with SLURM itself? I tried
#!/usr/bin/env bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --overcommit
srun python script1.py &
srun python script2.py &
wait
But that still runs them sequentially.
EDIT: We still want to allocate recourses exlusively, i.e. one SBATCH job should allocate a whole GPU for itself. The question is whether there is an easy way to start multiple scripts within the SBATCH in parallel, without having to setup a multiprocessing environment.

running same script over many machines

I have setup a few EC2 instances, which all have a script in the home directory. I would like to run the script simultaneously across each EC2 instance, i.e. without going through a loop.
I have seen csshX for OSX for terminal interactive useage...but was wondering what the commandline code is to execute commands like
ssh user#ip.address . test.sh
to run the test.sh script across all instances since...
csshX user#ip.address.1 user#ip.address.2 user#ip.address.3 . test.sh
does not work...
I would like to do this over the commandline as I would like to automate this process by adding it into a shell script.
and for bonus points...if there is a way to send a message back to the machine sending the command that it has completed running the script that would be fantastic.
will it be good enough to have a master shell script that runs all these things in the background? e.g.,
#!/bin/sh
pidlist="ignorethis"
for ip in ip1 ip2
do
ssh user#$ip . test.sh &
pidlist="$pidlist $!" # get the process number of the last forked process
done
# Now all processes are running on the remote machines, and we want to know
# when they are done.
# (EDIT) It's probably better to use the 'wait' shell built-in; that's
# precisely what it seems to be for.
while true
do
sleep 1
alldead=true
for pid in $pidlist
do
if kill -0 $pid > /dev/null 2>&1
then
alldead=false
echo some processes alive
break
fi
done
if $alldead
then
break
fi
done
echo all done.
it will not be exactly simultaneous, but it should kick off the remote scripts in parallel.