Slurm multiple job arrays on a single GPU? - jobs

I want to ask if it is possible to run multiple jobs (via job-array) on a single GPU (i.e. sharing the GPU). I am asking because each task only takes up 3GB of GPU RAM, and hence if it would be better if I could run 8 python scripts on a single GPU?
I tried doing something like :
#!/bin/bash
#SBATCH --job-name parallel_finetune #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH --nodelist=node3 #used node4
#SBATCH -t 48:00:00 # Time for running job #길게 10일넘게 잡음
#SBATCH -o ./shell_output/output_%A_%a.output
#SBATCH -e ./shell_output/error_%A_%a.error
#SBATCH --ntasks=8
#SBATCH --mem-per-cpu=4GB
#SBATCH --gpus=1
#SBATCH --cpus-per-task=2
#SBATCH --array=0-7
(where I didn't use --gpus-per-task) I thought that since --gpu is specified as opposed to --gpus-per-task, the slum would allocate seperate cpus (as specified by --cpus-per-task), but share a single GPU. However, this is not the case and each task gets one GPU. Is there a way to do this?
Thank you in advance for anyone's help!

Related

Inconsistent performance of GPU subclusters

I'm running my MATLAB code on subclusters provided by my school. One subcluster named 'G' uses Nvidia A100 GPU card and has 12 nodes (G[000-011]) and 128 cores/node.
Whenever I run my code on G[005] and G[006], my code finishes running in just 2 hours. However, strangely, when I run it on any other nodes (i.e.G[000-004, 007-011]), the computation becomes extremely slow (> 4 hours). Since all the nodes should be using the same hardware, I have no idea what is causing this difference.
Does anyone have an idea what is going on? Below is my SLURM job submission file.
Note that I already consulted with a support center at my school, but they also have no idea about this problem yet, so I thought I could get some help here...
#!/bin/sh -l
#SBATCH -A standby
#SBATCH -N 1
#SBATCH -G 1
#SBATCH -n 12
#SBATCH -t 4:00:00
#SBATCH --constraint="C|G|I|J"
#SBATCH --output=slurm-%j-%N.out
/usr/bin/sacct -j "$SLURM_JOBID" --batch-script
/usr/bin/sacct -j "$SLURM_JOBID" --format=NodeList,JobID
echo "------------------------"
cd ..
module load matlab/R2022a
matlab -batch "myfuncion(0,0,0)"

How to bind a particular GPU to a task

I have two GPUs in my system. I want my task to be executed on GPU 1 (not on GPU 0).
Below are my options. Slurm does not bind my task to GPU 1 despite --gpu-bind option. It starts up my task at GPU 0:
#SBATCH --job-name=Genkin_CPU
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=1
#SBATCH -e log/slurm_%A_%a.err
#SBATCH -o log/slurm_%A_%a.out
#SBATCH -t "100000"
#SBATCH --array=1-1
#SBATCH -N 1
#SBATCH --gpu-bind=map_gpu:1

SLURM script to run indefinitely to avoid session timeout

Background:
I need to remote access a HPC across the country (place is closed for public, I am interning for the summer)
Once I accomplish this, I load a script, for this purpose let's call it jupyter.sh
There are several nodes in this HPC, and every time I run the .sh script, I get assigned to one, say N123
From Jupyter notebook on browser, I have to run actual code/ calculations/ simulations using python. The data I'm working with takes about 2 hours to run completely so that I can then process it and do my job
Very often, I would get disconnected from that node N123 because "user doesn't have an active job running", even though my jupyter notebook is still running / I'm working on it
This results in me having to run that .sh script again, meaning I will get a different node, say N456 (then the ssh command line for jupyter has to be entered again, this time with the different node number)
Jupyter will disconnect from host, and this forces me to restart kernel and run the entire code again, costing me that 1 hour and something it takes to run the python code.
(Can't get into too many details since I don't know what I am allowed to share without getting in trouble)
My question is,
Is there a way that i can run an sh script with say, an infinite loop, so that the node sees it as an active job running and it doesn't kick me out for "inactivity" ?
I have tried running different notebooks that take about 10 minutes total to run, but this doesn't seem to be enough to be considered an active job (and I am not sure if it even counts)
My experience with slurm, terminal and ssh processes is very limited, so if this is a silly question, please forgive.
Any help is appreciated.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --job-name=pytorch
#SBATCH --mail-type=ALL
#SBATCH --mail-user== NO NEED TO SEE THIS
#SBATCH --partition=shared-gpu
#SBATCH --qos=long
#SBATCH --ntasks=1
#SBATCH --mem=2G
#SBATCH --time=04:00:00
export PORT=8892
kill -9 $(lsof -t -i:$PORT)
# Config stuff
module purge
module load anaconda/Anaconda3
module load cuda/10.2
source activate NO NEED FOR THIS
# Print stuff
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
### Define number of processors
echo This job has allocated $SLURM_JOB_NUM_NODES nodes
# Tell me which nodes it is run on
echo " "
echo This jobs runs on the following processors:
echo $SLURM_JOB_NODELIST
echo " "
jupyter notebook --no-browser --port=$PORT
echo Time is `date
Would something like:
#!/bin/bash
#SBATCH --job-name=jupyter # Job name
#SBATCH --nodes=1 # Run all processes on a single node.
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=1 # Number of CPU cores per task (multithreaded tasks)
#SBATCH --mem=2G # Job memory request. If you do not ask for enough your program will be killed.
#SBATCH --time=04:00:00 # Time limit hrs:min:sec. If your program is still running when this timer ends it will be killed.
srun jupyter.sh
work? You write this up in a text editor, save it as .slurm, then run it using
sbatch jobname.slurm
Salloc might also be a good way to go about this.

How can I verify my google account to use TensorBoard.dev during sbatch?

I want to run a tenosorboard.dev using the following bash file.
#!/bin/bash
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -t 50:00:00
#SBATCH -p medium
#SBATCH --mem=4G
#SBATCH -o hostname_tensorboard_%j.out
#SBATCH -e hostname_tensorboard_%j.err
module load python/3.7.4 conda2/4.2.13
source activate env_tf
echo y | tensorboard dev upload --logdir="mydir"
I need to authorize it with my google account when I recieve the following massage.
Continue? (yes/NO) Please visit this URL to authorize this application:
https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=***
How can I do it?
Thanks in advance.
Not familiar with sbatch, but when you authorize tensorboard, it creates a file so that you can upload afterwards without re-authorizing. You should be able to manually copy that file into the environment in which you will be uploading in the future.
On my workstation, the credentials file is
~/.config/tensorboard/credentials/uploader-creds.json

SLURM overcommiting GPU

How can one run multiple jobs in parallel on one GPU? One option which works is to run a script that spawn child processes. But is there also a way to do it with SLURM itself? I tried
#!/usr/bin/env bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --overcommit
srun python script1.py &
srun python script2.py &
wait
But that still runs them sequentially.
EDIT: We still want to allocate recourses exlusively, i.e. one SBATCH job should allocate a whole GPU for itself. The question is whether there is an easy way to start multiple scripts within the SBATCH in parallel, without having to setup a multiprocessing environment.