Setting the constraint in slurm job script for GPU compute capability - gpu

I am trying to set a constraint so that my job would only run on GPUs with compute capability higher (or equal) to 7.
Here is my script named torch_gpu_sanity_venv385-11.slurm:
#!/bin/bash
#SBATCH --partition=gpu-L --gres=gpu:1 --constraint="cc7.0"
# -------------------------> ask for 1 GPU
d=$(date)
h=$(hostname)
echo $d $h env # show CUDA related Env vars
env|grep -i cuda
# nvidia-smi
# actual work
/research/jalal/slurm/fashion/fashion_compatibility/torch_gpu_sanity_venv385-11.bash
Without using --constraint="cc7.0" my script runs correctly. I even used another version that is --constraint=cc7.0 but in either case I get the following error:
[jalal#goku fashion_compatibility]$ sbatch torch_gpu_sanity_venv385-11.slurm
sbatch: error: Batch job submission failed: Invalid feature specification
When I remove the --constraint="cc7.0" term, I am able to run the job.
after removing the constraint term:
[jalal#goku fashion_compatibility]$ sbatch torch_gpu_sanity_venv385-11.slurm
Submitted batch job 28398
So, how can I set the constraint so that I am only assigned GPUs with compute capability of 7 or higher?
I followed this tutorial for constraint setting.

Related

Tensorflow serving failing with std::bad_alloc

I'm trying to run tensorflow-serving using docker compose (served model + microservice) but the tensorflow serving container fails with the error below and then restarts.
microservice | To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
tensorflow-serving | terminate called after throwing an instance of 'std::bad_alloc'
tensorflow-serving | what(): std::bad_alloc
tensorflow-serving | /usr/bin/tf_serving_entrypoint.sh: line 3: 7 Aborted
(core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$#"
I monitored the memory usage and it seems like there's plenty of memory. I also increased the resource limit using Docker Desktop but still get the same error. Each request to the model is fairly small as the microservice is sending tokenized text with batch size of one. Any ideas?
I was encountering the same problem, and this fixed worked for me:
uninstalled and reinstalled:
tensorflow, tensorflow-gpu, etc to 2.9.0, (and trained and built my model)
docker pull and docker run tensorflow/serving:2.8.0 (this did the trick and finally got rid of this problem.)
Had the same error when using tensorflow/serving:latest. Based on Hanafi's response, I used tensorflow/serving:2.8.0 and it worked.
For reference, I used
sudo docker run -p 8501:8501 --mount type=bind,source= \
[PATH_TO_MODEL_DIRECTORY],target=/models/[MODEL_NAME] \
-e MODEL_NAME=[MODEL_NAME] -t tensorflow/serving:2.8.0
The issue is solved for TensorFlow and TensorFlow Serving 2.11 (not yet released) and fix is included in nightly release of TF serving. You can build nightly docker image or use pre-compiled version.
Also TensorFlow 2.9 and 2.10 was patched to fix this issue. Refer PR here.[1, 2]

SLURM script to run indefinitely to avoid session timeout

Background:
I need to remote access a HPC across the country (place is closed for public, I am interning for the summer)
Once I accomplish this, I load a script, for this purpose let's call it jupyter.sh
There are several nodes in this HPC, and every time I run the .sh script, I get assigned to one, say N123
From Jupyter notebook on browser, I have to run actual code/ calculations/ simulations using python. The data I'm working with takes about 2 hours to run completely so that I can then process it and do my job
Very often, I would get disconnected from that node N123 because "user doesn't have an active job running", even though my jupyter notebook is still running / I'm working on it
This results in me having to run that .sh script again, meaning I will get a different node, say N456 (then the ssh command line for jupyter has to be entered again, this time with the different node number)
Jupyter will disconnect from host, and this forces me to restart kernel and run the entire code again, costing me that 1 hour and something it takes to run the python code.
(Can't get into too many details since I don't know what I am allowed to share without getting in trouble)
My question is,
Is there a way that i can run an sh script with say, an infinite loop, so that the node sees it as an active job running and it doesn't kick me out for "inactivity" ?
I have tried running different notebooks that take about 10 minutes total to run, but this doesn't seem to be enough to be considered an active job (and I am not sure if it even counts)
My experience with slurm, terminal and ssh processes is very limited, so if this is a silly question, please forgive.
Any help is appreciated.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --job-name=pytorch
#SBATCH --mail-type=ALL
#SBATCH --mail-user== NO NEED TO SEE THIS
#SBATCH --partition=shared-gpu
#SBATCH --qos=long
#SBATCH --ntasks=1
#SBATCH --mem=2G
#SBATCH --time=04:00:00
export PORT=8892
kill -9 $(lsof -t -i:$PORT)
# Config stuff
module purge
module load anaconda/Anaconda3
module load cuda/10.2
source activate NO NEED FOR THIS
# Print stuff
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
### Define number of processors
echo This job has allocated $SLURM_JOB_NUM_NODES nodes
# Tell me which nodes it is run on
echo " "
echo This jobs runs on the following processors:
echo $SLURM_JOB_NODELIST
echo " "
jupyter notebook --no-browser --port=$PORT
echo Time is `date
Would something like:
#!/bin/bash
#SBATCH --job-name=jupyter # Job name
#SBATCH --nodes=1 # Run all processes on a single node.
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=1 # Number of CPU cores per task (multithreaded tasks)
#SBATCH --mem=2G # Job memory request. If you do not ask for enough your program will be killed.
#SBATCH --time=04:00:00 # Time limit hrs:min:sec. If your program is still running when this timer ends it will be killed.
srun jupyter.sh
work? You write this up in a text editor, save it as .slurm, then run it using
sbatch jobname.slurm
Salloc might also be a good way to go about this.

SLURM overcommiting GPU

How can one run multiple jobs in parallel on one GPU? One option which works is to run a script that spawn child processes. But is there also a way to do it with SLURM itself? I tried
#!/usr/bin/env bash
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --overcommit
srun python script1.py &
srun python script2.py &
wait
But that still runs them sequentially.
EDIT: We still want to allocate recourses exlusively, i.e. one SBATCH job should allocate a whole GPU for itself. The question is whether there is an easy way to start multiple scripts within the SBATCH in parallel, without having to setup a multiprocessing environment.

How can a specific application be monitored by perf inside the kvm?

I have an application which I want to monitor it via perf stat when running inside a kvm VM.
After Googling I have found that perf kvm stat can do this. However there is an error by running the command:
sudo perf kvm stat record -p appPID
which results in help representation ...
usage: perf kvm stat record [<options>]
-p, --pid <pid> record events on existing process id
-t, --tid <tid> record events on existing thread id
-r, --realtime <n> collect data with this RT SCHED_FIFO priority
--no-buffering collect data without buffering
-a, --all-cpus system-wide collection from all CPUs
-C, --cpu <cpu> list of cpus to monitor
-c, --count <n> event period to sample
-o, --output <file> output file name
-i, --no-inherit child tasks do not inherit counters
-m, --mmap-pages <pages[,pages]>
number of mmap data pages and AUX area tracing mmap pages
-v, --verbose be more verbose (show counter open errors, etc)
-q, --quiet don't print any message
Does any one know what is the problem?
Use kvm with vPMU (virtualization of PMU counters) - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html "2.2. VIRTUAL PERFORMANCE MONITORING UNIT (VPMU)"). Then run perf record -p $pid and perf stat -p $pid inside the guest.
Host system has no knowledge (tables) of guest processes (they are managed by guest kernel, which can be non Linux, or different version of linux with incompatible table format), so host kernel can't profile some specific guest process. It only can profile whole guest (and there is perf kvm command - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/chap-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools.html#sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-perf_kvm)

Never successfully built a large hadoop&spark cluster

I was wondering if anybody could help me with this issue in deploying a spark cluster using the bdutil tool.
When the total number of cores increase (>= 1024), it failed all the time with the following reasons:
Some machine is never sshable, like "Tue Dec 8 13:45:14 PST 2015: 'hadoop-w-5' not yet sshable (255); sleeping"
Some nodes fail with an "Exited 100" error when deploying spark worker nodes, like "Tue Dec 8 15:28:31 PST 2015: Exited 100 : gcloud --project=cs-bwamem --quiet --verbosity=info compute ssh hadoop-w-6 --command=sudo su -l -c "cd ${PWD} && ./deploy-core-setup.sh" 2>>deploy-core-setup_deploy.stderr 1>>deploy-core-setup_deploy.stdout --ssh-flag=-tt --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-f"
In the log file, it says:
hadoop-w-40: ==> deploy-core-setup_deploy.stderr <==
hadoop-w-40: dpkg-query: package 'openjdk-7-jdk' is not installed and no information is available
hadoop-w-40: Use dpkg --info (= dpkg-deb --info) to examine archive files,
hadoop-w-40: and dpkg --contents (= dpkg-deb --contents) to list their contents.
hadoop-w-40: Failed to fetch http://httpredir.debian.org/debian/pool/main/x/xml-core/xml-core_0.13+nmu2_all.deb Error reading from server. Remote end closed connection [IP: 128.31.0.66 80]
hadoop-w-40: E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
I tried 16-core 128-nodes, 32-core 64-nodes, 32-core 32-nodes and other over 1024-core configurations, but either the above Reason 1 or 2 will show up.
I also tried to modify the ssh-flag to change the ConnectTimeout to 1200s, and change bdutil_env.sh to set the polling interval to 30s, 60s, ..., none of them works. There will be always some nodes which fail.
Here is one of the configurations that I used:
time ./bdutil \
--bucket $BUCKET \
--force \
--machine_type n1-highmem-32 \
--master_machine_type n1-highmem-32 \
--num_workers 64 \
--project $PROJECT \
--upload_files ${JAR_FILE} \
--env_var_files hadoop2_env.sh,extensions/spark/spark_env.sh \
deploy
To summarize some of the information that came out from a separate email discussion, as IP mappings change and different debian mirrors get assigned, there can be occasional problems where the concurrent calls to apt-get install during a bdutil deployment can either overload some unbalanced servers or trigger DDOS protections leading to deployment failures. These do tend to be transient, and at the moment it appears I can deploy large clusters in zones like us-east1-c and us-east1-d successfully again.
There are a few options you can take to reduce the load on the debian mirrors:
Set MAX_CONCURRENT_ASYNC_PROCESSES to a much smaller value than the default 150 inside bdutil_env.sh, such as 10 to only deploy 10 at a time; this will make the deployment take longer, but would lighten the load as if you just did several back-to-back 10-node deployments.
If the VMs were successfully created but the deployment steps fail, instead of needing to retry the whole delete/deploy cycle, you can try ./bdutil <all your flags> run_command -t all -- 'rm -rf /home/hadoop' followed by ./bdutil <all your flags> run_command_steps to just run through the whole deployment attempt.
Incrementally build your cluster using resize_env.sh; initially set --num_workers 10 and deploy your cluster, and then edit resize_env.sh to set NEW_NUM_WORKERS=20, and run ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh deploy and it will only deploy the new workers 10-20 without touching those first 10. Then you just repeat, adding another 10 workers to NEW_NUM_WORKERS each time. If a resize attempt fails, you simply ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh delete to only delete those extra workers without affecting the ones you already deployed successfully.
Finally, if you're looking for more reproducible and optimized deployments, you should consider using Google Cloud Dataproc, which lets you use the standard gcloud CLI to deploy cluster, submit jobs, and further manage/delete clusters without needing to remember your bdutil flags or keep track of what clusters you have on your client machine. You can SSH into Dataproc clusters and use them basically the same as bdutil clusters, with some minor differences, like Dataproc DEFAULT_FS being HDFS so that any GCS paths you use should fully-specify the complete gs://bucket/object name.