On an NVIDIA host with 2 GPUs, how can two remote users use one gpu each by srun command under SLURM - gpu

I have an NVIDIA host with 2 GPUs and there are two different remote users that need to use a GPU on that host. When each one executes its tasks by srun, which are managed by SLURM, for one of them the GPU resources are released immediately, but for another it stays in a queue waiting for resources. But there are two GPUs. Why doesn't everyone get a GPU?
I have already tried several alternatives, they were in the parameters, but it seems that when using SRUN, in the interactive form, the person who manages to execute his job has the whole domain of the machine until he finishes his job.

Assuming Slurm is correctly be configured to allow node sharing (SelectType option), and to manage GPUs as generic resources (GresType option), you could use scontrol show node and compare the AllocTRES and CfgTRES outputs.
This would show what resources are available and find out why job 2 is pending. Maybe job 1 used parameter --exclusive? Maybe job 1 requested all the CPUs or all the memory? Maybe job 1 requested all GPUs? etc.

Related

Sometimes get the error "err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory"

I'm very lost on how to solve my particular problem, which is why I followed the getting help guideline in the Object Detection API and made a post here on Stack Overflow.
To start off, my goal was to run distributed training jobs on Azure. I've previously used gcloud ai-platform jobs submit training with great ease to run distributed jobs, but it's a bit difficult on Azure.
I built the tf1 docker image for the Object Detection API from the dockerfile here.
I had a cluster (Azure Kubernetes Service/AKS Cluster) with the following nodes:
4x Standard_DS2_V2 nodes
8x Standard_NC6 nodes
In Azure, NC6 nodes are GPU nodes backed by a single K80 GPU each, while DS2_V2 are typical CPU nodes.
I used TFJob to configure my job with the following replica settings:
Master (limit: 1 GPU) 1 replica
Worker (limit: 1 GPU) 7 replicas
Parameter Server (limit: 1 CPU) 3 replicas
Here's my conundrum: The job fails as one of the workers throw the following error:
tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory
I randomly tried reducing the number of workers, and surprisingly, the job worked. It worked only if I had 3 or less Worker replicas. Although it took a lot of time (bit more than a day), the model could finish training successfully with 1 Master and 3 Workers.
This was a bit vexing as I could only use up to 4 GPUs even though the cluster had 8 GPUs allocated. I ran another test: When my cluster had 3 GPU nodes, I could only successfully run the job with 1 Master and 1 Worker! Seems like I can't fully utilize the GPUs for some reason.
Finally, I ran into another problem. The above runs were done with a very small amount of data (about 150 Mb) since they were tests. I ran a proper job later with a lot more data (about 12 GiB). Even though the cluster had 8 GPU nodes, it could only successfully do the job when there was 1 Master and 1 Worker.
Increasing the Worker replica count to more than 1 immediately caused the same cuda error as above.
I'm not sure if this is an Object Detection API based issue, or if it is caused by Kubeflow/TFJob or even if it's something Azure specific. I've opened a similar issue on the Kubeflow page, but I'm also now seeing if I can get some guide from the Object Detection API community. If you need any further details (like the tfjob yaml, or pipeline.config for the training) or have any questions, please let me know in the comments.
It might be related to the batch size used by the API.
Try to control the batch size, maybe as described in this answer:
https://stackoverflow.com/a/55529875/2109287
this is because of insufficient gpu memory:
try this below commands
hope it'll help
$ sudo fuser -v /dev/nvidia*
$ sudo kill -9 pid_no (Ex: 12345)
$ nvidia-smi --gpu-reset
:)

How to delete an instance if cpu is low?

I am running managed Instance groups whose overall c.p.u is always below 30% but if i check instances individually then i found some are running at 70 above and others are running as low as 15 percent.
Keep in mind that Managed Instance Groups don't take into account individual instances as whether a machine should be removed from the pool or not. GCP's MIGs keep a running average of the last 10 minutes of activity of all instances in the group and use that metric to determine scaling decisions. You can find more details here.
Identifying instances with lower CPU usage than the group doesn't seem like the right goal here, instead I would suggest focusing on why some machines have 15% usage and others have 70%. How is work distributed to your instances, are you using the correct strategies for load balancing for your workload?
Maybe your applications have specific endpoints that cause large amounts of CPU usage while the majority of them are basic CRUD operations, having one machine generating a report and displaying higher usage is fine. If all instances render HTML pages from templates and return the results one machine performing much less work than the others is a distribution issue. Maybe you're using a RPS algorithm when you want a CPU utilization one.
In your use case, the best option is to create an Alert notification that will alert you when an instance goes over the desired CPU usage. Once you receive the notification, you will then be able to manually delete the VM instance. As it is part of the Managed Instance group, the VM instance will automatically recreate.
I have attached an article on how to create an Alert notification here.
There is no metric within Stackdriver that will call the GCE API to delete a VM instance .
There is currently no such automation in place. It should't be too difficult to implement it yourself though. You can write a small script that would run on all your machines (started from Cron or something) that monitors CPU usage. If it decides it is too low, the instance can delete itself from the MIG (you can use e.g. gcloud compute instance-groups managed delete-instances --instances ).

Ensuring one job per node for a specific rule

Hello and thank you for reviewing this question!
I'm working on an SGE cluster with 16 available worker nodes. Each has 32 cores.
I have a rule which defines a process which must be run only one instance per worker node. This means I could in theory run 16 jobs at a time. It's fine if there are other things happening on each worker node - there just can't be two jobs from this specific rule running at the same time. Is there a way to ensure this?
I have tried setting memory resources. But setting for example
resources:
mem_mb=10000
and running
snakemake --resources mem_mb=10000
will only allow one job to run at a time, not one job per cluster. Is there a way to set each individual cluster's memory limit? Or some other way to achieve one job per node for only a specific rule?
Thank you,
Eric

Can single CPU core work with multiple clients using Distributed Tensorflow?

In Distributed Tensorflow, we could run multiple clients working with workers in Parameter-Server architecture, which is known as "Between-Graph Replication". According to the documentation,
Between-graph replication. In this approach, there is a separate
client for each /job:worker task, typically in the same process as the
worker task.
it says the client and worker typically are in the same process. However, if they are not in the same process, can number of clients are not equal to the number of workers? Also, can multiple clients share and run on the same CPU core?
Clients are the python programs that define a graph and initialize a session in order to run computation. If you start these programs, the created processes represent the servers in the distributed architecture.
Now it is possible to write programs that do not create a graph and do not run session, but rather just call the server.join() method with the appropriate job name and task index. This way you could theoretically have a single client defining the whole graph and start a session with its corresponding server.target; then within this session, parts of the graph are automatically going to be sent to the other processes/servers and they will do the computations (as long as you have set which server/task is going to do what). This setup describes the in-graph replication architecture.
So, it is basically possible to start several servers/processes on the same machine, that has only a single CPU, but you are not going to gain much parallelism, because context switching between multiple running processes is going to slow you down. So unless the servers are doing some unrelated work, you should rather avoid this kind of setup.
Between-graph just means that every worker is going to have its own client and run its own session respectively.

Scheduling GPU resources using the Sun Grid Engine (SGE)

We have a cluster of machines, each with 4 GPUs. Each job should be able to ask for 1-4 GPUs. Here's the catch: I would like the SGE to tell each job which GPU(s) it should take. Unlike the CPU, a GPU works best if only one process accesses it at a time. So I would like to:
Job #1 GPU: 0, 1, 3
Job #2 GPU: 2
Job #4 wait until 1-4 GPUs are avaliable
The problem I've run into, is that the SGE will let me create a GPU resource with 4 units on each node, but it won't explicitly tell a job which GPU to use (only that it gets 1, or 3, or whatever).
I thought of creating 4 resources (gpu0, gpu1, gpu2, gpu3), but am not sure if the -l flag will take a glob pattern, and can't figure out how the SGE would tell the job which gpu resources it received. Any ideas?
When you have multiple GPUs and you want your jobs to request a GPU but the Grid Engine scheduler should handle and select a free GPUs you can configure a RSMAP (resource map) complex (instead of a INT). This allows you to specify the amount as well as the names of the GPUs on a specific host in the host configuration. You can also set it up as a HOST consumable, so that independent of the slots your request, the amount of GPU devices requested with -l cuda=2 is for each host 2 (even if the parallel job got i.e. 8 slots on different hosts).
qconf -mc
#name shortcut type relop requestable consumable default urgency
#----------------------------------------------------------------------------------------------
gpu gpu RSMAP <= YES HOST 0 0
In the execution host configuration you can initialize your resources with ids/names (here simply GPU1 and GPU2).
qconf -me yourhost
hostname yourhost
load_scaling NONE
complex_values gpu=2(GPU1 GPU2)
Then when requesting -l gpu=1 the Univa Grid Engine scheduler will select GPU2 if GPU1 is already used by a different job. You can see the actual selection in the qstat -j output. The job gets the selected GPU by reading out the $SGE_HGR_gpu environment variable, which contains in this case the chose id/name "GPU2". This can be used for accessing the right GPU without having collisions.
If you have a multi-socket host you can even attach a GPU directly to some CPU cores near the GPU (near the PCIe bus) in order to speed up communication between GPU and CPUs. This is possible by attaching a topology mask in the execution host configuration.
qconf -me yourhost
hostname yourhost
load_scaling NONE
complex_values gpu=2(GPU1:SCCCCScccc GPU2:SccccSCCCC)
Now when the UGE scheduler selects GPU2 it automatically binds the job to all 4 cores (C) of the second socket (S) so that the job is not allowed to run on the first socket. This does not even require the -binding qsub param.
More configuration examples you can find on www.gridengine.eu.
Note, that all these features are only available in Univa Grid Engine (8.1.0/8.1.3 and higher), and not in SGE 6.2u5 and other Grid Engine version (like OGE, Sun of Grid Engine etc.). You can try it out by downloading the 48-core limited free version from univa.com.
If you are using one of the other grid engine variants you can try adapting the scripts we use on our cluster:
https://github.com/UCL/Grid-Engine-Prolog-Scripts