Scheduling GPU resources using the Sun Grid Engine (SGE) - gpu

We have a cluster of machines, each with 4 GPUs. Each job should be able to ask for 1-4 GPUs. Here's the catch: I would like the SGE to tell each job which GPU(s) it should take. Unlike the CPU, a GPU works best if only one process accesses it at a time. So I would like to:
Job #1 GPU: 0, 1, 3
Job #2 GPU: 2
Job #4 wait until 1-4 GPUs are avaliable
The problem I've run into, is that the SGE will let me create a GPU resource with 4 units on each node, but it won't explicitly tell a job which GPU to use (only that it gets 1, or 3, or whatever).
I thought of creating 4 resources (gpu0, gpu1, gpu2, gpu3), but am not sure if the -l flag will take a glob pattern, and can't figure out how the SGE would tell the job which gpu resources it received. Any ideas?

When you have multiple GPUs and you want your jobs to request a GPU but the Grid Engine scheduler should handle and select a free GPUs you can configure a RSMAP (resource map) complex (instead of a INT). This allows you to specify the amount as well as the names of the GPUs on a specific host in the host configuration. You can also set it up as a HOST consumable, so that independent of the slots your request, the amount of GPU devices requested with -l cuda=2 is for each host 2 (even if the parallel job got i.e. 8 slots on different hosts).
qconf -mc
#name shortcut type relop requestable consumable default urgency
#----------------------------------------------------------------------------------------------
gpu gpu RSMAP <= YES HOST 0 0
In the execution host configuration you can initialize your resources with ids/names (here simply GPU1 and GPU2).
qconf -me yourhost
hostname yourhost
load_scaling NONE
complex_values gpu=2(GPU1 GPU2)
Then when requesting -l gpu=1 the Univa Grid Engine scheduler will select GPU2 if GPU1 is already used by a different job. You can see the actual selection in the qstat -j output. The job gets the selected GPU by reading out the $SGE_HGR_gpu environment variable, which contains in this case the chose id/name "GPU2". This can be used for accessing the right GPU without having collisions.
If you have a multi-socket host you can even attach a GPU directly to some CPU cores near the GPU (near the PCIe bus) in order to speed up communication between GPU and CPUs. This is possible by attaching a topology mask in the execution host configuration.
qconf -me yourhost
hostname yourhost
load_scaling NONE
complex_values gpu=2(GPU1:SCCCCScccc GPU2:SccccSCCCC)
Now when the UGE scheduler selects GPU2 it automatically binds the job to all 4 cores (C) of the second socket (S) so that the job is not allowed to run on the first socket. This does not even require the -binding qsub param.
More configuration examples you can find on www.gridengine.eu.
Note, that all these features are only available in Univa Grid Engine (8.1.0/8.1.3 and higher), and not in SGE 6.2u5 and other Grid Engine version (like OGE, Sun of Grid Engine etc.). You can try it out by downloading the 48-core limited free version from univa.com.

If you are using one of the other grid engine variants you can try adapting the scripts we use on our cluster:
https://github.com/UCL/Grid-Engine-Prolog-Scripts

Related

On an NVIDIA host with 2 GPUs, how can two remote users use one gpu each by srun command under SLURM

I have an NVIDIA host with 2 GPUs and there are two different remote users that need to use a GPU on that host. When each one executes its tasks by srun, which are managed by SLURM, for one of them the GPU resources are released immediately, but for another it stays in a queue waiting for resources. But there are two GPUs. Why doesn't everyone get a GPU?
I have already tried several alternatives, they were in the parameters, but it seems that when using SRUN, in the interactive form, the person who manages to execute his job has the whole domain of the machine until he finishes his job.
Assuming Slurm is correctly be configured to allow node sharing (SelectType option), and to manage GPUs as generic resources (GresType option), you could use scontrol show node and compare the AllocTRES and CfgTRES outputs.
This would show what resources are available and find out why job 2 is pending. Maybe job 1 used parameter --exclusive? Maybe job 1 requested all the CPUs or all the memory? Maybe job 1 requested all GPUs? etc.

Sometimes get the error "err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory"

I'm very lost on how to solve my particular problem, which is why I followed the getting help guideline in the Object Detection API and made a post here on Stack Overflow.
To start off, my goal was to run distributed training jobs on Azure. I've previously used gcloud ai-platform jobs submit training with great ease to run distributed jobs, but it's a bit difficult on Azure.
I built the tf1 docker image for the Object Detection API from the dockerfile here.
I had a cluster (Azure Kubernetes Service/AKS Cluster) with the following nodes:
4x Standard_DS2_V2 nodes
8x Standard_NC6 nodes
In Azure, NC6 nodes are GPU nodes backed by a single K80 GPU each, while DS2_V2 are typical CPU nodes.
I used TFJob to configure my job with the following replica settings:
Master (limit: 1 GPU) 1 replica
Worker (limit: 1 GPU) 7 replicas
Parameter Server (limit: 1 CPU) 3 replicas
Here's my conundrum: The job fails as one of the workers throw the following error:
tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory
I randomly tried reducing the number of workers, and surprisingly, the job worked. It worked only if I had 3 or less Worker replicas. Although it took a lot of time (bit more than a day), the model could finish training successfully with 1 Master and 3 Workers.
This was a bit vexing as I could only use up to 4 GPUs even though the cluster had 8 GPUs allocated. I ran another test: When my cluster had 3 GPU nodes, I could only successfully run the job with 1 Master and 1 Worker! Seems like I can't fully utilize the GPUs for some reason.
Finally, I ran into another problem. The above runs were done with a very small amount of data (about 150 Mb) since they were tests. I ran a proper job later with a lot more data (about 12 GiB). Even though the cluster had 8 GPU nodes, it could only successfully do the job when there was 1 Master and 1 Worker.
Increasing the Worker replica count to more than 1 immediately caused the same cuda error as above.
I'm not sure if this is an Object Detection API based issue, or if it is caused by Kubeflow/TFJob or even if it's something Azure specific. I've opened a similar issue on the Kubeflow page, but I'm also now seeing if I can get some guide from the Object Detection API community. If you need any further details (like the tfjob yaml, or pipeline.config for the training) or have any questions, please let me know in the comments.
It might be related to the batch size used by the API.
Try to control the batch size, maybe as described in this answer:
https://stackoverflow.com/a/55529875/2109287
this is because of insufficient gpu memory:
try this below commands
hope it'll help
$ sudo fuser -v /dev/nvidia*
$ sudo kill -9 pid_no (Ex: 12345)
$ nvidia-smi --gpu-reset
:)

Redis stream 50k consumer support parallel - capacity requirement

What are the Redis capacity requirements to support 50k consumers within one consumer group to consume and process the messages in parallel? Looking for testing an infrastructure for the same scenario and need to understand considerations.
Disclaimer: I worked in a company which used Redis in a somewhat large scale (probably less consumers than your case, but our consumers were very active), however I wasn't from the infrastructure team, but I was involved in some DevOps tasks.
I don't think you will find an exact number, so I'll try to share some tips and tricks to help you:
Be sure to read the entire Redis Admin page. There's a lot of useful information there. I'll highlight some of the tips from there:
Assuming you'll set up a Linux host, edit /etc/sysctl.conf and set a high net.core.somaxconn (RabbitMQ suggests 4096). Check the documentation of tcp-backlog config in redis.conf for an explanation about this.
Assuming you'll set up a Linux host, edit /etc/sysctl.conf and set vm.overcommit_memory = 1. Read below for a detailed explanation.
Assuming you'll set up a Linux host, edit /etc/sysctl.conf and set fs.file-max. This is very important for your use case. The Open File Handles / File Descriptors Limit is essentially the maximum number of file descriptors (each client represents a file descriptor) the SO can handle. Please check the Redis documentation on this. RabbitMQ documentation also present some useful information about it.
If you edit the /etc/sysctl.conf file, run sysctl -p to reload it.
"Make sure to disable Linux kernel feature transparent huge pages, it will affect greatly both memory usage and latency in a negative way. This is accomplished with the following command: echo never > /sys/kernel/mm/transparent_hugepage/enabled." Add this command also to /etc/rc.local to make it permanent over reboot.
In my experience Redis is not very resource-hungry, so I believe you won't have issues with CPU. Memory are directly related to how much data you intend to store in it.
If you set up a server with many cores, consider using more than one Redis Server. Redis is (mostly) single-threaded and will not use all your CPU resources if you use a single instance in a multicore environment.
Redis server also warns about wrong/risky configurations on startup (sorry for the old image):
Explanation on Overcommit Memory (vm.overcommit_memory)
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis [from Redis FAQ]
There are three possible settings for vm.overcommit_memory.
0 (zero): Check if enough memory is available and, if so, allow the allocation. If there isn’t enough memory, deny the request and return an error to the application.
1 (one): Permit memory allocation in excess of physical RAM plus swap, as defined by vm.overcommit_ratio. The vm.overcommit_ratio parameter is a
percentage added to the amount of RAM when deciding how much the kernel can overcommit. For instance, a vm.overcommit_ratio of 50 and 1 GB of
RAM would mean the kernel would permit up to 1.5 GB, plus swap, of memory to be allocated before a request failed.
2 (two): The kernel’s equivalent of "all bets are off", a setting of 2 tells the kernel to always return success to an application’s request for memory. This is absolutely as weird and scary as it sounds.

Dask Yarn failed to allocate number of workers

We have a CDH cluster (version 5.14.4) with 6 worker servers with a total of 384 vcores (64 cores per server).
We are running some ETL processes using dask version 2.8.1, dask-yarn version 0.8 with skein 0.8 .
Currently we are having problem allocating the maximum number of workers .
We are not able to run a job with more the 18 workers! (we can see the actual number of workers in the dask dashboad.
The definition of the cluster is as follows:
cluster = YarnCluster(environment = 'path/to/my/env.tar.gz',
n_workers = 24,
worker_vcores = 4,
worker_memory= '64GB'
)
Even when increasing the number of workers to 50 nothing changes, although when changing the worker_vcores or worker_memory we can see the changes in the dashboard.
Any suggestions?
update
Following #jcrist answer I realized that I didn't fully understand the termenology between the Yarn web UI application dashboard and the Yarn Cluster parameters.
From my understanding:
a Yarn Container is equal to a dask worker.
When ever a Yarn cluster is generated there are 2 additional workers/containers that are running (one for a Schedualer and one for a logger - each with 1 vCore)
The limitation between the n_workers * worker_vcores vs. n_workers * worker_memory that I still need fully grok.
There is another issue - while optemizing I tried using cluster.adapt(). The cluster was running with 10 workers each with 10 ntrheads with a limit of 100GB but in the Yarn web UI there was only displayed 2 conteiners running (my cluster has 384 vCorres and 1.9TB so there is still plenty of room to expand). probably worth to open a different question.
There are many reasons why a job may be denied more containers. Do you have enough memory across your cluster to allocate that many 64 GiB chunks? Further, does 64 GiB tile evenly across your cluster nodes? Is your YARN cluster configured to allow jobs that large in this queue? Are there competing jobs that are also taking resources?
You can see the status of all containers using the ApplicationClient.get_containers method.
>>> cluster.application_client.get_containers()
You could filter on state REQUESTED to see just the pending containers
>>> cluster.application_client.get_containers(states=['REQUESTED'])
this should give you some insight as to what's been requested but not allocated.
If you suspect a bug in dask-yarn, feel free to file an issue (including logs from the application master for a problematic run), but I suspect this is more an issue with the size of containers you're requesting, and how your queue is configured/currently used.

Can VMs on Google Compute detect when they've been migrated?

Is it possible to notify an application running on a Google Compute VM when the VM migrates to different hardware?
I'm a developer for an application (HMMER) that makes heavy use of vector instructions (SSE/AVX/AVX-512). The version I'm working on probes its hardware at startup to determine which vector instructions are available and picks the best set.
We've been looking at running our program on Google Compute and other cloud engines, and one concern is that, if a VM migrates from one physical machine to another while running our program, the new machine might support different instructions, causing our program to either crash or execute more slowly than it could.
Is there a way to notify applications running on a Google Compute VM when the VM migrates? The only relevant information I've found is that you can set a VM to perform a shutdown/reboot sequence when it migrates, which would kill any currently-executing programs but would at least let the user know that they needed to restart the program.
We ensure that your VM instances never live migrate between physical machines in a way that would cause your programs to crash the way you describe.
However, for your use case you probably want to specify a minimum CPU platform version. You can use this to ensure that e.g. your instance has the new Skylake AVX instructions available. See the documentation on Specifying the Minimum CPU Platform for further details.
As per the Live Migration docs:
Live migration does not change any attributes or properties of the VM
itself. The live migration process just transfers a running VM from
one host machine to another. All VM properties and attributes remain
unchanged, including things like internal and external IP addresses,
instance metadata, block storage data and volumes, OS and application
state, network settings, network connections, and so on.
Google does provide few controls to set the instance availability policies which also lets you control aspects of live migration. Here they also mention what you can look for to determine when live migration has taken place.
Live migrate
By default, standard instances are set to live migrate, where Google
Compute Engine automatically migrates your instance away from an
infrastructure maintenance event, and your instance remains running
during the migration. Your instance might experience a short period of
decreased performance, although generally most instances should not
notice any difference. This is ideal for instances that require
constant uptime, and can tolerate a short period of decreased
performance.
When Google Compute Engine migrates your instance, it reports a system
event that is published to the list of zone operations. You can review
this event by performing a gcloud compute operations list --zones ZONE
request or by viewing the list of operations in the Google Cloud
Platform Console, or through an API request. The event will appear
with the following text:
compute.instances.migrateOnHostMaintenance
In addition, you can detect directly on the VM when a maintenance event is about to happen.
Getting Live Migration Notices
The metadata server provides information about an instance's
scheduling options and settings, through the scheduling/
directory and the maintenance-event attribute. You can use these
attributes to learn about a virtual machine instance's scheduling
options, and use this metadata to notify you when a maintenance event
is about to happen through the maintenance-event attribute. By
default, all virtual machine instances are set to live migrate so the
metadata server will receive maintenance event notices before a VM
instance is live migrated. If you opted to have your VM instance
terminated during maintenance, then Compute Engine will automatically
terminate and optionally restart your VM instance if the
automaticRestart attribute is set. To learn more about maintenance
events and instance behavior during the events, read about scheduling
options and settings.
You can learn when a maintenance event will happen by querying the
maintenance-event attribute periodically. The value of this
attribute will change 60 seconds before a maintenance event starts,
giving your application code a way to trigger any tasks you want to
perform prior to a maintenance event, such as backing up data or
updating logs. Compute Engine also offers a sample Python script
to demonstrate how to check for maintenance event notices.
You can use the maintenance-event attribute with the waiting for
updates feature to notify your scripts and applications when a
maintenance event is about to start and end. This lets you automate
any actions that you might want to run before or after the event. The
following Python sample provides an example of how you might implement
these two features together.
You can also choose to terminate and optionally restart your instance.
Terminate and (optionally) restart
If you do not want your instance to live migrate, you can choose to
terminate and optionally restart your instance. With this option,
Google Compute Engine will signal your instance to shut down, wait for
a short period of time for your instance to shut down cleanly,
terminate the instance, and restart it away from the maintenance
event. This option is ideal for instances that demand constant,
maximum performance, and your overall application is built to handle
instance failures or reboots.
Look at the Setting availability policies section for more details on how to configure this.
If you use an instance with a GPU or a preemptible instance be aware that live migration is not supported:
Live migration and GPUs
Instances with GPUs attached cannot be live migrated. They must be set
to terminate and optionally restart. Compute Engine offers a 60 minute
notice before a VM instance with a GPU attached is terminated. To
learn more about these maintenance event notices, read Getting live
migration notices.
To learn more about handling host maintenance with GPUs, read
Handling host maintenance on the GPUs documentation.
Live migration for preemptible instances
You cannot configure a preemptible instances to live migrate. The
maintenance behavior for preemptible instances is always set to
TERMINATE by default, and you cannot change this option. It is also
not possible to set the automatic restart option for preemptible
instances.
As Ramesh mentioned, you can specify the minimum CPU platform to ensure you are only migrated to an instance which has at least the minimum CPU platform you specified. At a high level it looks like:
In summary, when you specify a minimum CPU platform:
Compute Engine always uses the minimum CPU platform where available.
If the minimum CPU platform is not available or the minimum CPU platform is older than the zone default, and a newer CPU platform is
available for the same price, Compute Engine uses the newer platform.
If the minimum CPU platform is not available in the specified zone and there are no newer platforms available without extra cost, the
server returns a 400 error indicating that the CPU is unavailable.