I have got a cluster with gpu nodes (nvidia) and deployed DC/OS 1.8. I'd like to enable to schedule jobs (batch and spark) on gpu nodes using gpu isolation.
DC/OS is based on mesos 1.0.1 that supports gpu isolation.
Unfortunately, DC/OS doesn't officially support GPUs in 1.8 (experimental support for GPUs will be coming in the next release as mentioned here: https://github.com/dcos/dcos/pull/766 ).
In this next release, only Marathon will officially be able to launch GPU services (Metronome (i.e. batch jobs) will not).
Regarding spark, the spark version bundled with Universe probably doesn't have GPU support for Mesos built in yet. Spark itself has it coming soon though: https://github.com/apache/spark/pull/14644
In order to enable supporting gpu resources in DC/OS cluster the next steps are needed:
Configure mesos agents on gpu nodes:
1.1. Stop dcos-mesos-slave.service:
systemctl stop dcos-mesos-slave.service
1.2. Add the next parameters into /var/lib/dcos/mesos-slave-common file:
# a comma separated list of GPUs (id), as determined by running nvidia-smi on the host where the agent is to be launched
MESOS_NVIDIA_GPU_DEVICES="0,1"
# value of the gpus resource must be complied with number of ids above
MESOS_RESOURCES= [ {"name":"ports","type":"RANGES","ranges": {"range": [{"begin": 1025, "end": 2180},{"begin": 2182, "end": 3887},{"begin": 3889, "end": 5049},{"begin": 5052, "end": 8079},{"begin": 8082, "end": 8180},{"begin": 8182, "end": 32000}]}} ,{"name": "gpus","type": "SCALAR","scalar": {"value": 2}}]
MESOS_ISOLATION=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,docker/volume,cgroups/devices,gpu/nvidia
1.3. Start dcos-mesos-slave.service:
systemctl start dcos-mesos-slave.service
Enable the GPU_RESOURCES capability in mesos frameworks:
2.1. Marathon framework should be launched with the option
--enable_features "gpu_resources"
2.2. Aurora scheduler should be launched with the option -allow_gpu_resource
Note.
Any host running a Mesos agent with Nvidia GPU support MUST have a valid Nvidia kernel driver installed. It is also highly recommended to install the corresponding user-level libraries and tools available as part of the Nvidia CUDA toolkit. Many jobs that use Nvidia GPUs rely on CUDA and not including it will severely limit the type of GPU-aware jobs you can run on Mesos.
Related
I am using the pre-built deep learning VM instances offered by google cloud, with an Nvidia tesla K80 GPU attached. I choose to have Tensorflow 2.5 and CUDA 11.0 automatically installed. When I start the instance, everything works great - I can run:
Import tensorflow as tf
tf.config.list_physical_devices()
And my function returns the CPU, accelerated CPU, and the GPU. Similarly, if I run tf.test.is_gpu_available(), the function returns True.
However, if I log out, stop the instance, and then restart the instance, running the same exact code only sees the CPU and tf.test.is_gpu_available() results in False. I get an error that looks like the driver initialization is failing:
E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
Running nvidia-smi shows that the computer still sees the GPU, but my tensorflow can’t see it.
Does anyone know what could be causing this? I don’t want to have to reinstall everything when I’m restarting the instance.
Some people (sadly not me) are able to resolve this by setting the following at the beginning of their script/main:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
I had to reinstall CUDA drivers and from then on it worked even after restarting the instance. You can configure your system settings on NVIDIAs website and it will provide you the commands you need to follow to install cuda. It also asks you if you want to uninstall the previous cuda version (yes!).This is luckily also very fast.
I fixed the same issue with the commands below, taken from https://issuetracker.google.com/issues/191612865?pli=1
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
Option-1:
Upgrade a Notebooks instance's environment. Refer the link to upgrade.
Notebooks instances that can be upgraded are dual-disk, with one boot disk and one data disk. The upgrade process upgrades the boot disk to a new image while preserving your data on the data disk.
Option-2:
Connect to the notebook VM via SSH and run the commands link.
After execution of the commands, the cuda version will update to 11.3 and the nvidia driver version to 465.19.01.
Restart the notebook VM.
Note: Issue has been solved in gpu images. New notebooks will be created with image version M74. About new image version is not yet updated in google-public-issue-tracker but you can find the new image version M74 in console.
I am interested in the RDMA support in tensorflow 1.15 for workers and parameter servers to communicate directly without going through CPU. I do not have infiniband VERBS devices but can build tensorflow from source with VERBS support
bazel build --config=opt --config=cuda --config=verbs //tensorflow/tools/pip_package:build_pip_package
after sudo yum install libibverbs-devel on centos-7. However, after pip installing the built package via
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg && pip install /tmp/tensorflow_pkg/tensorflow-1.15.0-cp36-cp36m-linux_x86_64.whl,
my training failed with the following error:
F tensorflow/contrib/verbs/rdma.cc:127] Check failed: dev_list No InfiniBand device found
This is expected since I do not have infiniband hardware on my machine. But do I really need infiniband if my job is run not cross-machine, but on a single machine? I just want to test whether RDMA can significantly speed up parameter server-based training. Thanks.
But do I really need Infiniband if my job is run not cross-machine, but on a single machine?
No and it seems you misunderstand what RDMA actually is. RDMA ("GPUDirect") is a way for third party devices like network interfaces and storage adapters to directly write to a GPUs memory across the PCI bus. It is intended for improved multi-node performance in things like compute clusters. It has nothing to do with multi-GPU operations within in a single node ("peer-to-peer"), where GPUs connected to one node can directly access one another's memory without trips to the host CPU.
I've just started an instance on a Google Compute Engine with 2 GPUs (Nvidia Tesla K80). And straight away after the start, I can see via nvidia-smi that one of them is already fully utilized.
I've checked a list of running processes and there is nothing running at all. Does it mean that Google has rented out that same GPU to someone else?
It's all running on this machine:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial
Enabling "persistence mode" with nvidia-smi -pm 1 might solve the problem.
ECC in combination with non persistence mode can lead to 100% GPU utilization.
Alternatively you can disable ECC with nvidia-smi -e 0.
Note: I'm not sure if the performance actually is worse. I can remember that I was able to train ML model despite the 100% GPU utilization but I don't know if it was slower.
I would like to suggest you to report and create this issue on the Google Issue Tracker as need to investigate. Please provide your project number and instance name over there. Please follow this URL that make you able to create a file as private in Google Issue Tracker.
I have a CPU with integrated GPU. I also have an external GPU that i have been using for ML. What i want is to use the integrated GPU only for display and dedicate the external GPU to NN training (in order to free some memory).
I have set at the BIOS the external GPU to be the primary GPU, but also to both be active. So they are both working. After i boot the system i can plug the monitor to any one of them and they both work.
The problem is that when i plug the monitor to the motherboard (integrated GPU) theano stops using the external GPU:
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
Is there a way to explicitly point theano to the external GPU? here is my the relevant part of my .theanorc:
[global]
floatX = float32
device = gpu
I have a similar system to yours. For linux, installing bumblebee worked.
sudo apt-get install bumblebee-nvidia
(adapt to your distro's package manager)
Then launch python via:
optirun python
I'm trying to use singularity to schedule tasks that use cuda over a mesos cluster. Mesos slaves do seem to support gpu resources that frameworks can make use of, but it seems the frameworks need to be flagged as gpu consumers.
Is this an option that is supported by singularity ? And if not, is there an alternative mesos framework that is gpu aware and can launch tasks other than long running ?
Right now Singularity does not support for gpu resources.
Only 3 frameworks support gpu resources:
Marathon (services)
Aurora (batch jobs, services, etc)
Spark (Map/Reduce jobs)
Chronos has just been reborn. Supporting gpu resources will be added soon.
Metronome (there is the unofficial docker image) will support gpu resources soon as well.