I'm trying to use singularity to schedule tasks that use cuda over a mesos cluster. Mesos slaves do seem to support gpu resources that frameworks can make use of, but it seems the frameworks need to be flagged as gpu consumers.
Is this an option that is supported by singularity ? And if not, is there an alternative mesos framework that is gpu aware and can launch tasks other than long running ?
Right now Singularity does not support for gpu resources.
Only 3 frameworks support gpu resources:
Marathon (services)
Aurora (batch jobs, services, etc)
Spark (Map/Reduce jobs)
Chronos has just been reborn. Supporting gpu resources will be added soon.
Metronome (there is the unofficial docker image) will support gpu resources soon as well.
Related
Every time I need to train a 'large' deep learning model I do it from Google Collab, as it allows you to use GPU acceleration.
My pc has a dedicated GPU, I was wondering if it is possible to use it to run my notebooks locally in a fast way. Is it possible to train models using my pc GPU? In that case, how?
I am open to work with DataSpell, VSCode or any other IDE.
Nicholas Renotte has a great 'Getting Started' video that goes through the entire process of setting up GPU accelerated notebooks on your PC. The stuff you're interested starts around the 12 minute mark.
Yes, it is possible to run .ipynb notebooks locally using GPU acceleration. To do so, you will need to install the necessary libraries and frameworks such as TensorFlow, PyTorch, or Keras. Depending on the IDE you choose, you will need to install the relevant plugins and packages for GPU acceleration.
In terms of IDEs, DataSpell, VSCode, PyCharm, and Jupyter Notebook are all suitable for running notebooks locally with GPU acceleration.
Once the necessary libraries and frameworks are installed, you will then need to install the appropriate drivers for your GPU and configure the environment for GPU acceleration.
Finally, you will need to modify the .ipynb notebook to enable GPU acceleration and specify the number of GPUs you will be using. Once all the necessary steps have been taken, you will then be able to run the notebook locally with GPU acceleration.
here is my scenario: I have a windows VM and it's having 2 runtimes installed on it (Mule1 and Mule2).
Now If i have to distribute 60% of VM CPU to Mule1 and 40% to Mule2. How can it be done?
Is that even possible ?
There is a concept called CPU affinity, when you have more than one core or CPU. The operating system you are using have tools for assigning cores to a process. I'm not aware of a feature to assign or limit a percentage of CPU usage to a process. I don't know about an out of the box feature to limit CPU usage per process.
Linux:
You can use the taskset command to set which cores to assign to the mule process.
Example:
taskset -c 0,1 ./mule
Source: https://help.mulesoft.com/s/article/How-to-set-CPU-affinity-for-Mule-ESB-process
Windows:
In the Task Manager, you can right-click the java.exe and wrapper-windows-x86-64.exe processes, select "Set Affinity" and choose the processors
In this article there are Powershell commands to do the same from the command line: https://help.mulesoft.com/s/article/How-to-set-CPU-affinity-for-a-Mule-ESB-process-in-Windows-as-a-Service
It is completely different topic however Docker allows something similar per container.
I am interested in the RDMA support in tensorflow 1.15 for workers and parameter servers to communicate directly without going through CPU. I do not have infiniband VERBS devices but can build tensorflow from source with VERBS support
bazel build --config=opt --config=cuda --config=verbs //tensorflow/tools/pip_package:build_pip_package
after sudo yum install libibverbs-devel on centos-7. However, after pip installing the built package via
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg && pip install /tmp/tensorflow_pkg/tensorflow-1.15.0-cp36-cp36m-linux_x86_64.whl,
my training failed with the following error:
F tensorflow/contrib/verbs/rdma.cc:127] Check failed: dev_list No InfiniBand device found
This is expected since I do not have infiniband hardware on my machine. But do I really need infiniband if my job is run not cross-machine, but on a single machine? I just want to test whether RDMA can significantly speed up parameter server-based training. Thanks.
But do I really need Infiniband if my job is run not cross-machine, but on a single machine?
No and it seems you misunderstand what RDMA actually is. RDMA ("GPUDirect") is a way for third party devices like network interfaces and storage adapters to directly write to a GPUs memory across the PCI bus. It is intended for improved multi-node performance in things like compute clusters. It has nothing to do with multi-GPU operations within in a single node ("peer-to-peer"), where GPUs connected to one node can directly access one another's memory without trips to the host CPU.
I have got a cluster with gpu nodes (nvidia) and deployed DC/OS 1.8. I'd like to enable to schedule jobs (batch and spark) on gpu nodes using gpu isolation.
DC/OS is based on mesos 1.0.1 that supports gpu isolation.
Unfortunately, DC/OS doesn't officially support GPUs in 1.8 (experimental support for GPUs will be coming in the next release as mentioned here: https://github.com/dcos/dcos/pull/766 ).
In this next release, only Marathon will officially be able to launch GPU services (Metronome (i.e. batch jobs) will not).
Regarding spark, the spark version bundled with Universe probably doesn't have GPU support for Mesos built in yet. Spark itself has it coming soon though: https://github.com/apache/spark/pull/14644
In order to enable supporting gpu resources in DC/OS cluster the next steps are needed:
Configure mesos agents on gpu nodes:
1.1. Stop dcos-mesos-slave.service:
systemctl stop dcos-mesos-slave.service
1.2. Add the next parameters into /var/lib/dcos/mesos-slave-common file:
# a comma separated list of GPUs (id), as determined by running nvidia-smi on the host where the agent is to be launched
MESOS_NVIDIA_GPU_DEVICES="0,1"
# value of the gpus resource must be complied with number of ids above
MESOS_RESOURCES= [ {"name":"ports","type":"RANGES","ranges": {"range": [{"begin": 1025, "end": 2180},{"begin": 2182, "end": 3887},{"begin": 3889, "end": 5049},{"begin": 5052, "end": 8079},{"begin": 8082, "end": 8180},{"begin": 8182, "end": 32000}]}} ,{"name": "gpus","type": "SCALAR","scalar": {"value": 2}}]
MESOS_ISOLATION=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,docker/volume,cgroups/devices,gpu/nvidia
1.3. Start dcos-mesos-slave.service:
systemctl start dcos-mesos-slave.service
Enable the GPU_RESOURCES capability in mesos frameworks:
2.1. Marathon framework should be launched with the option
--enable_features "gpu_resources"
2.2. Aurora scheduler should be launched with the option -allow_gpu_resource
Note.
Any host running a Mesos agent with Nvidia GPU support MUST have a valid Nvidia kernel driver installed. It is also highly recommended to install the corresponding user-level libraries and tools available as part of the Nvidia CUDA toolkit. Many jobs that use Nvidia GPUs rely on CUDA and not including it will severely limit the type of GPU-aware jobs you can run on Mesos.
I need to somehow monitor the CPU and MEM usage of an embedded system during automated system tests using Jenkins.
As of now Jenkins is flashing my target with the newest build and afterwards running some tests. The system is running arm linux so i would be able make a script to poll the info through ssh during the tests.
My question is if there already is a tool that provides this functionality - if not how would i make jenkins process a file and provide a graph of the cpu and memory info during these tests?
Would this plugin suit your needs?
Monitoring plugin: Monitoring of Jenkins