Different nvidia driver versions per docker container - gpu

Is it possible to run two Nvidia Docker containers, each with its own Nvidia driver version?
On my cloud instance, I have an older application running for which newer Nvidia drivers are causing issues. I'd like the ability to keep running it with the older driver, while allowing newer applications on the same instance to use newer drivers. I was thinking I could accomplish this with containers but I'm worried that they only allow you to containerize things in user space.

The answer is no. The driver is installed on the host.
These articles:
NVIDIA Docker: GPU Server Application Deployment Made Easy
and the newer
Enabling GPUs in the Container Runtime Ecosystem
discuss how the stack is set up.
The key takeaway is that Nvidia brings their own version of runc (the part of Docker that actually runs the container process). This modified version of runc communicates with the host OS to make driver-level details available to the container process.
EDIT (Oct 2022): I believe this answer is outdated. Nvidia/Docker have since released Driver Containers which allow provisioning of the driver as a container. The only thing that's needed is the Nvidia Container Toolkit and some light configuration. See this link for details.

Answering for an update, actually now its possible to different driver version inside different containers. All you have to do is change the NVIDIA Container Toolkit config file (/etc/nvidia-container-runtime/config.toml) so that the root directive points to the driver container as shown below:
disable-require = false
swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
root = "/run/nvidia/driver"
path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/var/log/nvidia-container-toolkit.log"
ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
user = "root:video"
ldconfig = "#/sbin/ldconfig.real"
[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
This command can be directly used to make this modification :
$sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
After that you can run a container with a required version of driver in background using :
$sudo docker run --name nvidia-driver -d --privileged --pid=host \
-v /run/nvidia:/run/nvidia:shared \
-v /var/log:/var/log \
--restart=unless-stopped \
nvidia/driver:450.80.02-ubuntu18.04
and then run the GPU conatiner with required version of CUDA for your work:
$sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi

Containers are for isolating the processes. Only the host's kernel is same for all containers, which is not the case for virtual machine. So you can create one container for old application with older driver and another container for new application with new nvidia driver. Containers are meant for this.
But for nvidia docker you may require 1 gpu per pod, but this can be bypassed using some simple methods, which is not a good solution

Related

Stopping and starting a deep learning google cloud VM instance causes tensorflow to stop recognizing GPU

I am using the pre-built deep learning VM instances offered by google cloud, with an Nvidia tesla K80 GPU attached. I choose to have Tensorflow 2.5 and CUDA 11.0 automatically installed. When I start the instance, everything works great - I can run:
Import tensorflow as tf
tf.config.list_physical_devices()
And my function returns the CPU, accelerated CPU, and the GPU. Similarly, if I run tf.test.is_gpu_available(), the function returns True.
However, if I log out, stop the instance, and then restart the instance, running the same exact code only sees the CPU and tf.test.is_gpu_available() results in False. I get an error that looks like the driver initialization is failing:
E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
Running nvidia-smi shows that the computer still sees the GPU, but my tensorflow can’t see it.
Does anyone know what could be causing this? I don’t want to have to reinstall everything when I’m restarting the instance.
Some people (sadly not me) are able to resolve this by setting the following at the beginning of their script/main:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
I had to reinstall CUDA drivers and from then on it worked even after restarting the instance. You can configure your system settings on NVIDIAs website and it will provide you the commands you need to follow to install cuda. It also asks you if you want to uninstall the previous cuda version (yes!).This is luckily also very fast.
I fixed the same issue with the commands below, taken from https://issuetracker.google.com/issues/191612865?pli=1
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
Option-1:
Upgrade a Notebooks instance's environment. Refer the link to upgrade.
Notebooks instances that can be upgraded are dual-disk, with one boot disk and one data disk. The upgrade process upgrades the boot disk to a new image while preserving your data on the data disk.
Option-2:
Connect to the notebook VM via SSH and run the commands link.
After execution of the commands, the cuda version will update to 11.3 and the nvidia driver version to 465.19.01.
Restart the notebook VM.
Note: Issue has been solved in gpu images. New notebooks will be created with image version M74. About new image version is not yet updated in google-public-issue-tracker but you can find the new image version M74 in console.

WSL2- $nvidia-smi command not running

I have Ubuntu 18.04LTS install inside WSL2 and I was able to use GPU. I can run
$nvidia-smi from window run terminal.
However, I can not find any result when I run $nvidia-smi on WSL2
The fix is now available at the nvidia-docs.
cp /usr/lib/wsl/lib/nvidia-smi /usr/bin/nvidia-smi
chmod ogu+x /usr/bin/nvidia-smi
Source: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations
From the known limitations in documentation from nvidia :
NVIDIA Management Library (NVML) APIs are not supported. Consequently,
nvidia-smi may not be functional in WSL 2.
However you should be able to run https://docs.nvidia.com/cuda/wsl-user-guide/index.html#unique_1238660826
EDIT: since this answer, nvidia-smi supported since driver 465.42
I am running it well using 470.57.02.
When I was installing CUDA 11.7.1 in WSL2, a same error raised which noted me
Failed to initialize NVML: GPU access blocked by the operating system
Failed to properly shut down NVML: GPU access blocked by the operating system
I fixed it by updating the Windows system to version 19044 21H2.

How to use a remote machine's GPU in jupyter notebook

I am trying to run tensorflow on a remote machine's GPU through Jupyter notebook. However, if I print the available devices using tf, I only get CPUs. I have never used a GPU before and am relatively new at using conda / jupyter notebook remotely as well, so I am not sure how to set up using the GPU in jupyter notebook.
I am using an environment set up by someone else who already executed the same code on the same GPU, but they did it via python script, not in a jupyter notebook.
this is the only code in the other person's file that had to do with the GPU
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
set_session(tf.Session(config=config))
I think the problem was that I had tensorflow in my environment instead of tensorflow-gpu. But now I get this message "cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version" and I don't know how to update the driver through terminal
How is your environment set up? Specifically, what is your remote environment, and what is your local environment? Sounds like your CUDA drivers are out of date, but it could be more than just that. If you are just getting started, I would recommend finding an environment that requires little to no configuration work on your part, so you can get started more easily/quickly.
For example, you can run GPUs on the cloud, and connect to them via local terminal. You also have your "local" frontend be Colab by connecting it to a local runtime. (This video explains that particular setup, but there's lots of other options)
You may also want to try running nvidia-smi on the remote machine to see if the GPUs are visible.
Here is another solution, that describes how to set up a GPU-Jupyterlab instance with Docker.
To update your drivers via terminal, run:
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
sudo reboot
Are your CUDA paths set appropriately? Like that?
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

-restcomm configuration for Windows

I am beginning to set up on Restcomm Mobicents framework.
In particular I am trying to set up Restcomm on my Windows 8.1 laptop.
Can anyone assist with how to set up RESTCOMM on Windows Server please. The reason I ask this is that most of the information is presumed to be on Linux, Ubuntu. Environments like "installation of screen" and also any sample configuration shared are for Ubuntu, Linux environments, but not specifically on Windows.
Restcomm is currently prepared to run natively on Linux operating systems, indeed.
Considering the diversity of OSs and configurations, a Docker image was created to gather everything Restcomm needs to run properly, as an independent layer.
Please check the following links to install Docker in Windows 8.1 and use Restcomm docker image.
About docker: https://www.docker.com/what-docker
Docker installation: http://docs.docker.com/windows/step_one/
About Restcomm docker image: http://www.telestax.com/docker-image-for-mobicents-restcomm-7-3-0/
Restcomm docker image: https://hub.docker.com/r/gvagenas/restcomm/
You should avoid installing Restcomm on windows, although the simulation will work, when you will have to go live, you will be unable because Windows does not implements the SCTP stack need by Restcomm.

How to get Sikuli working in headless mode

If we have a headless test server running sikuli (both ubuntu and windows configurations needed), how to get it work without a physical monitor and preferably for as many screen resolutions as possible.
I successfully got sikuli running in headless mode (no physical monitor connected)
Ubuntu: check Xvfb.
Windows: install display driver on the machine (to be headless) from virtualbox guest additions display drivers and use TightVNC to remotely set resolution from another machine.
Detailed steps for windows 7
Assume that:
Machine A: to be headless machine, windows 7, with vnc server ready (e.g. TightVNC server installed and waiting for connections).
Machine B: will be used to remotely setup the virtual display driver on machine A.
steps:
Download virtualbox guest additions iso file on Machine A from here (for latest version check latest version here and download VBoxGuestAdditions_x.y.z.iso)
Extract iso file (possibly with winrar) into a directory (let us call it folder D)
using command prompt cd to D folder
Driver extraction
-To extract the 32-bit drivers to "C:\Drivers", do the following:
VBoxWindowsAdditions-x86 /extract /D=C:\Drivers
-For the 64-bit drivers:
VBoxWindowsAdditions-amd64 /extract /D=C:\Drivers
Goto device manager
add hardware
Restart and connect with VNC viewer, now you should be able to change screen resolution
other valuable info on launchpad.
I got SikuliX working in a true headless mode in GCE with a Windows 2016 client system. It takes some duct tape and other Rube Goldberg contraptions to work, but it can be done.
The issue is that, for GCE (and probably AWS and other cloud environment Windows clients), you don't have a virtual video adapter and display, so, unless there's an open RDP connection to the client, it doesn't have a screen, and SikuliX/OpenCV will get a 1024x768 black desktop, and fail.
So, the question is, how to create an RDP connection without having an actual screen anywhere. I did this using Xvfb (X Windows virtual frame buffer). This does require a second VM, though. Xvfb runs on Linux. The other piece of the puzzle is xfreerdp 2.0. The 2.x version is required for compatibility with recent versions of Windows. 1.x is included with some Linux distros; 2.x may need to be built from sources, depending on what flavor Linux you're using. I'm using CentOS, which did require me to build my own.
The commands to establish the headless RDP session, once the pieces are in place, look something like this:
/usr/bin/Xvfb :0 -screen 0 1920x1080x24 &
export DISPLAY=:0.0
/usr/local/bin/xfreerdp /size:1920x1080 /u:[WindowsUser] /p:"[WindowsPassword]" /v:[WindowsTarget]
In our environment we automated this as part of the build job kicked off by Jenkins. For this to work under the Jenkins slave, it was also necessary to run the Jenkins slave as a user process, rather than a service... this can be accomplished by enabling auto admin login and setting the slave launch script as a run (on logon) command.
For those looking to automate on ec2 windows machines, this worked for me: http://www.allianceglobalservices.com/blog/executing-automation-suite-on-disconnectedlocked-machines
In summary, I used RDC to connect, put the following code in a batch file on remote desktop, double clicked it, and sikulix started working remotely (kicking me out of RDC at the same time). Note that ec2 windows machines default to 1024x768 when tscon takes over which may be too small so TightVnc can be used to increase the resolution to 1280x1024 before running.
tscon.exe 0 /dest:console
tscon.exe 1 /dest:console
tscon.exe 2 /dest:console
tscon.exe 3 /dest:console
START /DC:\Sikulix /WAIT /B C:\Sikulix\runsikulix.cmd -d 3 -r C:\test.sikuli -f C:\Sikulix\log.txt -d C:\Sikulix\userlog.txt
I just figure out a way to resolve similar issue.
My env:
local: windows pc
remote (for running sikulix + app I would like to test): windows ec2 instance
My way:
1.create a .bat file, its contents:
ping 127.0.0.1 -n 15 > nul
for /f "skip=1 tokens=3" %%s in ('query user %USERNAME%') do (
%windir%\System32\tscon.exe %%s /dest:console
)
cd "\path\to\sikulix"
java -jar sikulixide-2.0.5.jar -r /path/to/sikulix -c > logfile.log
prepare your app
run the bat (right click > run as administrator)
ping will give your 10s, so that you can bring your app back to front
you will be disconnnected from rdp connection
Explanation:
ping is like "sleep"
for loop: kick out current user & keep session alive