Cannot run Mesos Containers with GPU tasks - gpu

I am running Mesos on Ubuntu and am trying to execute:
mesos-execute \
--master=$(cat /etc/mesos/zk) \
--name=gpu-test \
--docker_image=nvidia/cuda \
--command="nvidia-smi" \
--framework_capabilities="GPU_RESOURCES" \
--resources="gpus:1"
and it is failing because: sh: 1: nvidia-smi: not found
even though when I run it without container support
mesos-execute \
--master=$(cat /etc/mesos/zk) \
--name=gpu-test \
--command="nvidia-smi" \
--framework_capabilities="GPU_RESOURCES" \
--resources="gpus:1"
it has access to the gpu
plus if I run it without container support but put the command as
nvidia-docker run -it nvidia/cuda nvidia-smi
it works, so it seems that the mesos containerizer doesnt have access to the GPUs. But in the /etc/mesos-slave/ directory I gave it containerizers mesos (and all the other required flags to run gpu commands). Plus non-gpu related commands are working fine.

This looks like a regression in 1.3.0. I downgraded to 1.2.1 on Ubuntu and can successfully use GPUs with Docker containers and the Mesos containerizer again.
sudo apt-get install mesos=1.2.1-2.0.1
It looks like someone filed a related bug but there's been no activity:
https://issues.apache.org/jira/browse/MESOS-7730

Related

what is a no nonsense way to create a ubuntu virtual machine using virt-manager on command line

I tried various methods explained on internet, but none seems to be working. using local iso image give one issue and location gives another issue.
Can we setup IP using this command?
currently using this command
sudo virt-install \ --name worker-2 \ --ram=4096 \ --disk size=100 \ --disk path=/opt/sciserver/vm/worker-2.qcow2,size=30,format=qcow2 \ --vcpus 2 \ --os-type linux \ --os-variant ubuntu20.04 \ --graphics none \ --location 'http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/' \ --extra-args "console=tty0 console=ttyS0,115200n8"
and error says..
"ERROR Error validating install location: Could not find an installable distribution at 'http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/'
The location must be the root directory of an install tree.
See virt-install man page for various distro examples."
Your help will be much appreciated

Tensorflow Serving Compiling Failure For CPU AVX AVX2

I use the method in the tfx official document to compile the tfx devel in docker file. The OS is MacOS, intel CPU.
here is the docker build code for it
#!/bin/bash
USER=$1
TAG=$2
TF_SERVING_VERSION_GIT_BRANCH="2.4.1"
git clone --branch="${TF_SERVING_VERSION_GIT_BRANCH}" https://github.com/tensorflow/serving
TF_SERVING_BUILD_OPTIONS="--copt=-mavx --local_ram_resources=4096"
cd serving && \
docker build --pull -t $USER/tensorflow-serving-devel:$TAG \
--build-arg TF_SERVING_VERSION_GIT_BRANCH="${TF_SERVING_VERSION_GIT_BRANCH}" \
--build-arg TF_SERVING_BUILD_OPTIONS="${TF_SERVING_BUILD_OPTIONS}" \
-f tensorflow_serving/tools/docker/Dockerfile.devel .
Then I run the shell script with >3hrs and get the following failure:
Actually I cannot know the detail because the log file from docker is clipped by the builder.
Does anyone met the similar problem and can help on this topic?
Thanks a lot in advance!
These instruction sets are not available on all machines, especially with older processors.
If you'd like to apply generally recommended optimizations, including utilizing platform-specific instruction sets for your processor, you can add --config=nativeopt to Bazel build commands when building TensorFlow Serving.
tools/run_in_docker.sh bazel build --config=nativeopt tensorflow_serving/...

Invalid argument --model_config_file_poll_wait_seconds

I'm trying to start tensorflow-serving with the following two options like on the documentation
docker run -t --rm -p 8501:8501 \
-v "$(pwd)/models/:/models/" tensorflow/serving \
--model_config_file=/models/models.config \
--model_config_file_poll_wait_seconds=60
The container does not start because it does not recognize the argument --model_config_file_poll_wait_seconds.
unknown argument: --model_config_file_poll_wait_seconds=60
usage: tensorflow_model_server
I'm on the latest docker image, 1.14.0 and the line is taken straight from the documentation
https://www.tensorflow.org/tfx/serving/serving_config
Does this argument even work?
Many thanks.
It seems https://www.tensorflow.org/tfx/serving/serving_config is talking about code that has not been released as a new version yet, which is odd. I will ask about that.
That package is generated from this source:
https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/serving_config.md, it mentions the --model_config_file_poll_wait_seconds flag.
However, the same document for 1.14.0 has no mention of the flag:
https://github.com/tensorflow/serving/blob/1.14.0/tensorflow_serving/g3doc/serving_config.md
Try using the nightly tensorflow serving image and see if it works.
docker run -t --rm -p 8501:8501 \
-v "$(pwd)/models/:/models/" tensorflow/serving:nightly \
--model_config_file=/models/models.config \
--model_config_file_poll_wait_seconds=60
Just tried. Tensorflow Serving 2.1.0 supports it while 1.14.0 doesn't.

virt-install hangs - GPU Passthrough for Virtual Machines

I want to run VMs that uses host's GPU. For that, I followed this docs to enable modules/grub configurations. Looks like I successfully configured, I can see dmesg | grep -i vfio. But when I run virt-install, it is hanging forever, parallely I can't run even virsh list --all. Every time I have to restart my laptop, in order to run any virsh/virt-install commands again.
veeru#ghost:~$ sudo su
[sudo] password for veeru:
root#ghost:/home/veeru# virt-install \
> --name vm0 \
> --ram 12028 \
> --disk path=/home/veeru/ubuntu14-HD.img,size=30 \
> --vcpus 2 \
> --os-type linux \
> --os-variant ubuntu16.04 \
> --network bridge=bridge:br0 \
> --graphics none \
> --console pty,target_type=serial \
> --location /home/veeru/Downloads/ubuntu-16.04.5.iso --force \
> --extra-args 'console=ttyS0,115200n8 serial' \
> --host-device 01:00.0 \
> --features kvm_hidden=on \
> --machine q35
Starting install...
Retrieving file .treeinfo... | 0 B 00:00:00
Retrieving file content... | 0 B 00:00:00
Retrieving file info... | 67 B 00:00:00
Retrieving file vmlinuz... | 6.8 MB 00:00:00
Retrieving file initrd.gz... | 14 MB 00:00:00
Below is the output when I do strace of process for above command
veeru#ghost:~$ sudo strace -p 9747
strace: Process 9747 attached
restart_syscall(<... resuming interrupted poll ...>
PS: My laptop is Predator Helios 300(UEFI-Secure Boot), GPU: Nvidia GeForce GTX1050Ti, Ubuntu Mate 18.04(Installed nvidia drivers), 8GB Ram,
Ok, I see the problem, the GPU is already being used by host(my laptop) i.e it is busy. So, when I run virt-install command, it hangs forever which is no wonder.
In order to resolve the issue, switch your X11 to use CPU. I use Ubuntu Mate 18.06 which has handy tool to switch like in below screenshot
Ater that logout and login and check nvidia GPU is not being used by any process by running nvidia-smi; it should similar output like below.
veeru#ghost:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Now you should able to run virt-install like me.

How to enable VNC in VM in KVM when the KVM server doesn't have a GUI?

Can someone tell me how to enable the VNC in VM which is being created using virt-install on KVM hypervisor?
My server doesn't have a GUI so I used to run the following command to spin up a VM:
virt-install \
--name centos6 \
--ram 1024 \
--disk path=/var/lib/libvirt/images/centos6.img,bus=virtio,size=30 \
--vcpus 1 \
--os-type linux \
--os-variant rhel6 \
--network bridge=br0 \
--graphics none \
--location 'http://mirror.i3d.net/pub/centos/6/os/x86_64/' \
--extra-args 'console=ttyS0,115200n8 serial'
Now I want to install GUI on the VM(centos6) and install VNC, can someone tell me how to achieve that?
Thanks.
Found answer on this link though it is implemented on ubuntu but same can be replicated for other distribution : https://www.howtoforge.com/tutorial/ubuntu-gnome-vnc-headless-server