VFIO - igpu passthrough on Intel 4770 to a virtual machine (Host Os Proxmox) - gpu

I am running the latest Proxmox (6.3-3 at this time, fully updated) and attempting to passthrough the onboard GPU on my Core i7 4770 CPU to a Windows 10 VM. I have already enabled iommu on the system and also told grub to not let the system claim the device by adding intel_iommu=on video=efifb:off to the grub kernel options. I've verified IOMMU is actually available by checking dmesg
# dmesg | grep -e DMAR -e IOMMU -e AMD-Vi
[ 0.007556] ACPI: DMAR 0x00000000D88C33C8 0000B8 (v01 INTEL HSW 00000001 INTL 00000001)
[ 0.083595] DMAR: IOMMU enabled
[ 0.180445] DMAR: Host address width 39
[ 0.180446] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[ 0.180449] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap c0000020660462 ecap f0101a
[ 0.180449] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[ 0.180451] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008020660462 ecap f010da
[ 0.180452] DMAR: RMRR base: 0x000000d8842000 end: 0x000000d884efff
[ 0.180452] DMAR: RMRR base: 0x000000db000000 end: 0x000000df1fffff
[ 0.180454] DMAR-IR: IOAPIC id 8 under DRHD base 0xfed91000 IOMMU 1
[ 0.180454] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[ 0.180455] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.180831] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 0.874497] DMAR: No ATSR found
[ 0.874527] DMAR: dmar0: Using Queued invalidation
[ 0.874531] DMAR: dmar1: Using Queued invalidation
[ 1.026818] DMAR: Intel(R) Virtualization Technology for Directed I/O
I've also added the iGPU (and associated audio device) to blacklist to prevent the host OS from claiming it:
# cat /etc/modprobe.d/blacklist.conf
blacklist snd_hda_intel
blacklist snd_hda_codec_hdmi
blacklist i915
# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=8086:0412 disable_vga=1
Finally, I setup a new Windows 10 VM on my host along with the q35 chipset and uEFI (OVMF) BIOS as this is apparently the most "compatible" way to pass through hardware. I've also got an external screen plugged into the HDMI port of my Proxmox host. I understand when the VM boots up, I should see this screen come to life. The qemu config file of the VM is below:
agent: 1
balloon: 0
bios: ovmf
boot: order=virtio0;ide2;net0
cores: 4
efidisk0: local-1tb-nvme-thinpool:vm-118-disk-1,size=4M
hostpci0: 00:02,pcie=1,x-vga=1
ide2: none,media=cdrom
machine: q35
memory: 4096
name: VFIOtest
net0: virtio=52:D7:02:CA:B6:2E,bridge=vmbr0,firewall=1
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=cd9d41e9-d8c2-465e-94dc-798aa8e517e2
sockets: 1
virtio0: local-1tb-nvme-thinpool:vm-118-disk-0,backup=0,discard=on,size=60G
vmgenid: 2cb8ce5e-5dda-4870-9cf3-774bb025057f
Once I've done that I can boot the VM. As soon as I boot the VM, the screen goes to standby indicating no signal. I can however then RDP into the system and I see that the Intel HD Graphics 4600 is visible in device manager. So I installed the latest drivers from the Intel website. Unfortunately, the device will not start and shows an exclamation mark next to it. The Device Status shows
Windows has stopped this device because it has reported problems. (Code 43)
Unfortunately, the code 43 error just means something is wrong, it isn't very specific on what is causing this.
Not too sure what to try from this point on - any assistance on where to continue fixing this would be useful.

Code 43 is a NVIDIA specific error; you will need a way to mask the true CPU by using the FancyId parameter. Here is a link to a video that covers some of the process revolving around the error you are seeing.
Can you edit the original post to contain your grub config file? There are some more recent changes to Proxmox 6.3 that might need to be reconfigured; there are almost no articles about setting up passthrough on 6.3.

I found it came down to setting the CPU model during VM creation. Changing it after VM creation does nothing so something must be set during creation. None of the other guides worked for me so I solved the problem and made my own guide https://elijahliedtke.medium.com/home-lab-guides-proxmox-6-pci-e-passthrough-with-nvidia-43ccfb9424de

Related

Retrain an object detection model - Out of memory (OOM) Killer - Google coral

I was trying to Retrain an object detection model for Google coral accelerator as per the below link
https://coral.ai/docs/edgetpu/retrain-detection/#prerequisites
The host system is based on Linux Mint with docker environment
CPU : Intel(R) Core(TM) i3-5005U CPU # 2.00GHz
Graphics: Card: Intel HD Graphics 5500
OS : Linux Mint 19 Tara
Memory Size: 8G
But after Starting the training job
root#beaa5d65a1d5:/tensorflow/models/research# ./retrain_detection_model.sh --num_training_steps ${NUM_TRAINING_STEPS} --num_eval_steps ${NUM_EVAL_STEPS}
The process is killed by OOM killer
./retrain_detection_model.sh: line 45: 86 Killed
python object_detection/model_main.py
--pipeline_config_path="${CKPT_DIR}/pipeline.config" --model_dir="${TRAIN_DIR}" --num_train_steps="${num_training_steps}" --num_eval_steps="${num_eval_steps}"
Any help is appreciated!
This is an out of memory issue due to HW limitation. 2 things you can do is to either add more RAM or Swapspace (using storage as RAM). Although going with the later will be very slow.

error while running gem5 full system mode on arm

i installed gem5 simulator on ubuntu 14.04. then i used the youtube guide (https://www.youtube.com/watch?v=gd_DtxQD5kc) to run gem5 in full system mode in ARM architecture. first i downloaded arm-system-2011-08.tar.bz2 as mentioned in the video then i run below command:
build/ARM/gem5.opt configs/example/fs.py --disk-image=/home/morteza/full_system_images/disks/arm-ubuntu-natty-headless.img --kernel=/home/morteza/full_system_images/binaries/vmlinux.arm.smp.fb.2.6.38.8
but i encountered this output. can abybody please help me?
p.s: i added --kernel option and rename bootloader in /fulls_system_image/binaries from boot.arm to boot_emm.arm because of some errors about not finding bootloader and kernel. this is my final output which i brought hereunder. i' ll appreciate if anybody tell what is the problem.
OUTPUT:
gem5 Simulator System. http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.
gem5 compiled Jan 3 2020 05:49:20
gem5 started Jan 3 2020 17:16:17
gem5 executing on morteza-pc, pid 2499
command line: build/ARM/gem5.opt configs/example/fs.py --disk-image=/home/morteza/full_system_images/disks/arm-ubuntu-natty-headless.img --kernel=/home/morteza/full_system_images/binaries/vmlinux.arm.smp.fb.2.6.38.8
warn: Can only correctly generate a dtb for VExpress_GEM5_V1 platforms, unless custom hardware models have been equipped with generation functionality.
Global frequency set at 1000000000000 ticks per second
warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (512 Mbytes)
info: kernel located at: /home/morteza/full_system_images/binaries/vmlinux.arm.smp.fb.2.6.38.8
warn: Bootloader entry point 0x80000000 overriding reset address 0
system.vncserver: Listening for connections on port 5900
system.terminal: Listening for connections on port 3456
0: system.remote_gdb: listening for remote gdb on port 7000
info: Using bootloader at address 0x80000000
info: Using kernel entry physical address at 0x80008000
warn: DTB file specified, but no device tree support in kernel
**** REAL SIMULATION ****
warn: Existing EnergyCtrl, but no enabled DVFSHandler found.
info: Entering event queue # 0. Starting simulation...
warn: Device system.membus.badaddr_responder accessed by read to address 0x10009018 size=4
gem5.opt: build/ARM/cpu/simple/atomic.cc:418: virtual Fault AtomicSimpleCPU::readMem(Addr, uint8_t*, unsigned int, Request::Flags, const std::vector<bool>&): Assertion `!pkt.isError()' failed.
Program aborted at tick 30500
--- BEGIN LIBC BACKTRACE ---
build/ARM/gem5.opt(_Z15print_backtracev+0x15)[0x1d505e5]
build/ARM/gem5.opt(_Z12abortHandleri+0x36)[0x1d5a796]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f41e3962330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f41e1eacc37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f41e1eb0028]
/lib/x86_64-linux-gnu/libc.so.6(+0x2fbf6)[0x7f41e1ea5bf6]
/lib/x86_64-linux-gnu/libc.so.6(+0x2fca2)[0x7f41e1ea5ca2]
build/ARM/gem5.opt(_ZN15AtomicSimpleCPU7readMemEmPhj5FlagsImERKSt6vectorIbSaIbEE+0x538)[0x1e4eca8]
build/ARM/gem5.opt(_ZN17SimpleExecContext7readMemEmPhj5FlagsImERKSt6vectorIbSaIbEE+0x21)[0x1e5c5b1]
build/ARM/gem5.opt(_Z13readMemAtomicI11ExecContextjESt10shared_ptrI9FaultBaseEPT_PN5Trace10InstRecordEmRT0_5FlagsImE+0x64)[0x1972e14]
build/ARM/gem5.opt(_ZNK10ArmISAInst27LOAD_IMM_AY_PN_SN_UN_WN_SZ47executeEP11ExecContextPN5Trace10InstRecordE+0x12d)[0x14f95cd]
build/ARM/gem5.opt(_ZN15AtomicSimpleCPU4tickEv+0x428)[0x1e4da58]
build/ARM/gem5.opt(_ZN10EventQueue10serviceOneEv+0xa1)[0x1d55f51]
build/ARM/gem5.opt(_Z9doSimLoopP10EventQueue+0x38)[0x1d65fc8]
build/ARM/gem5.opt(_Z8simulatem+0xaae)[0x1d66dfe]
build/ARM/gem5.opt[0x1dbbd3d]
build/ARM/gem5.opt[0xe08e85]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x45f7)[0x7f41e3579be7]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f41e357b63d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8)[0x7f41e3579ec8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f41e357b63d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8)[0x7f41e3579ec8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f41e357b63d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8)[0x7f41e3579ec8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f41e357b63d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7f41e357b772]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x563e)[0x7f41e357ac2e]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f41e357b63d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8)[0x7f41e3579ec8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x7f41e357b63d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7f41e357b772]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyRun_StringFlags+0x79)[0x7f41e35755a9]
--- END LIBC BACKTRACE ---
Aborted (core dumped)

Automounting USB-UART/FIFO IC (as ttyUSB0) with 0666 permissions - udev

I'm trying to automount the following device with 0666 permissions:
lsusb -vvv
Bus 001 Device 094: ID 0403:6014 Future Technology Devices International, Ltd FT232H Single HS USB-UART/FIFO IC
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 2.00
bDeviceClass 0 (Defined at Interface level)
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 64
idVendor 0x0403 Future Technology Devices International, Ltd
idProduct 0x6014 FT232H Single HS USB-UART/FIFO IC
bcdDevice 9.00
iManufacturer 1 FTDI
iProduct 2 C232HM-DDHSL-0
iSerial 3 FTVWEM02
bNumConfigurations 1
To achieve this I created the following udev rule in /etc/udev/rules.d
SUBSYSTEM=="usb", ATTR{idVendor}=="0403", ATTR{idProduct}=="6014", MODE="0666", RUN+="/usr/bin/touch /tmp/udev-test.txt"code here
As you can see I verify the functionality of the rule with a test file. The file is always created on the connection of the USB device.
-rw-r--r-- 1 root root 0 Oct 20 09:56 udev-test.txt
That should mean that the rule is functioning... however it never gets the permissions right.
When running ls -l /dev/ttyU* I get the following result:
crw-rw---- 1 root dialout 188, 0 Oct 20 09:56 /dev/ttyUSB0
Strangely enough if I run chmod from the command line as root, I can always change the permissions of the device. I would like that to happen on every plug-in automatically. Could you please help me?
I'm running Scientific Linux 7
Linux version 4.7.5-1.el7.elrepo.x86_64 (mockbuild#Build64R7) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Sat Sep 24 11:54:29 EDT 2016
The topics I already searched through:
Udev rule is not being applied - solution not working
https://stackoverflow.com/questions/34116854/udev-created-my-symblic-link-to-my-device-but-permission-not-set
Change ttyUSB permissions using udev - if I add KERNEL=="ttyUSB*" or KERNEL=="ttyUSB0" the rule is not working anymore
Run script with udev after USB plugged in on RPi

Live Migration Failure: unable to execute QEMU command 'migrate': Migration disabled: failed to allocate shared memory

I have a 2 node OpenStack Mitaka environment consisting of a controller/compute node and a compute node.
I've followed the setup guide to enable instance live migration using LVM block storage. I.e.: There's no shared storage backend, just local LVM block storage.
Using OpenStack Horizon to perform the live migration a success message is displayed, however, the migration is far from successful. This worked pretty much out-of-the-box with our Juno installation. I've exhausted Google and cannot find any other instances of people facing the same problem. I thought it might have been a time synchronisation problem so have set both nodes to UTC. Still the problems persists.
Source machine /var/log/nova/nova-compute.log
2016-08-12 15:56:42.120 2230 ERROR nova.virt.libvirt.driver [req-b71ea7b0-5fa8-4b57-92d2-4edec62135c2 b017d86d1143461a92a267d4b912c104 88c686f09e1b427fb750f5c00716f84e - - -] [instance: 5763b6b6-370c-448c-8e8f-8b71eafaa8f1] Migration operation has aborted
2016-08-12 15:56:42.470 2230 ERROR nova.virt.libvirt.driver [req-b71ea7b0-5fa8-4b57-92d2-4edec62135c2 b017d86d1143461a92a267d4b912c104 88c686f09e1b427fb750f5c00716f84e - - -] [instance: 5763b6b6-370c-448c-8e8f-8b71eafaa8f1] Live Migration failure: internal error: unable to execute QEMU command 'migrate': Migration disabled: failed to allocate shared memory
Target node /var/log/libvirt/libvirtd.log
2016-08-12 15:56:41.864+0000: 2170: error : qemuMonitorJSONGetMigrationStatsReply:2443 : internal error: info migration reply was missing return status
2016-08-12 15:56:41.864+0000: 2170: error : virNetClientProgramDispatchError:177 : Cannot open log file: '/var/log/libvirt/qemu/instance-0000006a.log': Device or resource busy
There are no other events captured in the source or target nova or libvirt logs.
I should also note that I am trying to use qemu+tcp (libvirt listening enabled, default tcp port, no auth) rather than qemu+ssh in order to keep things simple while testing. In fact, I intend to only use qemu+tcp anyway.
Which version of ubuntu did you deploy?
I had the same error with ubuntu 14.04 and mitaka version.
And I figured out that default kernel (3.13) makes this problem.
I upgraded the kernel from 3.13 to 4.40 and this problem is gone now.
I hope my experience help you solve this problem out.
Thanks

How can I run Tensorflow on one single core?

I'm using Tensorflow on a cluster and I want to tell Tensorflow to run only on one single core (even though there are more available).
Does someone know if this is possible?
To run Tensorflow on one single CPU thread, I use:
session_conf = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)
device_count limits the number of CPUs being used, not the number of cores or threads.
tensorflow/tensorflow/core/protobuf/config.proto says:
message ConfigProto {
// Map from device type name (e.g., "CPU" or "GPU" ) to maximum
// number of devices of that type to use. If a particular device
// type is not found in the map, the system picks an appropriate
// number.
map<string, int32> device_count = 1;
On Linux you can run sudo dmidecode -t 4 | egrep -i "Designation|Intel|core|thread" to see how many CPUs/cores/threads you have, e.g. the following has 2 CPUs, each of them has 8 cores, each of them has 2 threads, which gives a total of 2*8*2=32 threads:
fra#s:~$ sudo dmidecode -t 4 | egrep -i "Designation|Intel|core|thread"
Socket Designation: CPU1
Manufacturer: Intel
HTT (Multi-threading)
Version: Intel(R) Xeon(R) CPU E5-2667 v4 # 3.20GHz
Core Count: 8
Core Enabled: 8
Thread Count: 16
Multi-Core
Hardware Thread
Socket Designation: CPU2
Manufacturer: Intel
HTT (Multi-threading)
Version: Intel(R) Xeon(R) CPU E5-2667 v4 # 3.20GHz
Core Count: 8
Core Enabled: 8
Thread Count: 16
Multi-Core
Hardware Thread
Tested with Tensorflow 0.12.1 and 1.0.0 with Ubuntu 14.04.5 LTS x64 and Ubuntu 16.04 LTS x64.
Yes it is possible by thread affinity. Thread affinity allows you to decide which specific thread to be executed by which specific core of the cpu. For thread affinity you can use "taskset" or "numatcl" on linux. You can also use https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html and https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html
The following code will not instruct/direct Tensorflow to run only on one single core.
TensorFlow 1
session_conf = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)
TensorFlow 2
import os
# reduce number of threads
os.environ['TF_NUM_INTEROP_THREADS'] = '1'
os.environ['TF_NUM_INTRAOP_THREADS'] = '1'
import tensorflow
This will generate in total at least N threads, where N is the number of cpu cores. Most of the time only one thread will be running while others are in sleeping mode.
Sources:
https://github.com/tensorflow/tensorflow/issues/42510
https://github.com/tensorflow/tensorflow/issues/33627
You can restrict the number of devices of a certain type that TensorFlow uses by passing the appropriate device_count in a ConfigProto as the config argument when creating your session. For instance, you can restrict the number of CPU devices as follows :
config = tf.ConfigProto(device_count={'CPU': 1})
sess = tf.Session(config=config)
with sess.as_default():
print(tf.constant(42).eval())