GPUDirect RDMA out of range pin address by Quadro p620

GPUDirect RDMA out of range pin address by Quadro p620 - gpu

I want to implement FPGA-GPU RDMA by nvidia quadro p620.
Also, I used common PCIe BAR resources(BAR0 - BAR1 - BAR2) for FPGA registers and other chunk controllers handling which is independent from RDMA in my custom driver.
PCIe managements are OK but direct memory access to GPU ram which is pinned are always wrong. Precisely, i always get 64KB pinned addresses starting from 2955739136 (~2.7GB) by using nvidia_p2p_get_pages() API without any errors but the point is that quadro p620 ram capacity is just 2GB!.
The virtual address obtained by cuMemAlloc() change every time (which is correct) and i pass this address, together the allocated size, to my driver by ioctl sys-call. Also, i linked my custom driver to nvidia driver as the nvidia GPUDirect RDMA document is said.
Well, every things sounds OK, but the physical addresses are out of range!. Why? Does it requirement to have the qudro GPU equal or over 4GB ram address?
I expect to find the right solution to get the correct physical addresses and then DMA data by FPGA bus master.
Thanks
P.S. before this i implemented FPGA direct memory access to system ram over PCIe without any problems.

Related

How did the first GPUs get support from CPUs?

I imagine CPUs have to have features that allow it to communicate and work with the GPU, and I can imagine this exists today, but in the early days of GPUs, how did companies get support from large CPU companies to have their devices be supported, and what features did CPU companies add to enable this?

You mean special support beyond just being devices on a bus like PCI? (Or even older, ISA or VLB.)
TL:DR: All the special features CPUs have which are useful for improved bandwidth to write (and sometimes read) video memory came after 3D graphics cards were commercially successful. They weren't necessary, just a performance boost.
Once GPUs were commercially successful and popular, and a necessary part of a gaming PC, it made obvious sense for CPU vendors to add features to make things better.
The same IO busses that let you plug in a sound card or network card already have the capabilities to access device memory and MMIO, and device IO ports, which is all that's necessary for video drivers to make a graphics card do things.
Modern GPUs are often the highest-bandwidth devices in a system (especially non-servers), so they benefit from fast buses, hence AGP for a while, until PCI Express (PCIe) unified everything again.
Anyway, graphics cards could work on standard busses; it was only once 3D graphics became popular and commercially important (and fast enough for the PCI bus to be a bottleneck), that things needed to change. At that point, CPU / motherboard companies were fully aware that consumers cared about 3D games, and thus it would make sense to develop a new bus specifically for graphics cards.
(Along with a GART, graphics address/aperture remapping table, an IOMMU that made it much easier / safer for drivers to let an AGP or PCIe video card read directly from system memory. Including I think with addresses under control of user-space, without letting user-space read arbitrary system memory, thanks to it being an IOMMU that only allows a certain address range.)
Before the GART was a thing, I assume drivers for PCI GPUs needed to have the host CPU initiate DMA to the device. Or if bus-master DMA by the GPU did happen, it could read any byte of physical memory in the system if it wanted, so drivers would have to be careful not to let programs pass arbitrary pointers.
Anyway, having a GART was new with AGP, which post-dates early 3D graphics cards like 3dfx's Voodoo and ATI 3D Rage. I don't know enough details to be sure I'm accurately describing the functionality a GART enables.
So most of the support for GPUs was in terms of busses, and thus a chipset thing, not CPUs proper. (Back then, CPUs didn't have integrated memory controllers, instead just talking to the chipset northbridge over a frontside bus.)
Relevant CPU instructions included Intel's SSE and SSE2 instruction sets, which had streaming (NT = non-temporal) stores which are good for storing large amounts of data that won't be re-read by the CPU any time soon, if at all.
SSE4.1 in 2nd-gen Core2 (2008 ish) added a streaming load instruction (movntdqa) which (still) only does anything special if used on memory regions marked in the CPU's page tables or MTRR as WC (aka USWC: uncacheable, write-combining). Copying back from GPU memory to the host was the intended use-case. (Non-temporal loads and the hardware prefetcher, do they work together?)
x86 CPUs introducing the MTRR (Memory Type Range Register) is another feature that improved CPU -> GPU write bandwidth. Again, this came after 3D graphics were commercially successful for gaming.

Why is the max number of PCI lanes possible in your system set by your CPU? Even with DMA being widespread

There are a few scenarios I'm curious about:
a transfer from GPU1 memory to GPU2 memory over the PCI bus
a transfer from GPU1 to main memory with DMA
a transfer from GPU1 to main memory without DMA
Will all these scenarios be limited to the total number of PCIe lanes supported by the CPU? For Intel systems, ARM systems?

Will all these scenarios be limited to the total number of PCIe lanes supported by the CPU?
PCIe is not precisely a bus -- certainly not in the way that PCI or ISA were, for instance. It's a set of point-to-point connections between peripherals and the PCIe root complex (which is usually the CPU itself). Any given root complex will support some fixed number of PCIe lanes, each of which is connected to one device. (Often in sets. For instance, it's typical to connect 16 PCIe lanes to most GPUs.)
So, yes. Any communications between PCIe devices, or between devices and memory, must pass through the CPU, and will be limited by the number of PCIe lanes the device (or devices) have connecting them to the bus master.

Direct data copy between devices

I am trying to explore the possibility of achieving global IO space across devices (GPUs, NIC, storage etc.). This might boil down to the question asked in this thread - Direct communication between two PCI devices.
I have been reading upon Nvidia GPUDirect where the memory region pinned and the physical address is obtained with the help of nvidia_p2p_* calls. I can't exactly understand how can GPU's physical address be used to program the 3rd party device's DMA controller for data transfers. I am confused by the fact that GPU memory is not visible unlike the cpu memory space (this maybe due to my poor knowledge on programming dma controllers). Any pointers on this would really helpful.
Also, many PCI devices expose memory regions in terms of PCI BARs (e.g. GPUs expose a memory region of 256M). Is there any way to know device physical addresses over which this BAR memory region maps to? Is there any overlap between the BAR memory regions and memory allocated via nvidia driver to CUDA runtime?
Thanks in advance.

How does GPUDirect enforce isolation on a shared device

I have been reading here https://developer.nvidia.com/gpudirect about GPUDirect,
In there example there is a network card attached to the PCIe together with two GPU's and a CPU.
How is isolation enforced between all clients trying to access the network device? Are they all accessing the same PCI BAR of the device?
Is the network device using some kind SR-IOV mechanism to enforce isolation?

I believe you're talking about rDMA, which was supported with the second release of GPU Direct. It's where the NIC card can send/receive data external to the host machine and utilizes peer-to-peer DMA transfers to interact with the GPU's memory.
nVidia exports a variety of functions to kernel space that allow for programmers to look up where physical pages reside on the GPU, itself, and map them manually. nVidia also requires the use of physical addressing within kernel space, which greatly simplifies how other [3rd party] drivers interact with GPU's -- through the host machine's physical address space.
"RDMA for GPUDirect currently relies upon all physical addresses being the same from the PCI devices' point of view."
-nVidia, Design Considerations for rDMA and GPUDirect
As a result of nVidia requiring a physical addressing scheme, all IOMMU's must be disabled in the system, as these would alter the way each card views the memory space(s) of other cards. Currently, nVidia only supports physical addressing for rDMA+GPUDirect in kernel-space. Virtual addressing is possible via their UVA, made available to user space.
How is isolation enforced between all clients trying to access the network device? Are they all accessing the same PCI BAR of the device?
Yes. In kernel space, each GPU's memory is being accessed by it's physical address.
Is the network device using some kind SR-IOV mechanism to enforce isolation?
The driver of the network card is what does all of the work in setting up descriptor lists and managing concurrent access to resources -- which would be the the GPU's memory in this case. As I mentioned above, nVidia gives driver developers the ability to manage physical memory mappings on the GPU, allowing the 3rd party's NIC driver to control what resource(s) are available or not available to remote machines.
From what I understand about NIC drivers, I believe this to be a very rough outline of what's going on under the hood, relating to rDMA and GPUDirect:
Network card receives an rDMA request (whether it be reading or writing).
Network card's driver receives an interrupt that data has arrived or some polling mechanism has detected data has arrived.
The driver processes the request; any address translation is performed now, since all memory mappings for the GPU's are made available to kernel space. Additionally, the driver will more than likely have to configure the network card, itself, to prep for the transfer (e.g. set up specific registers, determine addresses, create descriptor lists, etc).
The DMA transfer is initiated and the network card reads data directly from the GPU.
This data is then sent over the network to the remote machine.
All remote machines requesting data via rDMA will use that host machine's physical addressing scheme to manipulate memory. If, for example, two separate computers wish to read the same buffer from a third computer's GPU with rDMA+GPUDirect support, one would expect the incoming read request's offsets to be the same. The same goes for writing; however an additional problem is introduced if multiple DMA engines are set to manipulate data in overlapping regions. This concurrency issue should be handled by the 3rd party NIC driver.
On a very related note, another post of mine has a lot information regarding nVidia's UVA (Unified Virtual Addressing) scheme and how memory manipulation from within kernel-space, itself, is handled. A few of the sentences in this post were grabbed from it.
Short answer to your question: if by "isolated" you mean how does each card preserve its own unique address-space for rDMA+GPUDirect operations, this is accomplished by relying on the host machine's physical address space which fundamentally separates the physical address space(s) requested by all devices on the PCI bus. By forcing the use of each host machine's physical addressing scheme, nVidia essentially isolates each GPU in that host machine.

Userspace PCI BAR access returns 0xFF at every offset

I am trying to access a PCI BAR (#5) for a PCIe SATA bridge from userspace, but whenever I mmap() from the BAR via the /sys/bus/pci/devices/.../resource5, I get 0xFF at every offset in the file. Other devices such as an Intel SATA controller respond with sensible data.
The BAR is shown in lspci -vv just the same as for the Intel controller (only the address is different).
Region 5: Memory at f7b10000 (32-bit, non-prefetchable) [size=2K]
Both devices are are matched by the ahci driver, and the SATA controller works otherwise - I can access the attached disks.
I am trying to access from user space because I just want to poke at the registers experimentally for now. To do this, I am using a modified form of pcimem, changed to access the registers I care about. However, any offset returns 0xFF, so even with plain pcimem:
pcimem /sys/bus/pci/devices/0000\:01\:00.0/resource5 0 w
returns 0xFFFFFFFF ("w" indicates a "word" read, hence the 4 bytes).
What is it that is preventing the BAR5 for this device being accessible, when other devices are? Does it even make sense to make this kind of userspace access to PCI BARs?

Not certain how helpful this is going to be, but I saw the same thing when I was writing a driver for a new PCIe FPGA device.
The MMAP-ed PCI BAR memory space would return 0xFFFFFFFF (-1) when there was an error on the device. The only way I was able to resolve this was to reset the card by a full power reset of the computer.

I've encountered the same issue debugging NVMe hdds hotplug.
If before GRUB prompt there was a hdd in PCIe slot, you can hotplug other drives in that slot, and if not, pci_ioremap_bar() returns memory region reading all 0xffffffff.
Strange shit.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas