Why is the max number of PCI lanes possible in your system set by your CPU? Even with DMA being widespread - hardware

There are a few scenarios I'm curious about:
a transfer from GPU1 memory to GPU2 memory over the PCI bus
a transfer from GPU1 to main memory with DMA
a transfer from GPU1 to main memory without DMA
Will all these scenarios be limited to the total number of PCIe lanes supported by the CPU? For Intel systems, ARM systems?

Will all these scenarios be limited to the total number of PCIe lanes supported by the CPU?
PCIe is not precisely a bus -- certainly not in the way that PCI or ISA were, for instance. It's a set of point-to-point connections between peripherals and the PCIe root complex (which is usually the CPU itself). Any given root complex will support some fixed number of PCIe lanes, each of which is connected to one device. (Often in sets. For instance, it's typical to connect 16 PCIe lanes to most GPUs.)
So, yes. Any communications between PCIe devices, or between devices and memory, must pass through the CPU, and will be limited by the number of PCIe lanes the device (or devices) have connecting them to the bus master.

Related

GPUDirect RDMA out of range pin address by Quadro p620

I want to implement FPGA-GPU RDMA by nvidia quadro p620.
Also, I used common PCIe BAR resources(BAR0 - BAR1 - BAR2) for FPGA registers and other chunk controllers handling which is independent from RDMA in my custom driver.
PCIe managements are OK but direct memory access to GPU ram which is pinned are always wrong. Precisely, i always get 64KB pinned addresses starting from 2955739136 (~2.7GB) by using nvidia_p2p_get_pages() API without any errors but the point is that quadro p620 ram capacity is just 2GB!.
The virtual address obtained by cuMemAlloc() change every time (which is correct) and i pass this address, together the allocated size, to my driver by ioctl sys-call. Also, i linked my custom driver to nvidia driver as the nvidia GPUDirect RDMA document is said.
Well, every things sounds OK, but the physical addresses are out of range!. Why? Does it requirement to have the qudro GPU equal or over 4GB ram address?
I expect to find the right solution to get the correct physical addresses and then DMA data by FPGA bus master.
Thanks
P.S. before this i implemented FPGA direct memory access to system ram over PCIe without any problems.

How did the first GPUs get support from CPUs?

I imagine CPUs have to have features that allow it to communicate and work with the GPU, and I can imagine this exists today, but in the early days of GPUs, how did companies get support from large CPU companies to have their devices be supported, and what features did CPU companies add to enable this?
You mean special support beyond just being devices on a bus like PCI? (Or even older, ISA or VLB.)
TL:DR: All the special features CPUs have which are useful for improved bandwidth to write (and sometimes read) video memory came after 3D graphics cards were commercially successful. They weren't necessary, just a performance boost.
Once GPUs were commercially successful and popular, and a necessary part of a gaming PC, it made obvious sense for CPU vendors to add features to make things better.
The same IO busses that let you plug in a sound card or network card already have the capabilities to access device memory and MMIO, and device IO ports, which is all that's necessary for video drivers to make a graphics card do things.
Modern GPUs are often the highest-bandwidth devices in a system (especially non-servers), so they benefit from fast buses, hence AGP for a while, until PCI Express (PCIe) unified everything again.
Anyway, graphics cards could work on standard busses; it was only once 3D graphics became popular and commercially important (and fast enough for the PCI bus to be a bottleneck), that things needed to change. At that point, CPU / motherboard companies were fully aware that consumers cared about 3D games, and thus it would make sense to develop a new bus specifically for graphics cards.
(Along with a GART, graphics address/aperture remapping table, an IOMMU that made it much easier / safer for drivers to let an AGP or PCIe video card read directly from system memory. Including I think with addresses under control of user-space, without letting user-space read arbitrary system memory, thanks to it being an IOMMU that only allows a certain address range.)
Before the GART was a thing, I assume drivers for PCI GPUs needed to have the host CPU initiate DMA to the device. Or if bus-master DMA by the GPU did happen, it could read any byte of physical memory in the system if it wanted, so drivers would have to be careful not to let programs pass arbitrary pointers.
Anyway, having a GART was new with AGP, which post-dates early 3D graphics cards like 3dfx's Voodoo and ATI 3D Rage. I don't know enough details to be sure I'm accurately describing the functionality a GART enables.
So most of the support for GPUs was in terms of busses, and thus a chipset thing, not CPUs proper. (Back then, CPUs didn't have integrated memory controllers, instead just talking to the chipset northbridge over a frontside bus.)
Relevant CPU instructions included Intel's SSE and SSE2 instruction sets, which had streaming (NT = non-temporal) stores which are good for storing large amounts of data that won't be re-read by the CPU any time soon, if at all.
SSE4.1 in 2nd-gen Core2 (2008 ish) added a streaming load instruction (movntdqa) which (still) only does anything special if used on memory regions marked in the CPU's page tables or MTRR as WC (aka USWC: uncacheable, write-combining). Copying back from GPU memory to the host was the intended use-case. (Non-temporal loads and the hardware prefetcher, do they work together?)
x86 CPUs introducing the MTRR (Memory Type Range Register) is another feature that improved CPU -> GPU write bandwidth. Again, this came after 3D graphics were commercially successful for gaming.

Can a non-enumerated device conduct DMA operations?

PCIe devices can read or write to memory, i.e. can do DMA without requiring a device driver.
If I remember correctly, if you flash a device's firmware (let's say an FPGA device) and input 0xFFFF as device and vendor ID, the device won't be enumerated by BIOS.
I am wondering, if a PCIe device can conduct DMA operations (memory read and write) by bus mastering even when it is not enumerated by BIOS.
A PCIe device can only do DMA if Bus Master Enable (BME) is set in the command register. BME will only be set when a driver is active.

setting usb communication speed

I would like to implement usb communication at a speed of 30Mbit/sec. My hardware support "high speed usb" so the hardware platform will not limit me.
Can I implement this speed using USB CDC class, or Mass storage class, or are these usb classes speed limited?
In USB protocol who determines the bit rate, is it the device?
The USB CDC and mass storage classes do not have any kind of artificial speed limiting, so you can probably get a throughput of 30 Mbps on a high-speed USB connection (which uses 480 Mbps per second for timing bits on the wire). The throughput you get will be determined by how much bus bandwidth is being used by other devices and how efficiently your device-side firmware, host-side driver, and host-side software operate.
The bit rate is mostly determined by the device. The device basically signals to the host what USB speeds it supports, and the host picks one. The full story is a little bit more complicated, and there are a lot more details about how that works in the USB specification.

Direct data copy between devices

I am trying to explore the possibility of achieving global IO space across devices (GPUs, NIC, storage etc.). This might boil down to the question asked in this thread - Direct communication between two PCI devices.
I have been reading upon Nvidia GPUDirect where the memory region pinned and the physical address is obtained with the help of nvidia_p2p_* calls. I can't exactly understand how can GPU's physical address be used to program the 3rd party device's DMA controller for data transfers. I am confused by the fact that GPU memory is not visible unlike the cpu memory space (this maybe due to my poor knowledge on programming dma controllers). Any pointers on this would really helpful.
Also, many PCI devices expose memory regions in terms of PCI BARs (e.g. GPUs expose a memory region of 256M). Is there any way to know device physical addresses over which this BAR memory region maps to? Is there any overlap between the BAR memory regions and memory allocated via nvidia driver to CUDA runtime?
Thanks in advance.