Direct data copy between devices - gpu

I am trying to explore the possibility of achieving global IO space across devices (GPUs, NIC, storage etc.). This might boil down to the question asked in this thread - Direct communication between two PCI devices.
I have been reading upon Nvidia GPUDirect where the memory region pinned and the physical address is obtained with the help of nvidia_p2p_* calls. I can't exactly understand how can GPU's physical address be used to program the 3rd party device's DMA controller for data transfers. I am confused by the fact that GPU memory is not visible unlike the cpu memory space (this maybe due to my poor knowledge on programming dma controllers). Any pointers on this would really helpful.
Also, many PCI devices expose memory regions in terms of PCI BARs (e.g. GPUs expose a memory region of 256M). Is there any way to know device physical addresses over which this BAR memory region maps to? Is there any overlap between the BAR memory regions and memory allocated via nvidia driver to CUDA runtime?
Thanks in advance.

Related

GPUDirect RDMA out of range pin address by Quadro p620

I want to implement FPGA-GPU RDMA by nvidia quadro p620.
Also, I used common PCIe BAR resources(BAR0 - BAR1 - BAR2) for FPGA registers and other chunk controllers handling which is independent from RDMA in my custom driver.
PCIe managements are OK but direct memory access to GPU ram which is pinned are always wrong. Precisely, i always get 64KB pinned addresses starting from 2955739136 (~2.7GB) by using nvidia_p2p_get_pages() API without any errors but the point is that quadro p620 ram capacity is just 2GB!.
The virtual address obtained by cuMemAlloc() change every time (which is correct) and i pass this address, together the allocated size, to my driver by ioctl sys-call. Also, i linked my custom driver to nvidia driver as the nvidia GPUDirect RDMA document is said.
Well, every things sounds OK, but the physical addresses are out of range!. Why? Does it requirement to have the qudro GPU equal or over 4GB ram address?
I expect to find the right solution to get the correct physical addresses and then DMA data by FPGA bus master.
Thanks
P.S. before this i implemented FPGA direct memory access to system ram over PCIe without any problems.

How did the first GPUs get support from CPUs?

I imagine CPUs have to have features that allow it to communicate and work with the GPU, and I can imagine this exists today, but in the early days of GPUs, how did companies get support from large CPU companies to have their devices be supported, and what features did CPU companies add to enable this?
You mean special support beyond just being devices on a bus like PCI? (Or even older, ISA or VLB.)
TL:DR: All the special features CPUs have which are useful for improved bandwidth to write (and sometimes read) video memory came after 3D graphics cards were commercially successful. They weren't necessary, just a performance boost.
Once GPUs were commercially successful and popular, and a necessary part of a gaming PC, it made obvious sense for CPU vendors to add features to make things better.
The same IO busses that let you plug in a sound card or network card already have the capabilities to access device memory and MMIO, and device IO ports, which is all that's necessary for video drivers to make a graphics card do things.
Modern GPUs are often the highest-bandwidth devices in a system (especially non-servers), so they benefit from fast buses, hence AGP for a while, until PCI Express (PCIe) unified everything again.
Anyway, graphics cards could work on standard busses; it was only once 3D graphics became popular and commercially important (and fast enough for the PCI bus to be a bottleneck), that things needed to change. At that point, CPU / motherboard companies were fully aware that consumers cared about 3D games, and thus it would make sense to develop a new bus specifically for graphics cards.
(Along with a GART, graphics address/aperture remapping table, an IOMMU that made it much easier / safer for drivers to let an AGP or PCIe video card read directly from system memory. Including I think with addresses under control of user-space, without letting user-space read arbitrary system memory, thanks to it being an IOMMU that only allows a certain address range.)
Before the GART was a thing, I assume drivers for PCI GPUs needed to have the host CPU initiate DMA to the device. Or if bus-master DMA by the GPU did happen, it could read any byte of physical memory in the system if it wanted, so drivers would have to be careful not to let programs pass arbitrary pointers.
Anyway, having a GART was new with AGP, which post-dates early 3D graphics cards like 3dfx's Voodoo and ATI 3D Rage. I don't know enough details to be sure I'm accurately describing the functionality a GART enables.
So most of the support for GPUs was in terms of busses, and thus a chipset thing, not CPUs proper. (Back then, CPUs didn't have integrated memory controllers, instead just talking to the chipset northbridge over a frontside bus.)
Relevant CPU instructions included Intel's SSE and SSE2 instruction sets, which had streaming (NT = non-temporal) stores which are good for storing large amounts of data that won't be re-read by the CPU any time soon, if at all.
SSE4.1 in 2nd-gen Core2 (2008 ish) added a streaming load instruction (movntdqa) which (still) only does anything special if used on memory regions marked in the CPU's page tables or MTRR as WC (aka USWC: uncacheable, write-combining). Copying back from GPU memory to the host was the intended use-case. (Non-temporal loads and the hardware prefetcher, do they work together?)
x86 CPUs introducing the MTRR (Memory Type Range Register) is another feature that improved CPU -> GPU write bandwidth. Again, this came after 3D graphics were commercially successful for gaming.

Cortex-M3 External RAM Region

I'm currently researching topics such as RAM/ROM/Stack/Heap and data segments etc.
I was looking at the ARM Cortex-M3 memory map and saw the region labeled "External RAM".
According to the data sheet of a random Cortex-M3 STM32 MCU the external RAM region is mapped from 0x60000000- 0x9FFFFFFF, so it is quite large!
I couldn't find a definitive answer about how this region is actually used.
I imagine you would have an external SRAM and you would choose between two options.
(1) Read via the SPI interface and place into a local buffer(stack), then load that local buffer into the external ram region. This option seems to have a lot of negative consequences, such as hogging the CPU and increasing the stack temporarily if the requested data is very large.
(2) Utilize a DMA and transfer from the SPI interface into the external ram region.
Now I can't understand, why you would map the data to this specific address range, what are the advantages, why don't you just place the data directly in that huge memory region?
Now I'm asking this question because I have a slight feeling I have completely missed the point of what the External RAM region really is.
-Edit-
In the data sheet that is linking to the STM32 device, the memory region "External RAM" is marked as reserved. It is my conclusion that the memory regions listed by ARM is showing the full potential of a 32bit MCU, as I incorrectly state that the external RAM region "is quite large!" does not necessarily mean that this is "real" size of that region, if it is even used, it depends on what the vendor can physically achieve within the MCU hardware, and I imagine they would limit hardware capabilities to be competitive on price, power consumption etc.
I imagine you would have an external [SRAM][3] and you would choose
between two options.
(1) Read via the SPI interface and place into a local buffer(stack), then load that local buffer into the external ram region. This option
seems to have a lot of negative consequences, such as hogging the CPU
and increasing the stack temporarily if the requested data is very
large.
(2) Utilize a DMA and transfer from the SPI interface into the external ram region.
None of the above. External memory on an SPI bus is not memory mapped. If you have an SPI memory, it is not mapped to that region, it is simply an SPI device, and the "address" is simply an offset from the start of the memory device itself. MCUs with a Quad or Octo-SPI controller are memory mapped. QSPI RAM is not that common and relatively expensive. QSPI is more commonly used for flash memory.
The external memory region can be used by STM32 parts with an FSMC (Flexible
Static Memory Controller) or an FMC (Flexible Memory Controller), or and mentions a QPSI interface. The latter FMC SDRAM, and is generally available on the higher end parts. Apart from the QSPI and NAND flash, these interfaces require using the GPIO EMIF (external memory interface) alternate function to create an address and data bus. So it generally requires parts with high pin count to accommodate. The EMIF can be configured for 8, 16 or 32bit data bus for reduced pin count (and slower access).
Now I can't understand, why you would map the data to this specific
address range, what are the advantages, why don't you just place the
data directly in that huge memory region?
Since it was precipitated by your earlier misconception this question is perhaps redundant, but memory that exists in the memory map can be used to store data accessed as regular variables rather than transferring to an from internal buffers and it can be used as an execution region - code can loaded to and be executed directly from such memory.
Now I'm asking this question because I have a slight feeling I have completely missed the point of what the External RAM region really is.
Self awareness is a skill. That is known as conscious incompetence and is a motivator for learning.
It is my conclusion that the memory regions listed by ARM is showing the full potential of a 32bit MCU, as I incorrectly state that the external RAM region "is quite large!" does not necessarily mean that this is "real" size of that region, if it is even used, it depends on what the vendor can physically achieve within the MCU hardware, and I imagine they would limit hardware capabilities to be competitive on price, power consumption etc.
No, it is largely about the number of pins available for an address bus (except for QSPI). The external memory is a matter for the board design - it is not something the MCU vendor decides must be present. The constraint is a maximum, not a required amount of physical memory. The STM32 FMC supports the following memory sizes/types:
So you can have up to 512Mb of SDRAM for example. The space available for static memories (NOR/PSRAM/SRAM) is significantly larger than the than the typical size of such memories.

What is internal memory? Where can I find it?

So today I got to make presentation about RAM ROM and Internal memory, the problem is that internal memory is literally RAM and ROM, I asked my teacher, he said it is and it is not, he also mentioned that there is internal memory in CPU it self or something right that, so any resources about Internal memory and what parts of technology you can find it would be appreciated, also if you could find internal memory in cpu I would love that because I couldn't find it anywhere!
Some microcontrollers come with some internal RAM that's architecturally usable as memory for loads/stores. (System on chip where external RAM is optional.) http://www.avr-asm-tutorial.net/avr_en/beginner/SRAM.html shows how AVR maps SRAM to the low end of physical address space. And that the very bottom of physical address space aliases the registers! (That's unusual, most ISAs don't memory-map the register file.)
But more generally, caches and physical register files are SRAM arrays, so in most CPUs you have internal RAM that's not architecturally visible as "memory".

How does GPUDirect enforce isolation on a shared device

I have been reading here https://developer.nvidia.com/gpudirect about GPUDirect,
In there example there is a network card attached to the PCIe together with two GPU's and a CPU.
How is isolation enforced between all clients trying to access the network device? Are they all accessing the same PCI BAR of the device?
Is the network device using some kind SR-IOV mechanism to enforce isolation?
I believe you're talking about rDMA, which was supported with the second release of GPU Direct. It's where the NIC card can send/receive data external to the host machine and utilizes peer-to-peer DMA transfers to interact with the GPU's memory.
nVidia exports a variety of functions to kernel space that allow for programmers to look up where physical pages reside on the GPU, itself, and map them manually. nVidia also requires the use of physical addressing within kernel space, which greatly simplifies how other [3rd party] drivers interact with GPU's -- through the host machine's physical address space.
"RDMA for GPUDirect currently relies upon all physical addresses being the same from the PCI devices' point of view."
-nVidia, Design Considerations for rDMA and GPUDirect
As a result of nVidia requiring a physical addressing scheme, all IOMMU's must be disabled in the system, as these would alter the way each card views the memory space(s) of other cards. Currently, nVidia only supports physical addressing for rDMA+GPUDirect in kernel-space. Virtual addressing is possible via their UVA, made available to user space.
How is isolation enforced between all clients trying to access the network device? Are they all accessing the same PCI BAR of the device?
Yes. In kernel space, each GPU's memory is being accessed by it's physical address.
Is the network device using some kind SR-IOV mechanism to enforce isolation?
The driver of the network card is what does all of the work in setting up descriptor lists and managing concurrent access to resources -- which would be the the GPU's memory in this case. As I mentioned above, nVidia gives driver developers the ability to manage physical memory mappings on the GPU, allowing the 3rd party's NIC driver to control what resource(s) are available or not available to remote machines.
From what I understand about NIC drivers, I believe this to be a very rough outline of what's going on under the hood, relating to rDMA and GPUDirect:
Network card receives an rDMA request (whether it be reading or writing).
Network card's driver receives an interrupt that data has arrived or some polling mechanism has detected data has arrived.
The driver processes the request; any address translation is performed now, since all memory mappings for the GPU's are made available to kernel space. Additionally, the driver will more than likely have to configure the network card, itself, to prep for the transfer (e.g. set up specific registers, determine addresses, create descriptor lists, etc).
The DMA transfer is initiated and the network card reads data directly from the GPU.
This data is then sent over the network to the remote machine.
All remote machines requesting data via rDMA will use that host machine's physical addressing scheme to manipulate memory. If, for example, two separate computers wish to read the same buffer from a third computer's GPU with rDMA+GPUDirect support, one would expect the incoming read request's offsets to be the same. The same goes for writing; however an additional problem is introduced if multiple DMA engines are set to manipulate data in overlapping regions. This concurrency issue should be handled by the 3rd party NIC driver.
On a very related note, another post of mine has a lot information regarding nVidia's UVA (Unified Virtual Addressing) scheme and how memory manipulation from within kernel-space, itself, is handled. A few of the sentences in this post were grabbed from it.
Short answer to your question: if by "isolated" you mean how does each card preserve its own unique address-space for rDMA+GPUDirect operations, this is accomplished by relying on the host machine's physical address space which fundamentally separates the physical address space(s) requested by all devices on the PCI bus. By forcing the use of each host machine's physical addressing scheme, nVidia essentially isolates each GPU in that host machine.