Does staging buffer have better performance than host visible and host coherent buffer if an application constantly do buffer streaming? - vulkan

by Buffer streaming i mean the vertex data are constantly updating every single frame.
i can understand how a staging buffer will be fast if it is static data, but what if the input vertices are constantly changing? does staging buffer still have the advantage over host visible and host coherent
or should i just use staging buffer as a fallback when the host coherent memory isn't available.
also, i have another question, can host visible and host coherent and device local all get used together? what's the difference between not specifying device local and specifying device local in both cases?

Staging buffers give you the ability to transfer data into device visible, host non-visible memory (i.e. discrete card GRAM).
Shared host-device buffer access without staging needs memory to be both device visible and host visible.
Impact of these two is really system dependent.
For integrated graphics and mobile graphics, there is generally no discrete GRAM. Nearly all available memory is therefore shared between the CPU and GPU, and therefore "host visible and device visible". On these devices there is no point using staging buffers unless you need layout changes (like texture upload/readback).
For discrete graphics, you have a discrete GRAM. If you want resources to live in GRAM then use staging because GRAM is not "host visible". If you want resources to live in system RAM then use "host visible and device visible". GPU access to system RAM will be slower than GRAM (requires travel over PCIe protocol). It is also normally a smaller capacity than GRAM (depends on the PCIe aperture size exposed by the device).
TLDR: YMMV, benchmark both speed and capacity on the devices you care about.

Related

Why static random access memory (SRAM) does not require a memory controller?

I have been studying the bootloaders, and there is an explanation on most of the sources that ROM code is on most of the chips that tell the chip where to go after it power up, and then ROM code load a small chunk of code into the SRAM.
My question is that DRAM requires a controller to run, but why SRAM doesn't? Who controls the SRAM? or how it is being controlled?
Also what happens after the system is being done with the SRAM and things are running off of DRAM?
I do not know yet if it makes sense or not but it would be best if you can answer from the perspective of u-boot and Linux.
Both need controllers, DRAM however needs to be refreshed periodically to keep its state (in condensators), unlike SRAM that stores its states through latch circuits.
That means that if you want to keep the content of the memory after a reset (from Linux or U-Boot for example), you must have configured the DRAM controller to "auto refresh" the memory during the reset step. There is no such need with SRAM.
Generally when you are referring to SRAM from bootloader perspective, it is internal RAM which is accessible by the controller. This RAM is accessed by the controller using an AHB/AXI bus (for ARM based devices). There might be a memory bridge which converts the signals from AHB/AXI bus to memory bus. So speaking from a software point of view, it is transparent, no specific software configuration is required to access this RAM.
... then ROM code load a small chunk of code into the SRAM.
That is a common procedure with some SoCs, but it's not required. There are alternate boot schemes.
Etrax SoCs that used CRIS processors (which are now out of production) required the DRAM parameters to be stored in nonvolatile memory (NVM). The embedded ROM boot code accessed this NVM, and initialized the DRAM controller. The ROM boot code was thus capable of directly booting a Linux kernel.
Some ARM SoCs have a Boot Memory Selector (BMS) pin (e.g. Atmel AT91SAM9xxx and Microchip SAMA5Dx) that can disable the internal ROM code, and has the processor execute code after a reset from an external NVM (e.g. NOR flash) which has execute-in-place (XIP) capability. Such a boot scheme could be customized to initialize the external DRAM, and then load U-Boot or even a Linux kernel.
My question is that DRAM requires a controller to run, but why SRAM doesn't?
Who controls the SRAM? or how it is being controlled?
DRAM requires a controller because this type of memory technology requires periodic refreshing. The DRAM controller needs to be programmatically initialized before the DRAM can be accessed. One of the functions of the boot code that is loaded into SRAM is to perform this initialization of the DRAM controller.
Interfacing SRAM by comparison is far more straightforward. Normally there is no "SRAM controller". The control logic to interface SRAM typically does reach the level of complexity to require a "controller". For instance I've used a SBC that had its Z80 microprocessor directly connected to the SRAM (HM6264) and EPROM (MBM2764) memory ICs plus some logic for address decoding.
The "SRAM controller" found on a modern SoC is primarily a buffered interface for external SRAM with the internal system bus. The internal SRAM of the SoC does not require any software initialization, and would be accessible immediately after a reset.
Also what happens after the system is being done with the SRAM and things are running off of DRAM?
Typically the internal SRAM is left unused when it is not included as part of the memory that the Linux kernel manages. I don't know if that is due to any technical reasons such as virtual memory or caching issues, or oversight, or desire for the simplicity of homogeneous memory.
For some SoCs the amount of internal SRAM is so small (e.g. 8 KB in Atmel AT91SAM926x) that the effort to utilize it in the kernel could be deemed to have a poor cost-to-benefit trade-off.
See this discussion regarding a kernel patch for SRAM on an Atmel/Microchip SAMA5D3x
A device driver could still utilize the internal SRAM as its private memory region for high-speed buffers. For instance there was a kernel patch to use the SRAM to hold Ethernet transmit packets to avoid transmit underrun errors.

Cortex-M3 External RAM Region

I'm currently researching topics such as RAM/ROM/Stack/Heap and data segments etc.
I was looking at the ARM Cortex-M3 memory map and saw the region labeled "External RAM".
According to the data sheet of a random Cortex-M3 STM32 MCU the external RAM region is mapped from 0x60000000- 0x9FFFFFFF, so it is quite large!
I couldn't find a definitive answer about how this region is actually used.
I imagine you would have an external SRAM and you would choose between two options.
(1) Read via the SPI interface and place into a local buffer(stack), then load that local buffer into the external ram region. This option seems to have a lot of negative consequences, such as hogging the CPU and increasing the stack temporarily if the requested data is very large.
(2) Utilize a DMA and transfer from the SPI interface into the external ram region.
Now I can't understand, why you would map the data to this specific address range, what are the advantages, why don't you just place the data directly in that huge memory region?
Now I'm asking this question because I have a slight feeling I have completely missed the point of what the External RAM region really is.
-Edit-
In the data sheet that is linking to the STM32 device, the memory region "External RAM" is marked as reserved. It is my conclusion that the memory regions listed by ARM is showing the full potential of a 32bit MCU, as I incorrectly state that the external RAM region "is quite large!" does not necessarily mean that this is "real" size of that region, if it is even used, it depends on what the vendor can physically achieve within the MCU hardware, and I imagine they would limit hardware capabilities to be competitive on price, power consumption etc.
I imagine you would have an external [SRAM][3] and you would choose
between two options.
(1) Read via the SPI interface and place into a local buffer(stack), then load that local buffer into the external ram region. This option
seems to have a lot of negative consequences, such as hogging the CPU
and increasing the stack temporarily if the requested data is very
large.
(2) Utilize a DMA and transfer from the SPI interface into the external ram region.
None of the above. External memory on an SPI bus is not memory mapped. If you have an SPI memory, it is not mapped to that region, it is simply an SPI device, and the "address" is simply an offset from the start of the memory device itself. MCUs with a Quad or Octo-SPI controller are memory mapped. QSPI RAM is not that common and relatively expensive. QSPI is more commonly used for flash memory.
The external memory region can be used by STM32 parts with an FSMC (Flexible
Static Memory Controller) or an FMC (Flexible Memory Controller), or and mentions a QPSI interface. The latter FMC SDRAM, and is generally available on the higher end parts. Apart from the QSPI and NAND flash, these interfaces require using the GPIO EMIF (external memory interface) alternate function to create an address and data bus. So it generally requires parts with high pin count to accommodate. The EMIF can be configured for 8, 16 or 32bit data bus for reduced pin count (and slower access).
Now I can't understand, why you would map the data to this specific
address range, what are the advantages, why don't you just place the
data directly in that huge memory region?
Since it was precipitated by your earlier misconception this question is perhaps redundant, but memory that exists in the memory map can be used to store data accessed as regular variables rather than transferring to an from internal buffers and it can be used as an execution region - code can loaded to and be executed directly from such memory.
Now I'm asking this question because I have a slight feeling I have completely missed the point of what the External RAM region really is.
Self awareness is a skill. That is known as conscious incompetence and is a motivator for learning.
It is my conclusion that the memory regions listed by ARM is showing the full potential of a 32bit MCU, as I incorrectly state that the external RAM region "is quite large!" does not necessarily mean that this is "real" size of that region, if it is even used, it depends on what the vendor can physically achieve within the MCU hardware, and I imagine they would limit hardware capabilities to be competitive on price, power consumption etc.
No, it is largely about the number of pins available for an address bus (except for QSPI). The external memory is a matter for the board design - it is not something the MCU vendor decides must be present. The constraint is a maximum, not a required amount of physical memory. The STM32 FMC supports the following memory sizes/types:
So you can have up to 512Mb of SDRAM for example. The space available for static memories (NOR/PSRAM/SRAM) is significantly larger than the than the typical size of such memories.

Are HOST_CACHED_BIT and HOST_COHERENT_BIT contradicting each other?

There are two types of memory in Vulkan buzzling me:
VK_MEMORY_PROPERTY_HOST_COHERENT_BIT bit indicates that the host cache
management commands vkFlushMappedMemoryRanges and
vkInvalidateMappedMemoryRanges are not needed to flush host writes to
the device or make device writes visible to the host, respectively.
VK_MEMORY_PROPERTY_HOST_CACHED_BIT bit indicates that memory allocated
with this type is cached on the host. Host memory accesses to uncached
memory are slower than to cached memory, however uncached memory is
always host coherent.
From what I understand is that modification of memory of type COHERENT is seen immediately by both the host and the device, and modifications to memory of type CACHED may not be seen immediately by the host and/or the device, i.e. invalidating/flushing the memory is needed to invalidate the cache.
I have seen some implementations combine both flags, and it is valid combinations according to the 10.2. Device Memory section in the documentation. Isn't there a contradictory (cached and coherent)?
Cached/coherent memory effectively means that the GPU can see the CPU's caches. This often happens on architectures where the GPU and the CPU are sitting on the same chip. The GPU is effectively just another core on the CPU's die, with access to the CPU's core.
But it can happen on other architectures as well. Some standalone GPUs offer cached/coherent memory. Indeed, most of them don't offer cached memory without coherency. From an architectural standpoint, it represents some way for the GPU to access data through at least part of the CPU cache.
The key thing about cached/coherent memory you should remember is this: if there is an alternative memory type for that memory pool, then the alternative is probably faster for the device to access. Also, if alternatives exist, it is entirely possible that the device may not be able to have images or buffers of certain types/formats stored in such memory types. So unless you really need cached memory access from the CPU, or the device offers no alternative, it's best to avoid it.
There are cache schemes that monitor write accesses to RAM over the memory bus to invalidate the Host's cache when the memory is written to.
This allows the best of both worlds, cached coherent accesses but at the cost of a more complex architecture.

How does GPUDirect enforce isolation on a shared device

I have been reading here https://developer.nvidia.com/gpudirect about GPUDirect,
In there example there is a network card attached to the PCIe together with two GPU's and a CPU.
How is isolation enforced between all clients trying to access the network device? Are they all accessing the same PCI BAR of the device?
Is the network device using some kind SR-IOV mechanism to enforce isolation?
I believe you're talking about rDMA, which was supported with the second release of GPU Direct. It's where the NIC card can send/receive data external to the host machine and utilizes peer-to-peer DMA transfers to interact with the GPU's memory.
nVidia exports a variety of functions to kernel space that allow for programmers to look up where physical pages reside on the GPU, itself, and map them manually. nVidia also requires the use of physical addressing within kernel space, which greatly simplifies how other [3rd party] drivers interact with GPU's -- through the host machine's physical address space.
"RDMA for GPUDirect currently relies upon all physical addresses being the same from the PCI devices' point of view."
-nVidia, Design Considerations for rDMA and GPUDirect
As a result of nVidia requiring a physical addressing scheme, all IOMMU's must be disabled in the system, as these would alter the way each card views the memory space(s) of other cards. Currently, nVidia only supports physical addressing for rDMA+GPUDirect in kernel-space. Virtual addressing is possible via their UVA, made available to user space.
How is isolation enforced between all clients trying to access the network device? Are they all accessing the same PCI BAR of the device?
Yes. In kernel space, each GPU's memory is being accessed by it's physical address.
Is the network device using some kind SR-IOV mechanism to enforce isolation?
The driver of the network card is what does all of the work in setting up descriptor lists and managing concurrent access to resources -- which would be the the GPU's memory in this case. As I mentioned above, nVidia gives driver developers the ability to manage physical memory mappings on the GPU, allowing the 3rd party's NIC driver to control what resource(s) are available or not available to remote machines.
From what I understand about NIC drivers, I believe this to be a very rough outline of what's going on under the hood, relating to rDMA and GPUDirect:
Network card receives an rDMA request (whether it be reading or writing).
Network card's driver receives an interrupt that data has arrived or some polling mechanism has detected data has arrived.
The driver processes the request; any address translation is performed now, since all memory mappings for the GPU's are made available to kernel space. Additionally, the driver will more than likely have to configure the network card, itself, to prep for the transfer (e.g. set up specific registers, determine addresses, create descriptor lists, etc).
The DMA transfer is initiated and the network card reads data directly from the GPU.
This data is then sent over the network to the remote machine.
All remote machines requesting data via rDMA will use that host machine's physical addressing scheme to manipulate memory. If, for example, two separate computers wish to read the same buffer from a third computer's GPU with rDMA+GPUDirect support, one would expect the incoming read request's offsets to be the same. The same goes for writing; however an additional problem is introduced if multiple DMA engines are set to manipulate data in overlapping regions. This concurrency issue should be handled by the 3rd party NIC driver.
On a very related note, another post of mine has a lot information regarding nVidia's UVA (Unified Virtual Addressing) scheme and how memory manipulation from within kernel-space, itself, is handled. A few of the sentences in this post were grabbed from it.
Short answer to your question: if by "isolated" you mean how does each card preserve its own unique address-space for rDMA+GPUDirect operations, this is accomplished by relying on the host machine's physical address space which fundamentally separates the physical address space(s) requested by all devices on the PCI bus. By forcing the use of each host machine's physical addressing scheme, nVidia essentially isolates each GPU in that host machine.

Why is virtual memory needed in embedded systems? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Per my understanding, virtual memory is as follows:
Programs/applications/executables reside in a storage device. Storage device access is much slower than RAM. Hence, programs is copied from storage memory to main memory for execution. Since computers have limited main memory (RAM), when all of the RAM is being used (e.g., if there are many programs open simultaneously or if one very large program is in use), a computer with virtual memory enabled will swap data to the HDD and back to memory as needed, thus, in effect, increasing the total system memory.
As far as I know, most embedded devices do not have disk memory (like smartphones or in car infotainment systems). Code is directly executed from Flash memory. RAM is mainly used as a scratchpad area (local variables, return address etc).
So why do we need virtual memory in embedded systems? (e.g. WinCE and QNX support virtual memory)
Your understanding is completely wrong. You are confusing virtual memory with swapping or page files. There are systems that have virtual memory and no swap or page files and there are systems that swap without virtual memory.
Virtual memory just means that a process has a view of memory that is different from the physical mapping. Among other things, it allows processes to have their own virtual address space.
Storage device access is much slower than RAM. Hence programs is copied from storage memory to main memory for execution. Since computers have limited main memory (RAM), when all of the RAM is being used (e.g., if there are many programs open simultaneously or if one very large program is in use), a computer with virtual memory enabled will swap data to the HDD and back to memory as needed, thus, in effect, increasing the total system memory.
That's swapping (or paging). It has nothing to do with virtual memory except that most modern operating systems implement swapping using virtual memory. Swapping actually existed before virtual memory.
I think you're probably incorrect about these devices running code directly from flash memory. The read speed of flash is pretty low and RAM is very cheap. My bet is that most of the systems you mention don't run code directly from flash and instead use virtual memory to fault code into RAM as needed.
embedded systems, the term itself has a wide range of applications. you could call a small microcontroller with flash program space measured in kbytes or less and ram measured in either bits or bytes (not enough to be kbytes) an embedded system. Likewise a tivo running a full blown operating system on a pretty much full blown computer motherboard (replace tivo with xbox as another example) as an embedded system. So you need to be less vague about your question. virtual memory has little to do with any of that its applications cross those boundaries.
There are many answers above, David S has the best of course that virtual memory simply means the memory address on one side of the virtual memory boundary is different than the physical address that is used on the other side of that boundary. Where, how, why, etc is there a boundary varies.
A popular use for virtual memory, and I might argue a primary use case is for operating systems. One benefit is that for example all applications could be compiled for the same address space, all applications might be compiled such that from the programs perspective they all start at say address 0x8000, and as far as that program when it runs and accesses memory it accesses stuff based on that address. A combination of the hardware and the operating system change that virtual address that the program is using to a physical address. If the operating system allows for multitasking, then each task might think they are in the same address space but the physical addresses are different for each of those tasks. I wont elaborate further on why using an assumed, fixed address space, is a benefit. Another aspect that operating systems use is memory management. Many MMU's will let you segment the memory however. If a user wants to allocate 100 Megabytes of memory the program may access in its virtual address space that 100 meg as if it were linear and in that address space it is linear, but that 100 meg might be broken down into say 4Kbyte chunks that are scattered all about the physical address space, not always likely but certainly technically possible that no two chunks of that physical memory is next to any other chunk of that 100 meg. your memory management doesnt necessarily have to try to keep large physical chunks of memory available for applications to allocate. Note not all MMUs are exactly the same and 4Kbytes is just an example. A third major benefit from virtual address space to an operating system is protection. If the application is bound to the virtual address space, it is often quite easy to prevent that application from touching the memory of any other application or the operating system. the application in this case would operate/execute at a proection level such that all accesses are considered virtual and have to go through a translation to physical, the tables that are used to define that virtual to physical can contain protection flags. If the application addresses a memory address in its virtual space that it has no business accessing, the hardware can trap that and let the operating system take action as to how to handle it (virtualize some hardware, pop up an error and kill the app, pop up a warning and not kill the app but at the same time feed the app bogus data for their transaction, etc).
There are lots of ways this can be used in an embedded system. first off many embedded systems run operating systems, so all of the above, ease of compiling the program for the address space, relative ease of memory management, and protection of the other applications and operating system and other benefits not mentioned. (virtualization being one, being able to enable/disable instruction/data caching on a block by block basis is another)
The bottom line though is what David S pointed out. virtual memory simply means the virtual address is not necessarily equal to the physical address, it can be but doesnt have to be, there is some boundary, some hardware, usually table driven, that translates the virtual address into a physical address. Lots of reasons why you would want to do this, since some embedded systems are indistinguishable from non-embedded systems any reason that applies to a non-embedded system can apply to an embedded system.
As much as folks may want you to believe that a system has a flat address space, it is often an illusion. In a microcontroller for example you might have multiple flash banks and one or more ram banks. Each of these banks has a physical, generally zero based address. Even if there is no mmu or anything else like that there is a place somewhere between the address bus on the processor and the address bus on the flash or ram memory that decodes the address on the processor and uses that to address into the specific memory bank. Often the lower bits match and upper bits are responsible for the bank choices (this is often the case with an mmu as well) so in that sense the processor is living in a virtual address space. (not limited to microcontrollers, this is generally how processors address busses are treated) With microcontrollers depending on a pin being pulled high or low or some other mechanism you might have a chip feature that allows one flash bank to be used to boot the processor or another. You might tie an input pin high and the processors built in bootloader allows you to access and debug the system for example reprogram the application flash. Or perhaps tie that line low and boot the application flash instead of the vendors debugger/boot flash. some chips get even more complicated letting you boot one flash then the program writes a register somewhere instantly changing the memory architecture moving things around, for example allowing ram to be used for the interrupt vector table so your application can be changed after boot rather than a vector table in flash that is not as easy to change at will.
now when you talk about virtual memory as far as swapping to and from a disk, that is a trick often employed by operating systems to give the illusion of having more ram. I mentioned that above under the category of virtualization. virtual memory in the sense that it isnt really there, I have X bytes but will let the software think there are Y bytes (where Y is larger than X) available. The operating system through the virtual tables used by the hardware, manages which memory chunks are tied to physical ram and are allowed to complete as is by the hardware, or are marked as not available in some way, causing an exception to the operating system, upon inspection the operating system determines that this is a valid address for this application, but the data behind this address has been swapped to disk. The operating system then finds through some algorithm another chunk of ram belonging to whomever (part of the algorithm) and it copies that chunk of ram to disk, marks the table related to that virtual to physical as not valid, then copies the desired chunk from disk to ram, marks that chunk as valid and lets the hardware complete the memory cycle.
Not any different than say how vmware or other virtual machines work. You can execute instructions natively on the hardware using virtual memory until such time as you cause an exception, the virtual machine might think you have an xyz network interface and might have a driver that is accessing a register in that xyz network interface, but the reality is you have no xyz hardware and/or you dont want the virtual machine applications to access that hardware, so you virtualize it, you trap that register access, and using software that simulates the hardware you fake that access and let the program on the virtual machine continue. This obviously not the only way to do virtual machines, but it is one way if the hardware supports it, to let a virtual machine run very fast as a percentage of the time it is actually running instructions on the hardware. The slowest way to virtualize of course is to virtualize everything including the processor, every instruction in that case would be simulated, this is quite slow but has its own features (virtualizing an arm system on an x86 or x86 on an arm, xyz on an abc, fill in the blanks). And if that is the type of virtual memory you are talking about in an embedded system, well if the embedded system is for the most part indistinguishable from a non-embedded system (an xbox or tivo for example) then well for the same reasons you could allow such a thing. If you were on a microcontroller, well the use cases there would generally mean if you needed more memory you would buy a bigger microcontroller, or add more memory to the system ,or change the needs of the application such that it doesnt need as much memory. there may be exceptions, but it mostly depends on your application and requirements, a general purpose or general purpose like system which allows for applications or their data to be larger than the available ram, will require some sort of solution. the microcontroller in your keyless entry key fob thing or in your tv remote control or clock radio or whatever normally would not have a need to allow "applications" to require more resources than are physically there.
The more important benefit of using virtual memory is that every process gets its own address space which is isolated from every other process's. That way virtual memory helps keep faults contained and improves security and stability. I should note that it is still possible for two processes to share a bit of memory, to facilitate communication (shared mem IPC).
Also you can do other tricks like conserving memory via mapping shared parts into more than one process's (libc comes to mind for embedded use) address space but only having it once in physical mem. Also this gives it a speed boost, you can even enhance it further the way linux does cheapen fork/clone by only copying the in kernel descriptors and leaving the memory image alone up until the first write access is done with a similar idea.
As a last benefit, in modern systems, it's common to do file I/O via mapping the file into the process space (cf. mmap for example).
It's interesting to note that one can get some of the benefits of "virtual memory" without needing a full-fledged MMU. The hardware requirements can sometimes be amazingly light. The PIC 16C505 has a 5-bit address space and 40 bytes of RAM; addresses 0x10 to 0x1F can map to either of two groups of 16 bytes of RAM. When writing an application which needed to manage two different data streams, I arranged so that all the variables associated with one data stream would be in the first group of 16 "switchable" memory locations, and those associated with the other would be at the corresponding addresses in the second group. I could then use the same code to manage both data streams. Simply set the banking bit one way, call the routine, set it the other way, and call the routine again.
One of the reasons Virtual Memory exists is so that your device can multitask. It can also act as your RAM does, thus taking the load off of your physical RAM and swapping the load back and forth.