Where are buffers located? - process

I hear a lot about flushing buffers, sending to buffer etc. but I don't have a visual image about where buffers reside and how they look like.
Are buffers part of the OS' kernel or part of each process? If the case is the first, can the same buffers be used by multiple processes?

A buffer is a generic term for a collection of bytes, typically used in the context of either sending, receiving or storing information where the internal data-structure of the information isn't important.
In the case of "flushing" buffers, this typically is used in the context of sending data either to a file or network; the buffer in this case being used to coalesce multiple small writes to the file or network into one larger and more-efficient-to-transmit buffer. After the final write has been performed (or after some "commit" point), the buffer must be "flushed" to ensure that any data left waiting to coalesce with a future write is committed immediately to the underlying file sent over the network rather than left waiting for a future write that might never come.
In both the case of network and file IO, buffers are usuaully used in multiple places. File IO may well be buffered by a buffer in the application, in a library (for instance an implementation of fwrite may buffer the output), in the kernel and even on the device itself - network writes may well be buffered by the device whilst waiting for bandwidth on the wire and hard-disk drives will buffer output from the OS to ensure that data isn't lost as the physical platters spin to the correct position for the write.

Related

Does it make sense to read from host memory in a compute shader to save a copy?

This answer suggests using a compute shader to convert from packed 3-channel image data to a 4-channel texture on the GPU. Is it a good idea to, instead of copying the 3 channel image to the GPU before decoding it, write it to a host visible buffer, then read that directly in the compute shader?
It would save a buffer on the GPU, but I don't know if the CPU-GPU buffer copy is done in some clever way that this would defeat.
Well, the first question you need to ask is whether the Vulkan implementation even allows a CS to directly read from host-visible memory. Vulkan implementations have to allow you to create SSBOs in some memory type, but it doesn't have to be a host-visible one.
So even if you want to do this, you'll need to provide a code path for what happens when you can't (or just fail out early on such implementations).
The next question is whether host-visible memory types that you can put an SSBO into are also device-local. Integrated GPUs that have only one memory pool are both host-visible and device-local, so there's no point in ever doing a copy on those (and they obviously can't refuse to allow you to make an SSBO in them).
But many/most discrete GPUs also have memory types that are both host-visible and device-local. These are usually around 256MB in size, regardless of how much actual GPU memory the cards have, and they're intended to be used for streamed data that changes every frame. Of course, GPUs don't necessarily have to allow you to use them for SSBOs.
Should you use such memory types for doing these kinds of image fiddling? You have to profile them to know. And you'll also have to take into account whether your application has ways to hide any DMA upload latency, which would allow you to ignore the cost of transferring the data to non-host-visible memory.

Synchronizing vertex buffer in vulkan?

I have a vertex buffer that is stored in a device memory and a buffer and is host visible and host coherent.
To write to the vertex buffer on the host side I map it, memcpy to it and unmap the device memory.
To read from it I bind the vertex buffer in a command buffer during recording a render pass. These command buffers are submitted in a loop that acquires, submits and presents, to draw each frame.
Currently I write once to the vertex buffer at program start up.
The vertex buffer then remains the same during the loop.
I'd like to modify the vertex buffer between each frame from the host side.
What I'm not clear on is the best/right way to synchronize these host-side writes with the device-side reads. Currently I have a fence and pair of semaphores for each frame allowed simulatenously in flight.
For each frame:
I wait on the fence.
I reset the fence.
The acquire signals semaphore #1.
The queue submit waits on semaphore #1 and signals semaphore #2 and signals the fence.
The present waits on semaphore #2
Where is the right place in this to put the host-side map/memcpy/unmap and how should I synchronize it properly with the device reads?
If you want to take advantage of asynchronous GPU execution, you want the CPU to avoid having to stall for GPU operations. So never wait on a fence for a batch that was just issued. The same thing goes for memory: you should never desire to write to memory which is being read by a GPU operation you just submitted.
You should at least double-buffer things. If you are changing vertex data every frame, you should allocate sufficient memory to hold two copies of that data. There's no need to make multiple allocations, or even to make multiple VkBuffers (just make the allocation and buffers bigger, then select which region of storage to use when you're binding it). While one region of storage is being read by GPU commands, you write to the other.
Each batch you submit reads from certain memory. As such, the fence for that batch will be set when the GPU is finished reading from that memory. So if you want to write to the memory from the CPU, you cannot begin that process until the fence representing the GPU reading operation for that memory reading gets set.
But because you're double buffering like this, the fence for the memory you're about to write to is not the fence for the batch you submitted last frame. It's the batch you submitted the frame before that. Since it's been some time since the GPU received that operation, it is far less likely that the CPU will have to actually wait. That is, the fence should hopefully already be set.
Now, you shouldn't do a literal vkWaitForFences on that fence. You should check to see if it is set, and if it isn't, go do something else useful with your time. But if you have nothing else useful you could be doing, then waiting is probably OK (rather than sitting and spinning on a test).
Once the fence is set, you know that you can freely write to the memory.
How do I know that the memory I have written to with the memcpy has finished being sent to the device before it is read by the render pass?
You know because the memory is coherent. That is what VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means in this context: host changes to device memory are visible to the GPU without needing explicit visibility operations, and vice-versa.
Well... almost.
If you want to avoid having to use any synchronization, you must call vkQueueSubmit for the reading batch after you have finished modifying the memory on the CPU. If they get called in the wrong order, then you'll need a memory barrier. For example, you could have some part of the batch wait on an event set by the host (through vkSetEvent), which tells the GPU when you've finished writing. And therefore, you could submit that batch before performing the memory writing. But in this case, the vkCmdWaitEvents call should include a source stage mask of HOST (since that's who's setting the event), and it should have a memory barrier whose source access flag also includes HOST_WRITE (since that's who's writing to the memory).
But in most cases, it's easier to just write to the memory before submitting the batch. That way, you avoid needing to use host/event synchronization.

A rarely mentioned Vulkan function "vkCmdUpdateBuffer()", what is it used for?

This seems to be a simple Vulkan API question but I really can not find answer after search Internet.
I noticed there is a Vulkan function:
void vkCmdUpdateBuffer(
VkCommandBuffer commandBuffer,
VkBuffer dstBuffer,
VkDeviceSize dstOffset,
VkDeviceSize dataSize,
const void* pData);
At first glance, I thoughts it can be used to record the command buffer since it has prefix vkCmd in its name, but the document says that
vkCmdUpdateBuffer is only allowed outside of a render pass. This command is treated as “transfer” operation, for the purposes of synchronization barriers.
So I start thinking that it is a convenience function that wraps the buffer data transferring operation like using memcpy() to copy the data from host to the device.
Then my question is: Why there is NOT a single Vulkan sample / tutorial (I have searched all of them) using vkCmdUpdateBuffer() instead of manually coping data by memcpy(). Did I understand it wrong?
All vkCmd* functions generate commands into a command buffer. This one is no exception. It is a transfer command, and like most transfer commands, you don't get to do them within a render pass. But there are plenty of command buffer generating commands that don't work in render passes.
Normally Vulkan memory transfer operations only happen between device memory. The typical mechanism for the host to put something in device memory is to write to a mapped pointer. But by definition, that requires that the destination memory be mappable. So if you want to write something to non-mappable memory, you have to copy it to mappable memory, then do a transfer operation between the mappable memory to the non-mappable memory via vkCmdCopy* functions.
And that's fine if you're doing a bunch of transfers all at once. You can copy a bunch of stuff into mapped memory, then submit a batch containing all of the copy operations to copy the data into the appropriate locations.
But sometimes, you're just updating a small piece of device memory. If it's not mappable, then that's a lot of work to do just to get a few kilobytes of data to the GPU. In that case, vkCmdUpdateBuffer may be the better choice, since it can "directly" copy from CPU memory to any device memory.
I say "directly" because that's obviously not what it's doing. It's really doing the same thing you would have done, except it's doing it within the command buffer. You would have copied your CPU data into GPU mappable memory, then created a command that copies from that mappable memory into non-mappable memory.
vkCmdUpdateBuffer does the exact same thing. It copies the data from the pointer/size you give it into mappable memory (which is provided by the command buffer itself. This is why it has an upper limit of 64KB). This copy happens immediately, just as it would have if you did a memcpy, so when this function returns, you can do whatever you want with the pointer you gave it. Then it creates a command in the command buffer that copies from the mappable memory in the command buffer to the destination memory location.
The documentation for this function explicitly gives warnings about using it for larger transfers. That is, it tells you not to do that. This is for quick, small, one-shot updates of unmappable memory. Nothing more.
That's one reason why tutorials don't talk about it: it's a highly special-case function that many novice users will try to use because it's easier than the explicit code. But in most cases, they should not be using it.

LwIP buffer management for sending UDP messages

My embedded application uses the LwIP library to send UDP messages of varying lengths, depending on the contents.
Right now I'm calling pbuf_alloc / pbuf_free every time a message needs sending using PBUF_RAM. It appears to work fine, but I'm worried it will lead to memory fragmentation nastiness after it's been running for a long time. Should I be worried?
Also, is it true that PBUF_POOL is for receiving messages only, not for sending?
PBUF_POOL is for RX only, the idea is to separate the memory pool for received packets and buffered TX segments. See PBUF_POOL define for documentation
In terms of PBUF_RAM and fragmentation in the memory heap, there are a number of configurations which determine how the heap is implemented, which may affect fragmentation. So you'll want to understand your configuration.
Heap could be implemented by standard C library malloc, a series of various fixed sized pools, or a single static array. If using the latter, plug_holes() is called from mem_free() which should handle fragments. See mem.c

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?