Should the amount of resource allocations be "per swap chain image"? - vulkan

I just learned about uniform buffers (https://vulkan-tutorial.com/Uniform_buffers/Descriptor_layout_and_buffer) and a bit confused about the size of uniformBuffers and uniformBuffersMemory. In the tutorial it is said that:
We should have multiple buffers, because multiple frames may be in flight at the same time and we don't want to update the buffer in preparation of the next frame while a previous one is still reading from it! We could either have a uniform buffer per frame or per swap chain image.
As far as I understand "per swap chain image" approach is more optimal. Please, prove me wrong, if I am. But why do we need it to be the size of swapChainImages.size()? Isn't MAX_FRAMES_IN_FLIGHT just enough, because we have fences? As a simple example, if we have just a single frame in flight and do vkDeviceWaitIdle after each presentation then our single uniform buffer will always be available and not used by cpu/gpu so we don't need an array of them.

do vkDeviceWaitIdle
OK, stop right there. There is basically only one valid reason to call that function: you need to delete every resource created by that device, because you're about the destroy the device, so you wait until all such resources are no longer being used.
Yes, if you halt the CPU's execution until the GPU stops doing stuff, then you're guaranteed that CPU writes to GPU memory will not interact with GPU reads from that memory. But you purchased this guarantee by ensuring that there will be no overlap at all between CPU execution and GPU execution. The CPU sets up some stuff, sends it to the GPU, then waits till the GPU is done, and the CPU starts up again. Everything executes perfectly synchronously. While the CPU is doing work, the GPU is doing nothing. And vice-versa.
This is not a recipe for performance. If you're going to use a graphics API designed to achieve lots of CPU/GPU overlap, you shouldn't throw that away because it's easier to work with.
Get used to multi-buffering any resources that you modify from the CPU on a regular basis. How many buffers you want to use is your choice, one that should be informed by the present mode and the like.
My question is "Do I need n buffers or m is enough?".
The situation you're describing ultimately only happens if your code wanted to have X frames in flight, but the presentation engine requires you to use a minimum of Y swap-chain images, and X < Y. So the question you're asking can be boiled down to, "if I wanted to do double-buffering, but the implementation forces 3 buffers on me, is it OK if I treat it as double-buffering?"
Yes, as long as you're not relying on the vkAcquireNextImage call to block the CPU for your synchronization. But you shouldn't be relying on that anyway, since the call itself doesn't constitute a proper barrier as far as the Vulkan execution model is concerned. You should instead block the CPU on fences tied to the actual work, not on the acquire process.

Related

How do I know when Vulkan isn't using memory anymore so I can overwrite it / reuse it?

When working with Vulkan it's common that when creating a buffer, such as a uniform buffer, that you create multiple (buffers 'versions'), because if you have double buffering for example you don't know if the graphics API is still drawing the last frame (using the memory you bound and instructed it to use the last loop). I've seen this happen with uniform buffers but not vertex or index buffers or image/texture buffers. Is this because uniform buffers are updated regularly and vertex buffers or images are not?
If you wanted to update an image or a vertex buffer how would you go about it given that you don't know whether the graphics API is still using it? Do you simply reallocate new memory for that image/buffer and start anew? Even if you just want to update a section of it? And if this is the case that you allocate a new buffer, when would you know to release the old buffer? Would say, for example 5 frames into the future be OK? Or 2 seconds? After all, it could still be being used. How is this done?
given that you don't know whether the graphics API is still using it?
But you do know.
Vulkan doesn't arbitrarily use resources. It uses them exactly and only how your code tells it to use the resource. You created and submitted the commands that use those resources, so if you need to know when a resource is in use, it is you who must keep track of it and manage this.
You have to use API synchronization functions to follow the GPU's execution of commands.
If an action command uses some set of resources, then those resources are in use while that command is being executed. You have tools like events which can be used to stop subsequent commands from executing until some prior commands have finished. And events can tell when a particular command has finished, so that you'll know when those resources are no longer in use.
Semaphores have similar powers, but at the level of a batch of work. If a semaphore is signaled, then all of the commands in the batch that signaled it have completed and are no longer using the resources they use. Fences can be used for extremely coarse synchronization, at the level of a submit command.
You multi-buffer uniform data because the nature of uniform data is such that it typically needs to change every frame. If you have vertex buffers or images to change every frame, then you'll need to do the same thing with those.
For infrequent changes, you may want to have extra memory available so that you can just create new images or buffers, then delete the old ones when the memory is no longer in use. Or you may have to stall the CPU until the GPU has finished using those resources.

Could a tight loop destroy cells of a microcontroller's flash?

It is well-known that Flash memory has limited write endurance, less so that reads could also have an upper limit such as mentioned in this Flash endurance test's Conclusion (3rd point).
On a microcontroller the code is typically stored in Flash, and is executed by fetching code words directly from the Flash cells.(at least this is most commonly so on 8 bit micros, some 32 bit micros might have some small buffer).
Depending on the particular code, it might happen that a location is accessed extremely frequently, such as if on the main execution path there is some busy loop, such as a wait for an interrupt (for example from a timer, synchronizing execution to a fixed interval).This could generate 100K or even more (read) accesses per second on average to a single Flash cell (depending on clock and the particular code).
Could such code actually destroy the cells of the Flash underneath it?(Is there any necessity to be concerned about this particular problem when designing code for microcontrollers? Such as part of a system which is meant to operate for years? Of course the Flash could be periodically verified by CRC, but that doesn't prevent the system failing if it happens, only that the failure will more likely happen in a controlled manner)
Only erasing/writing will affect the memory cells, not reading. You don't need to consider the number of reads when designing the program.
Programmed flash memory does age though, meaning that the value of the cells might not be reliable after a certain amount of years. This is known as data retention and depends mainly on temperature. MCU manufacturers typically specify a worse case in years, assuming that the part is kept in maximum specified ambient temperature.
This is something to consider for products that are expected to live long (> 10 years), particularly in environments where high temperatures can be expected. CRC and/or ECC is a good counter-measure against data retention, although if you do find that a cell has been corrupted, it typically just means that the application should shut down to a non-recoverable safe state.
I know of two techniques to approach this issue:
1) One technique is to set aside a const 32-bit integer variable in the system code. Then calculate a CRC32 checksum of the compiled binary image, and inserting the checksum into the reserved variable using an ELF-editor.
A module in the system software will then calculate a CRC32 over the flash area occupied by the application and compare to the "stored" value.
If you are using GCC, the linker can define a symbol to tell you where the segment stops. This method can detect errors but cannot correct them.
2) Another technique is to use a microcontroller that supports Flash ECC. TI sells Cortex-R4 MCUs which support Flash ECC (Hercules series).
I doubt that this is a practical concern. The article you cited vaguely asserts that this can happen but with no supporting evidence or quantification of the effect. There is a vague, unsupported and unquantified reference in the introduction:
[...] flash degrades over time from erasing/writing (or even just reading, although that decay is slower) [...]
Then again in the conclusion:
We did not check flash decay for reads, but reading also causes long term decay. It would be interesting to see if we can read a spot enough times to cause failure.
The author may be referring to read-disturbance in NAND flash, but microcontrollers do not use NAND flash for code storage/execution since it is not random-access. Read disturb is not a permanent effect, erasing and re-writing the affected block restores endurance. NAND controllers maintain read counts for blocks and automatically copy and erase blocks as necessary. They also employ ECC to detect and correct errors, and identify "write-worn" areas.
There is the potential for long-term "bit-rot" but I doubt that it is caused specifically by reading rather just ageing.
Most RTOS systems spend the majority of their processing time in a do-nothing idle loop, and run happily 24/7 365 days a year. Some processors support a wait-for-interrupt instruction that halts the CPU in the idle loop, but by no means all, and it is not uncommon not to use such an instruction. Processors with flash accelerators or caches may also prevent continuous rapid fetch from a single location, but again that is by no means all microcontrollers.

Which takes longer time? Switching between the user & kernel modes or switching between two processes?

Which takes longer time?
Switching between the user & kernel modes (or) switching between two processes?
Please explain the reason too.
EDIT : I do know that whenever there is a context switch, it takes some time for the dispatcher to save the status of the previous process in its PCB, and then reload the next process from its corresponding PCB. And for switching between the user and the kernel modes, I know that the mode bit has to be changed. Isn't it all, or is there more to it?
Switching between processes (given you actually switch, not run them in parallel) by an order of oh-my-god.
Trapping from userspace to kernelspace used to be done with a processor interrupt earlier. Around 2005 (don't remember the kernel version), and after a discussion on the mailing list where someone found that trapping was slower (in absolute measures!) on a high-end xeon processor than on an earlier Pentium II or III (again, my memory), they implemented it with a new cpu instruction sysenter (which had actually existed since Pentium Pro I think). This is done in the Virtual Dynamic Shared Object (vdso) page in each process (cat /proc/pid/maps to find it) IIRC.
So, nowadays, a kernel trap is basically just a couple of cpu instructions, hence rather few cycles, compared to tenths or hundreds of thousands when using an interrupt (which is really slow on modern CPU's).
A context switch between processes is heavy. It means storing all processor state (registers, etc) to RAM (at a magic memory location in the user process space actually, guess where!), in practice dirtying all cached memory in the cpu, and reading back the process state for the new process. It will (likely) have nothing still in the cpu cache from last time it ran, so each memory read will be a cache miss, and needed to be read from RAM. This is rather slow. When I was at the university, I "invented" (well, I did come up with the idea, knowing that there is plenty of dye in a CPU, but not enough cool if it's constantly powered) a cache that was infinite size although unpowered when unused (only used on context switches i.e.) in the CPU, and implemented this in Simics. Implemented support for this magic cache I called CARD (Context-switch Active, Run-time Drowsy) in Linux, and benchmarked rather heavily. I found that it could speed-up a Linux machine with lots of heavy processes sharing the same core with about 5%. This was at relatively short (low-latency) process time slices, though.
Anyway. A context switch is still pretty heavy, while a kernel trap is basically free.
Answer to at which memory location in user-space, for each process:
At address zero. Yep, the null pointer! You can't read from this entire page from user-space anyway :) This was back in 2005, but it's probably the same now unless the CPU state information has grown larger than a page size, in which case they might have changed the implementation.

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

Testing Real Time Operating System for Hardness

I have an embedded device (Technologic TS-7800) that advertises real-time capabilities, but says nothing about 'hard' or 'soft'. While I wait for a response from the manufacturer, I figured it wouldn't hurt to test the system myself.
What are some established procedures to determine the 'hardness' of a particular device with respect to real time/deterministic behavior (latency and jitter)?
Being at college, I have access to some pretty neat hardware (good oscilloscopes and signal generators), so I don't think I'll run into any issues in terms of testing equipment, just expertise.
With that kind of equipment, it ought to be fairly easy to sync the o-scope to a steady clock, produce a spike each time the real-time system produces an output, an see how much that spike varies from center. The less the variation, the greater the hardness.
To clarify Bob's answer maybe:
Use the signal generator to generate a pulse at some varying frequency.
Random distribution across some range would be best.
use the signal generator (trigger signal) to start the scope.
the RTOS has to respond, do it thing and send an output pulse.
feed the RTOS output into input 2 of the scope.
get the scope to persist/collect mode.
get the scope to start on A , stop on B. if you can.
in an ideal workd, get it to measure the distribution for you. A LeCroy would.
Start with a much slower trace than you would expect. You need to be able to see slow outliers.
You'll be able to see the distribution.
Assuming a normal distribution the SD of the response time variation is the SOFTNESS.
(This won't really happen in practice, but if you don't get outliers it is reasonably useful. )
If there are outliers of large latency, then the RTOS is NOT very hard. Does not meet deadlines well. Unsuitable then it is for hard real time work.
Many RTOS-like things have a good left edge to the curve, sloping down like a 1/f curve.
Thats indicitive of combined jitters. The thing to look out for is spikes of slow response on the right end of the scope. Keep repeating the experiment with faster traces if there are no outliers to get a good image of the slope. Should be good for some speculative conclusion in your paper.
If for your application, say a delta of 1uS is okay, and you measure 0.5us, it's all cool.
Anyway, you can publish the results ( and probably in the publish sense, but certainly on the web.)
Link from this Question to the paper when you've written it.
Hard real-time has more to do with how your software works than the hardware on its own. When asking if something is hard real-time it must be applied to the complete system (Hardware, RTOS and application). This means hard or soft real-time is system design issues.
Under loading exceeding the specification even a hard real-time system will fail (hopefully with proper failure indication) while a soft real-time system with low loading would give hard real-time results. How much processing must happen in time and how much pre/post processing can be performed is the real key to hard/soft real-time.
In some real-time applications some data loss is not a failure it should just be below a certain level, again a system criteria.
You can generate inputs to the board and have a small application count them and check at what level data is going to be lost. But that gives you a rating specific to that system running that application. As soon as you start doing more processing your computational load increases and you now have a different hard real-time limit.
This board will running a bare bones scheduler will give great predictable hard real-time performance for most tasks.
Running a full RTOS with heavy computational load you probably only get soft real-time.
Edit after comment
The most efficient and easiest way I have used to measure my software's performance (assuming you use a schedular) is by using a free running hardware timer on the board and to time stamp my start and end of my cycle. Or if you run a full RTOS time stamp you acquisition and transition. Save your Max time and run a average on the values over a second. If your average is around 50% and you max is within 20% of your average you are OK. If not it is time to refactor your application. As your application grows the cycle time will grow. You can monitor the effect of all your software changes on your cycle time.
Another way is to use a hardware timer generate a cyclical interrupt. If you are in time reset the interrupt. If you miss the deadline you have interrupt handler signal a failure. This however will only give you a warning once your application is taking to long but it rely on hardware and interrupts so you can't miss.
These solutions also eliminate the requirement to hook up a scope to monitor the output since the time information can be displayed in any kind of terminal by a background task. If it is easy to monitor you will monitor it regularly avoiding solving the timing problems at the end but as soon as they are introduced.
Hope this helps
I have the same board here at work. It's a slightly-modified 2.6 Kernel, I believe... not the real-time version.
I don't know that I've read anything in the docs yet that indicates that it is meant for strict RTOS work.
I think that this is not a hard real-time device, since it runs no RTOS.
I understand being geek, but using oscilloscope to test a computer with ethernet/usb/other digital ports and HUGE internal state (RAM) is both ineffective and unreliable.
Instead of watching wave forms, you can connect any PC to the output port and run proper statistical analysis.
The established procedure (if the input signal is analog by nature) is to test system against several characteristic inputs - traditionally spikes, step functions and sine waves of different frequencies - and measure phase shift and variance for each input type. Worst case is then used in specifications of the system.
Again, if you are using standard ports, you can easily generate those on PC. If the input is truly analog, a separate DAC or simply a good sound card would be needed.
Now, that won't say anything about OS being real-time - it could be running vanilla Linux or even Win CE and still produce good and stable results in those tests if hardware is fast enough.
So, you need to simulate heavy and varying loads on processor, memory and all ports, let it heat and eat memory for a few hours, and then repeat tests. If latency stays constant, it's hard real-time. If it doesn't, under any load and input signal type, increase above acceptable limit, it's soft. Otherwise, it's advertisement.
P.S.: Implication is that even for critical systems you don't actually need hard real-time if you have hardware.