how does CLOCK control events order? - hardware

how does clock control various events(operations) from being occurred in desired sequence?what is the significance of a clock cycle time(i've heard that many operations can be issued in a single clock cycle)?
or simply,how does CPU controls operation ordering?

CPUs have various processing units (float, vector, integer), and pipelines of different lengths for each unit.
The clock determines at which speed it will go through all operations in a pipeline, each operation being a tick. Once it gets to the end, the result is sent back to cache/memory.
Multiple pipelines can be active at the same time.
That's all I can tell you..
Ars Technica used to have great articles about this, such as this one:
Understanding the Microprocessor

The clock does not control the sequence of instructions. The clock controls the amount of times per second that the CPU "ticks." Each time is referred as a cycle and consequently each cycle takes some time to complete.
The sequence of instructions is dictated by the running program. Modern CPUs also include optimisations that influence the exact sequence.
These optimisations also make the clock speed (= amount of cycles per second) less significant. For example a dual core CPU is able to execute two instructions in the same cycle.
Yes usually instructions complete in a couple of cycles and compilers optimise the programs to use costly instructions less.

Related

Should the amount of resource allocations be "per swap chain image"?

I just learned about uniform buffers (https://vulkan-tutorial.com/Uniform_buffers/Descriptor_layout_and_buffer) and a bit confused about the size of uniformBuffers and uniformBuffersMemory. In the tutorial it is said that:
We should have multiple buffers, because multiple frames may be in flight at the same time and we don't want to update the buffer in preparation of the next frame while a previous one is still reading from it! We could either have a uniform buffer per frame or per swap chain image.
As far as I understand "per swap chain image" approach is more optimal. Please, prove me wrong, if I am. But why do we need it to be the size of swapChainImages.size()? Isn't MAX_FRAMES_IN_FLIGHT just enough, because we have fences? As a simple example, if we have just a single frame in flight and do vkDeviceWaitIdle after each presentation then our single uniform buffer will always be available and not used by cpu/gpu so we don't need an array of them.
do vkDeviceWaitIdle
OK, stop right there. There is basically only one valid reason to call that function: you need to delete every resource created by that device, because you're about the destroy the device, so you wait until all such resources are no longer being used.
Yes, if you halt the CPU's execution until the GPU stops doing stuff, then you're guaranteed that CPU writes to GPU memory will not interact with GPU reads from that memory. But you purchased this guarantee by ensuring that there will be no overlap at all between CPU execution and GPU execution. The CPU sets up some stuff, sends it to the GPU, then waits till the GPU is done, and the CPU starts up again. Everything executes perfectly synchronously. While the CPU is doing work, the GPU is doing nothing. And vice-versa.
This is not a recipe for performance. If you're going to use a graphics API designed to achieve lots of CPU/GPU overlap, you shouldn't throw that away because it's easier to work with.
Get used to multi-buffering any resources that you modify from the CPU on a regular basis. How many buffers you want to use is your choice, one that should be informed by the present mode and the like.
My question is "Do I need n buffers or m is enough?".
The situation you're describing ultimately only happens if your code wanted to have X frames in flight, but the presentation engine requires you to use a minimum of Y swap-chain images, and X < Y. So the question you're asking can be boiled down to, "if I wanted to do double-buffering, but the implementation forces 3 buffers on me, is it OK if I treat it as double-buffering?"
Yes, as long as you're not relying on the vkAcquireNextImage call to block the CPU for your synchronization. But you shouldn't be relying on that anyway, since the call itself doesn't constitute a proper barrier as far as the Vulkan execution model is concerned. You should instead block the CPU on fences tied to the actual work, not on the acquire process.

How is pipelining implemented? Can we read the firmware of a modern microprocessor?

My question has two related parts.
First, most of modern microprocessors implement pipelining and other means to execute code faster. How do they implement it? I mean, is it the firmware or something else?
Second, if it is the firmware, is it possible for me to read the firmware and look at the code?
Apologies if it is stupid as I have little idea of microprocessor.
Pipelining in processor design is a hardware concept; the idea that a stream of instructions can execute faster if it exploits a bit of parallelism in the flow for process and instruction and by breaking up critical paths in logic. In hardware, for a given design (technically it's implementation), you can only "run" it so fast; ie it takes some time for signals to propagate through all the logic. The longest time it could take in the worst case is the critical path and defines a maximum time (or frequency) the design can run (this is were maximum clock speed comes from).
Now, processing an instruction in the simplest processor can be broken into three big parts: fetching the instruction from memory (ie, fetch), decoding the instruction into it's parts (decode), and actually executing the instruction (execute). For every instruction, it is fetched, decoded, executed; then the next instruction, then the next instruction.
The hardware for each of these stages has a critical path, ie a maximum time it can take in the worst case (Tmax_fetch for fetch stage, Tmax_decode for decode, Tmax_exec for execute). So, for a processor without pipelining (or, single cycle), the critical path for the full processor is all these stages would be the sum of these critical paths (this isn't necessarily true in real designs, but we will use this as a simplified example), Tmax_inst = Tmax_fetch + Tmax_decode + Tmax_exec. So, to run through four instructions, it would take 4 * Tmax_inst = 4 * Tmax_fetch + 4 * Tmax_decode + 4 * Tmax_exec.
Pipelining allows us to break up these critical paths using hardware registers (not unlike the programmers registers, r2 in ARM is an example), but these registers are invisible to the firmware. Now, instead of Tmax_inst being the sum of the stages, it's now just three times the largest of the stages, Tmax_inst = 3 * Tmax_stage = 3 * max(Tmax_fetch, Tmax_decode, Tmax_exec) since the processor has to "wait" for the slowest stage to finish in the worst case. The processor is now slower for a single instruction, but due to the pipeline, we can do each of these stages independently as long as there isn't a dependency between the instructions being processed in each stsge (like a branch instruction, where the fetch stage can't run until the branch is executed). So, for four independent instructions, the processor will only take Tmax_stage * (3 + 4 - 1) as the pipeline allows the first instruction to be fetched, then decoded at the same time the second instruction is fetched, etc.
This should hopefully help better exolain pipelining, but to answer your questions directly:
It's a hardware design concept, so implemented in hardware, not firmware
As it's a hardware concept, there is no firmware code to read.

maximum safe CPU utilization for embedded system

What is the safe maximum CPU utilization time for embedded system for critical applications. We are measuring performance with top. Is 50 - 75% is safe?
Real-Time embedded systems are designed to meet Real-Time constraints, for example:
Voltage acquisition and processing every 500 us (lets say sensor monitoring).
Audio buffer processing every 5.8 ms (4ms processing).
Serial Command Acknowledgement within 3ms.
Thanks to Real-Time Operating System (RTOS), which are "preemptive" (the scheduler can suspend a task to execute one with a higher priority), you can meet those constraints even at 100% CPU usage: The CPU will execute the high priority task then resume to whatever it was doing.
But this does not mean you will meet the constraint no matter what, few tips:
High priority Tasks execution must be as short as possible (by calculating Execution Time/Occurence you can have an estimation of CPU usage).
If estimated CPU usage is too high, look for code optimization, hardware equivalent (hardware CRC, DMA, ...), or second micro-processor.
Stress test your device and measure if your real-time constraints are met.
For the previous example:
Audio processing should be the lowest priority
Serial Acknowledgement/Voltage Acquisition the highest
Stress test can be done by issuing Serial Commands and checking for missed audio buffers, missed analog voltage events and so on. You can also vary CPU clock frequency: your device might meet constraints at much lower clock frequencies, reducing power consumption.
To answer you question, 50-75 and even 100% CPU usage is safe as long as you meet real-time constraints, but bear in mind that if you want to add functionality later on, you will not have much room for that at 98% CPU usage.
In rate monotonic scheduling mathematical analysis determines that real-time tasks (that is tasks with specific real-time deadlines) are schedulable when the utilisation is below about 70% (and priorities are appropriately assigned). If you have accurate statistics for all tasks and they are deterministic this can be as high as 85% and still guarantee schedulability.
Note however that the utilisation applies only to tasks with hard-real-time deadlines. Background tasks may utilise the remaining CPU time all the time without missing deadlines.
Assuming by "CPU utilization" you are referring to time spent executing code other than an idle loop, in a system with a preemptive priority based scheduler, then ... it depends.
First of all there is a measurement problem; if your utilisation sampling period is sufficiently short, it will often switch between 100% and zero, whereas if it were very long you'd get a good average, but would not know if utilisation was high for long-enough to starve a lower priority task than that running to the extent that it might miss its deadlines. It is kind of the wrong question, because any practical utilisation sampling rate will typically be much longer then the shortest deadline, so is at best a qualitative rather then quantitative measure. It does not tell you much that is useful in critical cases.
Secondly there is teh issue of what you are measuring. While it is common for an RTOS to have a means to measure CPU utilisation, it is measuring utilisation by all tasks including those that have no deadlines.
If the utilisation becomes 100% in the lowest priority task for example, there is no harm to schedulabilty - the low priority task is in this sense no different from the idle loop. It may have consequences for power consumption in systems that normally enter a low-power mode in the idle loop.
It a higher priority task takes 100% CPU utilisation such that a deadline for a lower priority task is missed, then your system will fail (in the sense that deadlines will be missed - the consequences are application specific).
Simply measuring CPU utilisation is insufficient, but if your CPU is at 100% utilisation and that utilisation is not simply some background task in the lowest priority thread, it is probably not schedulable- there will be stuff not getting done. The consequences are of course application dependent.
Whilst having a low priority background thread consume 100% CPU may do no harm, it does render the ability to measure CPU utilisation entirely useless. If the task is preemted and for some reason the higher priority task take 100%, you may have no way of detecting the issue and the background task (and any tasks lower then the preempting task) will not get done. It is better therefore to ensure that you have some idle time so that you can detect abnormal scheduling behaviour (if you have no other means to do so).
One common solution to the non-yielding background task problem is to perform such tasks in the idle loop. Many RTOS allow you to insert idle-loop hooks to do this. The background task is then preemptable but not included in the utilisation measurement. In that case of course you cannot then have a low-priority task that does not yield, because your idle-loop does stuff.
When assigning priorities, the task with the shortest execution time should have the highest priority. Moreover the execution time should be more deterministic in higher priority tasks - that is if it takes 100us to run it should within some reasonable bounds always take that long. If some processing might be variable such that it takes say 100us most of the time and must occasionally do something that requires say 100ms; then the 100ms process should be passed off to some lower priority task (or the priority of the task temporarily lowered, but that pattern may be hard to manage and predict, and may cause subsequent deadlines or events to be missed).
So if you are a bit vague about your task periods and deadlines, a good rule of thumb is to keep it below 70%, but not to include non-real-time background tasks in the measurement.

What are some factors that could affect program runtime?

I'm doing some work on profiling the behavior of programs. One thing I would like to do is get the amount of time that a process has run on the CPU. I am accomplishing this by reading the sum_exec_runtime field in the Linux kernel's sched_entity data structure.
After testing this with some fairly simple programs which simply execute a loop and then exit, I am running into a peculiar issue, being that the program does not finish with the same runtime each time it is executed. Seeing as sum_exec_runtime is a value represented in nanoseconds, I would expect the value to differ within a few microseconds. However, I am seeing variations of several milliseconds.
My initial reaction was that this could be due to I/O waiting times, however it is my understanding that the process should give up the CPU while waiting for I/O. Furthermore, my test programs are simply executing loops, so there should be very little to no I/O.
I am seeking any advice on the following:
Is sum_exec_runtime not the actual time that a process has had control of the CPU?
Does the process not actually give up the CPU while waiting for I/O?
Are there other factors that could affect the actual runtime of a process (besides I/O)?
Keep in mind, I am only trying to find the actual time that the process spent executing on the CPU. I do not care about the total execution time including sleeping or waiting to run.
Edit: I also want to make clear that there are no branches in my test program aside from the loop, which simply loops for a constant number of iterations.
Thanks.
Your question is really broad, but you can incur context switches for various reasons. Calling most system calls involves at least one context switch. Page faults cause contexts switches. Exceeding your time slice causes a context switch.
sum_exec_runtime is equal to utime + stime from /proc/$PID/stat, but sum_exec_runtime is measured in nanoseconds. It sounds like you only care about utime which is the time your process has been scheduled in user mode. See proc(5) for more details.
You can look at nr_switches both voluntary and involuntary which are also part of sched_entity. That will probably account for most variation, but I would not expect successive runs to be identical. The exact time that you get for each run will be affected by all of the other processes running on the system.
You'll also be affected by the amount of file system cache used on your system and how many file system cache hits you get in successive runs if you are doing any IO at all.
To give a very concrete and obvious example of how other processes can affect the run time of the current process, think about if you are exceeding your physical RAM constraints. If your program asks for more RAM, then the kernel is going to spend more time swapping. That time swapping will be accounted in stime but will vary depending on how much RAM you need and how much RAM is available. There are lot's of other ways that other processes can affect your process's run time. This is just one example.
To answer your 3 points:
sum_exec_runtime is the actual time the scheduler ran the process including system time
If you count switching to the kernel as the process giving up the CPU, then yes, but it does not necessarily mean a different user process may get the CPU back once the kernel is done.
I think I've already answered this question that there are lot's of factors.

calculating burst time in operating system

how burst time calculated for process in os in real life scenario.say computer is having 5 processes to execute and is using shortest job first algorithm.then how OS will know in advance the burst time of each process??
Since an operation system can not guess the burst time of a process a priori, usually one of the two following approaches is used:
The process is annoted by the developer (e.g., by using methods of critical execution path analysis)
The OS uses the execution time of a former process instance of the very same code the do an estimation.
However, in "real life", SJF is rarely used, at least not as pure algorithm. It is more of theoretical interest.