Difference between Machine Cycle, Bus Cycle and Execution Cycle - hardware

I am unable to understand the difference between Bus Cycle, Instruction Cycle and Machine Cycle. Please help me out. Thanks

First off, computers use a clock. The frequency of this clock indicates how many (Giga/Mega/Kilo) cyles per second that the clock wave changes. This is the basis of any cycle for the computer.
The bus cycle is the cycle or time required to make a single read or write transaction between the cpu and an external device such as external memory.
The machine cycle is the amount of cycles needed to do either a fetch, read or write operation. more here. The read or write may be more than a single bus cycle if the transaction between the CPU and memory is longer than the data width fetched or written. For example, on an 8080 machine, the data width is 8 bits. If the CPU needs to fetch or write 16 bits of data, that will require two bus cycles.
The instruction cycle is how many of these machine cycles are needed to complete an instruction. This varies depending on the instruction. For instance, some instructions after fetching them from memory need to fetch more data to complete the instruction, some need to write data at the end of the instruction cycle, some instructions don't do much at all, like the NOP, which basically fetches the instruction and does nothing for one machine cycle.
I hope this helps a bit. If not, maybe microprocessor timing diagrams will help clear things up a bit more.

Related

Should the amount of resource allocations be "per swap chain image"?

I just learned about uniform buffers (https://vulkan-tutorial.com/Uniform_buffers/Descriptor_layout_and_buffer) and a bit confused about the size of uniformBuffers and uniformBuffersMemory. In the tutorial it is said that:
We should have multiple buffers, because multiple frames may be in flight at the same time and we don't want to update the buffer in preparation of the next frame while a previous one is still reading from it! We could either have a uniform buffer per frame or per swap chain image.
As far as I understand "per swap chain image" approach is more optimal. Please, prove me wrong, if I am. But why do we need it to be the size of swapChainImages.size()? Isn't MAX_FRAMES_IN_FLIGHT just enough, because we have fences? As a simple example, if we have just a single frame in flight and do vkDeviceWaitIdle after each presentation then our single uniform buffer will always be available and not used by cpu/gpu so we don't need an array of them.
do vkDeviceWaitIdle
OK, stop right there. There is basically only one valid reason to call that function: you need to delete every resource created by that device, because you're about the destroy the device, so you wait until all such resources are no longer being used.
Yes, if you halt the CPU's execution until the GPU stops doing stuff, then you're guaranteed that CPU writes to GPU memory will not interact with GPU reads from that memory. But you purchased this guarantee by ensuring that there will be no overlap at all between CPU execution and GPU execution. The CPU sets up some stuff, sends it to the GPU, then waits till the GPU is done, and the CPU starts up again. Everything executes perfectly synchronously. While the CPU is doing work, the GPU is doing nothing. And vice-versa.
This is not a recipe for performance. If you're going to use a graphics API designed to achieve lots of CPU/GPU overlap, you shouldn't throw that away because it's easier to work with.
Get used to multi-buffering any resources that you modify from the CPU on a regular basis. How many buffers you want to use is your choice, one that should be informed by the present mode and the like.
My question is "Do I need n buffers or m is enough?".
The situation you're describing ultimately only happens if your code wanted to have X frames in flight, but the presentation engine requires you to use a minimum of Y swap-chain images, and X < Y. So the question you're asking can be boiled down to, "if I wanted to do double-buffering, but the implementation forces 3 buffers on me, is it OK if I treat it as double-buffering?"
Yes, as long as you're not relying on the vkAcquireNextImage call to block the CPU for your synchronization. But you shouldn't be relying on that anyway, since the call itself doesn't constitute a proper barrier as far as the Vulkan execution model is concerned. You should instead block the CPU on fences tied to the actual work, not on the acquire process.

TinyAVR 0-Series: Can I use pin-change sensing without entering interrupt handler?

I am evaluating the ATtiny806 running at 20MHz to build a cycle-accurate Intel 4004 microprocessor emulator. (I know it will be a bit too slow, but AVRs have a huge community.)
I need to synchronize to the external, two-phase non-overlapping clocks. These are not fast clocks (the original 4004 ran at 750kHz)
but if I spin-wait for every clock edge, I risk wasting most of my time budget.
The TinyAVR 0-series has a very nice pin-change interrupt facility that can be configured to trigger only on rising edges.
But, an interrupt routine round-trip is 8 cycles (3 in, 5 out).
My question is:
Can I leverage the pin-change sensing mechanism while never visiting an ISR?
(Other processor families let you poll for interruptible conditions without enabling interrupts from that peripheral). Can polling be done with a tight skip-on-bit/jump-back loop, followed by a set-bit instruction?
Straightforward way
You can always just poll on the level of the GPIO pin using the single cycle skip if bit set/clear instruction on the appropriate PORT register and bit.
But as you mention, polling does burn cycles so I'm not sure exactly what you want here - either a poll (that burns cycles but has low latency) or an interrupt (that has higher latency but allows processing to continue until the condition is true).
Note that if things get really tight and you are looking for, say, power savings by sleeping between clock signal transitions then you can do tricks like having an ISR that nevers returns (saving the IRET cycles) but that requires some careful coding probably with something like a state machine.
INTFLAG way
Alternately, if you want to use the internal pin state machine logic and you can live without interrupts, then you can use the INTFLAGS flags to check for the pin change configured in the ISC bits of the PINxCTRL register. As long as global interrupts are not enabled in SREG then you can spin poll on the appropriate INTFLAG bit to check/wait for the desired condition, and then write a 1 to that bit to clear the flag.
Note that if you want to make this fast, you will probably want to map the appropriate PORT to a VPORT since the VPORT registers are in I/O Memory. This lets you use SBIS to test the INTFLAG bit a single cycle and SBI to clear the bit in a single cycle (these instructions only work on IO memory and the normal PORT registers are not in IO Memory).
Finally one more complication, if you need to leave the interrupts on when doing this, it is probably possible by hacking the interrupt priority registers. You'd set the pin change to be on level 0, and then make sure the interrupts you care about are level 1 or higher, and then trick the interrupt controller into thinking that there is already a level 0 running so these interrupts do not actually fire. There are also other restrictions to this strategy so avoid it if at all possible.
Programmable logic way
If you want to get really esoteric, it is likely possible that you could route the input value of a pin to a configurable custom logic LUT in the chip and then route the output of that module to a bit that you test using a 1-cycle bit test (maybe an unused IO Pin). To do this, you'd feedback the output of the LUT back into one of its inputs and then use the LUT to create a strobe on the edge you are looking for. This is very complex, and also since the strobe has no acknowledgement that if the signal changes when you are not looking for it (in a spin check) then it will be lost and you will have to wait for the next edge (probably fatal in your application).

Does the CPU stall when access external memory(SRAM) via FSMC

I am using a STM32f103 chip with a Cortex-m3 core in a project. According to the manual 3.3.1. Cortex-M3 instructions, load a 32bit word with a single LRD instruction takes 2 CPU cycles to complete (assuming the destination is not PC).
My understanding is that this is only true for reading from internal memories (Flash or internal SRAM)
When reading from an external SRAM via the FSMC, it must take more cycles to complete the read operation. During the read operation, does the CPU stall until the FSMC is able to put the data together? In other words, do I lose CPU cycles when accessing external memories?
Thank you.
Edit 1: Also assume all access are aligned 32bit access.
LDR and STR instructions are not interruptible. The FSMC is bridged from the AHB, and can run at a much slower rate, as you already know. For reads, the pipeline will stall until the data is ready, and this may cause increased worst-case interrupt latency. The write may or may not stall the pipe, depending on configuration. The reference manual says there is a two-word write buffer, but it appears that may only be used to buffer bursting memories. If you were using a CRAM (PSRAM) with a bursting interface, subsequent writes would likely not complete before the next instruction is executing, but a subsequent read would stall (longer) to allow the write to finish before initiating the read.
If using LDM and STM instructions to perform multiple reads or writes, these instructions are interruptible, and it is implementation defined as to whether they will restart from the beginning or continue when returned to. I haven't been able to find out how ST has chosen to implement this behavior. In either case, each individual bus transaction would should not be interrupted.
In regards to LDRD and STRD for working on 64-bit values, I found this discussion which references the following from the ARM-ARM:
"... LDRD, ... STRD, ... instructions are executed as a sequence of
word-aligned word accesses. Each 32-bit word access is guaranteed to
be single-copy atomic. The architecture does not require subsequences
of two or more word accesses from the sequence to be single-copy
atomic."
So, it appears that LDRD and STRD are likely to function the same way LDM and STM function.
The STM32F1xx FSMC has programmable wait states - if for your memory that is not set to zero, then it will indeed take additional cycles. The data bus for the external memory is either 16 or 8 bits, so 32 bit accesses will also take additional cycles. Also the write FIFO can cause the insertion of wait states.
On the other hand the Cortex-M is a Harvard architecture core with different memories on different buses so that instruction and data fetches can occur simultaneously, minimising ot some extent processor stalling.

Drive high frequency output (32Khz) on 20Mhz Renesas microcontroller IO pin

I need to drive a 32Khz square wave on pin 19 of a Renesas R8C/36C µController. The pin is non-negotiable (the circuit design is already complete.)
The software design uses a 250 µsec interrupt for simulating multi-tasking, but that's only good for 2Khz full-wave.
Do I need to create another higher-priority interrupt for driving 32 Khz, or is there some other trick that I'm not aware of?
R8C/36C Hardware Manual
R8C/36C Software Manual
I am not familiar with the RC8 and Renesas don't say much on the subject of performance, but it is a CISC processor with typically 4 cycles per instruction, so lets estimate about 4 MIPS? Some instructions are much longer with division up to 30 cycles.
So if you create a 64KHz timer and flip the output on each interrupt, you have about 63 instructions between each interrupt, you have the interrupt latency plus the code to flip the bit. If it works at all, it is likely to constitute a significant CPU load and may affect the timeliness of other operations.
Be realistic, without a redesign, the project may not be viable. You are already stressing it with the 4KHz OS tick in my opinion - the software overhead at that rate is likley to be a significant chunk of your CPU load.
[ADDED]
I previously suggested 6 instructions between interrupts - finger trouble in the calculator, I have changed that estimate to 63, and moderated my conclusion to "barely feasible".
However I looked again at the data sheet, interrupt latency is variable because the instruction execution is variable, and the current instruction must complete before the interrupt is serviced, the worst case is when the DIVX instruction is executing, when it takes up-to 51 cycles before the first instruction of the interrupt routine. That's 2.55us, when you need the interrupt to trigger every 15.625us, the variable latency will impose significant jitter and constitutes 6 to 16 % of your total CPU time without even considering that used by the ISR itself.. Plus if the interrupt itself is pre-empted, or a higher priority interrupt is running when this one becomes due, further jitter will be imposed.
Whether it works will depend on the accuracy and jitter constraints of the 32KHz, and whatever else your code needs to get done.
As many people have pointed out, this design doesn't seem to be very good from a hardware standpoint if the 32khz clock is meant to be generated with a gpio.
However, I don't know How desperate is your situation, nor do I know the volume involved. But if it is a prototype or very short series, and pin 20 is free, you can short-circuit pins 19 and 20, setup pin 19 as an input and 20 as output. Since pin 20 can be used as output from timer rd, you could set up that timer to output the 32khz without using any interrupts.
I am not a renesas micro expert, but I'm talking from what I've seen in the data sheet you attached and previous experience with other mcu's.
I hope this helps.
Looking at the datasheet for that chip:
It looks like your only real option is to use the pin as a generic output port.
the only usable output mode seems to be the generic output port.
If you can't strap pin 19 to another pin that has the hardware to generate 32KHz and just make pin 19 an input? Not a proud moment but it was easy on a DIL package.
Could you call an interrupt every 15.6us and toggle pin19 then on the sixteenth interrupt do the multi-tasking stuff but that is likely to be wasteful. With an interrupt rate of 32Khz, setting pin19 then eighth of the time doing the multi-tasking decisions and the other seven times wait till a point you can reset pin19 and do some background code for less than half the CPU time

how does CLOCK control events order?

how does clock control various events(operations) from being occurred in desired sequence?what is the significance of a clock cycle time(i've heard that many operations can be issued in a single clock cycle)?
or simply,how does CPU controls operation ordering?
CPUs have various processing units (float, vector, integer), and pipelines of different lengths for each unit.
The clock determines at which speed it will go through all operations in a pipeline, each operation being a tick. Once it gets to the end, the result is sent back to cache/memory.
Multiple pipelines can be active at the same time.
That's all I can tell you..
Ars Technica used to have great articles about this, such as this one:
Understanding the Microprocessor
The clock does not control the sequence of instructions. The clock controls the amount of times per second that the CPU "ticks." Each time is referred as a cycle and consequently each cycle takes some time to complete.
The sequence of instructions is dictated by the running program. Modern CPUs also include optimisations that influence the exact sequence.
These optimisations also make the clock speed (= amount of cycles per second) less significant. For example a dual core CPU is able to execute two instructions in the same cycle.
Yes usually instructions complete in a couple of cycles and compilers optimise the programs to use costly instructions less.