I am learning memory-mapped IOs. I have learned so far that the CPU reads and writes to the specific memory addr to write or read to a certain IO device. In the traditional way, the IO devices used to generate the interrupt and that's how CPU used to know that an IO device is done processing and it has results to be consumed by the CPU.
But in the memory-mapped IO there is no such provision, right? So, according to my knowledge (which doesn't make sense) to read if an IO device has provided any result to CPU or not the CPU needs to go and read the memory every time. Isn't that bad? It is worse than polling, right? because it adds the cycles for reading from the memory.
What am I missing here? Please help.
Take and example of my device:
MEMORY {
IRAM : origin = 0x0, len = 0x30000
CACHE_L2 : origin = 0x30000, len = 0x10000
SDRAM : origin = 0x80000000, len = 0x1000000
FPGA_A1 : origin = 0x90000000, len = 0x1000
FPGA_A2 : origin = 0xA0000000, len = 0x1000
WATCHDOG : origin = 0xB0000000, len = 0x1
}
This is for my C6713 DSP. It shares memory with two FPGA's FPGA_1 and FPGA_2 and it share memory with a PowerPC CPU the SDRAM section.
This is what can be called an example of memory mapped device.
The two FPGA basically handles the ADC conversion, pulsing, digital IO and protections.
Lets stick to an ADC.
A normal C6713 would process ADC at 50usec. At the same level we would have executed our control loop. Maybe we can push it to 25usec but we are consuming more and more CPU and C6713 would be able to do less and less with each increase in sampling speed.
Whereas the FPGA does an ADC conversion per 1usec. So by the time the first control loop starts operation the FPGA has already churned out 50 values. But our control loop doesnt need it all it can go to memory just do a read operation and it has all the values it need.
So by offloading it on a memory mapped device. We have
Freed the C6713 from ADC operation. (That is lot of computing power freed)
Only with a memory read operation C6713 will have a new value.
So now to have your CPU read back the value you can either schedule a read operation on your CPU making sure it always gets a new value or you can configure your FPGA, in this particular case generate an interrupt every 50 usec making you CPU interrupt driven.
So in short no it is not bad as you gain a lot by offloading such tasks from CPU and free up lot of computing power. In process of doing so you also simplify your system.
Hope this helps.
Related
I am trying to learn how to debug an MCU non-intrusively using SWD & openOCD.
while (1)
{
my_count++;
HAL_GPIO_TogglePin(LD2_GPIO_Port,LD2_Pin);
HAL_Delay(750);
}
The code running on my MCU has a free running counter "my_count" . I want to sample/trace the data stored in the address holding "my_count" in real time :
I was doing it this way:
while(1){// generic algorithm no specific language
mdw 0x00000000200000ac; //openOCD command to read from an address
}
0x200000ac is the address of the variable my_count from the .map file.
But, this method is very slow and experiences data drops at high frequencies.
Is there any other way to trace the data at high frequencies without experiencing data drops?
I made some napkin math, and I have an idea that may work.
As per Reference Manual, page 948, the max baud rate for UART of STM32F334 is 9Mbit/s.
If we want to send memory at the specific address, it will be 32 bits. 1 bit takes 1/9Mbps or 1.111*10^(-7)s, multiply that by 32 bits, that makes it 3.555 microseconds. Obviously, as I said, it's purely napkin math. There are start and stop bits involved. But we have a lot of wiggle room. You can easily fit 64 bits into transmission too.
Now, I've checked with the internet, it seems the ST-Link based on STM32F103 can have max baud rate of 4.5Mbps. A bummer, but we simply need to double our timings. 3.55*2 = 7.1us for 32-bit and 14.2us for 64-bit transmission. Even given there is some start and stop bit overhead, we still seem to fit into our 25us time budget.
So the suggestion is the following:
You have a timer set to 25us period that fires an interrupt, that activates DMA UART transmission. That way your MCU actually has very little overhead since DMA will autonomously handle the transmission, while your MCU can do whatever it wants in the meantime. Entering and exiting the timer ISR will be in fact the greatest part of the overhead caused by this, since in the ISR you will literally flip a pair of bits to tell DMA to send stuff over UART # 4.5Mbps.
I have a vertex buffer that is stored in a device memory and a buffer and is host visible and host coherent.
To write to the vertex buffer on the host side I map it, memcpy to it and unmap the device memory.
To read from it I bind the vertex buffer in a command buffer during recording a render pass. These command buffers are submitted in a loop that acquires, submits and presents, to draw each frame.
Currently I write once to the vertex buffer at program start up.
The vertex buffer then remains the same during the loop.
I'd like to modify the vertex buffer between each frame from the host side.
What I'm not clear on is the best/right way to synchronize these host-side writes with the device-side reads. Currently I have a fence and pair of semaphores for each frame allowed simulatenously in flight.
For each frame:
I wait on the fence.
I reset the fence.
The acquire signals semaphore #1.
The queue submit waits on semaphore #1 and signals semaphore #2 and signals the fence.
The present waits on semaphore #2
Where is the right place in this to put the host-side map/memcpy/unmap and how should I synchronize it properly with the device reads?
If you want to take advantage of asynchronous GPU execution, you want the CPU to avoid having to stall for GPU operations. So never wait on a fence for a batch that was just issued. The same thing goes for memory: you should never desire to write to memory which is being read by a GPU operation you just submitted.
You should at least double-buffer things. If you are changing vertex data every frame, you should allocate sufficient memory to hold two copies of that data. There's no need to make multiple allocations, or even to make multiple VkBuffers (just make the allocation and buffers bigger, then select which region of storage to use when you're binding it). While one region of storage is being read by GPU commands, you write to the other.
Each batch you submit reads from certain memory. As such, the fence for that batch will be set when the GPU is finished reading from that memory. So if you want to write to the memory from the CPU, you cannot begin that process until the fence representing the GPU reading operation for that memory reading gets set.
But because you're double buffering like this, the fence for the memory you're about to write to is not the fence for the batch you submitted last frame. It's the batch you submitted the frame before that. Since it's been some time since the GPU received that operation, it is far less likely that the CPU will have to actually wait. That is, the fence should hopefully already be set.
Now, you shouldn't do a literal vkWaitForFences on that fence. You should check to see if it is set, and if it isn't, go do something else useful with your time. But if you have nothing else useful you could be doing, then waiting is probably OK (rather than sitting and spinning on a test).
Once the fence is set, you know that you can freely write to the memory.
How do I know that the memory I have written to with the memcpy has finished being sent to the device before it is read by the render pass?
You know because the memory is coherent. That is what VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means in this context: host changes to device memory are visible to the GPU without needing explicit visibility operations, and vice-versa.
Well... almost.
If you want to avoid having to use any synchronization, you must call vkQueueSubmit for the reading batch after you have finished modifying the memory on the CPU. If they get called in the wrong order, then you'll need a memory barrier. For example, you could have some part of the batch wait on an event set by the host (through vkSetEvent), which tells the GPU when you've finished writing. And therefore, you could submit that batch before performing the memory writing. But in this case, the vkCmdWaitEvents call should include a source stage mask of HOST (since that's who's setting the event), and it should have a memory barrier whose source access flag also includes HOST_WRITE (since that's who's writing to the memory).
But in most cases, it's easier to just write to the memory before submitting the batch. That way, you avoid needing to use host/event synchronization.
I'm trying just for the fun to design a more complex Z80 CP/M system with a lot of peripheral devices. When reading the documentation I stumbled over an (undocumented?) behaviour of the Z80 CPU, when accepting an interrupt in IM0.
When an interrupt occurs, the Z80 activates M1 and IORQ to signal the external device: "Hey, give me an opcode". All is well if the opcode is rst 00 or something like this. Now the documentation tells, ANY opcode of any command can be given to the cpu, for instance a CALL.
But now comes the undocumented part: "The first byte of a multi-byte instruction is read during the interrupt acknowledge cycle. Subsequent bytes are read in by a normal memory read sequence."
A "normal memory read sequence". How can I determine, if the CPU wants to get a byte from memory or instead the next byte from the device?
EDIT: I think, I found a (good?) solution: I can dectect the start of the interrupt acknowlegde cycle by analyzing IORQ and M1. Also I can detect the next "normal" opcode fetch by analyzing MREQ and M1. This way I can install a flip-flop triggered by these two ANDed signals, i.e. the flip-flop is 1 as long as the CPU reads data from the io-device. This 1 I can use to inhibit the bus drivers to and from the memory.
My intentions? I'm designing an interrupt controller with 8 prioritized inputs in a CPLD. It's registers hold a 16 bit address for each interrupt pin. Just for the fun :-)
My understanding is that the peripheral device is required:
to know how many bytes it needs to feed;
to respond to normal read cycles following the IORQ cycle; and
to arrange that whatever would normally respond to memory read cycles does not do so for the duration.
Also the behaviour was documented by Zilog in an application note, from which your quote originates (presumably uncredited).
In practice I guess 99.99% of IM0 users just use an RST and 99.99% of the rest use a known-size instruction like CALL xxxx.
(also I'm aware of a few micros that effectively guaranteed not to put anything onto the bus during an interrupt cycle, thereby turning IM0 into a synonym of IM1 owing to open collector output).
The interrupt behavior is reasonably documented in the Z80 manual:
Interupt modes, IM2 allows you to supply an 8-bit address to a 16-bit pointer. At least halfway to the desired 16-bit direct address.
How to set the interrupt modes
My understanding is that the M1 + IORQ combination is used since there was no pin left for a dedicated interrupt response. A fun detail is also that the Zilog I/O chips like PIO, SIO, CTC reads the RETI instruction (as the CPU fetches it) to learn that the CPU is ready to accept another interrupt.
I am using a STM32f103 chip with a Cortex-m3 core in a project. According to the manual 3.3.1. Cortex-M3 instructions, load a 32bit word with a single LRD instruction takes 2 CPU cycles to complete (assuming the destination is not PC).
My understanding is that this is only true for reading from internal memories (Flash or internal SRAM)
When reading from an external SRAM via the FSMC, it must take more cycles to complete the read operation. During the read operation, does the CPU stall until the FSMC is able to put the data together? In other words, do I lose CPU cycles when accessing external memories?
Thank you.
Edit 1: Also assume all access are aligned 32bit access.
LDR and STR instructions are not interruptible. The FSMC is bridged from the AHB, and can run at a much slower rate, as you already know. For reads, the pipeline will stall until the data is ready, and this may cause increased worst-case interrupt latency. The write may or may not stall the pipe, depending on configuration. The reference manual says there is a two-word write buffer, but it appears that may only be used to buffer bursting memories. If you were using a CRAM (PSRAM) with a bursting interface, subsequent writes would likely not complete before the next instruction is executing, but a subsequent read would stall (longer) to allow the write to finish before initiating the read.
If using LDM and STM instructions to perform multiple reads or writes, these instructions are interruptible, and it is implementation defined as to whether they will restart from the beginning or continue when returned to. I haven't been able to find out how ST has chosen to implement this behavior. In either case, each individual bus transaction would should not be interrupted.
In regards to LDRD and STRD for working on 64-bit values, I found this discussion which references the following from the ARM-ARM:
"... LDRD, ... STRD, ... instructions are executed as a sequence of
word-aligned word accesses. Each 32-bit word access is guaranteed to
be single-copy atomic. The architecture does not require subsequences
of two or more word accesses from the sequence to be single-copy
atomic."
So, it appears that LDRD and STRD are likely to function the same way LDM and STM function.
The STM32F1xx FSMC has programmable wait states - if for your memory that is not set to zero, then it will indeed take additional cycles. The data bus for the external memory is either 16 or 8 bits, so 32 bit accesses will also take additional cycles. Also the write FIFO can cause the insertion of wait states.
On the other hand the Cortex-M is a Harvard architecture core with different memories on different buses so that instruction and data fetches can occur simultaneously, minimising ot some extent processor stalling.
I read that if DMA is available, then processor can route long read or write requests of disk blocks to the DMA and concentrate on other work. But, DMA to memory data/control channel is busy during this transfer. What else can processor do during this time?
First of all, DMA (per se) is almost entirely obsolete. As originally defined, DMA controllers depended on the fact that the bus had separate lines to assert for memory read/write, and I/O read/write. The DMA controller took advantage of that by asserting both a memory read and I/O write (or vice versa) at the same time. The DMA controller then generated successive addresses on the bus, and data was read from memory and written to an output port (or vice versa) each bus cycle.
The PCI bus, however, does not have separate lines for memory read/write and I/O read/write. Instead, it encodes one (and only one) command for any given transaction. Instead of using DMA, PCI normally does bus-mastering transfers. This means instead of a DMA controller that transfers memory between the I/O device and memory, the I/O device itself transfers data directly to or from memory.
As for what else the CPU can do at the time, it all depends. Back when DMA was common, the answer was usually "not much" -- for example, under early versions of Windows, reading or writing a floppy disk (which did use the DMA controller) pretty much locked up the system for the duration.
Nowadays, however, the memory typically has considerably greater bandwidth than the I/O bus, so even while a peripheral is reading or writing memory, there's usually a fair amount of bandwidth left over for the CPU to use. In addition, a modern CPU typically has a fair large cache, so it can often execute some instruction without using main memory at all.
Well the key point to note is that the CPU bus is always partly used by the DMA and the rest of the channel is free to use for any other jobs/process to run. This is the key advantage of DMA over I/O. Hope this answered your question :-)
But, DMA to memory data/control channel is busy during this transfer.
Being busy doesn't mean you're saturated and unable to do other concurrent transfers. It's true the memory may be a bit less responsive than normal, but CPUs can still do useful work, and there are other things they can do unimpeded: crunch data that's already in their cache, receive hardware interrupts etc.. And it's not just about the quantity of data, but the rate at which it's generated: some devices create data in hard real-time and need it to be consumed promptly otherwise it's overwritten and lost: to handle this without DMA the software may may have to nail itself to a CPU core then spin waiting and reading - avoiding being swapped onto some other task for an entire scheduler time slice - even though most of the time further data's not even ready.
During DMA transfer, the CPU is idle and has no control over memory bus. CPU is put in idle state by using high impedance state