embedded system : reading that same memory block with two different DMAs - embedded

I am new to embedded devices programing. I have a task that reads a block of data from the DSP memory address and copies it to other addresses (where other peripherals are mapped). the copying process is done by programming one of the DMA channels in the device.
I would like, to have a copy of that data copied elsewhere in addition for the first copy.
now my question is, if I use a second DMA channel and trigger its copy operation on right after the first DMA start doing its job, would the two DMA operations collide with each other in some way ?

Depends, I'm sure, on what your doing this on, but no, the DMA channels will not likely 'collide' though one may preempt the other.
If you're using this on one of Microchips dsPIC33F devices, the point of the DMA is that the access is independent of the CPU. If you time it right, then you can match your DMA timing to your clock timing and get atomic reads or writes. Also, you could have up to 8 unidirectional channels, that are ordered in priority.
On that platform, I believe, (I don't know) that two DMA channels will not operate simultaneously, they will operate one after the other, based on that particular channels' priority. The higher priority channel will finish first, even if the lower priority channel started first.
So, yes, you can copy your information to two different locations without using up CPU clocks, but it will take twice as long.

Related

Underlying hardware mapping of Vulkan queues

Vulkan is intended to be thin and explicit to user, but queues are a big exception to this rule: queues may be multiplexed by driver and it's not always obvious if using multiple queues from a family will improve performance or not.
After one of driver updates, I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them (will be happy to be proved wrong)
So why not just say "we have N separate hardware queues and if you want to use some of them in parallel, just mutex it yourself"? Now it looks like there's no way to know, how independent queues in family really are.
GPUs these days have to contend with a multi-processed world. Different programs can access the same hardware, and GPUs have to be able to deal with that. As such, having parallel input streams for a single piece of actual hardware is no different from being able to create more CPU threads than you have actual CPU cores.
That is, a queue from a family is probably not "mutexing" access to the actual hardware. At least, not in a CPU way. If multiple queues from a family are different paths to execute stuff on the same hardware, then the way that hardware gets populated from these multiple queues probably happens at the GPU level. That is, it's an actual hardware feature.
And you could never get performance equivalent to that hardware feature by "mutexing it yourself". For example:
I've got 2 transfer-only queues instead of one, but I'm pretty sure that there will be no benefit in using them in parallel for data streaming compared to just using one of them
Let's assume that there really is only one hardware DMA channel with a fixed bandwidth behind that transfer queue. This means that, at any one time, only one thing can be DMA'd from CPU memory to GPU memory at one time.
Now, let's say you have some DMA work to do. You want to upload a bunch of stuff. But every now and then, you need to download some rendering product. And that download needs to complete ASAP, because you need to reuse the image that stores those bytes.
With prioritized queues, you can give the download transfer queue much higher priority than the upload queue. If the hardware permits it, then it can interrupt the upload to perform the download, then get back to the upload.
With your way, you'd have to upload each item one at a time at regular intervals. A process that will have to be able to be interrupted by a possible download. To do that, you'd basically have to have a recurring tasks that shows up to perform and submit a single upload to the transfer queue.
It'd be much more efficient to just throw the work at the GPU and let its priority system take care of it. Even if there is no priority system, then it'll probably perform operations round-robin, jumping back and forth between the input transfer queue operations rather than waiting for one queue to run dry before trying another.
But of course, this is all hypothetical. You'd need to do profiling work to make sure that these things pan out.
The main issue with queues within families is that they sometimes represent distinct hardware with their own dedicated resources and sometimes they don't. AMD's hardware for example offers two transfer queues, but these actually use separate DMA channels. Granted, they probably still share the same overall bandwidth, but it's not a simple case of one queue having to wait to execute work until the other queue has executed a transfer command.

What happens at the receive part when we write with SPI?

When SPI master writes to the slave, something is shifting into the receive buffer right?
If yes, then it is normal "RXDATAAVAILABLE" flag to be set? It is nonsense! We send data, and when data is sent, we get notified that there is data received.
If all of my statements are correct, then how do we know what the correct data is into the RXFIFO?
Suppose we send two bytes frame. The first one is the address and the second one is dummy in order to read the value in that address (of the slave). Then suppose we have two levels Rx FIFO. In that FIFO instead the value read from the slave, we have two bytes, the first is who knows what, and the second the value read from the slave.
So the question is: how do we manage to receive only what is necessary, without getting junk data during the write part of the frame?
SPI works like simple 8 bit shift registers. You shift out bytes on MOSI at each flank of a clock and at the same time you shift in new data from MISO. Thus you send and receive at the same time. Hence the names MOSI = Master Out Slave In, and MISO = Master In Slave Out.
SPI peripherals on microcontrollers are more intricate than that though, and have separate data registers that are different from the actual hardware shift register, so that we can write data without worrying about the pending transmission. Some may even have multiple data buffers. But on the fundamental level, SPI always work with 8 bits.
When the microcontroller acting as SPI master writes something, there is usually two flags, one that says that the data buffer is made available, and another that tells that transmission is done.
When you are done sending, you are also done receiving. You'll get some sort of flag set. This is assuming that all devices are implementing SPI as intended, which is often not the case.
Note that some devices implement a system where you first send x bytes of data, and after that receive x bytes of data. This seems to be the scenario you describe. Sending and receiving is not done at the same time for that device, but instead in sequence. Meaning that during the first transmission, you'll clock in garbage, and then in order to receive data, you must clock out garbage. This is no fault of SPI, but how the manufacturer of the specific device has specified things.
Note that SPI is very poorly standardized and therefore all manner of weird crap exists on the market. The manner of sending/receiving data may vary, the clock polarity (flanks) may vary, where the device clocks the data may vary. Some devices might need delays between data bytes. Some devices might need some obscure handling of the Slave Select pin in order to work. It's all one big mess and the lack of international standardization is to blame.
An SPI master engine's received data available flag will be set as a simple result of the occurrence of a word's worth of clock cycles generated by the master itself. It tells you nothing about the operation or even existence of a peripheral on the bus.
When this flag is set, it is entirely up to you and your software to know if the contents of the received data register will have meaning or not.
If you have properly selected and interacted with an existent, operational peripheral in a read or transfer operation where it is documented to give a result, they will have meaning
If you have performed a purely write operation to a peripheral for which no reply data is documented at the word position in question, it will be meaningless, effectively no different than reading some random legal memory location. Note that in most cases, a write operation is simply a transfer where the received data is to be ignored - at implementation level there is usually no other difference.
If you have failed to address any existent peripheral it will be similarly be meaningless.
As with any other memory or read operation, it is up to you to know if the contents of the register in a particular situation are meaningful or not.
Since you know that the first byte contains "who knows what" while the second has meaning, write your software to ignore the first and use the second.
(As an aside, many, though by no means all, SPI peripherals are documented to shift out whatever constitutes their primary status register during the address phase, since that makes for a quick way to poll it)

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

How CPU finds ISR and distinguishes between devices

I should first share all what I know - and that is complete chaos. There are several different questions on the topic, so please don't get irritated :).
1) To find an ISR, CPU is provided with a interrupt number. In x86 machines (286/386 and above) there is a IVT with ISRs in it; each entry of 4 bytes in size. So we need to multiply interrupt number by 4 to find the ISR. So first bunch of questions is - I am completely confused in mechanism of CPU receiving the interrupt. To raise an interrupt, firstly device shall probe for IRQ - then what ? The interrupt number travels "on IRQ" towards CPU? I also read something like device putting ISR address on data bus ; whats that then ? What is the concept of devices overriding the ISR. Can somebody tell me few example devices where CPU polls for interrupts? And where does it finds ISR for them ?
2) If two devices share an IRQ (which is very much possible), how does CPU differs amongst them ? What if both devices raise an interrupt of same priority simultaneously. I got to know there will be masking of same type and low priority interrupts - but how this communication happens between CPU and device controller? I studied the role of PIC and APIC for this problem, but could not understand.
Thanks for reading.
Thank you very much for answering.
CPUs don't poll for interrupts, at least not in a software sense. With respect to software, interrupts are asynchronous events.
What happens is that hardware within the CPU recognizes the interrupt request, which is an electrical input on an interrupt line, and in response, sets aside the normal execution of events to respond to the interrupt. In most modern CPUs, what happens next is determined by a hardware handshake particular to the type of CPU, but most of them receive a number of some kind from the interrupting device. That number can be 8 bits or 32 or whatever, depending on the design of the CPU. The CPU then uses this interrupt number to index into the interrupt vector table, to find an address to begin execution of the interrupt service routine. Once that address is determined, (and the current execution context is safely saved to the stack) the CPU begins executing the ISR.
When two devices share an interrupt request line, they can cause different ISRs to run by returning a different interrupt number during that handshaking process. If you have enough vector numbers available, each interrupting device can use its own interrupt vector.
But two devices can even share an interrupt request line and an interrupt vector, provided that the shared ISR is clever enough to go back to all the possible sources of the given interrupt, and check status registers to see which device requested service.
A little more detail
Suppose you have a system composed of a CPU, and interrupt controller, and an interrupting device. In the old days, these would have been separate physical devices but now all three might even reside in the same chip, but all the signals are still there inside the ceramic case. I'm going to use a powerPC (PPC) CPU with an integrated interrupt controller, connected to a device on a PCI bus, as an example that should serve nicely.
Let's say the device is a serial port that's transmitting some data. A typical serial port driver will load bunch of data into the device's FIFO, and the CPU can do regular work while the device does its thing. Typically these devices can be configured to generate an interrupt request when the device is running low on data to transmit, so that the device driver can come back and feed more into it.
The hardware logic in the device will expect a PCI bus interrupt acknowledge, at which point, a couple of things can happen. Some devices use 'autovectoring', which means that they rely on the interrupt controller to see to it that the correct service routine gets selected. Others will have a register, which the device driver will pre-program, that contains an interrupt vector that the device will place on the data bus in response to the interrupt acknowledge, for the interrupt controller to pick up.
A PCI bus has only four interrupt request lines, so our serial device will have to assert one of those. (It doesn't matter which at the moment, it's usually somewhat slot dependent..) Next in line is the interrupt controller (e.g. PIC/APIC), that will decide whether to acknowledge the interrupt based on mask bits that have been set in its own registers. Assuming it acknowledges the interrupt, it either then obtains the vector from the interrupting device (via the data bus lines), or may if so programmed use a 'canned' value provided by the APIC's own device driver. So far, the CPU has been blissfully unaware of all these goings-on, but that's about to change.
Now it's time for the interrupt controller to get the attention of the CPU core. The CPU will have its own interrupt mask bit(s) that may cause it to just ignore the request from the PIC. Assuming that the CPU is ready to take interrupts, it's now time for the real action to start. The current instruction usually has to be retired before the ISR can begin, so with pipelined processors this is a little complicated, but suffice it to say that at some point in the instruction stream, the processor context is saved off to the stack and the hardware-determined ISR takes over.
Some CPU cores have multiple request lines, and can start the process of narrowing down which ISR runs via hardware logic that jumps the CPU instruction pointer to one of a handful of top level handlers. The old 68K, and possibly others did it that way. The powerPC (and I believe, the x86) have a single interrupt request input. The x86 itself behaves a bit like a PIC, and can obtain a vector from the external PIC(s), but the powerPC just jumps to a fixed address, 0x00000500.
In the PPC, the code at 0x0500 is probably just going to immediately jump out to somewhere in memory where there's room enough for some serious decision making code, but it's still the interrupt service routine. That routine will first go to the PIC and obtain the vector, and also ask the PIC to stop asserting the interrupt request into the CPU core. Once the vector is known, the top level ISR can case out to a more specific handler that will service all the devices known to be using that vector. The vector specific handler then walks down the list of devices assigned to that vector, checking interrupt status bits in those devices, to see which ones need service.
When a device, like the hypothetical serial port, is found wanting service, the ISR for that device takes appropriate actions, for example, loading the next FIFO's worth of data out of an operating system buffer into the port's transmit FIFO. Some devices will automatically drop their interrupt request in response to being accessed, for example, writing a byte into the transmit FIFO might cause the serial port device to de-assert the request line. Other devices will require a special control register bit to be toggled, set, cleared, what-have-you, in order to drop the request. There are zillions of different I/O devices and no two of them ever seem to do it the same way, so it's hard to generalize, but that's usually the way of it.
Now, obviously there's more to say - what about interrupt priorities? what happens in a multi-core processor? What about nested interrupt controllers? But I've burned enough space on the server. Hope any of this helps.
I Came over this Question like after 3 years.. Hope I Can help ;)
The Intel 8259A or simply the "PIC" has 8 pins ,IRQ0-IRQ7, every pin connects to a single device..
Lets suppose that u pressed a button on the keyboard.. the voltage of the IRQ1 pin, which is connected to the KBD, is High.. so after the CPU gets interrupted, acknowledge the Interrupt bla bla bla... the PIC does simply add 8 to the number of the IRQ line so IRQ1 means 1+8 which means 9
SO the CPU sets its CS and IP on the 9th entry in the vector table.. and because the IVT is an array of longs it just multiply the number of cells by 4 ;)
CPU.CS=IVT[9].CS
CPU.IP=IVT[9].IP
the ESR deals with the device through the I/O ports ;)
Sorry for my bad english .. am an Arab though :)

How firmwares communicate to the electronic devices to perform its operations?

Almost all electronic devices comes with firmwares. I know it is stored in ROM (Read only memory) so it becomes non-volatile (no power source required to hold the contents from getting erased like RAM)
What I want to know is "How firmwares communicate to the electronic devices to perform its operations?"
Let say there is a small roller.. On press of a button, how it makes it to move?
Can someone please explain what is residing behind, to make it happen..
I think it may require a little brief explanation to unwind it..
Also what is the most popular language used for coding firmwares?
Modern hardware like you're describing has a program stored in ROM and an all-purpose microcomputer (CPU) executing that program.
The CPU reads information from ROM by setting up addresses on its address bus and then asking the ROM to tell it the value stored at that location. There's something like a read pulse being raised (on a separate line) to tell the ROM to make the value accessible on the lines of the data bus. That, in a nutshell, is reading.
To get the hardware to do something, the CPU basically executes a kind of write operation. It puts a value, which is just a bunch of bits if you want to look at it that way, on the address bus to select a certain device and perhaps function on that device, then it raises another signal line saying "write!" The device that recognizes its address on the address bus responds to that signal by accepting the data from the data bus and then performing whatever its function is. Typically, one of the data bus bits will be connected within the output device to a power output stage, i.e. a transistor stronger than the ones used just for computation, and that transistor will connect some electrical device to current sufficient to make it move/glow/whatever.
Tiny, cheap devices are coded in assembly language to save costs for ROM; in industrial quantities, even small amounts of memory can affect price. The assembly language is specific to the CPU; some chips called "8051", "6502" and "Atmel (something or other)" are popular. Bigger devices with more complex requirements may have their firmware written in C or a C-like dialect, which makes programming a little easier than assembler. The bigges ones even run C++ code. Compiled, of course.
In most systems there are special memory addresses which are used for I/O. Reading and writing on such addresses executes some function instead of just moving data around. In x86 systems there are also special I/O instructions IN and OUT for that.
The simplest case is called general parallel I/O (GPIO), where you can read or write data directly from/to external electrical pins on the device. There are several memory addresses, called registers, where you can read data from the port (voltage near 0 = 0, near supply voltage = 1), where you can write data to the port, and where you can define whether a particular pin is input (the corresponding bit is typically 0) or output (the bit is 1). Every microcontroller has GPIO.
So in your example the button could be connected to a pin set to input, which the software could sense. It would typically do this every 10ms and only react if it has a stable value for several reads, this is called debouncing. Then it would write a 1 to some output, which via some transistor for amplification could drive a motor. If it senses that you release the switch it could turn the motor off again by writing a 0. And so on, this program would run until you turn the device off.
There are lots of other I/O devices for other purposes with typically hundreds of registers for controlling them. If you want to see more you could look into the data sheet of some microcontroller. For example, here is the data sheet of ATtiny4/5/9/10, a very small controller from the Atmel AVR family.
Today most firmware is written in C, except for the smallest devices and for a little special code for handling resets and interrupts, which is written in assembly language.