What does it mean when my CPU doesn't support unaligned memory access?

What does it mean when my CPU doesn't support unaligned memory access? - embedded

I just discovered that the ARM I'm writing code on (Cortex M0), doesn't support unaligned memory access.
Now in my code I use a lot of packed structures, and I never got any warnings or hardfaults, so how can the Cortex access members of these structures when it doesnt allow for unaligned access?

Compilers such as gcc understand about alignment and will issue the correct instructions to get around alignment issues. If you have a packed structure, you will have told the compiler about it so it know ahead of time how to do alignment.
Let's say you're on a 32 bit architecture but have a struct that is packed like this:
struct foo __attribute__((packed)) {
unsigned char bar;
int baz;
}
When an access to baz is made, it will do the memory loads on a 32 bit boundary, and shift all the bits into position.
In this case it will probably to a 32 bit load of the address of bar and a 32 bit load at the address of bar + 4. Then it will apply a sequence of logical operations such as shift and logical or/and to end up with the correct value of baz in a 32 bit register.
Have a look at the assembly output to see how this works. You'll notice that unaligned accesses will be less efficient than aligned accesses on these architectures.

On many older 8-bit microprocessors, there were instructions to load (and storing) registers which were larger than the width of the memory bus. Such an operation would be performed by loading half of the register from one address, and the other half from the next higher address. Even on systems where the memory bus is wider than 8 bits wide (say, 16 bits) it is often useful to regard memory as being an addressable collection of bytes. Loading a byte from any address will cause the processor to read half of a 16-bit memory location and ignore the other half. Reading a 16-bit value from an even address will cause the processor to read an entire 16-bit memory location and use the whole thing; the value will be the same as if one read two consecutive byte addresses and concatenated the result, but it will be read in one operation rather than two.
On some such systems, if one attempts to read a 16-bit value from an odd address, the processor will read two consecutive addresses, using half of one value and the other half of the other value, as though one had performed two single-byte reads and combined the results. This is called an unaligned memory access. On other systems, such an operation will result in a bus fault, which generally triggers some form of interrupt which may or may not be able to do something useful about it. Hardware to support unaligned accesses is rather complicated, and designing code to avoid unaligned accesses is generally not overly difficult. Thus, such hardware generally only exists either on processors that are already very complicated, or processors which will be running code that was designed for processors that would assembly multi-byte registers from single-byte reads (e.g. on the 8088, every 16-bit read required two 8-bit memory fetches, and a lot of 8088 code was run on later Intel processors).

Related

Is there any connection between the segments created in memory by a microprocessor and the memory structure of a process in an operating system?

In 8086 microprocessor, we segment the memory into segments of 64K each because of the 16 bit registers (Since a 20 bit address cannot be stored in the 16 bit register). These segments are categorized as code segment, data segment, stack segment and extra segment. This structure is similar to that of created by a process in operating system. Does that mean each process takes up memory equivalent to 4 segments which will be equivalent to 4*64K in case of 8086 ? And if this is true then by doing some more math we can say that only 4 process will be handled by a 8086 microprocessor at a time (i.e. one of the process will be running state and others would be in block or ready state) since maximum of 16 segments are possible (Total memory size / size of each segment = 1MB/64K = 16).
I have just started studying this and saw this equivalence between process and segments. Does any such connection between the segments of the memory and the memory structure of the process actually exists or it's just my crazy imagination ?

A little history helps. Early UNIX(tm) ran on the Digital pdp minicomputer family. The first circulated versions were V6 & V7, which were exclusive to the pdp-11 family. That family could support a whopping 256K of RAM; but the gp register set (used for address formation) were 16bits wide. There was a limited memory protection scheme in the processor, which permitted the kernel (supervisor) to have a separate address space from user (user); and instructions (addresses generated by pc) to be separate from data (generated by other means). This will probably get edited into the dust by pdp-11 fanbois.
At around this time, intel was rolling what was to become the 8086. Current 8-bit CPUs were already straining at a 64K address space limitation, and were using a concept called bank switching to increase that. In bank switching, some sub-ranges of the 64K address space could be re-pointed into a larger memory bank; so although you could carefully address much more memory. The Hitachi 64180 was one of the CPUs that incorporated this into its silicon; most used external memory controllers.
The 8086 addressing scheme was an amalgamation of these notions. You could produce an Operating System which supported dynamically relocated processes and shared text with up to 64K Instruction + 64K Data. The general idea was you take the segment registers out of the programming model, thus if the OS has to relocate the process, it knows that the process had no saved copy of the old segment value. The commercial OS QNX 1.x, 2.x provided this as a model; the later using the 286 extensions to protect against programs that played with the segment registers.
For programs that didn't care about such subtleties (Lotus 123, ...), you could use the segment registers to effectively create a 2^20 address space on the 8086. It is an ugly programming model in this mode because address formation is A=Seg*16+Base, so Seg=1,Base=0 and Seg=0,Base=16 resolve to the same address.
So, you aren't hallucinating, it was quite intentional, if more than a little half-arsed.

How does malloc know where the first available block is in embedded systems?

I have read that malloc has multiple implementations which are platform depended.
How does it work in an embedded device in bare metal programming?
Let's suppose we have an mcu with 256KB FLASH memory and 64KB RAM.
How does it know how much available RAM there is from my program?

For bare metal systems, you'll have a specific segment allocated in the linker script, often called .heap. There is no such thing as memory sharing between processes, meaning that the heap must have a fixed maximum size and therefore is pretty useless in general. malloc doesn't know a thing about how much RAM your program uses since there is no desktop OS in sight.
Your RAM is divided into .stack, .data, .bss and .heap, each with its own fixed maximum size. More about these segments here: https://electronics.stackexchange.com/a/237759/6102. In a typical bare metal MCU application, most of the RAM will be reserved for .data and .bss. You will have something from 128 bytes up to several kb reserved for the stack. You will typically not have a heap at all - but if you do, it will sit there and take up a fixed amount of x kb no matter how much of it you actually use.
malloc in itself could be implemented in different ways indeed. Either you include a "header" together with each allocated segment, the header stating the allocated size and potentially the address of the next available free segment. Or you could implement it as a look-up table where each item is a pointer to the first element and the size.
None of this is particularly relevant, since you shouldn't be using heap allocation in embedded systems. The main reason being that it doesn't make any sense. You don't want arbitrary behavior, you want deteministic behavior. You want to allocate x amount of memory for the worst case and if a heap was to be used it would have to be at least that large anyway, so you gain nothing but bloat from using a heap. Then comes all the usual problems with allocation overhead, fragmentation and leaks.
For bare metal/RTOS applications, do yourself a favour and delete .heap from your linker script, then forget that you ever heard about malloc. A MCU is not a PC.

How to classify microprocessor

Hi I am new in embedded system. I do not know the true reason we classify microprocessor into 8 bit, 16 bit, 32 bit.
In a document I read, it explained it is because of number of the bit we used to number the address of register. But I think it is not true, because if we need 32 bit to number the register address of a processor so we must have more than 232 registers. It seem nonsense, it is too much register. So I think maybe, it is depended on the size of register or maybe the size of bus or the number of the bit, which microprocessor can work with a time.
Please help me to clarify this issued.

It is clear that you have either misunderstood your reference, or it is poorly worded. It should presumably state:
... number of the bits used for an address register
This means that the address range of the processor is then 2n, so perhaps your reference is referring to memory locations rather than registers.
i.e. it refers to the bit-width of a register, not the enumeration of a register.
However I would suggest that data path width is the more common and useful measure of processor architecture by "bit-width". For example 8-bit processors commonly have 16 bit address buses, and 16 bit address registers. And 16-bit 8086 devices use two 16 bit registers (32 bits) to represent a 20-bit address, but it is neither a 20 nor 32 -bit processor. 32 and 64 bit processors tend to have equal address and data register widths, which may be the cause of this erroneous statement.
As described here, the natural size of an integer (i.e. the integer size that single machine instructions take as operands) is the usual method of classification in this context.

It isn't the address of a register but the width of the register.

Can a 32-bit processor load a 64-bit memory address using multiple blocks or registers?

I was doing a little on 32-bit microprocessors and have I have learnt that:
1) A 32-bit microprocessor can only address 2^32 bits of memory which means that the memory pointer size should not exceed 32-bit range i.e. the pointer size should be equal to or less than 32-bit.
2) I also came to know that CPU allocate multiple blocks of memory for things like storing numbers and text, that is up to the program and not related to the size of each address (Source:here).So is it possible that a CPU can use multiple blocks (registers) to store pointers more than 32-bit in size?

Processors can access an essentially unlimited amount of memory by using variations on a technique called bank switching. In a simple bank-switching scheme, the memory chips that are wired to a portion of the address space will have some address inputs fed by the processor and some from an external latching device. Historically, the IBM PC had a 1MB address space, but an expanded memory board would IIRC allow two 16KB regions of that space to be mapped to any of dozens or hundreds of 16KB blocks of memory contained thereon. Nowadays processors generally have a memory-management unit built-in, which maps 4KB or 64KB blocks of memory to any address within a much larger space, and additional circuitry may, with OS support, expand things further.
The big difficulty with bank switching is that any given address might identify many different places in memory depending upon how the bank-switching hardware is configured, so accessing data from memories in a banked region will generally be more complicated than accessing data in directly-accessible memory and will only be possible from code which knows how the bank-switching hardware works. Nowadays it's more common to simply use a processor which can access all the memory one needs, but historically bank-switching was often a useful technique for going beyond processor limitations.

You could store a 64 bit pointer using 2 separate locations in memeory. But it probably wouldn't be useful since your processor can only use 32 bit pointers.

Physical Memory and Virtual Memory data allocation behavior

Im interested in understanding how a computer allocates variables for physical memory vs files in virtual memory ( such as on a hard drive ), in terms of how does the computer determine know where to put data. It almost seems random in both memory storage types, but its not because it simply can't put data at a memory address or sector (any location) of a hard drive that's occupied or allocated for another process already. When I was studying how Norton's speed disk ( a program that de-fragments files on hard drives ) on my old W95 system, I noticed from the program's representation of hard drive's data ( a color coded visual map of different data types, e.g. swap files were always first at the top.), consisting of many files spread out all over the hard drive with empty unused areas. In addition some of these areas, I saw what looked like a mix of data and empty space showed a spotty pattern. I want to think its random for that to happen. Like wise, when I was studying the memory addresses of a simple program I wrote in C, I noticed that each version of my program after recompiling it after changes - showed different addresses for segments and offsets. I was expecting the computer to use the same address when I recompiled it. Sometimes the same address would be used, other times it was different. Again, I want to think its random also for memory locations to be chosen by programs. I thought that memory allocation or file writing was based on the first empty space available, written in a contiguous manner.
So my question is, I want to know how and what is it in the logic works of a common computer, that decides where it writes its data in such a arbitrary manner for either type of location (physical RAM or Dynamic )? What area of computer science (if not assembly language) would I need to study that would explains this, almost random behavior?
Thanks in Advance

Something broader and directly from computer science would be a linked list. http://en.wikipedia.org/wiki/Linked_list
Imagine if you had a linked list and simply added items to the end, these items might live linearly in memory or disk or whatever somewhere. But as you remove some items in the middle of the list by having say item number 7 point at item number 9 eliminating item number 8. As with memory allocation for allocs or virtual memory or hard drive sector allocation, etc how fast you fragment your storage has to do with the algorithm you use for allocating the next item.
file systems can/do use a link list type scheme to keep track of what sectors are tied to a single file. it is fast and easy to use the link list but deal with fragmentation. A much slower method would be to have no fragmentation but be constantly copying/moving files around to keep them on linear sectors.
malloc() allocation schemes and MMU allocation schemes also fall under this category. Basically any time you take something, slice it up into fractions and put a virtual interface in front of those fractions to give the appearance to the programmer/user that they are linear. Malloc() (not counting the virtual memory via the MMU) is the other way around allocating a number of linear chunks of those fractions to meed the alloc need, and having an alloc/free scheme that attempts to keep as many large chunks available, just in case, a bad malloc system is one where you have half of your memory free but the maximum malloc that works without an out of memory error is a malloc of a small fraction of that memory, say you have a gig free and can only allocate 4096 bytes.

You should look at virtual memory and TLB (translation lookaside buffer) or paging.
It is not trivial to implement virtual memory and paging. The performance of your whole system depends on it. If it's not done properly your system will thrash.
It is early morning here so Wikipedia will have to do for now: https://en.m.wikipedia.org/wiki/Translation_lookaside_buffer
EDIT:
Those coloured spots you saw in your defrag were chunks on your HDD. Each chunk is of some specified size. Depending on how fragmented your HDD is, you might have portions of your HDD that look like this:
*-*-***-***-*
where * means full, and - means empty
This (above) could be part of one application/file or multiple files; I will assume one file is split across those to simplify my example. At the end of each * there is a pointer to the next location where the next * chunk is (this is called a linked list). The more fragmented your HDD is (or memory) the more of these pointers to next chunk you will have. This in turn uses more space for next pointers instead of using space for data and the result is more overhead when reading that data. If this is a file on disk, you will have multiple seeks (which are bad because they're slow) if your data is not grouped together (locality principle). When you use defrag, it moves and groups all chunks together (as best as it can).
*-*-***-***-*
becomes
*********----
The OS decides paging and virtual memory addressing (and such). TLB is a hardware (a cache) that aids this process (it maps physical memory to virtual memory addresses for fast look up). The CPU communicates with the TLB via MMU
To answer your questions
You should study operating systems.
Yes the locations where to place your files on HDD are decided by the OS. If you deleted a file and download it again, there is no guarantee it will be placed in the same location-most likely not.
A nice summary of all these components and principles I mentioned here work: Click Here. It's a ppt with slides from a Real Time Operating Systems book (if I'm not mistaken the same exact one I used)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas