STM32F3DISCOVERY microcontroller (STM32F303VC) memory size - embedded

The STM32F3DISCOVERY board's data brief indicates it features:
STM32F303VCT6 microcontroller featuring 256‑Kbyte Flash memory and 48‑Kbyte RAM in an LQFP100 package
However, the reference manual (RM0316) for STM32F303x6 et al. indicates only 16 Kbytes of SRAM (section 3.3) and 64 Kbytes of Flash memory (section 4.1) for STM32F303x6. The 256 and 48 Kbyte values match up with the STM32F303xB/C, which is also what is linked to on the board's data brief under Table 1's "Target STM32", even though it says "STM32F303VCT6".
I don't understand why there appears to be a discrepancy. Am I missing or misunderstanding something?

Latest version on the st website seems right: Reference Manual.
For the STM32F303VC:
STM32F303xB/C and STM32F358xC devices feature up to 48 Kbytes of static SRAM.
For the STM32F303x6:
STM32F303x6/8 and STM32F328x8 devices feature the same memory but only up to 16Kbytes of static SRAM
Maybe you have a old copy with a typo?

My problem was that I did not realize x was a single-letter substitution. I thought that in STM32F303x6, x replaced VCT. It does not. STM32F303VCT6 matches to STM32F303xC (x replaces V), and the T6 is a specific product in this line of chips.

Related

How to demonstrate a memory misalignment error in C on a macbook pro (Intel 64 bit processor)

In an attempt to understand C memory alignment or whatever the term is (data structure alignment?), i'm trying to write code that results in a alignment error. The original reason that brought me to learning about this is that i'm writing data parsing code that reads binary data received over the network. The data contains some uint32s, uint64s, floats, and doubles, and i'd like to make sure they are never corrupted due to errors in my parsing code.
An unsuccessful attempt at causing some problem due to misalignment:
uint32_t integer = 1027;
uint8_t * pointer = (uint8_t *)&integer;
uint8_t * bytes = malloc(5);
bytes[0] = 23; // extra byte to misalign uint32_t data
bytes[1] = pointer[0];
bytes[2] = pointer[1];
bytes[3] = pointer[2];
bytes[4] = pointer[3];
uint32_t integer2 = *(uint32_t *)(bytes + 1);
printf("integer: %u\ninteger2: %u\n", integer, integer2);
On my machine both integers print out the same. (macbook pro with Intel 64 bit processor, not sure what exactly determines alignment behaviour, is it the architecture? or exact CPU model? or compiler maybe? i use Xcode so clang)
I guess my processor/machine/setup supports unaligned reads so it takes the above code without any problems.
What would a case where parsing of say an uint32_t would fail because of code not taking alignment in account? Is there a way to make it fail on a modern Intel 64 bit system? Or am i safe from alignment errors when using simple datatypes like integers and floats (no structs)?
Edit: If anyone's reading this later, i found a similar question with interesting info: Mis-aligned pointers on x86
Normally, the x86 architecture doesn't have alignment requirements [except for some SIMD instructions like movdqa).
However, since you're trying to write code to cause such an exception ...
There is an alignment check exception bit that can be set into the x86 flags register. If you turn in on, an unaligned access will generate an exception which will show up [under linux at least] as a bus error (i.e. SIGBUS)
See my answer here: any way to stop unaligned access from c++ standard library on x86_64? for details and some sample programs to generate an exception.

NASM Prefetching

I ran across the below instructions in the NASM documentation, but I can't quite make heads or tails of them. Sadly, the Intel documentation on these instructions is also somewhat lacking.
PREFETCHNTA m8 ; 0F 18 /0 [KATMAI]
PREFETCHT0 m8 ; 0F 18 /1 [KATMAI]
PREFETCHT1 m8 ; 0F 18 /2 [KATMAI]
PREFETCHT2 m8 ; 0F 18 /3 [KATMAI]
Could anyone possibly provide a concise example of the instructions, say to cache 256 bytes at a given address? Thanks in advance!
These instructions are hints used to suggest that the CPU try to prefetch a cache line into the cache. Because they're hints, a CPU can ignore them completely.
If the CPU does support them, then the CPU will try to prefetch but will give up (and won't prefetch) if a TLB miss would be involved. This is where most people get it wrong (e.g. fail to do "preloading", where you insert a dummy read to force a TLB load so that prefetching isn't prevented from working).
The amount of data prefetched is 32 bytes or more, depending on CPU, etc. You can use CPUID to determine the actual size (CPUID function 0x00000004, the "System Coherency Line Size" returned in EBX bits 0 to 31).
If you prefetch too late it doesn't help, and if you prefetch too early the data can be evicted from the cache before it's used (which also doesn't help). There's an appendix in Intel's "IA-32 Intel Architecture Optimisation Reference Manual" that describes how to calculate when to prefetch, called "Mathematics of Prefetch Scheduling Distance" that you should probably read.
Also don't forget that prefetching can decrease performance (e.g. cause data that is needed to be evicted to make room) and that if you don't prefetch anything the CPU has a hardware prefetcher that will probably do it for you anyway. You should probably also read about how this hardware prefetcher works (and when it doesn't). For example, for sequential reads (e.g. memcmp()) the hardware prefetcher does it for you and using explicit prefetches is mostly a waste of time. It's probably only worth bothering with explicit prefetches for "random" (non-sequential) accesses that the CPU's hardware prefetcher can't/won't predict.
After sifting through some examples of heavily-optimized memcmp functions and the like, I've figured out how to use these instructions (somewhat) effectively.
These instructions imply a cache "line" of 32 bytes, something I missed originally. Thus, to cache a 256 byte buffer into L1 and L2, the following instruction set could be used:
prefetcht1 [buffer]
prefetcht1 [buffer+32]
prefetcht1 [buffer+64]
prefetcht1 [buffer+96]
prefetcht1 [buffer+128]
prefetcht1 [buffer+160]
prefetcht1 [buffer+192]
prefetcht1 [buffer+224]
The t0 suffix instructs the CPU to prefetch it into the entire cache hierarchy.
t1 instructs that the data be cached into L1, L2, and so on.
t2 continues this trend, prefetching into L2 and such.
The "nta" suffix is a bit more confusing, as it tells the CPU to write the data straight to memory (ideally), as opposed to reading/writing cache lines. This can actually be quite useful in the case of incredibly large data structures, as cache pollution can be avoided and more relevant data can instead be cached.

Time each CPU core spends in C0 power state

Any help figuring out how to do this would be great: How much time each CPU core spent in the C0 power state over the past second.
This is for a mac app so Objective-C, cocoa and c are needed.
OS X doesn't have any APIs that expose the c-state of the CPU. However, it seems like you can do this using the MWAIT/MONITOR instructions on intel CPUs. Intel mentions that you can track C-state residency using this technique in section 14.4 of the reference manual:
Software should use CPUID to discover if a target processor supports the enumeration of MWAIT extensions. If CPUID.05H.ECX[Bit 0] = 1, the target processor supports MWAIT extensions and their enumeration (see Chapter 3, “Instruction Set Reference, A-M,” of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A).
If CPUID.05H.ECX[Bit 1] = 1, the target processor supports using interrupts as break-events for MWAIT, even when interrupts are disabled. Use this feature to measure C-state residency as follows:
Software can write to bit 0 in the MWAIT Extensions register (ECX) when issuing an MWAIT to enter into a processor-specific C-state or sub C-state.
When a processor comes out of an inactive C-state or sub C-state, software can read a timestamp before an interrupt service routine (ISR) is potentially executed.
You can find more info about the MWAIT instruction in the same manual. Good luck!
Intel's Reference Manual
For getting C0 percentage you should do the following:
Read following MSRs at start and end points of the period you measure:
0x3FC (core c3), 0x3FD (core c6), 0x3FE (core c7), 0x10 (tsc)
Then do the following calculation:
Cx_ticks = (c3_after - c3_before) + (c6_after - c6_before) + (c7_after - c7_before)
total_ticks = (tsc_after - tsc_before)
Cx_percentage = Cx_ticks/total_ticks
C0_percentage = 100% - Cx_percentage
You can find more information in this document (go to Vol. 3C 35-95)

What is difference between MSP430 and MSP430X?

Any comparison table available?
The basic change for the 430X architecture was to introduce a 20 bit address range to allow addressing outside the 64K available on the original 430 devices. There are a new set of instructions that operate on the 20 bit address in parallel with the old style 16 bit instructions. e.g.
CALL ; takes a 16 bit address
CALLA ; takes a 20 bit address
PUSH ; Push the bottom 16 bits of a register onto the stack
PUSHA ; Push the full 20 bit register
The existing code compiled for a 430 based processor will run within the bottom 64K address space of the 430X processor. In the development tools (IAR and probably Rowley) you can specify a memory model so that the longer function calls and other 430X specific instructions are not generated if you ensure that your code does not cross the 64K boundary.
Wikipedia's usually good for this sort of thing. It looks like it's to increase the address space to 1MB on the X from 64K on the regular.
http://en.wikipedia.org/wiki/MSP430#MSP430X_20-bit_extension
The MSP430X extension has only 20 bit address space. So the CALLA takes only a 20 bit address.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)