(STM32) Erasing flash and writing to flash gives HAL_FLASH_ERROR_PGP error (using HAL) - embedded

Trying to write to flash to store some configuration. I am using an STM32F446ze where I want to use the last 16kb sector as storage.
I specified VOLTAGE_RANGE_3 when I erased my sector. VOLTAGE_RANGE_3 is mapped to:
#define FLASH_VOLTAGE_RANGE_3 0x00000002U /*!< Device operating range: 2.7V to 3.6V */
I am getting an error when writing to flash when I use FLASH_TYPEPROGRAM_WORD. The error is HAL_FLASH_ERROR_PGP. Reading the reference manual I read that this has to do with using wrong parallelism/voltage levels.
From the reference manual I can read
Furthermore, in the reference manual I can read:
Programming errors
It is not allowed to program data to the Flash
memory that would cross the 128-bit row boundary. In such a case, the
write operation is not performed and a program alignment error flag
(PGAERR) is set in the FLASH_SR register. The write access type (byte,
half-word, word or double word) must correspond to the type of
parallelism chosen (x8, x16, x32 or x64). If not, the write operation
is not performed and a program parallelism error flag (PGPERR) is set
in the FLASH_SR register
So I thought:
I erased the sector in voltage range 3
That gives me 2.7 to 3.6v specification
That gives me x32 parallelism size
I should be able to write WORDs to flash.
But, this line give me an error (after unlocking the flash)
uint32_t sizeOfStorageType = ....; // Some uint I want to write to flash as test
HAL_StatusTypeDef flashStatus = HAL_FLASH_Program(TYPEPROGRAM_WORD, address++, (uint64_t) sizeOfStorageType);
auto err= HAL_FLASH_GetError(); // err == 4 == HAL_FLASH_ERROR_PGP: FLASH Programming Parallelism error flag
while (flashStatus != HAL_OK)
{
}
But when I start to write bytes instead, it goes fine.
uint8_t *arr = (uint8_t*) &sizeOfStorageType;
HAL_StatusTypeDef flashStatus;
for (uint8_t i=0; i<4; i++)
{
flashStatus = HAL_FLASH_Program(TYPEPROGRAM_BYTE, address++, (uint64_t) *(arr+i));
while (flashStatus != HAL_OK)
{
}
}
My questions:
Am I understanding it correctly that after erasing a sector, I can only write one TYPEPROGRAM? Thus, after erasing I can only write bytes, OR, half-words, OR, words, OR double words?
What am I missing / doing wrong in above context. Why can I only write bytes, while I erased with VOLTAGE_RANGE_3?

This looks like an data alignment error, but not the one related with 128-bit flash memory rows which is mentioned in the reference manual. That one is probably related with double word writes only, and is irrelevant in your case.
If you want to program 4 bytes at a time, your address needs to be word aligned, meaning that it needs to be divisible by 4. Also, address is not a uint32_t* (pointer), it's a raw uint32_t so address++ increments it by 1, not 4. As far as I know, Cortex M4 core converts unaligned accesses on the bus into multiple smaller size aligned accesses automatically, but this violates the flash parallelism rule.
BTW, it's perfectly valid to perform a mixture of byte, half-word and word writes as long as they are properly aligned. Also, unlike the flash hardware of F0, F1 and F3 series, you can try to overwrite a previously written location without causing an error. 0->1 bit changes are just ignored.

Related

RISC-V inline assembly using memory not behaving correctly

This system call code is not working at all. The compiler is optimizing things out and generally behaving strangely:
template <typename... Args>
inline void print(Args&&... args)
{
char buffer[1024];
auto res = strf::to(buffer) (std::forward<Args> (args)...);
const size_t size = res.ptr - buffer;
register const char* a0 asm("a0") = buffer;
register size_t a1 asm("a1") = size;
register long syscall_id asm("a7") = ECALL_WRITE;
register long a0_out asm("a0");
asm volatile ("ecall" : "=r"(a0_out)
: "m"(*(const char(*)[size]) a0), "r"(a1), "r"(syscall_id) : "memory");
}
This is a custom system call that takes a buffer and a length as arguments.
If I write this using global assembly it works as expected, but program code has generally been extraordinarily good if I write the wrappers inline.
A function that calls the print function with a constant string produces invalid machine code:
0000000000120f54 <start>:
start():
120f54: fa1ff06f j 120ef4 <public_donothing-0x5c>
-->
120ef4: 747367b7 lui a5,0x74736
120ef8: c0010113 addi sp,sp,-1024
120efc: 55478793 addi a5,a5,1364 # 74736554 <add_work+0x74615310>
120f00: 00f12023 sw a5,0(sp)
120f04: 00a00793 li a5,10
120f08: 00f10223 sb a5,4(sp)
120f0c: 000102a3 sb zero,5(sp)
120f10: 00500593 li a1,5
120f14: 06600893 li a7,102
120f18: 00000073 ecall
120f1c: 40010113 addi sp,sp,1024
120f20: 00008067 ret
It's not loading a0 with the buffer at sp.
What am I doing wrong?
It's not loading a0 with the buffer at sp.
Because you didn't ask for a pointer as an "r" input in a register. The one and only guaranteed/supported behaviour of T foo asm("a0") is to make an "r" constraint (including +r or =r) pick that register.
But you used "m" to let it pick an addressing mode for that buffer, not necessarily 0(a0), so it probably picked an SP-relative mode. If you add asm comments inside the template like "ecall # 0 = %0 1 = %1 2 = %2" you can look at the compiler's asm output and see what it picked. (With clang, use -no-integrated-as so asm comments in the template come through in the -S output.)
Wrapping a system call does need the pointer in a specific register, i.e. using "r" or +"r"
asm volatile ("ecall # 0=%0 1=%1 2=%2 3=%3 4=%4"
: "=r"(a0_out)
: "r"(a0), "r"(a1), "r"(syscall_id), "m"(*(const char(*)[size]) a0)
: // "memory" unneeded; the "m" input tells the compiler which memory is read
);
That "m" input can be used instead of the "memory" clobber, not instead of an "r" pointer input. (For write specifically, because it only reads that one area of pointed-to memory and has no other side-effects on memory user-space can see, only on kernel write write buffers and file-descriptor positions which aren't C objects this program can access directly. For a read call, you'd need the memory to be an output operand.)
With optimization disabled, compilers do typically pick another register as the base for the "m" input (e.g. 0(a5) for GCC), but with optimization enabled GCC picks 0(a0) so it doesn't cost extra instructions. Clang still picks 0(a2), wasting an instruction to set up that pointer, even though the "=r"(a0_out) is not early-clobber. (Godbolt, with a very cut-down version of the function that doesn't call strf::to, whatever that is, just copies a byte into the buffer.)
Interestingly, with optimization enabled for my cut-down stand-alone version of the function without fixing the bug, GCC and clang do happen to put a pointer to buffer into a0, picking 0(a0) as the template expansion for that operand (see the Godbolt link above). This seems to be a missed optimization vs. using 16(sp); I don't see why they'd need the buffer address in a register at all.
But without optimization, GCC picks ecall # 0 = a0 1 = 0(a5) 2 = a1. (In my simplified version of the function, it sets a5 with mv a5,a0, so it did actually have the address in a0 as well. So it's a good thing you had more code in your function to make it not happen to work by accident, so you could find the bug in your code.)

can't edit integer with .noinit attribute on STM32L476 with custom linker file

I'm doing my first attempt working with linker files. In the end i want to have a variable that keeps it's value after reset. I'm working with an STM32L476.
To achieve this i modified the Linker files: STM32L476JGYX_FLASH.ld and STM32L476JGYX_RAM.ld to include a partition called NOINT.
MEMORY
{
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 96K
RAM2 (xrw) : ORIGIN = 0x10000000, LENGTH = 32K
FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 1024K -0x100
NOINIT (rwx) : ORIGIN = 0x8000000 + 1024K - 0x100, LENGTH = 0x100
}
/* Sections */
SECTIONS
{
...
/* Global data not cleared after reset. */
.noinit (NOLOAD): {
KEEP(*(*.noinit*))
} > NOINIT
...
In the main.c i initialize the variable reset_count as a global variable.
__attribute__((section(".noinit"))) volatile uint32_t reset_count = 0;
The =0 part is just for simplification. I actually want to set reset_count to zero somewhere in a function.
When i run the program and step through the initialization i would expect to see the value of reset_count as 0. But somehow i always get 0xFFFFFFFF. It seems like i can't edit the reset_count variable. Can anybody tell me how i can make this variable editable?
It is not clear from the question whether you want to have a variable that keeps its value when power is removed, or just while power stays on but hardware reset is pulsed.
If you want something that keeps its value when power is removed, then your linker script is ok to put the block in flash memory, but you need to use the functions HAL_FLASH_Program etc. to write to it, you can't just make an assignment. In addition, you could simplify the linker script by instead of creating the NOINIT output region, just putting >FLASH.
If you want a variable that just persists across reset wile power stays up then you need to put the variable into SRAM not FLASH, for example like this:
.noinit (NOLOAD) :
{
*(.noinit*)
}
> RAM2
Note that you don't need to use KEEP unless you want to link a section that is unreferenced, which will not be the case if you actually use the variables, and you don't need another * immediately before .noinit unless you section names don't start with a ., which they should.
You will not be able to write to the flash memory as simply as that. If you use ST HAL, there is a flash module that provides HAL_FLASH_Program() function.
Alternatively, if the data you are trying to store is 128 bytes or less and you have an RTC backup battery, you can use the RTC backup registers (RTC_BKPxR) to store your data.

Why is the pCode type const uint32_t*? (pCode in VkShaderModuleCreateInfo )?

I just read the Shader Modules Vulkan tutorial, and I didn't understand something.
Why is createInfo.pCode a uint32_t rather than unsigned char or uint8_t? Is it faster? (because moving pointer is now 4 bytes)?
VkShaderModule createShaderModule(const std::vector<char>& code) {
VkShaderModuleCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
createInfo.codeSize = code.size();
createInfo.pCode = reinterpret_cast<const uint32_t*>(code.data());
VkShaderModule shaderModule;
if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) {
throw std::runtime_error("failed to create shader module!");
}
}
A SPIR-V module is defined a stream of 32-bit words. Passing in data using a uint32_t pointer tells the driver that the data is 32-bit aligned and allows the shader compiler in the driver to directly access the data using aligned 32-bit loads.
This is usually faster (in most CPU designs) than random unaligned access, in particular for cases that cross cache lines.
This is also more portable C/C++. Using direct unaligned memory access instructions is possible in most CPU architectures, but not standard in the language. The portable alternative using a byte stream requires assembling 4 byte loads and merging them, which is less efficient than just making a direct aligned word load.
Note using reinterpret_cast here assumes the data is aligned correctly. For the base address of std::vector it will work (data allocated via new must be aligned enough for the largest supported primitive type), but it's one to watch out for if you change where the code comes from in future.
According to VUID-VkShaderModuleCreateInfo-pCode-parameter, the specification requires that:
pCode must be a valid pointer to an array of 4/codeSize uint32_t values
pCode being a uint32_t* allows this requirement to be partially validated by the compiler, making it one less thing to worry about for users of the API.
Because SPIR-V word is 32 bits.
It should not matter for performance. It is just a type.

How to demonstrate a memory misalignment error in C on a macbook pro (Intel 64 bit processor)

In an attempt to understand C memory alignment or whatever the term is (data structure alignment?), i'm trying to write code that results in a alignment error. The original reason that brought me to learning about this is that i'm writing data parsing code that reads binary data received over the network. The data contains some uint32s, uint64s, floats, and doubles, and i'd like to make sure they are never corrupted due to errors in my parsing code.
An unsuccessful attempt at causing some problem due to misalignment:
uint32_t integer = 1027;
uint8_t * pointer = (uint8_t *)&integer;
uint8_t * bytes = malloc(5);
bytes[0] = 23; // extra byte to misalign uint32_t data
bytes[1] = pointer[0];
bytes[2] = pointer[1];
bytes[3] = pointer[2];
bytes[4] = pointer[3];
uint32_t integer2 = *(uint32_t *)(bytes + 1);
printf("integer: %u\ninteger2: %u\n", integer, integer2);
On my machine both integers print out the same. (macbook pro with Intel 64 bit processor, not sure what exactly determines alignment behaviour, is it the architecture? or exact CPU model? or compiler maybe? i use Xcode so clang)
I guess my processor/machine/setup supports unaligned reads so it takes the above code without any problems.
What would a case where parsing of say an uint32_t would fail because of code not taking alignment in account? Is there a way to make it fail on a modern Intel 64 bit system? Or am i safe from alignment errors when using simple datatypes like integers and floats (no structs)?
Edit: If anyone's reading this later, i found a similar question with interesting info: Mis-aligned pointers on x86
Normally, the x86 architecture doesn't have alignment requirements [except for some SIMD instructions like movdqa).
However, since you're trying to write code to cause such an exception ...
There is an alignment check exception bit that can be set into the x86 flags register. If you turn in on, an unaligned access will generate an exception which will show up [under linux at least] as a bus error (i.e. SIGBUS)
See my answer here: any way to stop unaligned access from c++ standard library on x86_64? for details and some sample programs to generate an exception.

Assembly code for optimized bitshifting of a vector

i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.
I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.
Because i'm not familiar with assembly language that's about it.
Which registers should i use for each instruction set?
How will the availability of the large vector (an image) in L1 cache be optimized?
How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?
Although i have all the ideas, writing the code is a bit difficult at this point.
Thank you.
Arnaud.
Edit:
Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:
__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state
I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)
Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).