SRAM usage optimization in ARM devices - optimization

The relationship between the variable size and the data bus size was confusing for me so I decided to get to the bottom of it by examining the assembly code.
I compiled the source code below in the STM32CubeIDE Version 1.2.0.
#define BUFFER_SIZE ((uint8_t)0x20)
uint8_t aTxBuffer[BUFFER_SIZE];
int i;
for(i=0; i<BUFFER_SIZE; i++){
aTxBuffer[i]=0xFF; /* TxBuffer init */
}
Looking at the assembly code confirmed my suspicion. Unless I misunderstood it grossly, this code will allocate an array with total size of BUFFER_SIZE * DATA_BUS_SIZE (Which is 32 bits on Cortex-M) but we will use only the least significant byte of each memory address.
for(i=0; i<BUFFER_SIZE; i++)
//reset i to 0
800051c: 4b09 ldr r3, [pc, #36] ; (8000544 <main+0x3c>)
800051e: 2200 movs r2, #0
8000520: 601a str r2, [r3, #0]
8000522: e009 b.n 8000538 <main+0x30>
{
//store 0xFF in each member of TxBuffer
aTxBuffer[i]=0xFF; /* TxBuffer init */
8000524: 4b07 ldr r3, [pc, #28] ; (8000544 <main+0x3c>)
8000526: 681b ldr r3, [r3, #0]
8000528: 4a07 ldr r2, [pc, #28] ; (8000548 <main+0x40>)
800052a: 21ff movs r1, #255 ; 0xff
800052c: 54d1 strb r1, [r2, r3]
for(i=0; i<BUFFER_SIZE; i++)
//increment i
800052e: 4b05 ldr r3, [pc, #20] ; (8000544 <main+0x3c>)
8000530: 681b ldr r3, [r3, #0]
8000532: 3301 adds r3, #1
8000534: 4a03 ldr r2, [pc, #12] ; (8000544 <main+0x3c>)
8000536: 6013 str r3, [r2, #0]
//compare if i is less than 31. then jump to 8000524
8000538: 4b02 ldr r3, [pc, #8] ; (8000544 <main+0x3c>)
800053a: 681b ldr r3, [r3, #0]
800053c: 2b1f cmp r3, #31
800053e: d9f1 bls.n 8000524 <main+0x1c>
//pointer to i in SRAM
8000544: 2000002c .word 0x2000002c
//pointer to TxBuffer in SRAM
8000548: 20000064 .word 0x20000064
As the SRAM is at premium in embedded devices I believe there must be some clever ways to optimize usage. One naive solution that I can think of is to allocate the buffer as uint32_t and do bit shifting to access higher bytes but this seems like costly from speed optimization perspective. What is the recommended practice here?

Bus size does not matter in this case. Memory usage will be the the same.
Some Cortex cores do not allow not aligned access. What is unaligned access? Unaligned memory accesses occur when you try to access (as single operation) N bytes of data starting from an address that is not evenly divisible by N (i.e. addr % N != 0). In our case N can be 1, 2 and 4.
your example should be analyzed with optimizations turned on.
#define BUFFER_SIZE ((uint8_t)0x20)
uint8_t aTxBuffer[BUFFER_SIZE];
void init(uint8_t x)
{
for(int i=0; i<BUFFER_SIZE; i++)
{
aTxBuffer[i]=x;
}
}
The STM32F0 which does not allow unaligned access will have to store the data byte by byte
init:
ldr r3, .L5
movs r2, r3
adds r2, r2, #32
.L2:
strb r0, [r3]
adds r3, r3, #1
cmp r3, r2
bne .L2
bx lr
.L5:
.word aTxBuffer
but stm32F4 will faster (in less operations) store the full words 32birs - 4 bytes.
init:
movs r3, #0
bfi r3, r0, #0, #8
bfi r3, r0, #8, #8
ldr r2, .L3
bfi r3, r0, #16, #8
bfi r3, r0, #24, #8
str r3, [r2] # unaligned
str r3, [r2, #4] # unaligned
str r3, [r2, #8] # unaligned
str r3, [r2, #12] # unaligned
str r3, [r2, #16] # unaligned
str r3, [r2, #20] # unaligned
str r3, [r2, #24] # unaligned
str r3, [r2, #28] # unaligned
bx lr
.L3:
.word aTxBuffer
the SRAM consumption is exactly the same in both cases

The given code does not utilize more BUFFER_SIZE*8 bits for aTxBuffer.
Note the following line in your assembly
800052c: 54d1 strb r1, [r2, r3]
Note the b suffix to the instruction here, indicating 'byte'.
In effect, the instruction translates to 'store 1 byte of value 0xFF (stored in r1) at aTxBuffer (stored in r2) + i (stored in r3)'.
So, while the assembly doesn't indicate the end of the buffer, it certainly accesses all bytes in the aTxBuffer array without any waste.
It's possible that your minimal example doesn't capture the problem you face in your actual code but I find it unlikely that the compiler will have such wasted bytes, especially one for an embedded device.
In case you do find that to be the case, you can simply allocate a uint32 array of the same size in bits (or one element higher) and cast the address of the first element to a uint8_t pointer to a uint8_t variable. Now you can access the uint8_t variable as normal.
Note that such programming should be avoided and is only shown as an example. Specifically, this makes it difficult for compilers to analyze pointer aliasing which makes some optimizations difficult. It also creates some burden on the user; careful memory management will be required to avoid mistakes (for example, you should free only one of these pointers to avoid a double-free error).
Example:
#define BUFFSIZE 0x20
// number of elements in int32 will be BUFFSIZE / 4
#define BUFFSIZE_IN_INT_32 (BUFFSIZE >> 2)
// allocate the buffer
uint32_t uint32_array[BUFFSIZE_IN_INT_32];
// point to 1 byte sized elements
uint8_t * aTxBuffer = (uint8_t *)(uint32_array)
// use aTxBuffer as you like
Note here that I assume BUFFSIZE to be divisible by 4. If that is not the case, add BUFFSIZE_IN_INT_32 by 1 more.

Related

Raspberry Pico Pi cmake

I have written a code for Pico Pi, and basically the program is about one LED and two buttons where one button turns on the LED and one turns it off. I am pretty new to raspberry and so I don't know much, I am using a virtual machine for cmake and make, but unfortunately, I can't turn my code into uf2, because I have not defined my link_gpio_get function in the sdlink.c file, which I don't know how to do so cmake is failing due to an undefined reference...
.EQU LED_PIN1, 0
.EQU BUT_PIN1, 1
.EQU BUT_PIN2, 2
.EQU GPIO_IN, 0
.EQU GPIO_OUT, 1
.thumb_func
.global main
main:
MOV R0, #LED_PIN1
BL gpio_init
MOV R0, #LED_PIN1
MOV R1, #GPIO_OUT
BL link_gpio_set_dir # Initialize PIN1
MOV R0, #BUT_PIN1
BL gpio_init
MOV R0, #BUT_PIN1
MOV R1, #GPIO_IN
BL link_gpio_set_dir
MOV R0, #BUT_PIN2
BL gpio_init
MOV R0, #BUT_PIN2
MOV R1, #GPIO_IN
BL link_gpio_set_dir
wait_on:
MOV R0, #BUT_PIN1 # Wait for turn on button
BL link_gpio_get
CMP R0, #1
BEQ turn_on
B wait_on
turn_on:
MOV R0, #LED_PIN1
MOV R1, #1
BL link_gpio_put # Turn on led
B wait_off
turn_off:
MOV R0, #LED_PIN1
MOV R1, #0
BL link_gpio_put # Turn off led
B wait_on
wait_off:
MOV R0, #BUT_PIN2 # Wait for off
BL link_gpio_get
CMP R0, #1
BEQ turn_off
B wait_off
Here is my sdlink.c file
/* C wrapper functions for the RP2040 SDK
* Incline functions gpio_set_dir and gpio_put.
*/
#include "hardware/gpio.h"
void link_gpio_set_dir(int pin, int dir)
{
gpio_set_dir(pin, dir);
}
void link_gpio_put(int pin, int value)
{
gpio_put(pin, value);
}
I've been working to output uf2, using cmake on Windows 10 and I after watching a Youtube video, reviewing hackster and making my own edits I was able to get it working.
I'm not sure what OS you are using but hopefully these links and my edit can help guide you to identify the issue with your project.
https://www.youtube.com/watch?v=mUF9xjDtFfY
https://www.hackster.io/lawrence-wiznet-io/how-to-setup-raspberry-pi-pico-c-c-sdk-in-window10-f2b816
The following is my edit that allowed me to build and output I hope it helps!
After you've cloned the pico-examples project, navigate to the pico-examples directory. I opened pico_sdk_import.cmake in a text editor and I changed line 6 from if (DEFINED ENV{PICO_SDK_PATH} AND (NOT PICO_SDK_PATH)) to if (DEFINED ENV{PICO_SDK_PATH})
If you can provide a link to where you obtained the code you posted, maybe I can help further figure our what sdlink.c should contain.

Assembly variables

I am new to assembly and am confused how some variables magically obtain values from nowhere, like in this code I have (program shifts by one ASCII code all entered symbols)
.model small
.stack 100h
.data
Enterr db 10, 13, "$"
buffer db 255
number db ?
symb db 255 dup (?)
.code
START:
MOV ax, #data
MOV ds, ax
MOV ah, 10
MOV dx, offset buffer
INT 21h
MOV ah, 9
MOV dx, offset ENTERR
INT 21h
MOV bx, offset symb
MOV cl, number
MOV ch, 0
CMP cx, 0
JE terminate
cycle:
INC byte ptr [bx]
INC bx
LOOP cycle
MOV byte ptr [bx], '$'
MOV ah, 9
MOV dx, offset symb
INT 21h
terminate:
MOV ah, 4Ch
MOV al, 0
INT 21h
END START
Just before the loop, cx has the number of symbols entered, and cycle begins to take pace from there on. This value of cx was obtained when variable "number" is copied to cl. How did variable "number" obtained such a value? Replacing
MOV cl, number
with
MOV cl, [number]
Does not effect the program. Why is that? Does every variable defined by
variable db ?
has the same value, i.e. number of symbols entered?(I am using TASM)

Miss symbols when link static library to shared library

I have a problem that missing symbols when link static libraries and .o files to a shared libray. I have checked the symbol table of static libray, the functions i needed list in the table normally, like this:
...
00000000 g F .text 000000b0 av_int2dbl
...
000000b0 g F .text 00000060 av_int2flt
but when i generate shared library, av_int2dbl and av_int2flt and some else functions
missed(they all list in the static symtable normally), I used a stupid method to resolve this problem, by making a dummy function in .o file, and reference to functions missed form the dummy function, the DYNAMIC SYMBOL TABLE of shared library add some functions that missed before, but strange thing is av_int2dbl and av_int2flt missed as before.
Could anybody tell me, what's the principle to remove symbols when generate shared library?
If ld will remove all unreferfenced symbol, why functions defined in .o files (these funcs are not be referenced from other location) existed in shared library still? Why av_int2dbl and av_int2flt are invoked explicitly in dummy func, while disassembly loss the these two funcs?
Below is dummy function defined in .o file:
int my_dummy_funcs(void)
{
av_rdft_init(0x01,0x1);
av_rdft_calc(NULL, NULL);
av_rdft_end(NULL);
av_int2dbl(1);
av_int2flt(1);
av_resample(NULL,NULL,NULL,NULL,0,0,0);
av_resample_close(NULL);
av_resample_init(0,0,0,0,0,1.0);
return 0;
}
disassemble the dummy function as follow:
0008951c <my_dummy_funcs>:
8951c: e3a00001 mov r0, #1
89520: e92d40d0 push {r4, r6, r7, lr}
89524: e1a01000 mov r1, r0
89528: e3a04000 mov r4, #0
8952c: e24dd010 sub sp, sp, #16
89530: eb03a21e bl 171db0 <av_rdft_init>
89534: e1a01004 mov r1, r4
89538: e1a00004 mov r0, r4
8953c: e3a06000 mov r6, #0
89540: eb03a22f bl 171e04 <av_rdft_calc>
89544: e1a00004 mov r0, r4
89548: eb03a231 bl 171e14 <av_rdft_end>
8954c: e1a01004 mov r1, r4
89550: e1a02004 mov r2, r4
89554: e1a03004 mov r3, r4
89558: e1a00004 mov r0, r4
8955c: e58d4000 str r4, [sp]
89560: e58d4004 str r4, [sp, #4]
89564: e3a07000 mov r7, #0
89568: e58d4008 str r4, [sp, #8]
8956c: eb0e4a62 bl 41befc <av_resample>
89570: e1a00004 mov r0, r4
89574: e3437ff0 movt r7, #16368 ; 0x3ff0
89578: eb0e4a4a bl 41bea8 <av_resample_close>
8957c: e1a00004 mov r0, r4
89580: e1a01004 mov r1, r4
89584: e1a02004 mov r2, r4
89588: e1a03004 mov r3, r4
8958c: e58d4000 str r4, [sp]
89590: e1cd60f8 strd r6, [sp, #8]
89594: eb0e495f bl 41bb18 <av_resample_init>
89598: e1a00004 mov r0, r4
8959c: e28dd010 add sp, sp, #16
895a0: e8bd80d0 pop {r4, r6, r7, pc}
but when i generate shared library, av_int2dbl and av_int2flt and some else functions missed
The most likely reason: they are marked as HIDDEN in the regular symbol table, and that tells the linker to not export them in the dynamic symbol table.
You can verify this hypothesis by running
readelf -s libfoo.a | grep av_int2dbl
(and learn to use readelf instead of objdump on ELF platforms).

How to optimize load and stores?

I'm trying to have a bunch of operation executed on different targets such as ARM,Bfin... but every time I write a simple code in C and then compile it for each operation it has like 2 loads and one store which is unnecessary for every operation.
ldr r2, [fp, #-24]
ldr r3, [fp, #-28]
add r3, r2, r3
str r3, [fp, #-20]
ldr r2, [fp, #-36]
ldr r3, [fp, #-40]
add r3, r2, r3
str r3, [fp, #-32]
ldr r2, [fp, #-44]
ldr r3, [fp, #-48]
add r3, r2, r3
str r3, [fp, #-20]
ldr r3, [fp, #-16]
add r3, r3, #1
str r3, [fp, #-16]
When I turn on any optimization options, even -O1, it simply calculates the result and stores it in the output:
subl $24, %esp
movl $4, 4(%esp)
movl $.LC0, (%esp)
Is there anyway,I can have operations without fetching the same variable over and over again? I've tried gcc -fgcse-lm and -fgcse-sm but that didn't work.
It depends on the operation. Gcc can't figure out a high level optimizations for
int a(int b, int c)
{
b-=c;
c-=b;
b-=c;
c-=b;
b-=c;
c-=b;
return c;
}
If you want to do benchmarking and avoid constant folding and dead code elimination of the optimizer in gcc, you need to use non-constants as input and make sure the result goes somewhere.
For instance, instead of using
int main(int argc, char** argv) {
int a = 1;
int b = 2;
start_clock();
int c = a + b;
int d = c + a;
int e = d + b;
stop_clock();
output_time_needed();
return 0;
}
You should use something like
int main(int argc, char** argv) {
int a = argc;
int b = argc + 1;
start_clock();
int c = a + b;
int d = c + a;
int e = d + b;
stop_clock();
output_time_needed();
return e;
}

CPU not calling IRQ0?

I am writting an OS and trying to use the PIT. I have a handler written and wrote an ISR entry for the IRQ0 (Interrupt 32). The handler is not being called at all. I am pretty sure I am not putting the ISR entry in right. Any suggestions? Here is my ASM code
mov dword EAX, irq_common_stub
mov byte [_NATIVE_IDT_Contents + 0x100], AL
mov byte [_NATIVE_IDT_Contents + 0x101], AH
mov byte [_NATIVE_IDT_Contents + 0x102], 0x8
mov byte [_NATIVE_IDT_Contents + 0x105], 0x8E
shr dword EAX, 0x10
mov byte [_NATIVE_IDT_Contents + 0x106], AL
mov byte [_NATIVE_IDT_Contents + 0x107], AH
My code to init the PIT is
public static void PIT_Init(uint frequency)
{
uint divisor = 1193180 / frequency;
GruntyOS.IO.Ports.Outb(0x43, 0x36);
byte l = (byte)(divisor & 0xFF);
byte h = (byte)((divisor >> 8) & 0xFF);
GruntyOS.IO.Ports.Outb(0x40, l);
GruntyOS.IO.Ports.Outb(0x40, h);
}
The handler is
public static void HandlePIT()
{
GruntyOS.IO.Ports.Outb(0xA0, 0x20);
GruntyOS.IO.Ports.Outb(0x20, 0x20);
print("Tick: " + Tick.ToString());
Tick++;
}
Which is called from
irq_common_stub:
pusha
mov ax, ds
push eax
mov ax, 0x10
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
call System_Void__GruntyOS_Entry_HandlePIT__
pop ebx
mov ds, bx
mov es, bx
mov fs, bx
mov gs, bx
popa
add esp, 8
sti
iret
Maybe this might help. Its a simple kernel
that is capable of handling IRQs and Exceptions.
http://www.osdever.net/bkerndev/Docs/irqs.htm
http://www.ni.com/white-paper/2874/en