How to optimize or reduce RAM size in embedded system software? - embedded

i am working on embedded software projects in automotive domain. In one of my projects, the application software consumes almost 99% of RAM memory. Actual RAM size available is 12KB. we use TMS470R1B1 Titan F05 microcontroller. I have done some optimisation like finding unused messages in software and deleting them but its still not worth reducing RAM. could you please suggest some good ways to reduce the RAM by some software optimisation?

Unlike speed optimisation, RAM optimisation might be something that requires "a little bit here, a little bit there" all through the code. On the other hand, there may turn out to be some "low hanging fruit".
Arrays and Lookup Tables
Arrays and look-up tables can be good "low-hanging fruit". If you can get a memory map from the linker, check that for large items in RAM.
Check for look-up tables that haven't used the const declaration properly, which puts them in RAM instead of ROM. Especially look out for look-up tables of pointers, which need the const on the correct side of the *, or may need two const declarations. E.g.:
const my_struct_t * param_lookup[] = {...}; // Table is in RAM!
my_struct_t * const param_lookup[] = {...}; // In ROM
const char * const strings[] = {...}; // Two const may be needed; also in ROM
Stack and heap
Perhaps your linker config reserves large amounts of RAM for heap and stack, larger than necessary for your application.
If you don't use heap, you can possibly eliminate that.
If you measure your stack usage and it's well under the allocation, you may be able to reduce the allocation. For ARM processors, there can be several stacks, for several of the operating modes, and you may find that the stacks allocated for the exception or interrupt operating modes are larger than needed.
Other
If you've checked for the easy savings, and still need more, you might need to go through your code and save "here a little, there a little". You can check things like:
Global vs local variables
Check for unnecessary use of static or global variables, where a local variable (on the stack) can be used instead. I've seen code that needed a small temporary array in a function, which was declared static, evidently because "it would take too much stack space". If this happens enough times in the code, it would actually save total memory usage overall to make such variables local again. It might require an increase in the stack size, but will save more memory on reduced global/static variables. (As a side benefit, the functions are more likely to be re-entrant, thread-safe.)
Smaller variables
Variables that can be smaller, e.g. int16_t (short) or int8_t (char) instead of int32_t (int).
Enum variable size
enum variable size may be bigger than necessary. I can't remember what ARM compilers typically do, but some compilers I've used in the past by default made enum variables 2 bytes even though the enum definition really only required 1 byte to store its range. Check compiler settings.
Algorithm implementation
Rework your algorithms. Some algorithms have have a range of possible implementations with a speed/memory trade-off. E.g. AES encryption can use an on-the-fly key calculation which means you don't have to have the entire expanded key in memory. That saves memory, but it's slower.

Deleting unused string literals won't have any effect on RAM usage because they aren't stored in RAM but in ROM. The same goes for code.
What you need to do is cut back on actual variables and possibly the size of your stack/stacks. I'd look for arrays that can be resized and unused varaibles. Also, it's best to avoid dynamic allocation because of the danger of memory fragmentation.
Aside from that, you'll want to make sure that constant data such as lookup tables are stored in ROM. This can usually be achieved with the const keyword.

Make sure the linker produces a MAP file - it will show you where the RAM is used. Sometimes you can find things like string literals/constants that are kept in RAM. Sometimes you'll find there are unused arrays/variables put there by someone else.
IF you have the linker map file it's also easy to attack the modules which are using the most RAM first.

Here are the tricks I've used on the Cell:
Start with the obvious: squeeze 32-bit words into 16s where possible, rearrange structures to eliminate padding, cut down on slack in any arrays. If you've got any arrays of more than eight structures, it's worth using bitfields to pack them down tighter.
Do away with dynamic memory allocation and use static pools. A constant memory footprint is much easier to optimize and you'll be sure of having no leaks.
Scope local allocations tightly so that they don't stay on stack longer than they have to. Some compilers are very bad at recognizing when you're done with a variable, and will leave it on the stack until the function returns. This can be bad with large objects in outer functions that then eat up persistent memory they don't have to as the outer function calls deeper into the tree.
alloca() doesn't clean up until a function returns, so can waste stack longer than you expect.
Enable function body and constant merging in the compiler, so that if it sees eight different consts with the same value, it'll put just one in the text segment and alias them with the linker.
Optimize executable code for size. If you've got a hard realtime deadline, you know exactly how fast your code needs to run, so if you've any spare performance you can make speed/size tradeoffs until you hit that point. Roll loops, pull common code into functions, etc. In some cases you may actually get a space improvement by inlining some functions, if the prolog/epilog overhead is larger than the function body.
The last one is only relevant on architectures that store code in RAM, I guess.

w.r.t functions, following are the handles to optimise the RAM
Make sure that the number of parameters passed to a functions is deeply analysed. On ARM architectures as per AAPCS(ARM arch Procedure Call standard), maximum of 4 parameters can be passed using the registers and rest of the parameters would be pushed into the stack.
Also consider the case of using a global rather than passing the data to a function which is most frequently called with the same parameter.
The deeper the function calls, the heavier is the use of the stack. use any static analysis tool, to get to know worst cast function call path and look for venues to reduce it. When function A is calling function B, B is calling C, which in turn calls D, which in turn calls E and goes deeper. In this case registers can't be at all levels to pass the parameters and so obviously stack will be used.
Try for venues for clubbing the two parameters into one wherever applicable. remember that all the registers are of 32bit in ARM and so further optimisation is also possible.
void abc(bool a, bool b, uint16_t c, uint32_t d, uint8_t e)// makes use of registers and stack
void abc(uint8 ab, uint16_t c, uint32_t d, uint8_t e)//first 2 params can be clubbed. so total of 4 parameters can be passed using registers
Have a re-look on nested interrupt vectors. In any architecture, we use to have scratch-pad registers and preserved registers. Preserved registers are something which needs to be saved before the servicing the interrupt. In case of nested interrupts it will be needing huge stack space to back up the preserved registers to and from the stack.
if objects of type such as structure is passed to the function by value, then it pushes so much of data(depending on the struct size) which will eat up stack space easily. This can be changed to pass by reference.
regards
barani kumar venkatesan

Adding to the previous answers.
If you are running your program from RAM for faster execution, you can create a user defined section which contains all the initialization routines which you are sure that it wont run more than once after your system boots up. After all the initialization functions executed, you can re use the region for heap.
This can be applied to the data section which are identified as not helpful after a certain stage in your program.

Related

Does optimizing code in TI-BASIC actually make a difference?

I know in TI-BASIC, the convention is to optimize obsessively and to save as many bits as possible (which is pretty fun, I admit).
For example,
DelVar Z
Prompt X
If X=0
Then
Disp "X is zero"
End //28 bytes
would be cleaned up as
DelVar ZPrompt X
If not(X
"X is zero //20 bytes
But does optimizing code this way actually make a difference? Does it noticeably run faster or save memory?
Yes. Optimizing your TI-Basic code makes a difference, and that difference is much larger than you would find for most programming languages.
In my opinion, the most important optimization to TI-Basic programs is size (making them as small as possible). This is important to me since I have dozens of programs on my calculator, which only has 24 kB of user-accessible RAM. In this case, it isn't really necessary to spend lots of time trying to save a few bytes of space; instead, I simply advise learning the shortest and most efficient ways to do things, so that when you write programs, they will naturally tend to be small.
Additionally, TI-Basic programs should be optimized for speed. Examples off of the top of my head include the quirk with the unclosed For( loop, calculating a value once instead of calculating it in every iteration of a loop (if possible), and using quickly-accessed variables such as Ans and the finance variables whenever the variable must be accessed a large number of times (e.g. 1000+).
A third possible optimization is for run-time memory usage. Every loop, function call, etc. has an overhead that must be stored in the memory stack in order to return to the original location, calculate values, etc. during the program's execution. It is important to avoid memory leaks (such as breaking out of a loop with Goto).
It is up to you to decide how you balance these optimizations. I prefer to:
First and foremost, guarantee that there are no memory leaks or incorrectly nested loops in my program.
Take advantage of any size optimizations that have little or no impact on the program's speed.
Consider speed optimizations, and decide if the added speed is worth the increase in program size.
TI-BASIC is an interpreted language, which usually means there is a huge overhead on every single operation.
The way an interpreted language works is that instead of actually compiling the program into code that runs on the CPU directly, each operation is a function call to the interpreter that look at what needs to be done and then calls functions to complete those sub tasks. In most cases, the overhead is a factor or two in speed, and often also in stack memory usage. However, the memory for non-stack is usually the same.
In your above example you are doing the exact same number of operations, which should mean that they run exactly as fast. What you should optimize are things like i = i + 1, which is 4 operations into i++ which is 2 operations. (as an example, TI-BASIC doesn't support ++ operator).
This does not mean that all operations take the exact same time, internally a operation may be calling hundreds of other functions or it may be as simple as updating a single variable. The programmers of the interpreter may also have implemented various peephole optimizations that optimizes very specific language constructs, e.g. for(int i = 0; i < count; i++) could both be implemented as a collection of expensive interpreter functions that behave as if i is generic, or it could be optimized to a compiled loop where it just had to update the variable i and reevaluate the count.
Now, not all interpreted languages are doomed to this pale existence. For example, JavaScript used to be one, but these days all major js engines JIT compile the code to run directly on the CPU.
UPDATE: Clarified that not all operations are created equal.
Absolutely, it makes a difference. I wrote a full-scale color RPG for the TI-84+CSE, and let me tell you, without optimizing any of my code, the game would flat out not run. At present, on the CSE, Sorcery of Uvutu can only run if every other program is archived and all other memory is out of RAM. The programs and data storage alone takes up 20k bytes in RAM, or just 1kb under all of available user memory. With all the variables in use, the memory approaches dangerously low points. I had points in my development where due to poor optimizations, I couldn't even start the game without getting a "memory all gone" error. I had plans to implement various extra things, but due to space and speed concerns, it was impossible to do so. That's only the consideration to space.
In the speed department, the game became, and still is, slow in the overworld. Walking around in the overworld is painfully slow compared to other games, and that's because of what I have to do in that code; I have to check for collisions, check if the user is moving to a new map, check if they pressed a key that should illicit a response, check if a battle should go on, and more. I was able to make slight optimizations to the walking speed, but even then, I could blatantly tell I had made improvements. It still was pretty awfully slow (at least compared to every other port I've made), but I made it a little more tolerable.
In summary, through my own experiences crafting a large project, I can say that in TI-Basic, optimizing code does make a difference. Other answers mentioned this, but TI-Basic is an interpreted language. This means the code isn't compiled into faster, lower level code, but the stuff that you put in the program is read straight out as it executes, is interpreted by the interpreter, calls the subroutines and other stuff it needs to to execute the commands, and then returns back to read the next line. As a result of that, and the fact that the TI-84+ series CPU, the Zilog Z80, was designed in 1976, you get a rather slow interpreter, especially for this day and age. As such, the fewer the commands you run, and the more you take advantage of system weirdness such as Ans being the fastest variable that can also hold the most types of data (integers/floats, strings, lists, matrices, etc), the better the performance you're gonna get.
Sources: My own experiences, documented here: https://codewalr.us/index.php?topic=778.msg27190#msg27190
TI-84+CSE RAM numbers came from here: https://education.ti.com/en/products/calculators/graphing-calculators/ti-84-plus-c-se?category=specifications
Information about the Z80 came from here: http://segaretro.org/Zilog_Z80
Depends, if it's just a basic math program then no. For big games then YES. The TI-84 has only 3.5MB of space available and has the combo of an ancient Z80 processor and a whopping 128KB of RAM. TI-BASIC is also quite slow as it's interpreted (look it up for further information) so if you to make fast-running games then YES. Optimization is very important.

Usage of frame pointer optimizations

This is related but NOT the same as frame pointer omitting ? Any risk?
I am trying to follow this old (but still relevan article)
http://blogs.msdn.com/b/larryosterman/archive/2007/03/12/fpo.aspx
Larry (author writes)
machines got sufficiently faster since 1995 that the performance
improvements that were achieved by FPO weren't sufficient to counter
the pain in debugging and analysis that FPO caused
However in the discussion further down the page one user writes
Disabling FPO can have both serious code size and performance impact.
Tail call optimizations have to be disabled when a frame pointer is
present, leading to much greater stack usage in affected paths. Small
functions are also disproportionately affected by prolog/epilog code.
Third, although there are still six registers available with a frame
pointer on X86, only three of them are nonvolatile with respect to
nested calls: EBX, ESI, and EDI. Opening up a fourth register can drop
out a bunch of spill code.
I have a couple of question.
Spill code == Register spillage?
Is the author correct that FPO is generally considered a pain and
the gain doe not out-weigh the benefits.
Is FPO still relevant today in x64 architecture since there are a
LOT more registers o play with.
Do you use FPO? What for (if yes) and does it make a difference to
you?
Finally in this article
http://www.altdevblogaday.com/2012/05/24/x64-abi-intro-to-the-windows-x64-calling-convention/
The author says
[with repect to Windows x64 calling convention].....
All parameters have space reserved on the stack, even the ones passed in registers. In fact, there’s stack space for 4 parameters
even if your function doesn’t have any params. Those parameters are 8
bytes so that’s at least 32 bytes on the stack for every function
(every function actually has at least 48 bytes on the stack…I’ll
explain that another time). This stack area is called the home space.
There are few reasons behind this home space:
If the registers need to be used for something else, the called function can store the data in the home space without moving the stack
pointer.
It keeps the stack structure easy to determine. That’s very handy for debugging, and perhaps necessary for x64′s stack metadata (another
point I’ll come back to another time). ...... The compiler can use it
for whatever it wants, and an optimized build will likely make great
use of it.
Wouldn't an optimized build optimize the excess allocation away?
1.Spill code == Register spillage?
Almost. Stricly speaking, spill code is the code added by the compiler to implement a register spill. The spill itself is the decision to tag a live range as not able to be placed in a register.
2.Is the author correct that FPO is generally considered a pain and the gain doe not out-weigh the benefits.
The author is probably correct that in modern processor architectures, the kinds of functions where FPOs will generate a significant performance gain is a smaller set than in the past. Yet FPO's do make code smaller, reducing cache pressure. They do reduce register pressure. These can be important in some settings. They do speed up prolog and epilog code by a few instructions. The point about debuggers not working well without the FP is noteworthy. It means core dumps are less useful for post mortems on production-optimized code. You'd never want to use FPO during development except for final testing.
3.Is FPO still relevant today in x64 architecture since there are a LOT more registers o play with.
Modern processors are so various and complex that you just about never know what's "relevant" until you try it and measure.
4.Do you use FPO? What for (if yes) and does it make a difference to you?
I have written a medium-size C library (20K SLOC) where it made a small (~5%) difference in run time overall under gcc. This was a native language extension to a scripting language that had to compile under both gcc and Visual C. Using it would have split the build path. I decided 5% was not worth it for the purpose the extension served. But if it had been a dynamic fluid simulation to predict the weather, 5% could have been worth many millions of dollars. The decision would have been different.
5.Wouldn't an optimized build optimize the excess allocation away?
That's entirely up to the compiler and optimizer designer. It looks from the MS documentation here that MS has defined the ABI to require home space for all data even if it's whole lifetime is spent in a register.
1) When you need to use a register and don't have any unused ones, you need to write code to save some register value on the stack and later restore it.
2) FPO was a pain back when unwinding was primarily done by walking the stack. Nowadays standard unwind ABIs exist anyway (e.g. to enable exception handling), so the information already exists, and is organized more efficiently (away from the hot code), so there's no pain. Sure, there would be some pain if you wrote all your machine code by hand, but that's not the typical use case.
3) Typical x86_64 ABIs don't use frame pointers at all (except when absolutely necessary, like for variable-length arrays in C).
4) I'm not a compiler. My compiler doesn't generate frame pointers.
Optimize excess away) Not sure what your question is. The space consumption for the home area isn't a problem. The benefit of not having to adjust any stack pointers is a big advantage, since you need a lot less code. The same goes for the red zone just beyond the stack frame, which allows leaf code to use a lot of memory without ever needing any stack pointer gymnastics.

gfortran change/find out write buffer size

I have this molecular dynamics program that writes atom position and velocities to a file at every n steps of simulation. The actual writing is taking like 90% of the running time! (checked by eiminating the writes) So I desperately need to optimize that.
I see that some fortrans have an extension to change the write buffer size (called i/o block size) and the "number of blocks" at the OPEN statement, but it appears that gfortran doesn't. Also I read somewhere that gfortran uses 8192 bytes write buffer.
I even tried to do an FSTAT (right after opening, is that right?) to see what is the block size and number of blocks it is using but it returns -1 on both. (compiling for windows 64 bit)
Isn't there a way to enlarge the write buffer for a file in gfortran? Will it be diferent compiling for linux than for windows?
I'd really really rather stay in fortran but as a desperate measure isn't there a way to do so by adding some c routine?
thanks!
IanH question is key. Unformatted IO is MUCH faster than formatted. The conversion from base 2 to base 10 is very CPU intensive. If you don't need the values to be human readable, then use unformatted IO. If you want to be able to read the values in another language, then use access='stream'.
Another approach would be to add your own buffering. Replace the write statement with a call to a subroutine. Have that subroutine store values and write only when it has received M values. You'll also have to have a "flush" call to the subroutine to cause it to write the last values, if they are fewer them M.
If gcc C is faster at IO, you could mix Fortran and C with Fortran's ISO_C_Binding: https://stackoverflow.com/questions/tagged/fortran-iso-c-binding. There are examples of the use of the ISO C Binding in the gfortran manual under "Mixed Language Programming".
If you spend 90% of your runtime writing coords/vels every n timesteps, the obvious quick fix would be to instead write data every, say, n/100 timestep. But I'm sure you already thought of that yourself.
But yes, gfortran has a fixed 8k buffer, whose size cannot be changed except by modifying the libgfortran source and rebuilding it. The reason for the buffering is to amortize the syscall overhead; (simplistic) tests on Linux showed that 8k is sufficient and more than that goes far into diminishing returns territory. That being said, if you have some substantiated claims that bigger buffers are useful on some I/O patterns and/or OS, there's no reason why the buffer can't be made larger in a future release.
As for you performance issues, as already mentioned, unformatted is a lot faster than formatted I/O. Additionally, gfortran has rather high per-IO-statement overhead. You can amortize that by writing arrays (or, array sections) rather than individual elements (this matters mostly for unformatted, for formatted IO there is so much to do that this doesn't help that much).
I am thinking that if cost of IO is comparable or even larger than the effort of simulation, then it probably isn't such a good idea to store all these data to disk the first place. It is better to do whatever processing you intend to do directly during the simulation, instead of saving lots of intermediate data them later read them in again to do the processing.
Moreover, MD is an inherently highly parallelizable problem, and with IO you will severely cripple the efficiency of parallelization! I would avoid IO whenever possible.
For individual trajectories, normally you just need to store the initial condition of each trajectory, along with its key statistics, or important snapshots at a small number of time values. When you need one specific trajectory plotted you can regenerate the exact same trajectory or section of trajectory from the initial condition or the closest snapshot, and with similar cost as reading it from the disk.

CUDA: optimize latency due to iterative process

I have an iterative computation that involves a Fourier transform in each iteration.
in high level it looks like this:
// executed in host , calling functions that run on the device
B = image
L = 100
while(L--) {
A = FFT_2D(B)
A = SOME_PER_PIXEL_CALCULATION(A)
B = INVERSE_FFT_2D(A)
B = SOME_PER_PIXEL_CALCULATION(B)
}
I am using "cufft" library to do the transforms.
now the problem is that I am always working with global memory,
basically if there was a way of doing some of the work with shared memory it would be great,
but it seems like using FFT won't allow me to bypass this, given "cufft" library functions can only be called from the host, and stores input and output in global memory.
how should I tackle this?
thanks.
EDIT:
since there IS a data dependency. it would seem like I can't do much but optimize the 'per pixel' calculations...
the bottleneck is still due to the fact that the kernels pass the data via global memory .which seems unavoidable in this case.
so basically the fact that I have to do the transform an it's inverse is what keeps me from sharing intermidiate computation data.
currently I am exploring ways of doing most of the calculation in the frequency space.
( more of a math problem )
so does anyone has a good idea on how to approximate F{max(0,f(x,y))} given F{f(x,y)} ?
EDIT:
note that f(x,y) is in the time domain, and therefore is real valued,
f(x,y) is also processed before calculating pointwise max(0,f(x,y)), so it is indeed possible for negetiv values to appear.
Concerning the FFT/IFFT, I think you are wrongly assuming that the CUFFT routine does not internally use shared memory. Typical algorithms for FFT calculations split the entire FFT into smaller ones fitting one thread block and so probably they already internally exploit shared memory, see for example the paper.
Concerning the PER_PIXEL_CALCULATIONS, shared memory is typically used to make threads within a thread block cooperate each other. My question is: are the PER_PIXEL_CALCULATIONS independent each other? If so, perhaps thread cooperation is not needed and you would not need shared memory either and arrange the calculations by using only registers.
Anyway, to be more specific on the latter point, you should provide more information on what you actually need (by editing your original post). Is your code related to an implementation of the Gerchberg-Saxton algorithm?

Bit Fields vs Integer

I'm writing an API that gets information about the CPU (using CPUID). What I'm wondering is should I store the values from the bit field returned by calling CPUID in separate integer values, or should I just store the entire bit field in a value and write functions to get the different values on-the-fly?
What is preferable in this case? Memory usage or speed? If it's memory usage, I'll just store the entire bit field in a single variable. If it's speed, I'll store each value in a separate variable.
You're only going to query a CPU once. With modern computers having both huge amounts of memory and processing power, it would make no difference either way.
Just do what would make more sense for the next person who reads it.
Programs must be written for people to read, and only incidentally for machines to execute.
— The Structure and Interpretation of Computer Programs
I think it does not matter here, b/c you will not call your CPU-id code 10000 times per second.. will you?
I think you can define different interface (method) for different value. this is more clear and easy to use. a clear, accuracy & easy to use of interface should be the first thing to consider, then performance (memory usage & speed).