assembly language and optimization [closed] - optimization

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
How can programming in assembly help in achieving optimization

The most likely way programming in assembly can improve your code is by improving you: teaching you more about what is happening at a low level and getting the discipline of optimization can help you make good decisions in higher-level languages.
As far as actually helping one program: as others have noted it's rarely worth it. It's just possible you can use it as a kind of advanced profile-driven optimization: try many variations until you find one that's best on your particular problem.
To start with this: write a program in C or C++ or whatever compiled language you normally use, fire up your debugger, and disassemble a small but nontrivial function, and have a think about why the compiler did what it did. Then try writing a small bit of inline assembler yourself. On modern systems assembly is mostly easily embedded within C rather than done from scratch.
Or alternatively, get a teeny machine like a PIC and make it flash a LED...

These days, you have to be very good at assembly to beat the compiler.
I can do it any day of the week, but only by viewing the compiler's output first.
And then, if it gains more than a couple of percentage points I'd be surprised.
These days, I only program in assembly when I'm doing something the compiler can't do.

In principle, you can write highly-optimized code in assembly because the compiler is limited to specific, general-purpose optimizations that should apply to many programs, while you can be creative and use your knowledge of this particular program.
To take a simple example, back when I was new to this business compilers were very limited in their ability to optimize register usage. You know that to perform any sort of arithmetic or logical operation, the CPU must generally load one of the values into a register, then perform the operation on the other, then save the result? Like to add two numbers together -- and I'll use a pseudo-assembler here because I don't know what assembly languages you know and I've forgotten most of the details myself -- you'd write something like this:
LOAD A,value1
ADD A,value2
STORE a,destination
Compilers used to generate the loads for every operation. So if your C program said:
x=x+y;
z=z+x;
The compiler would generate something like:
LOAD A,x
ADD A,y
STORE A,x
LOAD A,z
ADD A,x
STORE A,z
But a human could observe that by the time we get to the second statement, register A already contains x, and addition is commutative, so we could optimize this to:
LOAD A,x
ADD A,y
STORE A,x
ADD A,z
STORE A,z
Et cetera. One could go through all sorts of tiny micro-optimizations like this. I used to do that all the time back when I was young and the world was green.
But over the years compilers have gotten much smarter, and CPUs have gotten more powerful so the micro-optimizations don't matter as much.
Thus, I haven't written any assembly language code in, wow, probably 15 years. I used to read the assembly generated by the compiler when debugging, sometimes it would give a clue to a subtle problem, but I haven't done that in years now either.
I don't think compilers are even written in assembly any more. Instead, you write the first draft of the compiler in a high level language on some other computer, i.e. you write a cross-compiler to get yourself off the ground.
I suspect the only real use of assembly today is for extremely constrained environments, embedded systems and that sort of thing; and for programs that have to deal intimately with the hardware, like device drivers.
I'd be interested to hear if there are any assembly programmers on this forum who care to tell us why they assembly programmers.

Programming in assembly won't, in and of itself, optimize your code. The main thing about assembly is that it allows you to have very low-level access and to choose exactly what instructions the processor executes.
Since you won't have some compiler generating the assembly for you, you can perform code optimizations when you write the program yourself, if you know how.

So, you think you are smarter than gcc optimizing compiler?
If not, then fughed aboud it (learning assembly for the sake of getting better at optimization). That would be akin to learning Scheme language for the sake of getting better at recursion :)

In general, the compiler will do a fairly good job at generating optimal code. There are, however, cases where writing your own assembly can result in even more optimized (in terms of space and/or speed) code.
Typically, this happens when there is something that you know about the target system that the compiler doesn't. Compilers are designed to work on a variety of systems; if you want to take advantage of something unique to your target system, sometimes you have to go in and do it yourself. Here's an example. A few months ago, I was writing some code for a MIPS-based embedded system. There are many different types of MIPS CPUs, and some support certain opcodes that others do not. My compiler would generate MIPS code using the set of assembly operations that all MIPS architectures support. However, I knew that my chip could do more. I had a subroutine that needed to count the number of leading zeroes in a 32-bit number. The compiler synthesized this into a loop that took about 10 lines of assembly to do. I re-wrote it in one line by using the CLZ opcode that was designed to do just this. I knew that my chip supported the opcode but the compiler didn't. Admittedly, situations like this aren't very common; when they do pop up, however, it's nice to have enough of a background in assembly to take advantage of them.

Sometimes one will need to perform a task which maps particularly well onto some CPU instructions, but does not fit well into any high-level-language constructs. For example, on many processors one may easily perform extended-precision arithmetic using something like:
add r0,r4
addc r1,r5
addc r2,r6
addc r3,r7
This will regard r3:r2:r1:r0 and r7:r6:r5:r4 as numbers four words long, adding the second to the first. Four nice easy instructions, any anyone who understands assembly would know what they do. I know of no way to perform the same task in C without it not only generating bigger and slower object code, but also being an incomprehensible mess of source code.
A somewhat more extreme but specialized real-world example: Given two arrays array1[0..63] and array2[0..63], compute array1[0]*array2[0] + array1[1]*array2[1] + array1[2]*array2[2] ... + array1[63]*array2[63]. On a DSP I used, the computation could be done in machine code in about 75 machine cycles (about 67 of which are a repeating MAC instruction). There's no way C code could come anywhere close.

About the only time I can think of using Assembly language for optimizing code is when you need something very specific, like you need a GPIO on a microcontroller to toggle between high and low exactly every 9 clock cycles. that's too short a time to manage with an interrupt, and higher level language compilers don't normally offer this kind of control over the instruction stream.

Typically you wouldn't program in assembly. You would program in C, and then look at the generated assembly to see what optimzations (or not) the C compiler made automatically. Adjusting your C code (to allow for better vectorization for example) will allow the compiler to re-arrange code better, which will give you optimized assembly

More likely than being able to beat the compiler at writing assembly code. Knowing how typical tasks translate to assembly may help you write better high level language code.
Typically you do not resort to assembly for optimiziation purposes. If this is possible, usually someone already will have provided the essential code ready for you to call, for example in form of a linear algebra library.
Likewise assembly offers direct access to the processor (e.g. for atomicity, time measurement, I/O) but the important accesses will already have have been made accessible for your high level language.

Compilers do a good job of generating assembler.
However, there's a bad reason why hand-written assembler is faster. Since it's harder to write, you write less of it.
It would be nice if programmers could discipline themselves to get the same job done in minimal code, regardless of language.

When writing assembly, or even just straight raw bytes the assembler outputs, you can write programs that use computer hardware specific features or makes something otherwise very carefully specified.
There might be really high benefits if your program does the optimized part far more often than it does anything else. Always set up benchmarks before attempting optimizations.
The downcome is that your hand-written assembly works on fewer different hardware. It may even end up getting limited into the hardware model and revision!
It's rare you ever can or need to write assembly routines because commonly written software must work on almost every hardware you find and your kitten.
There's one interesting application if you know assembly. You can then write programs that produce assembly routines. Though it's mostly only fun unless you keep it really small so you can port it easily.

Read the Graphics Programming Black Book by Michael Abrash

In most modern applications, it can't to any significant degree.
Inter-Process Communication Affects Application Response Time explains why algorithms are unlikely to be bottlenecks. (But always profile - never guess.)
In general, programming in assembly will increase time-to-market, bug density, and maintenance costs. Instead, strive for simplicity and readability in your code.
As poolie mentioned, the main benefit of learning assembly today is a deeper understanding of software and hardware. From that perspective, there's quite a bit of information on Steve Gibson's site.

If you understood why there is sometimes the need to do asm, you would appreciate the strengths, costs (headaches for you).

Related

Analyzing the speed-up of Oracle's HotSpot versus other compilation techniques

I'm currently working on a project that must involve research of JIT techniques. I'm a complete beginner when it comes to anything related to compilers but I did some research and learned about Java's Hotspot VM. I was hoping to do an analysis on the benefits (or downsides) of using Hotspot versus traditional compilers (for example, g++).
My initial idea was to create some sort of simple program that can be run through both compilers in order to compare compilation times but this brought up a number of questions:
From my understanding, Java source code is initially turned into bytecode by the javac compiler (creating .class files) and then, in turn, this bytecode can be run through HotSpot at runtime to execute the program. Given this, would it even be relevant to compare results with a traditional compiler that converts sources directly to machine code?
Another concern I'm facing is that the programs would be in different languages (ex. C++ vs Java). Although the functionality would be identical, could this skew results when attempting to compare?
Moving on, if the above two points are not a problem, my main questions is:
How can I actually go about benchmarking the speed-up in one method versus the other?
I did some brief research about this but all I was able to find were ways to measure the efficiency of the program itself, not the compilation technique used to run it. Is what I'm trying to do possible? Are there methods to actually analyze the speed up of one compiler over another?
Any help is appreciated!
How can I actually go about benchmarking the speed-up in one method versus the other?
You first need to consider what you actually intend to measure. In other words, saying "the speed-up" is not sufficiently rigorous.
Are we talking about CPU cycles spent compiling? Or walltime from source code to running program? Or peak performance of a few critical methods in a micro benchmark? Overall steady-state program performance? Speed of program initialization? ...
In the end you're comparing two systems that made quite different trade-offs. You can find a few roughly comparable benchmarks already mentioned in the comments but in the end they mostly represent a specific type of throughput-bound tasks and not large applications. It's not like you can find an application such as firefox written both in C and Java with identical feature sets and comparable code quality. So any comparison you do will be incomplete because you'll have to use some limited proxy measurement of how comparable two code-bases are when you compare them.

Why aren't object oriented languages popular in the embedded world? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm a firmware developper and I usually develop firmware in C or Assembly. However, I came across a project completely implemented in C++ in our embedded library. Now I know object oriented languages can be used on a hardware level, but I would like to know why they aren't that popular when developing embedded systems.
The real reason: because of conceptual complexity. C and assembly provide an a simple mental model to track what is going on in the system. Object oriented programs require a more complex model that makes it harder to reason about what is going on.
Embedded systems tend to be environments that call for very tight control about what is going on in the system versus the more open ended server and PC environment. This requires both simple and transparent programming constructs. Both C and assembly provide a high level of visibility on what is really going on in the system at the lowest hardware level.
Object oriented languages in general and C++ in particular, abstract away many of the details of what is going on in the system when code is executed, thus making it much harder to reason about the inner working of the system.
Here is an example to explain what I mean. Consider the following code snippet:
i++;
Seeing this in a C program gives us a mostly accurate idea about what this does and an order of magnitude about how many CPU cycles are used, how many registers are involved and so on.
Now what does the same line will do in a C++ program? well, it depends. Depends on what type i is and how has the ++ operator been overloaded. See what I mean?
None of these comes to say that C++ or object oriented is bad. It is not. It does require a much more complex mental model if one is interested about the minute details about what is really going in the system, as many embedded developers feel they need.
From technical point of view, embedded systems have limited resources. Object oriented languages tend to create much bigger binaries than pure procedural ones, thus many would choose something that's as light as possible. For instance, I work for a smartcard company and my team is the one handling extremely low cost cards, with RAM ranging only from 1.5 - 1.75 KB and EEPROM from 96 - 136 KB. For this kind of embedded environment, most object oriented languages (especially the heavy ones such as Java) won't fit. We don't even use ANY standard C library, everything is written from scratch for size. C++ might fit, with proper coding technique and use of compiler options that doesn't generate rtti, minimize vmt, use only stack based objects, etc. but it's just my guess.
"OO languages" is too broad. There are lots of object-oriented languages with radically different characteristics. "It can be done in C++" doesn't imply "it can be done in any OO language". Good luck for writing a Python program for a less powerful AVR MCU for example. The device has 2kB of RAM and 32kB of Flash memory, the Python interpreter itself doesn't even fit into them.
C++ is a language with high-level and low-level parts at the same time. It's object-oriented, but at the end, your nice OO code will be compiled down to raw machine code, just as if you wrote it in C or assembly directly. Some other object-oriented languages, which are considered "higher-level" (or high-level-only, rather), can't do the same thing. It's all about the implementation of a particular language, really.
One addition to what all others said:
We don't write much C++ embedded code because the customer demands it. In my field code may need to get a certification and certification guidelines only exist for C, not C++.
Therefore the project has to be implemented in C even if C++ would lead to a better product.
Because many developers "think" that C++ is not suitable for embedded environment at all because of code size and performance! and this is not true in most cases !
I strongly recommend to read those slides which is talking about C++ usage for embedded and talking about C++ myths like:
The "Bloat" Myth !
The "Poor Performance" Myth !
Many compiler vendors for embedded targets provide C++ compilers like Keil , IAR , CodeRed also processor manufactures provide their toolchain with C++ compilers e.g. Texas Instruments, Freescale,...
Generally the developer need to consider c++ while starting a new project and determine to use it or not based on project needs and what OOP/C++ could provide to get the job done on time on cost!

Is it possible to beat the Assembler today? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Apparently it used to be, according to this terrific account by Ed Nather. How about today? That is, is it possible, with enough knowledge of CPU/FPU/GPU/etc. architecture, to write machine code that is more efficient than what would be produced by a mainstream assembler (nasm, GAS, etc.), in any scenario? How about for GPU kernels?
EDIT: "Not constructive"? Please. This question produced #Pointy's answer, which was quite enlightening to anyone not that familiar with how assemblers work. Someone has favorited it. The fact that Pointy is, endearingly, one of the close-voters is a nice touch but hey if it's the best answer it gets accepted.
Two things:
There's no single thing called "assembly language". An assembler is a program that translates a textual encoding of instructions for a particular architecture into a form suitable for execution. Exactly what facilities a particular assembler exposes is up to its designer. Many CPU architectures have several assemblers available.
Because an assembler's job is to provide a "friendly" way for a person to request a precise sequence of machine instructions (and other aspects of a program, such as initialized memory locations, reserved blocks of storage, directives to the runtime executive, etc), if it's possible to produce a program by hand that can't be produced by some particular assembler then that really only means you've got an inadequate assembler. The assembler that Intel developed for the iAPX 86 series (not Microsoft's masm, which was a weak imitator) had a fairly typical macro facility, and it also had a sort of "micro macro" facility that would allow the dictionary of opcode mnemonics (things like MOV, ADD, BNE, etc) to be extended arbitrarily. With an assembler like that, it would clearly be possible to create any piece of code you desired.
The real topic for concern in programming is whether burdening the programmer with the responsibility for choosing a strategy for getting work done by a computer in extreme detail is worthwhile for performance. The question of course has no single answer, because there are many possible situations, many different computing devices, and mostly because things change all the time. In 1959, for example, the computing task of translating a higher-level language like FORTRAN into machine code was itself a significant workload for computers. Understanding of how programming languages should even work was in its infancy.
Today, then, the only reason to know "machine language" (and note that the word "language" isn't really accurate) is to create an instruction sequence when there's no available (or convenient) assembler. That's assuming that explicitly creating a particular instruction sequence is better than using a higher-level language for some reason. Even then, it's generally the case that if you were doing that now you'd be writing software in some high-level language to emit the chosen instruction sequence; that is, you'd effectively create a "domain-specific assembler" for some task. A good example would be the code in something like a virtual machine interpreter that builds machine language blocks on-the-fly, like a Java or JavaScript VM.
The assembler takes assembly language and turns it into machine code. Ideally, but not always the case the assembly language has a one to one relationship with the machine code instruction. MOST of the time the translation from assembly language syntax for an instruction and the machine code will be identical whether the assembler does it or if it is done by hand. naturally there are some dont care bits from time to time and the assembler and human may choose different dont care bits so the result doesnt have to be a bit for bit match but lengthwise and speedwise they will be identical, no difference at all.
The differences between the human and the assembler software will be, if any, where the assembly language is not a one to one relationship with the machine code, and/or for various reasons the programmer wants the assembler to take care of something. This could be pseudo instructions, or macros, or things having to do with externally defined variables.
Assembly language is a loaded term as it is defined by the particular assembler, you can have many different and non-compatible assembly languages for the same processor. And you can have assembly languages where there are instances where the language does not completely describe all the information needed to choose the specific instruction, near vs far jumps for example for some instruction sets with some assemblers.
So if you want to compare apples to apples there will be no difference between hand assembled code and software assembled code. Apples to apples meaning the code in question is written properly to not be vague so the software and the human assemblers can assemble it. If you do find differences other than dont care bits, then it probably has to do with an optimization which has to do with the human assembler changing the code, to make it fair the matching assembly language can/should be changed to match. This difference would have nothing to do with human vs assembly language assemblers, but one programmers program as compared to anothers. Basically you could/would get the same result in assembly language with the software assembler.
A skilled assembly-language programmer who is targeting a very specific run-time environment can probably produce code which will run better than a compiler would produce. Depending upon the nature of the code, the performance improvement may or may not be significant relative to the amount of work required.
On the other hand, frameworks such as Java or .NET allow a programmer to compile software to an "intermediate" form which can, on demand, be translated into machine code which includes specific optimizations for the environment where it's actually running. Code which is compiled to run on "any platform", when run by framework engine which was hand-tweaked for the platform it's running on, may not run as well as assembly code that was hand-tweaked for that platform, but better than code which was hand-tweaked to optimize performance on some other platform.
"Apparently it used to be, according to this terrific account by Ed Nather. "
I remember reading a story like that decades ago.
It's been doing the rounds for a long time.
The blackjack bit is an embellishment but the code dropping out at the right point and the weeks spent trying to figure out how it worked before the penny finally dropped definitely rings a very old bell

decompilation resources and theory

There must be a million of books and papers on the theory and techniques of building compilers. Are there any resources on doing the reverse? Im not interested in any particular HW platform. Looking for good books/research papers that examine the subject and difficulties in depth.
I've worked on an AS3 and Java decompiler and I can assure you that everything I've learned in regards to decompilation is straight from compiler theory. Intermediate representations, data flow analysis, term rewriting, and other related concepts can all be found in the dragon book.
I've written about decompilers for dynamic languages here and for Python specifically.
Note though this is for dynamic languages with custom (high-level) VMs.
Decompilation is really a misnomer. Decompilers compile object code into a source representation. In many ways they are easier to write than traditional compilers - the 'source' code is already syntax checked and usually very precisely formatted.
They build up a symbol table (of addresses) and construct a target language representation of the application. The usual difficulty is that the original compiler has to a greater or lesser degree optimised the original application by removing common sub-expressions, hoisting constant code out of loops and many other similar techniques. These are often not possible to represent in the target language.
In cases where the source is for a well defined VM, then often this optimisation is left to the JIT compiler and the resulting decompiled code is very readable - in many cases almost identical to the original. Compilers of this type often leave some or all of the symbols in the object code allowing these to be recovered. Others include line numbers to help with debugging and troubleshooting. These all help to recover the original code.
As a counter, there are code obfuscators that deliberately perform transformations to the code that prevent simple restoration of the original source by scrambling names, change the sequence code is generated (without changing its resulting meaning) and introducing constructs for which there is no source language equivalent.

How do you organize code in embedded projects?

Highly embedded (limited code and ram size) projects pose unique challenges for code organization.
I have seen quite a few projects with no organization at all. (Mostly by hardware engineers who, in my experience are not typically concerned with non-functional aspects of code.)
However, I have been trying to organize my code accordingly:
hardware specific (drivers, initialization)
application specific (not likely to be reused)
reusable, hardware independent
For each module I try to keep the purpose to one of these three types.
Due to limited size of embedded projects and the emphasis on performance, it is often keep this organization.
For some context, my current project is a limited DSP application on a MSP430 with 8k flash and 256 bytes ram.
I've written and maintained multiple embedded products (30+ and counting) on a variety of target micros, including MSP430's. The "rules of thumb" I have been most successful with are:
Try to modularize generic concepts as much as possible (e.g. separate driver code from application code). -- It makes for easier maintenance and reuse/porting of a project to another target micro in the future.
DO NOT start by worrying about optimized code at the very beginning. Try to solve the domain's problem first and optimize second. -- Your target micro can handle a lot more "stuff" than you might expect.
Work to ensure readability. Although most embedded projects seem to have short development-cycles, the projects often live longer than you might expect and another developer will undoubtedly have to work with your code.
I've worked on 8-bit PIC processors with similar limitations.
One restriction you don't have is how many comments you make or what you choose to name your methods, variables, etc.. Take advantage. Speed and size constraints do sometimes trump organization, but you can always explain.
Another tip is to break up a logical source file into even more pieces than you need, then bind them by #includeing them in a compilation unit. This allows you to have lots of reusable code (even one routine per file) but combine in whatever order you need. This is useful e.g. when trying to meet compilation unit size restrictions, or to pick and choose which common subroutines you need on the next project.
I try to organize it as if I had unlimited RAM and ROM, and it usually works out fine. As mentioned elsewhere, do not try to optimize it until you absolutely need to.
If you can get a pin-compatible processor that has more resources, it's better to get it working on that, concentrating on good structure and layout, then optimize for size later when you understand the code better.
Except under exceptional circumstances (see note), the organisation of your code will have no impact on the final product. (contents of the code are obviously a different matter)
So with that in mind you should organise your code as you would any other project.
With that said, the following are fairly typical:
If this is a processor that you've worked on before, or will be working on in the future, you will usually want to keep a dedicated hardware abstraction layer that can be shared between projects in the future. Typically this module would contain items like routines for managing any uarts, timers etc.
Usually it's reasonable to maintain a set of platform specific code for initialisation and setup that performs all of the configuration and initialisation up to the point where your executive takes over and runs your application. It will also include platform specific hal routines.
The executive/application is probably maintained as a separate module. All of the hardware specific code should be hidden in the hal (as mentioned above).
By splitting your code up like this you also have the option of compiling and running your application as a simulation, on a completely different platform, just by replacing the hardware specific code with routines that mimic the hardware.
This can be good for unit testing and debugging and algorithmic problems you might have.
Exceptional circumstances as might be imposed by unusual compiler restrictions. eg. I've come across some compilers that expect all interrupt service routines to be compiled within a single object file.
I've worked with some sensors like the Tmote Sky, I too have seen poor organization, and I have to admit i have contributed to it. Anyway I'd say that some confusion has to be, because loading too much modules or too much part of program will be (imho) resource killing too, so try to be aware of a threshold between organization and usability on the low resources.
Obviously this don't mean let caos begin, but for example try to get a look on the organization of the tinyOS source code and applications, it's an idea on what I'm trying to say.
Although it is a bit painful, one organization technique that is somewhat common with embedded C libraries is to split every single function and variable into a separate C source file, and then aggregate the resulting collection of O files into a library file.
The motivation for doing this is that for most normal linkers the unit of linkage is an object, for every object you either get the whole object or none of it. Since there is a 1-1 relationship between C files and object files, putting each symbol in it's own C file gives each one it's own object. This in turn lets the linker pull in only that subset of functions and variables that are actually used.
This sort of game doesn't help at all for headers they can happily be left as single files.