What are the uses of self modifying code? [closed]

What are the uses of self modifying code? [closed] - executable

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Is there any real use for self modifying code?
I know that they can be used to build worms/viruses, but I was wondering whether there is some good reason that a programmer may have to use self modifying code.
Any ideas? Hypothetical situations are welcome too.

Turns out that the Wikipedia entry on "self-modifying code" has a great list:
Semi-automatic optimization of a state dependent loop.
Runtime code generation, or specialization of an algorithm in
runtime or loadtime (which is popular,
for example, in the domain of
real-time graphics) such as a general
sort utility preparing code to perform
the key comparison described in a
specific invocation.
Altering of inlined state of an object, or simulating the high-level
construction of closures.
Patching of subroutine address calling, as done usually at load time
of dynamic libraries, or, on each
invocation patching the subroutine's
internal references to its parameters
so as to use their actual addresses.
Whether this is regarded as
'self-modifying code' or not is a case
of terminology.
Evolutionary computing systems such as genetic programming.
Hiding of code to prevent reverse engineering, as through use of a
disassembler or debugger.
Hiding of code to evade detection by virus/spyware scanning software and
the like.
Filling 100% of memory (in some architectures) with a rolling pattern
of repeating opcodes, to erase all
programs and data, or to burn-in
hardware.
Compression of code to be decompressed and executed at runtime,
e.g., when memory or disk space is
limited.
Some very limited instruction sets leave no option but to use
self-modifying code to achieve certain
functionality. For example, a "One
Instruction Set Computer" machine that
uses only the
subtract-and-branch-if-negative
"instruction" cannot do an indirect
copy (something like the equivalent of
"*a = **b" in the C programming
language) without using self-modifying
code.
Altering instructions for fault-tolerance
On the point about thwarting hackers using self-modifying code:
Over the course of several firmware updates, DirectTV slowly assembled a program on their smart card to destroy cards that have been hacked to illegally receive unpaid channels. See Jeff's Coding Horror article on the Black Sunday Hack for more information.

I've seen self-modifying code used for:
speed optimisation, by having the program write more code for itself on the fly
obsfucation, to make reverse engineering much harder

In former times where RAM was limited, self modifying code was used to save memory. Nowadays for example application compression utilities like UPX are used to decompress/modify the own code after loading a compressed image of the application.

Because the Commodore 64 doesn't have many registers and has a 1Mhz processor. When you need to read a memory address offset by a value it is easier to modify the source.
#Reader:
LDA $C000
STA $D020
INC Reader+1
JMP Reader
That's the last time I wrote self-modifying code anyway :-)

Artificial Intelligence?

Because it's really really cool, and sometimes that's reason enough.

1960s-era assembly languages used self-modifying code to implement function calls without a stack.
Knuth, v1, 1ed p.182:
MAX100 STJ EXIT ;Subroutine linkage
ENT3 100 ;M1. Initialize
JMP 2F
1H CMPA X,3 ;M3. Compare
JGE *+3
2H ENT2 0,3 ;M4. Change m
LDA X,3 ;(New maximum found)
DEC3 1 ;M5. Decrease k
J3P 1B ;M2. All tested?
EXIT JMP * ;Return to main program
In a larger program containing this coding as a subroutine, the single instruction "JMP MAX100" would cause register A to be set to the current maximum value of locations X + 1 through X + 100, and the position of the maximum would appear in rI2. Subroutine linkage in this case is achieved by the instructions "MAX100 STJ EXIT" and, later, "EXIT JMP *". Because of the way the J-register operates, the exit instruction will then jump to the location following the place where the original reference to MAX100 was made.
Edit: It may be hard to see what's going on, even with the brief explanation here. In the line MAX100 STJ EXIT, MAX100 is a label for the instruction (and thus for the procedure as a whole), STJ means STORE the jump register (where we just came from), EXIT means the memory location labeled 'EXIT' is the target of the STORE. EXIT, we see later is the label for the last instruction. So it's overwriting code! But, many instructions (including STJ here) implicitly overwrite only the operand portion of the instruction word. So the JMP remains untouched, and the * is a dummy token, since there's really nothing meaningful to put there, it'd only get overwritten.
Self-modifying code is also used where register-indirect addressing is not available, and yet the address you need is sitting right there in the register. PDP-1 LISP:
dap .+1 ;deposit address part of accumulator in (IP+1)
lac xy ;load accumulator with (ADDRESS) [xy is a dummy symbol, just like * above]
These two instructions perform ACC := (ACC) by modifying the operand of the load instruction.
Modifications like these are relatively safe, and on antique architectures, they are necessary.

Lots of reasons. Off the top of my head:
Runtime class construction and meta programming. For example, having a class factory that takes a connection to an SQL table and generates a client class specialized for that table (with accessors for the columns, find methods, etc.).
Then of course there's the famous bitblt example, and the regexp analogs.
Dynamically optimizing based on RT information a la tracing JITs
Subtype specialization of ada style generic functions in an accretive environment.
-- MarkusQ

Dynamic linking is a kind of self-modification (patching absolute and/or relative jump locations) ... that's normally done by the O/S's program loader, though.

Neural networks are kind of self-modifying code.
Then there are evolutionary algorithms which modify themselves.

Mike Abrash described the Pixomatic code generator for Dr. Dobb's Journal a while back: http://www.ddj.com/architect/184405807 . That's a software 3d dx7(?) compatible rasterizer.

LOL - i've written self-modifying code on two occasions:
when first learning assembly language, before i understood indirect indexed access
accidentally, as pointer bugs in assembly language and C
i can imagine that there may be scenarios where self-modifying code would be more efficient than alternatives, but nothing obvious leaps to mind. In general, this is something to avoid - debugging nightmare, etc. - unless you are deliberately trying to obfuscate as mentioned above.

Applications which implement their own scripting languages often do this. For example, database servers often compile stored procedures (or queries) this way.

Dynamic code generation in SwiftShader is a form of self modifying code that enables it to efficiently implement Direct3D 9 on the CPU.

Modern example:
I have a script that needs JWT token to work. To request a token interactive login is needed or use a refresh token that is issued with new JWT token. Would be nice to store refresh token in the script and update it each time it is being executed

Related

Are there any interpreted languages in which you can dynamically modify the interpreter?

I've been thinking about this writing (apparently) by Mark Twain in which he starts off writing in English but throughout the text makes changes to the rules of spelling so that by the end he ends up with something probably best described as pseudo-German.
This made me wonder if there is interpreter for some established language in which one has access to the interpreter itself, so that you can change the syntax and structure of the language as you go along. For example, often an if clause is a keyword; is there a language that would let you change or redefine this on the fly? Imagine beginning a console session in one language, and by the end, working in another.
Clearly one could write an interpreter and run it, and perhaps there is no concrete distinction between doing this and modifying the interpreter. I'm not sure about this. Perhaps there are limits to the modifications you can make dynamically to any given interpreter?
These more open questions aside, I would simply like to know if there are any known interpreters that allow this at all? Or, perhaps, this ability is just a matter of extent and my question is badly posed.

There are certainly languages in which this kind of self-modifying behavior at the level of the language syntax itself is possible. Lisp programs can contain macros, which allow among other things the creation of new control constructs on the fly, to the extent that two Lisp programs that depend on extensive macro programming can look almost as if they are written in two different languages. Forth is somewhat similar in that a Forth interpreter provides a core set of just a dozen or so primitive operations on which a program must be built in the language of the problem domain (frequently some kind of real-world interaction that must be done precisely and programmatically, such as industrial robotics). A Forth programmer creates an interpreter that understands a language specific to the problem he or she is trying to solve, then writes higher-level programs in that language.
In general the common idea here is that of languages or systems that treat code and data as equivalent and give the user just as much power to modify one as the other. Every Lisp program is a Lisp data structure, for example. This is in contrast to a language such as Java, in which a sharp distinction is made between the program code and the data that it manipulates.
A related subject is that of self-modifying low-level code, which was a fairly common technique among assembly-language programmers in the days of minicomputers with complex instruction sets, and which spilled over somewhat into the early 8-bit and 16-bit microcomputer worlds. In this programming idiom, for purposes of speed or memory savings, a program would be written with the "awareness" of the location where its compiled or interpreted instructions would be stored in memory, and could alter in place the actual machine-level instructions byte by byte to affect its behavior on the fly.

Forth is the most obvious thing I can think of. It's concatenative and stack based, with the fundamental atom being a word. So you write a stream of words and they are performed in the order in which they're written with the stack being manipulated explicitly to effect parameter passing, results, etc. So a simple Forth program might look like:
6 3 + .
Which is the words 6, 3, + and .. The two numbers push their values onto the stack. The plus symbol pops the last two items from the stack, adds them and pushes the result. The full stop outputs whatever is at the top of the stack.
A fundamental part of Forth is that you define your own words. Since all words are first-class members of the runtime, in effect you build an application-specific grammar. Having defined the relevant words you might end up with code like:
red circle draw
That wold draw a red circle.
Forth interprets each sequence of words when it encounters them. However it distinguishes between compile-time and ordinary words. Compile-time words do things like have a sequence of words compiled and stored as a new word. So that's the equivalent of defining subroutines in a classic procedural language. They're also the means by which control structures are implemented. But you can also define your own compile-time words.
As a net result a Forth program usually defines its entire grammar, including relevant control words.
You can read a basic introduction here.

Prolog is an homoiconic language, allowing meta interpreters (MIs) to be declined in a variety of ways. A meta interpreter - interpreting the interpreter - is a common and useful native construct in Prolog.
See this page for an introduction to this argument. An interesting and practical technique illustrated is partial execution:
The overhead incurred by implementing these things using MIs can be compiled away using partial evaluation techniques.

Is it possible to beat the Assembler today? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Apparently it used to be, according to this terrific account by Ed Nather. How about today? That is, is it possible, with enough knowledge of CPU/FPU/GPU/etc. architecture, to write machine code that is more efficient than what would be produced by a mainstream assembler (nasm, GAS, etc.), in any scenario? How about for GPU kernels?
EDIT: "Not constructive"? Please. This question produced #Pointy's answer, which was quite enlightening to anyone not that familiar with how assemblers work. Someone has favorited it. The fact that Pointy is, endearingly, one of the close-voters is a nice touch but hey if it's the best answer it gets accepted.

Two things:
There's no single thing called "assembly language". An assembler is a program that translates a textual encoding of instructions for a particular architecture into a form suitable for execution. Exactly what facilities a particular assembler exposes is up to its designer. Many CPU architectures have several assemblers available.
Because an assembler's job is to provide a "friendly" way for a person to request a precise sequence of machine instructions (and other aspects of a program, such as initialized memory locations, reserved blocks of storage, directives to the runtime executive, etc), if it's possible to produce a program by hand that can't be produced by some particular assembler then that really only means you've got an inadequate assembler. The assembler that Intel developed for the iAPX 86 series (not Microsoft's masm, which was a weak imitator) had a fairly typical macro facility, and it also had a sort of "micro macro" facility that would allow the dictionary of opcode mnemonics (things like MOV, ADD, BNE, etc) to be extended arbitrarily. With an assembler like that, it would clearly be possible to create any piece of code you desired.
The real topic for concern in programming is whether burdening the programmer with the responsibility for choosing a strategy for getting work done by a computer in extreme detail is worthwhile for performance. The question of course has no single answer, because there are many possible situations, many different computing devices, and mostly because things change all the time. In 1959, for example, the computing task of translating a higher-level language like FORTRAN into machine code was itself a significant workload for computers. Understanding of how programming languages should even work was in its infancy.
Today, then, the only reason to know "machine language" (and note that the word "language" isn't really accurate) is to create an instruction sequence when there's no available (or convenient) assembler. That's assuming that explicitly creating a particular instruction sequence is better than using a higher-level language for some reason. Even then, it's generally the case that if you were doing that now you'd be writing software in some high-level language to emit the chosen instruction sequence; that is, you'd effectively create a "domain-specific assembler" for some task. A good example would be the code in something like a virtual machine interpreter that builds machine language blocks on-the-fly, like a Java or JavaScript VM.

The assembler takes assembly language and turns it into machine code. Ideally, but not always the case the assembly language has a one to one relationship with the machine code instruction. MOST of the time the translation from assembly language syntax for an instruction and the machine code will be identical whether the assembler does it or if it is done by hand. naturally there are some dont care bits from time to time and the assembler and human may choose different dont care bits so the result doesnt have to be a bit for bit match but lengthwise and speedwise they will be identical, no difference at all.
The differences between the human and the assembler software will be, if any, where the assembly language is not a one to one relationship with the machine code, and/or for various reasons the programmer wants the assembler to take care of something. This could be pseudo instructions, or macros, or things having to do with externally defined variables.
Assembly language is a loaded term as it is defined by the particular assembler, you can have many different and non-compatible assembly languages for the same processor. And you can have assembly languages where there are instances where the language does not completely describe all the information needed to choose the specific instruction, near vs far jumps for example for some instruction sets with some assemblers.
So if you want to compare apples to apples there will be no difference between hand assembled code and software assembled code. Apples to apples meaning the code in question is written properly to not be vague so the software and the human assemblers can assemble it. If you do find differences other than dont care bits, then it probably has to do with an optimization which has to do with the human assembler changing the code, to make it fair the matching assembly language can/should be changed to match. This difference would have nothing to do with human vs assembly language assemblers, but one programmers program as compared to anothers. Basically you could/would get the same result in assembly language with the software assembler.

A skilled assembly-language programmer who is targeting a very specific run-time environment can probably produce code which will run better than a compiler would produce. Depending upon the nature of the code, the performance improvement may or may not be significant relative to the amount of work required.
On the other hand, frameworks such as Java or .NET allow a programmer to compile software to an "intermediate" form which can, on demand, be translated into machine code which includes specific optimizations for the environment where it's actually running. Code which is compiled to run on "any platform", when run by framework engine which was hand-tweaked for the platform it's running on, may not run as well as assembly code that was hand-tweaked for that platform, but better than code which was hand-tweaked to optimize performance on some other platform.

"Apparently it used to be, according to this terrific account by Ed Nather. "
I remember reading a story like that decades ago.
It's been doing the rounds for a long time.
The blackjack bit is an embellishment but the code dropping out at the right point and the weeks spent trying to figure out how it worked before the penny finally dropped definitely rings a very old bell

Choosing CPU architecture for LLVM/CLANG

I am designing TTL serial computer, and I am struggling on choosing architecture more suitable for LLVM compiler backend (I want to be able to run any C++ software there). There will be no MMU, no multiplication/division, no hardware stack, no interrupts.
I have 2 main options:
1) 8-bit memory, 8-bit ALU, 8-bit registers (~12-16). Memory address width 24 bit. So I will need to use 3 registers as IP and 3 registers for any memory location.
Needless to say that any address calculations would be pure pain to implement in compiler.
2) 24-bit memory, 24-bit ALU, 24-bit registers (~6-8). Flat memory, nice. The drawbacks is that due to serial nature of the design, each operation would take 3 time more clocks, even if we are operating on some booleans. 24-bit memory data width is expensive. And it's harder to implement in hardware in general.
The question is : Do you think implementing all c++ features on this 8-bit, stack-less based hardware is possible, or I need to have more complex hardware to have generated code of reasonable quality & speed?

I second the suggestion to use LCC. I used it in this homebrew 16-bit RISC project: http://fpgacpu.org/xsoc/cc.html .
I don't think it should make much difference whether you build the 8-bit variant and use 3 add-with-carries to increment IP, or the 24-bit variant and do the whole thing in hardware. You can hide the difference in your assembler.
If you look at my article above, or an even simpler CPU here: http://fpgacpu.org/papers/soc-gr0040-paper.pdf you will see you really don't need that many operators / instructions to cover the integer C repetoire. In fact there is an lcc utility (ops) to print the min operator set for a given machine.
For more information see my article on porting lcc to a new machine here: http://www.fpgacpu.org/usenet/lcc.html
Once I had ported lcc, I wrote an assembler, and it synthesized a larger repetoire of instructions from the basic ones. For example, my machine had load-byte-unsigned but not load-byte-signed, so I emitted this sequence:
lbs rd,imm(rs) ->
lbu rd,imm(rs)
lea r1,0x80
xor rd,r1
sub rd,r1
So I think you can get by with this min cover of operations:
registers
load register with constant
load rd = *rs
store *rs1 = rs2
+ - (w/ w/o carry) // actually can to + with - and ^
>> 1 // << 1 is just +
& ^ // (synthesize ~ from ^, | from & and ^)
jump-and-link rd,rs // rd = pc, pc = rs
skip-z/nz/n/nn rs // skip next insn on rs==0, !=0, <0, >=0
Even simpler is to have no registers (or equivalently blur registers with memory -- all registers have a memory address).
Set aside a register for SP, and write the function prolog/epilog handler in the compiler and you won't have to worry about stack instructions. There's just code to store each of the callee save registers, adjust the SP by the frame size, and so forth.
Interrupts (and return from interrupts) are straightforward. All you need to do is force a jump-and-link instruction into the instruction register. If you chose the bit pattern for that to be something like 0, and put the right addresses into the source register rs (especially if it is r0), it can be done with a flip-flop reset input or an extra force-to-0 and gate. I use a similar trick in the second paper above.
Interesting project. I see a TTL / 7400 contest is underway and I was thinking myself of how simple a machine could you get away with and would it be cheating to add a 32 KB or 128 KB async SRAM to the machine to hold the code and data.
Anyway, happy hacking!
p.s.
1) You will want to decide how large each integral type is. You can certainly make char, short, int, long, long long, etc. the same size, one 24b word, if you wish, although it won't be compliant in min representation ranges.
2) And although I focused on lcc here, you were asking about C++. I recommend persuing C first. Once you have things figured out for C, including *, /, % operators in software, etc., it should be more tractable to move to full blown C++ whether in LLVM or GCC. The difference between C and C++ is "only" the extra vtables and RTTI tables and code sequences (entirely built up out the primitive C integer operator repetoire) required to handle virtual function calls, pointer to member dereference, dynamic casts, static constructors, exception handling, etc.

IMHO, It is possible for c compiler. i am not sure for c++, though.
LLVM/CLang could be hard choice for 8bit computer,
Instead, first try lcc, then second llvm/etc, HTH.
Bill Buzbee succeed to retarget lcc compiler for his Magic-1(known as homebrewcpu).
Although the hardware design and construction of Magic-1 usually gets the most attention, the largest part of the project (by far) has been developing/porting the software. To this end, I've had to write an assembler and linker from scratch, retarget a C compiler, write and port the standard C libraries, write a simplified operating system and then port a more sophisticated one. It's been a challenge, but a fun one. I suppose I'm somewhat twisted, but I happen to enjoy debugging difficult problems. And, when the bug you're trying to track down could involve one or more of: hardware design flaw, loose or broken wire, loose or bad TTL chip, assembler bug, linker bug, compiler bug, C runtime library bug, or finally a bug in the program in question there's lot of opportunity for fun. Oh, and I also don't have the luxury of blaming the bugs on anyone else.
I'm continually amazed that the damn thing runs at all, much less runs as well as it does.

In my opinion, stackless hardware is already poorly suited for C and C++ code. If you have nested function calls, you will need to emulate a stack in software anyway, which of course is much slower.
When going the stackless route, you will probably allocate most of your variables as 'static', and have no re-entrant functions. In this case, 6502-style addressing modes can be effective. You could for example have these addressing modes:
Immediate address (24bit) as part of opcode
Immediate address (24bit) plus index register (8bit)
Indirect access: immediate 24bit address to memory, which contains the actual address
Indirect access: 24 bit address to memory, 8 bit index register added to value from memory.
The address modes outlined above would allow efficient access to arrays, structures and objects allocated at a constant address (static allocation). They would be less efficient (but still usable) for dynamically and stack-allocated objects.
You would also get some benefit from your serial design: usually the 24 bit + 8 bit addition does not take 24 cycles, but you can instead short-circuit the addition when carry is 0.
Instead of mapping the IP as registers directly, you could allow changing it only through goto/branch instructions, using the same address modes as above. Jumps into dynamically computed addresses are quite rare so it makes more sense to give the whole 24-bit address directly in the opcode.
I think that if you design the CPU carefully, you can use many C++ features quite efficiently. However, do not expect that any random C++ code would run fast on such a limited CPU.

The implementation is certainly possible, but I doubt it will be usable (at lest for C++ code). As it was already noted, first problem is lack of stack. Next, bunch of C++ relies heavily on dynamic memory allocation, also C++ "internal" structures are quite big.
So, as it seems to me, it will be better, if you:
Get rid of C++ requirement (or at least, limit yourself to some subset)
Use 24 bits, not 8 bits for everything (for registers as well)
Add hardware stack

You are not going to be able to run "any" C++ code there. For example fork(), system(), etc. Anything that clearly relies on interrupts for example. You can get a long way there, sure.
Now do you mean any programs that can/have been written in C++ or are you limiting yourself to the language only and not the libraries that are commonly associated with C/C++? The language itself is a much easier rule to live with.
I think the easier question/answer, is, why not just try? What have you tried so far? It could be argued that the x86 is an 8-bit machine, no regard for alignment and many 8 bit instructions. the msp430 was ported to llvm to show how easily and quickly it could be done, I would like to see that platform with better support (not where my strengths lie otherwise I would be doing it) a 16 bit platform. no mmu. does have a stack and interrupts sure, dont have to use them and if you remove library rules then what is left that needs an interrupt?
I would look at llvm but note that the documentation produced that shows how easy it is to port, is dated and wrong and you basically have to figure it out on your own from the compiler sources. llc has a book, known for that, not optimized. Sources dont compile well on modern computers, always having to go backwards in time to use it, any time I go near it after an evening just trying to build it as is I give up. vbcc, simple, clean, documented, not unfriendly to smaller processors. Is it C++, dont remember. Of all of them the easiest to get a compiler up and running though. Of all of them LLVM is the most attractive and most useful when all said and done. dont go near gcc or even think of it, duct tape and bailing wire inside holding it together.
Have you invented your instruction set yet? do you have a simulator and assembler yet? Look up lsasim at github to find my instruction set. You can write an llvm backend for mine as practice for yours...grin...(my vbcc backend is horrible, I need to start over)...
You have to have some idea of how the high level will be implemented but you really have to start with an instruction set and an instruction set simulator and an assembler of some sort. Then start hand converting C/C++ code into assembly for your instruction set, that should pretty quickly get you through "can I do this without a stack", etc. In this process define your calling convention, implement more C/C++ code by hand using your calling convention. THEN dig into a compiler and make a back end. I think you should consider vbcc as a stepping stone, then head for LLVM if it appears like it (the isa) will work.

assembly language and optimization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
How can programming in assembly help in achieving optimization

The most likely way programming in assembly can improve your code is by improving you: teaching you more about what is happening at a low level and getting the discipline of optimization can help you make good decisions in higher-level languages.
As far as actually helping one program: as others have noted it's rarely worth it. It's just possible you can use it as a kind of advanced profile-driven optimization: try many variations until you find one that's best on your particular problem.
To start with this: write a program in C or C++ or whatever compiled language you normally use, fire up your debugger, and disassemble a small but nontrivial function, and have a think about why the compiler did what it did. Then try writing a small bit of inline assembler yourself. On modern systems assembly is mostly easily embedded within C rather than done from scratch.
Or alternatively, get a teeny machine like a PIC and make it flash a LED...

These days, you have to be very good at assembly to beat the compiler.
I can do it any day of the week, but only by viewing the compiler's output first.
And then, if it gains more than a couple of percentage points I'd be surprised.
These days, I only program in assembly when I'm doing something the compiler can't do.

In principle, you can write highly-optimized code in assembly because the compiler is limited to specific, general-purpose optimizations that should apply to many programs, while you can be creative and use your knowledge of this particular program.
To take a simple example, back when I was new to this business compilers were very limited in their ability to optimize register usage. You know that to perform any sort of arithmetic or logical operation, the CPU must generally load one of the values into a register, then perform the operation on the other, then save the result? Like to add two numbers together -- and I'll use a pseudo-assembler here because I don't know what assembly languages you know and I've forgotten most of the details myself -- you'd write something like this:
LOAD A,value1
ADD A,value2
STORE a,destination
Compilers used to generate the loads for every operation. So if your C program said:
x=x+y;
z=z+x;
The compiler would generate something like:
LOAD A,x
ADD A,y
STORE A,x
LOAD A,z
ADD A,x
STORE A,z
But a human could observe that by the time we get to the second statement, register A already contains x, and addition is commutative, so we could optimize this to:
LOAD A,x
ADD A,y
STORE A,x
ADD A,z
STORE A,z
Et cetera. One could go through all sorts of tiny micro-optimizations like this. I used to do that all the time back when I was young and the world was green.
But over the years compilers have gotten much smarter, and CPUs have gotten more powerful so the micro-optimizations don't matter as much.
Thus, I haven't written any assembly language code in, wow, probably 15 years. I used to read the assembly generated by the compiler when debugging, sometimes it would give a clue to a subtle problem, but I haven't done that in years now either.
I don't think compilers are even written in assembly any more. Instead, you write the first draft of the compiler in a high level language on some other computer, i.e. you write a cross-compiler to get yourself off the ground.
I suspect the only real use of assembly today is for extremely constrained environments, embedded systems and that sort of thing; and for programs that have to deal intimately with the hardware, like device drivers.
I'd be interested to hear if there are any assembly programmers on this forum who care to tell us why they assembly programmers.

Programming in assembly won't, in and of itself, optimize your code. The main thing about assembly is that it allows you to have very low-level access and to choose exactly what instructions the processor executes.
Since you won't have some compiler generating the assembly for you, you can perform code optimizations when you write the program yourself, if you know how.

So, you think you are smarter than gcc optimizing compiler?
If not, then fughed aboud it (learning assembly for the sake of getting better at optimization). That would be akin to learning Scheme language for the sake of getting better at recursion :)

In general, the compiler will do a fairly good job at generating optimal code. There are, however, cases where writing your own assembly can result in even more optimized (in terms of space and/or speed) code.
Typically, this happens when there is something that you know about the target system that the compiler doesn't. Compilers are designed to work on a variety of systems; if you want to take advantage of something unique to your target system, sometimes you have to go in and do it yourself. Here's an example. A few months ago, I was writing some code for a MIPS-based embedded system. There are many different types of MIPS CPUs, and some support certain opcodes that others do not. My compiler would generate MIPS code using the set of assembly operations that all MIPS architectures support. However, I knew that my chip could do more. I had a subroutine that needed to count the number of leading zeroes in a 32-bit number. The compiler synthesized this into a loop that took about 10 lines of assembly to do. I re-wrote it in one line by using the CLZ opcode that was designed to do just this. I knew that my chip supported the opcode but the compiler didn't. Admittedly, situations like this aren't very common; when they do pop up, however, it's nice to have enough of a background in assembly to take advantage of them.

Sometimes one will need to perform a task which maps particularly well onto some CPU instructions, but does not fit well into any high-level-language constructs. For example, on many processors one may easily perform extended-precision arithmetic using something like:
add r0,r4
addc r1,r5
addc r2,r6
addc r3,r7
This will regard r3:r2:r1:r0 and r7:r6:r5:r4 as numbers four words long, adding the second to the first. Four nice easy instructions, any anyone who understands assembly would know what they do. I know of no way to perform the same task in C without it not only generating bigger and slower object code, but also being an incomprehensible mess of source code.
A somewhat more extreme but specialized real-world example: Given two arrays array1[0..63] and array2[0..63], compute array1[0]*array2[0] + array1[1]*array2[1] + array1[2]*array2[2] ... + array1[63]*array2[63]. On a DSP I used, the computation could be done in machine code in about 75 machine cycles (about 67 of which are a repeating MAC instruction). There's no way C code could come anywhere close.

About the only time I can think of using Assembly language for optimizing code is when you need something very specific, like you need a GPIO on a microcontroller to toggle between high and low exactly every 9 clock cycles. that's too short a time to manage with an interrupt, and higher level language compilers don't normally offer this kind of control over the instruction stream.

Typically you wouldn't program in assembly. You would program in C, and then look at the generated assembly to see what optimzations (or not) the C compiler made automatically. Adjusting your C code (to allow for better vectorization for example) will allow the compiler to re-arrange code better, which will give you optimized assembly

More likely than being able to beat the compiler at writing assembly code. Knowing how typical tasks translate to assembly may help you write better high level language code.
Typically you do not resort to assembly for optimiziation purposes. If this is possible, usually someone already will have provided the essential code ready for you to call, for example in form of a linear algebra library.
Likewise assembly offers direct access to the processor (e.g. for atomicity, time measurement, I/O) but the important accesses will already have have been made accessible for your high level language.

Compilers do a good job of generating assembler.
However, there's a bad reason why hand-written assembler is faster. Since it's harder to write, you write less of it.
It would be nice if programmers could discipline themselves to get the same job done in minimal code, regardless of language.

When writing assembly, or even just straight raw bytes the assembler outputs, you can write programs that use computer hardware specific features or makes something otherwise very carefully specified.
There might be really high benefits if your program does the optimized part far more often than it does anything else. Always set up benchmarks before attempting optimizations.
The downcome is that your hand-written assembly works on fewer different hardware. It may even end up getting limited into the hardware model and revision!
It's rare you ever can or need to write assembly routines because commonly written software must work on almost every hardware you find and your kitten.
There's one interesting application if you know assembly. You can then write programs that produce assembly routines. Though it's mostly only fun unless you keep it really small so you can port it easily.

Read the Graphics Programming Black Book by Michael Abrash

In most modern applications, it can't to any significant degree.
Inter-Process Communication Affects Application Response Time explains why algorithms are unlikely to be bottlenecks. (But always profile - never guess.)
In general, programming in assembly will increase time-to-market, bug density, and maintenance costs. Instead, strive for simplicity and readability in your code.
As poolie mentioned, the main benefit of learning assembly today is a deeper understanding of software and hardware. From that perspective, there's quite a bit of information on Steve Gibson's site.

If you understood why there is sometimes the need to do asm, you would appreciate the strengths, costs (headaches for you).

How do you organize code in embedded projects?

Highly embedded (limited code and ram size) projects pose unique challenges for code organization.
I have seen quite a few projects with no organization at all. (Mostly by hardware engineers who, in my experience are not typically concerned with non-functional aspects of code.)
However, I have been trying to organize my code accordingly:
hardware specific (drivers, initialization)
application specific (not likely to be reused)
reusable, hardware independent
For each module I try to keep the purpose to one of these three types.
Due to limited size of embedded projects and the emphasis on performance, it is often keep this organization.
For some context, my current project is a limited DSP application on a MSP430 with 8k flash and 256 bytes ram.

I've written and maintained multiple embedded products (30+ and counting) on a variety of target micros, including MSP430's. The "rules of thumb" I have been most successful with are:
Try to modularize generic concepts as much as possible (e.g. separate driver code from application code). -- It makes for easier maintenance and reuse/porting of a project to another target micro in the future.
DO NOT start by worrying about optimized code at the very beginning. Try to solve the domain's problem first and optimize second. -- Your target micro can handle a lot more "stuff" than you might expect.
Work to ensure readability. Although most embedded projects seem to have short development-cycles, the projects often live longer than you might expect and another developer will undoubtedly have to work with your code.

I've worked on 8-bit PIC processors with similar limitations.
One restriction you don't have is how many comments you make or what you choose to name your methods, variables, etc.. Take advantage. Speed and size constraints do sometimes trump organization, but you can always explain.
Another tip is to break up a logical source file into even more pieces than you need, then bind them by #includeing them in a compilation unit. This allows you to have lots of reusable code (even one routine per file) but combine in whatever order you need. This is useful e.g. when trying to meet compilation unit size restrictions, or to pick and choose which common subroutines you need on the next project.

I try to organize it as if I had unlimited RAM and ROM, and it usually works out fine. As mentioned elsewhere, do not try to optimize it until you absolutely need to.
If you can get a pin-compatible processor that has more resources, it's better to get it working on that, concentrating on good structure and layout, then optimize for size later when you understand the code better.

Except under exceptional circumstances (see note), the organisation of your code will have no impact on the final product. (contents of the code are obviously a different matter)
So with that in mind you should organise your code as you would any other project.
With that said, the following are fairly typical:
If this is a processor that you've worked on before, or will be working on in the future, you will usually want to keep a dedicated hardware abstraction layer that can be shared between projects in the future. Typically this module would contain items like routines for managing any uarts, timers etc.
Usually it's reasonable to maintain a set of platform specific code for initialisation and setup that performs all of the configuration and initialisation up to the point where your executive takes over and runs your application. It will also include platform specific hal routines.
The executive/application is probably maintained as a separate module. All of the hardware specific code should be hidden in the hal (as mentioned above).
By splitting your code up like this you also have the option of compiling and running your application as a simulation, on a completely different platform, just by replacing the hardware specific code with routines that mimic the hardware.
This can be good for unit testing and debugging and algorithmic problems you might have.
Exceptional circumstances as might be imposed by unusual compiler restrictions. eg. I've come across some compilers that expect all interrupt service routines to be compiled within a single object file.

I've worked with some sensors like the Tmote Sky, I too have seen poor organization, and I have to admit i have contributed to it. Anyway I'd say that some confusion has to be, because loading too much modules or too much part of program will be (imho) resource killing too, so try to be aware of a threshold between organization and usability on the low resources.
Obviously this don't mean let caos begin, but for example try to get a look on the organization of the tinyOS source code and applications, it's an idea on what I'm trying to say.

Although it is a bit painful, one organization technique that is somewhat common with embedded C libraries is to split every single function and variable into a separate C source file, and then aggregate the resulting collection of O files into a library file.
The motivation for doing this is that for most normal linkers the unit of linkage is an object, for every object you either get the whole object or none of it. Since there is a 1-1 relationship between C files and object files, putting each symbol in it's own C file gives each one it's own object. This in turn lets the linker pull in only that subset of functions and variables that are actually used.
This sort of game doesn't help at all for headers they can happily be left as single files.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas