Are Smalltalk bytecode optimizations worth the effort? - smalltalk

Consider the following method in the Juicer class:
Juicer >> juiceOf: aString
| fruit juice |
fruit := self gather: aString.
juice := self extractJuiceFrom: fruit.
^juice withoutSeeds
It generates the following bytecodes
25 self ; 1
26 pushTemp: 0 ; 2
27 send: gather:
28 popIntoTemp: 1 ; 3
29 self ; 4
30 pushTemp: 1 ; 5
31 send: extractJuiceFrom:
32 popIntoTemp: 2 ; 6 <-
33 pushTemp: 2 ; 7 <-
34 send: withoutSeeds
35 returnTop
Now note that 32 and 33 cancel out:
25 self ; 1
26 pushTemp: 0 ; 2
27 send: gather:
28 popIntoTemp: 1 ; 3 *
29 self ; 4 *
30 pushTemp: 1 ; 5 *
31 send: extractJuiceFrom:
32 storeIntoTemp: 2 ; 6 <-
33 send: withoutSeeds
34 returnTop
Next consider 28, 29 and 30. They insert self below the result of gather. The same stack configuration could have been achieved by pushing self before sending the first message:
25 self ; 1 <-
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 popIntoTemp: 1 ; 4 <-
30 pushTemp: 1 ; 5 <-
31 send: extractJuiceFrom:
32 storeIntoTemp: 2 ; 6
33 send: withoutSeeds
34 returnTop
Now cancel out 29 and 30
25 self ; 1
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 storeIntoTemp: 1 ; 4 <-
30 send: extractJuiceFrom:
31 storeIntoTemp: 2 ; 5
32 send: withoutSeeds
33 returnTop
Temporaries 1 and 2 are written but not read. So, except when debugging, they could be skipped leading to:
25 self ; 1
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 send: extractJuiceFrom:
30 send: withoutSeeds
31 returnTop
This last version, which saves 4 out 7 stack operations, corresponds to the less expressive and clear source:
Juicer >> juiceOf: aString
^(self extractJuiceFrom: (self gather: aString)) withoutSeeds
Note also that there are other possible optimizations that Pharo (I haven't checked Squeak) does not implement (e.g., jump chaining.) These optimizations would encourage the Smalltalk programmer to better express their intentions without having to pay the cost of additional computations.
My question is whether these improvements are an illusion or not. Concretely, are bytecode optimizations absent from Pharo/Squeak because they are known to have little relevance, or are they regarded as beneficial but haven't been addressed yet?
EDIT
An interesting advantage of using a register+stack architecture [cf. A Smalltalk Virtual Machine Architectural Model by Allen Wirfs-Brock and Pat Caudill] is that the additional space provided by registers makes it easier the manipulation of bytecodes for the sake of optimization. Of course, even though these kinds of optimizations are not as relevant as method inlining or polymorphic inline caches, as pointed out in the answer below, they shouldn't be disregarded, especially when combined with others implemented by the JIT compiler. Another interesting topic to analyze is whether destructive optimization (i.e., the one that requires de-optimization for supporting the debugger) is actually necessary or enough performance gains can be attained by non-destructive techniques.

The main annoyance when you start playing with such optimizations is debugger interface.
Historically and still currently in Squeak, the debugger is simulating the bytecode level and needs to map the bytecodes to corresponding Smalltalk instruction.
So I think the gain was too low for justifying complexification, or even worse degradation of debugging facility.
Pharo wants to change the debugger to operate at a higher level (Abstract Syntax Tree), but I don't know how they will end up at bytecode which is all the VM knows of.
IMO, this kind of optimization might better be implemented in the JIT compiler which transforms bytecode to machine native code.
EDIT
The greatest gains are in eliminating the sends themselves (by inlining) because they are much more expensive (x10) than the stack operations - there are 10 times more bytecodes executed per second than sends when you test 1 tinyBenchmarks (COG VM).
Interestingly, such optimizations could take place in the Smalltalk image, but only on hotspot detected by VM, as in the SISTA effort. See for example https://clementbera.wordpress.com/2014/01/22/the-sista-chronicles-iii-an-intermediate-representation-for-optimizations/
So, in the light of SISTA, the answer is rather: interesting, not yet addressed, but actively studied (and work in progress)!
All the machinery for de-optimizing when the method has to be debugged still is one of the difficult points as I understand it.

I think that a broader question is worth answering: are bytecodes worth the effort? Bytecodes were thought as a compact and portable representation of code that is close the target machine. As such, they are easy to interpret, but slow to execute.
Bytecodes do not excel in any of these games, and that usually makes them not the best choice if you want to either write an interpreter or a fast VM. On one hand, AST nodes far easier to interpret (only a few node types vs lots of different bytecodes). On the other hand, with the advent of JIT compilers, it became clear that running native code instead is not only possible but also much faster.
If you look at the most efficient VM implementations of JavaScript (which can be considered the most modern compilers of today) and also Java (HotSpot, Graal), you'll see they all use a tiered compilation scheme. Methods are initially interpreted from the AST, and only jitted when they become a hot spot.
At the hardest tiers of compilation there are no bytecodes. The key component in a compiler is its intermediate representation, and bytecodes do not fulfill the required properties. The most optimizable IRs are much more fine grained: they are in SSA form, and allow specific representation of registers and memory. This allows for much better code analysis and optimization.
Then again, if you are interested in portable code, there isn't anything more portable than the AST. Besides, it's easier and more practical to implement AST-based debuggers and profilers than bytecode-based ones. The only remaining problem is compactness, but in any case you can implement something like ast-codes (coded asts, similar to bytecodes but representing the tree)
On the other hand, if you want full speed, then you'll go for a JIT with a good IR and no bytecodes. I think that bytecodes don't fill many gaps in today VMs, but still remain mostly for backwards compatibility (also there are many examples of hardware archiqutectures that directly execute Java bytecodes).
There are also some cool experiments with the Cog VM related with bytecodes. But from what I understand they transform the bytecode into another IR for optimizing, then they convert back to bytecodes. I'm not sure if there's a technical gain in the last conversion besides reusing the original JIT architecture, or if there actually is any optimization at the bytecode level.

Related

How to build a decompiler for Lua 5.1?

I am building a decompiler for Lua 5.1. (for study purposes only)
Here is the generated code:
main <test.lua:0,0> (12 instructions, 48 bytes at 008D0520)
0+ params, 2 slots, 0 upvalues, 0 locals, 6 constants, 0 functions
1 [1] LOADK 0 -2 ; 2
2 [1] SETGLOBAL 0 -1 ; plz_help_me
3 [2] LOADK 0 -4 ; 24
4 [2] SETGLOBAL 0 -3 ; oh_no
5 [3] GETGLOBAL 0 -1 ; plz_help_me
6 [3] GETGLOBAL 1 -3 ; oh_no
7 [3] ADD 0 0 1
8 [3] SETGLOBAL 0 -5 ; plz_work
9 [4] GETGLOBAL 0 -6 ; print
10 [4] GETGLOBAL 1 -5 ; plz_work
11 [4] CALL 0 2 1
12 [4] RETURN 0 1
constants (6) for 008D0520:
1 "plz_help_me"
2 2
3 "oh_no"
4 24
5 "plz_work"
6 "print"
locals (0) for 008D0520:
upvalues (0) for 008D0520:
Original Code:
plz_help_me = 2
oh_no = 24
plz_work = plz_help_me + oh_no
print(plz_work)
How to build a decompiler efficiently to generate this code? Should I use AST trees to map the behavior of the code? (opcodes in this case)
Lua VM is a register machine with a nearly unlimited supply of registers, which means you don't have to deal with the consequences of register allocation. It makes the whole thing much more bearable than decompiling, say, x86.
A very convenient intermediate representation for going up the abstraction level would have been an SSA. A trivial transform treating registers as local variable pointers and leaving memory loads as is, followed by an SSA-transform [1], will give you a code suitable for further analysis. Next step will be loop detection (done purely on CFG level), and, helped by SSA, detection of the loop variables and loop invariants. Once done, you'll see that only a handful of common patterns exist, that can be directly translated to higher level loops. Detecting if and other linear control flow sequences is even easier, once you're in an SSA.
One nice property of an SSA is that you can easily construct high level AST expressions out of it. You have a use count for every SSA variable, so you can simply substitute all the single-use variables (that were not produced by side-effect instructions) in place of their use (and side-effect ones too if you maintain their order). Only multi-use variables will remain.
Of course, you'll never get meaningful local variable names out of this procedure. Globals are preserved.
[1] https://pfalcon.github.io/ssabook/latest/
It looks like there is a lua decompiler for lua 5.1 written in C that can be used for study; https://github.com/viruscamp/luadec https://www.lua.org/wshop05/Muhammad.pdf describes how it works.
And on the extremely simple example you give, it comes back with the exactly same thing:
$ ./luadec ../tmp/luac.out
-- Decompiled using luadec 2.2 rev: 895d923 for Lua 5.1 from https://github.com/viruscamp/luadec
-- Command line: ../tmp/luac.out
-- params : ...
-- function num : 0
plz_help_me = 2
oh_no = 24
plz_work = plz_help_me + oh_no
print(plz_work)
There is a debug flag -d to show a little bit about how it works:
-- Decompiled using luadec 2.2 rev: 895d923 for Lua 5.1 from https://github.com/viruscamp/luadec
-- Command line: -d ../tmp/luac.out
LoopTree of function 0
FUNCTION_STMT=0x2e2f5ec0 prep=-1 start=0 body=0 end=11 out=12 block=0x2e2f5f10
----------------------------------------------
1 LOADK 0 1
SET_SIZE(tpend) = 0
next bool: 0
locals(0):
vpend(0):
tpend(1): 0{2}
Note: The Lua VM in version in 5.1 (and other versions up to the latest 5.4) seems to be stack oriented. This means that in order to compute plz_hep_me + oh_no, both values need to be pushed onto a stack and then when the ADD bytecode instruction is seen two entries are popped off of the stack and the result is pushed onto the stack. From the decompiler's standpoint, a tree fragment or "AST" fragment of the addition with its two operands is created.
In the output above, a constant is push onto stack tpend to which we see 0{2}. The stuff in brackets seems to hold the output representation. Later on, just before the ADD opcode you will see two entries on the stack. Entries are separate simply with a space in the debug output.
----------------------------------------------
2 SETGLOBAL 0 0
next bool: 0
locals(0):
vpend(1): -1{plz_help_me=2}
tpend(0):
----------------------------------------------
Seems to set variable plz_help_me from the top of the stack and push the assignment statement instead.
3 LOADK 0 3
SET_SIZE(tpend) = 0
next bool: 0
locals(0):
vpend(0):
tpend(1): 0{24}
...
At this point I guess the decompiler sees that the stack is 0 SET_SIZE(tpend)=0 or the first statement then is done.
Moving onto the next statement, the constant 24 is then loaded onto tpend.
In vpend and tpend, we see the output as it gets built up.
Looking at the C code, It looks like it loops over functions and within that loops over what it thinks are statements, and within that is driven by the a loop over bytecode instructions but "walking" them based on instruction type and information it saves.
It looks it does build an AST on the fly in a somewhat ad-hoc manner. (By "ad hoc" I simply mean programmer code builds the tree).
Although for compilation especially of optimized code Single Static Assignment (SSA) is a win, in the decompilation direction of high-level bytecode such as is found here, usually this is neither needed nor wanted. SSA maps to finite reusable names to a potentially infinite set.
In a decompiler we want to go the other direction from infinite registers (or even finite registers) back to names the programmer used and here this mapping is already given, so just use it. This is simpler too.
If the same name is used for two conceptually different computations, tracking that is probably wanted. Correcting or improving the style of the source code in the way that the decompiler writer feels might be nice, but usually isn't appreciated. (In fact, in my experience many programmers using a decompiler like this will call it a bug, when technically, it isn't.)
The luadec decompiler breaks things into functions and walks over disassembled instructions in that in an ad hoc way based on specific instructions for various opcodes. After this tree or list of tree nodes is built, an AST if you will (and that is the term this decompiler uses), there seem to be custom print routines for each AST node type.
Personally, I think there is a better way, and one that is slightly different from the one suggested in another answer.
I wrote about decompilation from this novel perspective here. I believe it applies to Lua as well. And older, longer more research oriented paper here.
As with many decompilers, logic instructions introduce control flow, and these strategies which are instruction based and that go down a path early on from the instructions, have a hard time shifting these around. So a pattern matching/parsing/term-rewrite approach where control flow is marked in the instructions such as with addition of pseudo Phi-function instructions (or its equivalent). Dominator regions from a control-flow graph also greatly help here.
Edit:
There is also unluac, a lua decompiler written in Java that can be studied. Although there are very few comments in the code describing how it works, the code it self seems well organized and the functions are pretty small and clear. It seems similar in operation in that there is a part that is bytecode operation based, and a part that is more expression-tree based. It seems to be currently worked on and is the only decompiler that supports lua versions from 5.0 up to 5.4.

Coolest multi operation tricks to exploit 64bit registers ? (No SIMD/SSE/AVX) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Im always eager to learn about the most extreme performance optimizations out there. Recently, ive been thinking alot about exploiting large registers. I feel guilty when ive got a one bit information sitting in a 64 bit register ... So i'd like to know tricks how I can do stuff like compare multiple 16 bits at once (e.g. Usefull when there is high probability none would match) or such. Ofc the most easiest example to check for whether atleast one element got a flag set is to xor a 0 register with those 64 bits and compare it for > 0. on the instruction level this ofc would exploit instr. Pipeline but anyways youre down to 2 instructions instead of 128 (each mov and cmp). Thats what I would call a whopping speedup!
Ofc Im aware that cache misses are what our cpus spend 95% of its time on, but lets assume the cache use would already be optimal.
Especially with trees, it would be usefull to compare multiple values without SIMD at once and get out a single childIndex to read next. In the end the instructions should be minimized and dont suffer too much from pipeline wait penaltys.
Other operations I could list up:
When put next to eachother with padding, e.g. 5x11 bits you do 5 additions, shifts, subtractions and bitwise operations in parallel. Or 7x 8bits. Ofc, one would need to store the data in that way and use the result efficiently to not pay a penalty on bitmask extraction/import.
This is a bit subjective, but the coolest trick I know is called bit slicing. It belongs to the category of Distributed Arithmetics and especially useful when the length of each arithmetic operation is small and/or not some common size such as uint8_t.
Instead of packing 5 x 11 bits in a register, one associates a register for each bit in a vector.
A[0] = 1 0 1 1 1 1 0 1 1 0 | 0 |
A[1] = 0 0 0 1 1 1 1 0 0 0 | 0 |
A[4] = 0 1 1 0 0 0 0 0 0 0 | 1 |
R_0
Packing all the least significant bits to a single register R_0, or generally bit N to R_N, one can perform arithmetics for 64 vectors A[n] in parallel.
Most arithmetic operations would require k*11 instructions to complete, but some would have a better best / average case, such as A[] == B[], or A++, or even A < B starting from the MSB.
// D = A == B
for (i = 0, D=~0; D && i < 11; i++) D &= ~(A[i] ^ B[i]);
The caveat is that currently with the introduction of SIMD and openCL, the technique is most likely useful only for obfuscation. In the 90's the state of the art MD5 cracker utilized that; a more recent version found from the net admits being 3x slower than a more conventional approach.

Software memory testing for bus failures

I have a board with quite a few flash chips, some of them are showing intermittent failures. Standard memory tests are not showing any specific problem addresses, other than certain chips are failing intermittently under mechanical and thermal stress.
Suspecting the actual connections and not the flash cells themselves, I'm looking for a way to test the parallel bus for address or data pin errors.
There are some memory tests but they apply better to RAM rather than flash memory (http://www.ganssle.com/testingram.htm). Specifically, the parallel flash has a sequence of bus writes to write to each value; a write/verify failure could easily be the write operation which could be any pin on the bus.
Ideas welcome...
The typical memory tests are there to do that. I prefer a pseudo randomizer (deterministic using an lfsr) to the 0xAA, 0x55, 0xFF, 0x00 tests. This allows for an address bus test as well as data bus test in two passes (repeat inverted). I say typical in the sense of wiggle the data bits and address bits both states each and vary the states of signals and their neighbors. The pounding on a ram to create thermal or other stresses, well you cant write very fast to a flash so you cant really do fast write/read cycles.
Flash creates another problem and that is writing then reading back isnt that interesting, you want to write the read back later, hours, days, weeks to determine if the part is actually holding data.
When you say thermal or stress do you mean only during the time it is above X degrees it fails, or do you mean that due to thermal stress it is broken all the time after the event. Likewise with mechanical, while vibrating or under mechanical stress the part fails, but when relieved of that stress it is okay, or the mechanical stress has done permanent damage that can be detected under stress or not.
Now although you cant do fast write/read cycles, you can punish a flash by reading heavily. I have seen read-disturb problems by constant reading of one block or location. Not necessarily something you have time to do for every location, but you might fill the ram with a pseudo random pattern and concentrate on one location for a while, (minutes, tens of minutes), if you have a part that you know is bad see if this accelerates the detection of the problem and if any location will work or only certain ones. then another thing is to read all the locations repetitively for hours/days or leave it sit for hours/days/weeks and then do a read pass without an erase or write and see if it has lost anything.
unfortunately as you probably know each new failure case takes its own research project and development of a new test.
First step to test a memory is data bus test0 0 0 0 0 0 0 • In this test, data bus wiring is properly tested to0 0 0 0 0 0 0 confirm that the value placed on data bus by processor0 0 0 0 0 0 0 is correctly received by memory device at the other end0 0 0 0 0 0 00 0 0 0 0 0 0 • An obvious way to test is to write all possible0 0 0 0 0 0 0 data values and verify 0 0 0 0 0 0 0 • Each bit can be tested independently• To perform walking 1s test, write the first data value given in the table, verify by reading it back, write the second value, verify and so on. • When you reach the end of the table, the test is complete
In the linked article Jack Ganssle says: "Critical to this [test], and every other RAM test algorithm, is that you write the pattern to all of RAM before doing the read test."
Since reading should be isolated from writing, testing the flash is easier. Perform the writing portion of the tests while the system is not under stress. Then perform the reading portion with the system under stress. By recording the address, expected value, and actual value in enough error cases, you should be able to determine the source of the errors.
If the system never fails when doing the above, you can then perform the whole tests while under stress. Any errors that appear are most likely write errors.
I've decided to design a memory pattern that I think I can deduce both data and address errors from. The concept is to use values significantly different as key indicators of possible read errors. The concept is also to detect a failure on one pin at a time.
The test will read alternately from only bottom and top addresses (0x000000 and 0x3FFFFF - my chip has 22 address lines). In those locations I will put 0xFF and 0x00 respectively (byte wide). The idea is to flip all address and data lines and see what happens. (All other values in the flash have at least 3 bits different from 0x00 and 0xFF)
There are 44 addresses that a single pin failure could send me to in error. In each address put one of 22 values to represent which of the 22 address pin was flipped. Each are 2 bits different from each other, and 3 bits different from 00 and FF. (I tried for 3 bits different from each other but 8 bits could only get 14 values)
07,0B,0D,0E,16,1A,1C,1F,25,29,2C,
2F,34,38,3D,3E,43,49,4A,4F,52,58
The remaining addresses I put a nice pattern of six values 33,55,66,99,AA,CC. (3 bits different from all other values) value(address) = nicePattern[ sum of bits set in address % 6];
I tested this and have statistically collected 100s of intermittent failure incidents synchronized to the mechanical stress.
single bit errors detectable
double bit errors deducible (Explainable by a combination of frequent single bit errors)
3 or more bit errors (generally inconclusive)
Even though some of the chips had 3 failing pins, 70% of the incidents were single bit (they usually didn't fail at the same time)
The testing group is now using this to identify which specific connections are failing.

How do I process enormous numbers? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Most efficient implementation of a large number class
Suppose I needed to calculate 2^150000. Obviously that number is going to exceed the size of an int, float, or double. How can I make a data type that allows normal math functions but exceeds the basic number types?
If this is a "depends which language you use" kind of deal. I will say C#.
See
Most efficient implementation of a large number class
for some leads.
If C# is not cast in stone, and you want something that just works out of the box, then there are several options. The one I know best is Python, but I think that languages like Scheme and Ruby support large numbers, too.
Python: 2**150000. Prints the result after about 1 second.
If you want free mathematics software, look at Maxima or Sage.
You might also consider using Frink, which is a language with the native capability of dealing with measurement units.
It computes 2^150000 without difficulty, deals with fractions (e.g. 1/3+2/5 --> 11/15), computes 3 meters + 2 inch --> 3.0508 m and is a full programming language.
Frink - Copyright 2000-2008 Alan Eliasen, eliasen#mindspring.com
http://futureboy.us/frinkdocs/
Several languages have built in support for arbitrary large numbers. You could use Mathematica, for example. I tried your example in Mathematica, and the result has 45,155 digits. I tried the same example with bc on a Unix machine. bc supports extended precision, but not that extended; it bombed on the example.
Lisp is your friend. Default biginteger numbers.
I find it very frustrating to use a language without arbitrarily large numbers: it seems nonsensical to be able to use ordinary operators like addition on most numbers, but to have to switch to method calls on a BigInt instance simply because of its size.
A whole bunch of languages have more complete numeric towers, and seamlessly coerce when needed; e.g., Allegro Common Lisp evaluates and prints all 45,155 digits of (expt 2 150000) in 1ms.
cl-user(2): (time (expt 2 150000))
; cpu time (non-gc) 0 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 0 msec user, 0 msec system
; real time 1 msec
; space allocation:
; 2 cons cells, 18,784 other bytes, 0 static bytes
There is a product in C called calc which is an arbitrary precision calculator. I used it once when working as a researcher and found it fairly straightforward to use...
http://sourceforge.net/projects/calc/
It can be programmed for difficult or long calculations and can accept arguments from the command line. In interactive mode, it accepts one command at a time, and displays the answer.
Ordinarily the commands are simply expressions such as:
3 * (4 + 1)
and calc will print:
15
Calc does the arithmetic operators +, -, /, * as well as ^ (exponentiation), % (modulus) and // (integer divide).
For example:
3 * 19 ^ 43 - 1
will produce:
29075426613099201338473141505176993450849249622191102976
Calc values can be VERY large. For example:
2 ^ 23209 - 1
will print:
402874115778988778181873329071 ... loads of digits ... 3779264511
Hope this helps...
I don't know C# but I do know the Ruby programming language has the BigDemical class that seems to allow numbers of unlimited size.
Python has a bignum library. If you need to implement a bignum library in another language you can at least use the Python one as reference for validating your work. Note that bignums have a few implementation gotchas that aren't immediately obvious if you don't know what you're looking for.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)