Related
I am building a decompiler for Lua 5.1. (for study purposes only)
Here is the generated code:
main <test.lua:0,0> (12 instructions, 48 bytes at 008D0520)
0+ params, 2 slots, 0 upvalues, 0 locals, 6 constants, 0 functions
1 [1] LOADK 0 -2 ; 2
2 [1] SETGLOBAL 0 -1 ; plz_help_me
3 [2] LOADK 0 -4 ; 24
4 [2] SETGLOBAL 0 -3 ; oh_no
5 [3] GETGLOBAL 0 -1 ; plz_help_me
6 [3] GETGLOBAL 1 -3 ; oh_no
7 [3] ADD 0 0 1
8 [3] SETGLOBAL 0 -5 ; plz_work
9 [4] GETGLOBAL 0 -6 ; print
10 [4] GETGLOBAL 1 -5 ; plz_work
11 [4] CALL 0 2 1
12 [4] RETURN 0 1
constants (6) for 008D0520:
1 "plz_help_me"
2 2
3 "oh_no"
4 24
5 "plz_work"
6 "print"
locals (0) for 008D0520:
upvalues (0) for 008D0520:
Original Code:
plz_help_me = 2
oh_no = 24
plz_work = plz_help_me + oh_no
print(plz_work)
How to build a decompiler efficiently to generate this code? Should I use AST trees to map the behavior of the code? (opcodes in this case)
Lua VM is a register machine with a nearly unlimited supply of registers, which means you don't have to deal with the consequences of register allocation. It makes the whole thing much more bearable than decompiling, say, x86.
A very convenient intermediate representation for going up the abstraction level would have been an SSA. A trivial transform treating registers as local variable pointers and leaving memory loads as is, followed by an SSA-transform [1], will give you a code suitable for further analysis. Next step will be loop detection (done purely on CFG level), and, helped by SSA, detection of the loop variables and loop invariants. Once done, you'll see that only a handful of common patterns exist, that can be directly translated to higher level loops. Detecting if and other linear control flow sequences is even easier, once you're in an SSA.
One nice property of an SSA is that you can easily construct high level AST expressions out of it. You have a use count for every SSA variable, so you can simply substitute all the single-use variables (that were not produced by side-effect instructions) in place of their use (and side-effect ones too if you maintain their order). Only multi-use variables will remain.
Of course, you'll never get meaningful local variable names out of this procedure. Globals are preserved.
[1] https://pfalcon.github.io/ssabook/latest/
It looks like there is a lua decompiler for lua 5.1 written in C that can be used for study; https://github.com/viruscamp/luadec https://www.lua.org/wshop05/Muhammad.pdf describes how it works.
And on the extremely simple example you give, it comes back with the exactly same thing:
$ ./luadec ../tmp/luac.out
-- Decompiled using luadec 2.2 rev: 895d923 for Lua 5.1 from https://github.com/viruscamp/luadec
-- Command line: ../tmp/luac.out
-- params : ...
-- function num : 0
plz_help_me = 2
oh_no = 24
plz_work = plz_help_me + oh_no
print(plz_work)
There is a debug flag -d to show a little bit about how it works:
-- Decompiled using luadec 2.2 rev: 895d923 for Lua 5.1 from https://github.com/viruscamp/luadec
-- Command line: -d ../tmp/luac.out
LoopTree of function 0
FUNCTION_STMT=0x2e2f5ec0 prep=-1 start=0 body=0 end=11 out=12 block=0x2e2f5f10
----------------------------------------------
1 LOADK 0 1
SET_SIZE(tpend) = 0
next bool: 0
locals(0):
vpend(0):
tpend(1): 0{2}
Note: The Lua VM in version in 5.1 (and other versions up to the latest 5.4) seems to be stack oriented. This means that in order to compute plz_hep_me + oh_no, both values need to be pushed onto a stack and then when the ADD bytecode instruction is seen two entries are popped off of the stack and the result is pushed onto the stack. From the decompiler's standpoint, a tree fragment or "AST" fragment of the addition with its two operands is created.
In the output above, a constant is push onto stack tpend to which we see 0{2}. The stuff in brackets seems to hold the output representation. Later on, just before the ADD opcode you will see two entries on the stack. Entries are separate simply with a space in the debug output.
----------------------------------------------
2 SETGLOBAL 0 0
next bool: 0
locals(0):
vpend(1): -1{plz_help_me=2}
tpend(0):
----------------------------------------------
Seems to set variable plz_help_me from the top of the stack and push the assignment statement instead.
3 LOADK 0 3
SET_SIZE(tpend) = 0
next bool: 0
locals(0):
vpend(0):
tpend(1): 0{24}
...
At this point I guess the decompiler sees that the stack is 0 SET_SIZE(tpend)=0 or the first statement then is done.
Moving onto the next statement, the constant 24 is then loaded onto tpend.
In vpend and tpend, we see the output as it gets built up.
Looking at the C code, It looks like it loops over functions and within that loops over what it thinks are statements, and within that is driven by the a loop over bytecode instructions but "walking" them based on instruction type and information it saves.
It looks it does build an AST on the fly in a somewhat ad-hoc manner. (By "ad hoc" I simply mean programmer code builds the tree).
Although for compilation especially of optimized code Single Static Assignment (SSA) is a win, in the decompilation direction of high-level bytecode such as is found here, usually this is neither needed nor wanted. SSA maps to finite reusable names to a potentially infinite set.
In a decompiler we want to go the other direction from infinite registers (or even finite registers) back to names the programmer used and here this mapping is already given, so just use it. This is simpler too.
If the same name is used for two conceptually different computations, tracking that is probably wanted. Correcting or improving the style of the source code in the way that the decompiler writer feels might be nice, but usually isn't appreciated. (In fact, in my experience many programmers using a decompiler like this will call it a bug, when technically, it isn't.)
The luadec decompiler breaks things into functions and walks over disassembled instructions in that in an ad hoc way based on specific instructions for various opcodes. After this tree or list of tree nodes is built, an AST if you will (and that is the term this decompiler uses), there seem to be custom print routines for each AST node type.
Personally, I think there is a better way, and one that is slightly different from the one suggested in another answer.
I wrote about decompilation from this novel perspective here. I believe it applies to Lua as well. And older, longer more research oriented paper here.
As with many decompilers, logic instructions introduce control flow, and these strategies which are instruction based and that go down a path early on from the instructions, have a hard time shifting these around. So a pattern matching/parsing/term-rewrite approach where control flow is marked in the instructions such as with addition of pseudo Phi-function instructions (or its equivalent). Dominator regions from a control-flow graph also greatly help here.
Edit:
There is also unluac, a lua decompiler written in Java that can be studied. Although there are very few comments in the code describing how it works, the code it self seems well organized and the functions are pretty small and clear. It seems similar in operation in that there is a part that is bytecode operation based, and a part that is more expression-tree based. It seems to be currently worked on and is the only decompiler that supports lua versions from 5.0 up to 5.4.
I'm looking at the print interface in the gota dataframe here:
https://github.com/kniren/gota/blob/master/dataframe/dataframe.go#L99
I see the default value is shortCols = true, given here.
When I call the print data frame, how can I override this value to print with shortCols = false when I println?
fmt.Println(fil)
eg, I'd like to print all columns, rather than just the first 5 as the above produces the below:
[31x16] DataFrame
valA valB valC valD valE ...
0: 578 8.30 491 7959 1.040000 ...
1: 577 8.30 291 7975 2.050000 ...
2: 466 16.7 179 6470 3.210000 ...
3: 592 9.03 194 8212 4.040000 ...
Without modifying the library there is nothing you can do.
If modifying the library is an option you have a few possibilities:
Change the name of the internal formatting function so it is exported and call that. This is a bit more work, since you need to explicitly call a function every time you want to print a DataFrame, but it is a reasonable option if you want to make minimal changes to the way the library works.
Basically change print to Print on lines 101 and 104 (I think those are the only occurrences of that function; if not the compiler will be happy to point out the others :P)
Change the arguments to df.print in the definition of df.String. This is positively trivial, but it has the effect of changing the default behaviour, which may or may not be a good thing.
For this option just change line 101 to return df.print(true, false, true, true, 10, 70, "DataFrame") or whatever combination fits your needs.
Add a new method for each printing format you want, and explicitly call these new methods. This is more work than #1 or #2, but some people may prefer it.
Personally, I would go with #1, but your question makes #2 sound more like what you want.
Consider the following method in the Juicer class:
Juicer >> juiceOf: aString
| fruit juice |
fruit := self gather: aString.
juice := self extractJuiceFrom: fruit.
^juice withoutSeeds
It generates the following bytecodes
25 self ; 1
26 pushTemp: 0 ; 2
27 send: gather:
28 popIntoTemp: 1 ; 3
29 self ; 4
30 pushTemp: 1 ; 5
31 send: extractJuiceFrom:
32 popIntoTemp: 2 ; 6 <-
33 pushTemp: 2 ; 7 <-
34 send: withoutSeeds
35 returnTop
Now note that 32 and 33 cancel out:
25 self ; 1
26 pushTemp: 0 ; 2
27 send: gather:
28 popIntoTemp: 1 ; 3 *
29 self ; 4 *
30 pushTemp: 1 ; 5 *
31 send: extractJuiceFrom:
32 storeIntoTemp: 2 ; 6 <-
33 send: withoutSeeds
34 returnTop
Next consider 28, 29 and 30. They insert self below the result of gather. The same stack configuration could have been achieved by pushing self before sending the first message:
25 self ; 1 <-
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 popIntoTemp: 1 ; 4 <-
30 pushTemp: 1 ; 5 <-
31 send: extractJuiceFrom:
32 storeIntoTemp: 2 ; 6
33 send: withoutSeeds
34 returnTop
Now cancel out 29 and 30
25 self ; 1
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 storeIntoTemp: 1 ; 4 <-
30 send: extractJuiceFrom:
31 storeIntoTemp: 2 ; 5
32 send: withoutSeeds
33 returnTop
Temporaries 1 and 2 are written but not read. So, except when debugging, they could be skipped leading to:
25 self ; 1
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 send: extractJuiceFrom:
30 send: withoutSeeds
31 returnTop
This last version, which saves 4 out 7 stack operations, corresponds to the less expressive and clear source:
Juicer >> juiceOf: aString
^(self extractJuiceFrom: (self gather: aString)) withoutSeeds
Note also that there are other possible optimizations that Pharo (I haven't checked Squeak) does not implement (e.g., jump chaining.) These optimizations would encourage the Smalltalk programmer to better express their intentions without having to pay the cost of additional computations.
My question is whether these improvements are an illusion or not. Concretely, are bytecode optimizations absent from Pharo/Squeak because they are known to have little relevance, or are they regarded as beneficial but haven't been addressed yet?
EDIT
An interesting advantage of using a register+stack architecture [cf. A Smalltalk Virtual Machine Architectural Model by Allen Wirfs-Brock and Pat Caudill] is that the additional space provided by registers makes it easier the manipulation of bytecodes for the sake of optimization. Of course, even though these kinds of optimizations are not as relevant as method inlining or polymorphic inline caches, as pointed out in the answer below, they shouldn't be disregarded, especially when combined with others implemented by the JIT compiler. Another interesting topic to analyze is whether destructive optimization (i.e., the one that requires de-optimization for supporting the debugger) is actually necessary or enough performance gains can be attained by non-destructive techniques.
The main annoyance when you start playing with such optimizations is debugger interface.
Historically and still currently in Squeak, the debugger is simulating the bytecode level and needs to map the bytecodes to corresponding Smalltalk instruction.
So I think the gain was too low for justifying complexification, or even worse degradation of debugging facility.
Pharo wants to change the debugger to operate at a higher level (Abstract Syntax Tree), but I don't know how they will end up at bytecode which is all the VM knows of.
IMO, this kind of optimization might better be implemented in the JIT compiler which transforms bytecode to machine native code.
EDIT
The greatest gains are in eliminating the sends themselves (by inlining) because they are much more expensive (x10) than the stack operations - there are 10 times more bytecodes executed per second than sends when you test 1 tinyBenchmarks (COG VM).
Interestingly, such optimizations could take place in the Smalltalk image, but only on hotspot detected by VM, as in the SISTA effort. See for example https://clementbera.wordpress.com/2014/01/22/the-sista-chronicles-iii-an-intermediate-representation-for-optimizations/
So, in the light of SISTA, the answer is rather: interesting, not yet addressed, but actively studied (and work in progress)!
All the machinery for de-optimizing when the method has to be debugged still is one of the difficult points as I understand it.
I think that a broader question is worth answering: are bytecodes worth the effort? Bytecodes were thought as a compact and portable representation of code that is close the target machine. As such, they are easy to interpret, but slow to execute.
Bytecodes do not excel in any of these games, and that usually makes them not the best choice if you want to either write an interpreter or a fast VM. On one hand, AST nodes far easier to interpret (only a few node types vs lots of different bytecodes). On the other hand, with the advent of JIT compilers, it became clear that running native code instead is not only possible but also much faster.
If you look at the most efficient VM implementations of JavaScript (which can be considered the most modern compilers of today) and also Java (HotSpot, Graal), you'll see they all use a tiered compilation scheme. Methods are initially interpreted from the AST, and only jitted when they become a hot spot.
At the hardest tiers of compilation there are no bytecodes. The key component in a compiler is its intermediate representation, and bytecodes do not fulfill the required properties. The most optimizable IRs are much more fine grained: they are in SSA form, and allow specific representation of registers and memory. This allows for much better code analysis and optimization.
Then again, if you are interested in portable code, there isn't anything more portable than the AST. Besides, it's easier and more practical to implement AST-based debuggers and profilers than bytecode-based ones. The only remaining problem is compactness, but in any case you can implement something like ast-codes (coded asts, similar to bytecodes but representing the tree)
On the other hand, if you want full speed, then you'll go for a JIT with a good IR and no bytecodes. I think that bytecodes don't fill many gaps in today VMs, but still remain mostly for backwards compatibility (also there are many examples of hardware archiqutectures that directly execute Java bytecodes).
There are also some cool experiments with the Cog VM related with bytecodes. But from what I understand they transform the bytecode into another IR for optimizing, then they convert back to bytecodes. I'm not sure if there's a technical gain in the last conversion besides reusing the original JIT architecture, or if there actually is any optimization at the bytecode level.
I'm trying to do some interesting orbit mechanics, I've found some related code in Fortan and I'm going through line by line moving it to Visual Basic. I can't understand what this is though:
IF(ABS(EPW-TEMP2) .LE. E6A) GO TO 140
It's not a variable. I figure E6 might be 10^6 but what's the 'A' mean?
Thanks!
When I google that line of code, I end up with some "Spacetrack Report No.3" with the fortran code listing. And E6A is defined as 1.E-6 in the routine DRIVER (Page. 73)
DATA DE2RA,E6A,PI,PIO2,QO,SO,TOTHRD,TWOPI,X3PIO2,XJ2,XJ3,
1 XJ4,XKE,XKMPER,XMNPDA,AE/.174532925E-1,1.E-6,
2 3.14159265,1.57079633,120.0,78.0,.66666667,
4 6.2831853,4.71238898,1.082616E-3,-.253881E-5,
5 -1.65597E-6,.743669161E-1,6378.135,1440.,1.
I see this code has already been converted to Java and C, perhaps you should use those as reference.
I'm trying to understand the output of the gcov tool. Running it with no options makes sense, but I'm wanting to try and understand the branch coverage options. Unfortunately it's hard to make sense of what the branches do and why they aren't taken. Below is the output for a method (compile using the latest LLVM/Clang build).
function -[TestCoverageAppDelegate loopThroughArray:] called 5 returned 100% blocks executed 88%
5: 30:- (NSInteger)loopThroughArray:(NSArray *)array {
5: 31: NSInteger i = 0;
22: 32: for (NSString *string in array) {
branch 0 taken 0
branch 1 taken 7
-: 33:
22: 34: }
branch 0 taken 4
branch 1 taken 3
branch 2 taken 0
branch 3 taken 3
5: 35: return i;
-: 36:}
I've run 5 test through this, passing in nil, an empty array, an array with 1 object, and array with 2 objects and an array with 4 objects. I can guess that in the first case, branch 1 means "go into the loop" but I haven't a clue what branch 0 is. In the second case branch 0 seems to be loop through again, branch 1 seems to be end the loop and branch 3 is continue/exit the loop, but I have no idea what branch 2 is or why/when it would be executed.
If anyone knows how to decipher the branch info, or knows of any detailed documentation on what it all means, I'd appreciate the help.
Gcov works by instrumenting (while compiling) every basic block of machine commands (you can think about assembler). Basic block means a linear section of code, which have no branches inside it and no lables inside it. So, If and only if you start running a basic block, you will reach end of basic block. Basic blocks are organized in CFG (Control flow graph, think about it as directed graph), which shows relations between basicblocks (edge from V1 to V2 is V1 calls V2; and V2 is called by V1). So, profile-arcs mode of compiler and gcov want to get execution count for every line and do this via counting basic block executions. Some of edges in CFG are instrumented and some are not, because there are algebraic relations between basic blocks in graph.
Your ObjC construction (for..in) is lowered (converted in early compilation) to several basic blocks. So, gcov sees 4 branches, because it sees only lowered BBs. It knows nothing about this lowering, but it knows which line corresponds to every assembler instruction (this is debug info). So, branches are edges of CFG.
If you want to see basic blocks, you should do an assembler dump of compiled program or disassemble a binary or dump CFG from compiler. You can do this both for profile-arcs and non-profile-arcs modes and compare them.
profile-arcs mode will have a lot calls and increments of something like "__llvm_gcov_ctr" or "__llvm_gcda_edge" - it is an actual instrumentation of basic blocks.