I'm working on my own toy programming language. For now I'm interpreting the source language from AST and I'm wondering what advantages compiling to a byte-code and then interpreting it could provide me.
For now I have three things in mind:
Traversing the syntax tree hundreds of time may be slower than running instructions in an array, especially if the array support O(1) random access(ie. jumping 10 instructions up and down).
In typed execution environment, I have some run-time costs because my AST is typed, and I'm constantly traversing it(ie. I have 10 types of nodes and I need to check what type I'm on now to execute). Maybe compiling to an untyped byte-code could help to improve this, since after type-checking and compiling, I would have an untyped values and code.
Compiling to byte-code may provide better portability.
Are my points correct? What are some other motivations behind compiling to bytecode?
Speed is the main reason; interpreting ASTs is just too slow in practice.
Another reason to use bytecode is that it can be trivially serialized (stored on disk), so that you can distribute it. This is what Java does.
The point of generating byte code (or any other "easily interpreted" form such as threaded code) is essentially performance.
For an AST intepreter to decide what to do next, it needs to traverse the tree, inspect nodes, determine the type of nodes, check the type of any operands, verify legality, and decide which special case of the AST-designated operator applies (it says "+", but it means 16 bit add or string concatenate?), before it finally performs some action.
If one takes the final action and generates some kind of easily interpreted structure, then at "execution" time the interpreter can focus simply on performing actions without all that checking/special-case determination.
Another recent excuse is that if you generate byte code for any of a number of well-known virtual machines (JVM, MSIL, Parrot, etc.) you don't even have to code the interpreter. For the JVM and MSIL, you also get the benefit of the JIT compilers associated with them, and with careful design of your language, compatibility with huge libraries, which are the real attraction of Java and C#.
Related
, where I define IR as a 3-address code type representation (I realize that one can mean by it an AST representation as well).
It is my understanding that, when writing a best-practice compiler for an imperative language, code optimization happens both on the AST (probably best using a Visitor Pattern), and on the IR produced from the AST.
(a) Is that correct?
(b) Which type of optimization steps are best handled on the AST before even producing an IR? (reference to an article/a list online welcome too as long as it deals with an imperative language)
The compiler I'm working on is for Decaf (which some might know) which has a fairly deep CFG up to (single) class inheritance; I'll add features not part of it such as type coercion. It will be completely hand-coded (using no tools whatsoever). This is not homework; writing it for fun.
(a) Yes.
(b) Constant folding is one example; CSE is another; in fact almost anything to do with expression evaluation. IR-phase optimizations are more about what results from flow analysis.
IR is a form of an AST (often it is "flattened", but there are deep tree IRs as well), it may not be easy to distinguish one from another, especially if compiler is implemented as a sequence of very small rewrites from an original AST all the way down to a final IR suitable for instruction selection.
Optimisations may happen anywhere on this chain, but some representations are more suitable for a wide range of optimisations, most notably, an SSA form, used by most of the modern compilers to do nearly all the optimisations.
It's never too early to optimise (to coin a phrase). So there are optimisations performed before and during AST creation, on the AST itself, on the IR (if you have one) and on the code as it is generated. In C-like languages and those that compile to machine code, the effort goes into the later stages. In compilers targeting a VM I think there is less room for improvements at that stage.
Some early optimisations obviously work better than others. I don't know much about Decaf, but there are the obvious things like constant folding and constant expression evaluation. If you get the whole program in tree form before you have to generate any code you can find common subexpressions, do code migration, eliminate dead code/dead stores, hoist invariants, eliminate tail recursion and some kinds of strength reduction.
A lot of it depends on how hard you want to work and what your target is. You didn't say much about that.
I've been thinking about this writing (apparently) by Mark Twain in which he starts off writing in English but throughout the text makes changes to the rules of spelling so that by the end he ends up with something probably best described as pseudo-German.
This made me wonder if there is interpreter for some established language in which one has access to the interpreter itself, so that you can change the syntax and structure of the language as you go along. For example, often an if clause is a keyword; is there a language that would let you change or redefine this on the fly? Imagine beginning a console session in one language, and by the end, working in another.
Clearly one could write an interpreter and run it, and perhaps there is no concrete distinction between doing this and modifying the interpreter. I'm not sure about this. Perhaps there are limits to the modifications you can make dynamically to any given interpreter?
These more open questions aside, I would simply like to know if there are any known interpreters that allow this at all? Or, perhaps, this ability is just a matter of extent and my question is badly posed.
There are certainly languages in which this kind of self-modifying behavior at the level of the language syntax itself is possible. Lisp programs can contain macros, which allow among other things the creation of new control constructs on the fly, to the extent that two Lisp programs that depend on extensive macro programming can look almost as if they are written in two different languages. Forth is somewhat similar in that a Forth interpreter provides a core set of just a dozen or so primitive operations on which a program must be built in the language of the problem domain (frequently some kind of real-world interaction that must be done precisely and programmatically, such as industrial robotics). A Forth programmer creates an interpreter that understands a language specific to the problem he or she is trying to solve, then writes higher-level programs in that language.
In general the common idea here is that of languages or systems that treat code and data as equivalent and give the user just as much power to modify one as the other. Every Lisp program is a Lisp data structure, for example. This is in contrast to a language such as Java, in which a sharp distinction is made between the program code and the data that it manipulates.
A related subject is that of self-modifying low-level code, which was a fairly common technique among assembly-language programmers in the days of minicomputers with complex instruction sets, and which spilled over somewhat into the early 8-bit and 16-bit microcomputer worlds. In this programming idiom, for purposes of speed or memory savings, a program would be written with the "awareness" of the location where its compiled or interpreted instructions would be stored in memory, and could alter in place the actual machine-level instructions byte by byte to affect its behavior on the fly.
Forth is the most obvious thing I can think of. It's concatenative and stack based, with the fundamental atom being a word. So you write a stream of words and they are performed in the order in which they're written with the stack being manipulated explicitly to effect parameter passing, results, etc. So a simple Forth program might look like:
6 3 + .
Which is the words 6, 3, + and .. The two numbers push their values onto the stack. The plus symbol pops the last two items from the stack, adds them and pushes the result. The full stop outputs whatever is at the top of the stack.
A fundamental part of Forth is that you define your own words. Since all words are first-class members of the runtime, in effect you build an application-specific grammar. Having defined the relevant words you might end up with code like:
red circle draw
That wold draw a red circle.
Forth interprets each sequence of words when it encounters them. However it distinguishes between compile-time and ordinary words. Compile-time words do things like have a sequence of words compiled and stored as a new word. So that's the equivalent of defining subroutines in a classic procedural language. They're also the means by which control structures are implemented. But you can also define your own compile-time words.
As a net result a Forth program usually defines its entire grammar, including relevant control words.
You can read a basic introduction here.
Prolog is an homoiconic language, allowing meta interpreters (MIs) to be declined in a variety of ways. A meta interpreter - interpreting the interpreter - is a common and useful native construct in Prolog.
See this page for an introduction to this argument. An interesting and practical technique illustrated is partial execution:
The overhead incurred by implementing these things using MIs can be compiled away using partial evaluation techniques.
This question is about definitions, semantics.
I understand the general concept of interpretation, translating source to machine code in real-time, or into an intermediate cache which is later "compiled" in real time or just before run time, etc.
Is there a semantic distinction made between the source > byte code translation step, and the byte code > machine code translation step? Do people typically refer to the first part as "interpretation" and the second step as "compilation". Please don't misunderstand, I am not asking for a definition of compilation outside the scope of dynamic languages. That is another topic.
Additionally, is it futile to make a semantic distinction between these two steps, due to the large number of interpreters that implement so many different techniques?
Typically, interpretation means the execution of a program in an arbitrary form (plain sourcecode, abstract syntax tree (AST), bytecode, ...) by an interpreter.
Some virtual machines make heavy use of JITs (just in time compilers) which translate (compile) the intermediate representation of a program to native machine code. This is definitely a form of compilation.
Also, some VMs do several phases of compilation: At first, an AST is compiled to bytecode, which can later on be compiled to machine code.
I would say, compilation means basically a transformation of one intermediate representation to the next representation.
The steps an interpreter makes is usually programmed in a loop similar to:
get next instruction
parse and interpret its components
dispatch its translation
The definitions and semantics of the language are only implemented in an interpreter, but are defined elsewhere.
The answer to your question lies in the formal, operational and axiomatic semantic definitions of the language being either interpreted or compiled. In both cases, the semantics of the formal language definition must be preserved and consistent for any interpretation or compilation regardless of the implementation techniques employed.
Implementations of languages such as interpreters and compilers are tested against test suites which test the implementation of each language construct in the language against its formal semantic definition.
A language designer generates the formal definition of a language in a symbolic form such as denotational semantics. This definition is very abstract from a mathematical point-of-view.
A compiler or interpreter implementer is more interested in the operational semantic definition of the language which is more directly related to building the compiler or interpreter to run on a target machine.
A user of the language is more interested in the axiomatic definition of the language which informs programmers how to use the language's constructs to create programs.
There must be a million of books and papers on the theory and techniques of building compilers. Are there any resources on doing the reverse? Im not interested in any particular HW platform. Looking for good books/research papers that examine the subject and difficulties in depth.
I've worked on an AS3 and Java decompiler and I can assure you that everything I've learned in regards to decompilation is straight from compiler theory. Intermediate representations, data flow analysis, term rewriting, and other related concepts can all be found in the dragon book.
I've written about decompilers for dynamic languages here and for Python specifically.
Note though this is for dynamic languages with custom (high-level) VMs.
Decompilation is really a misnomer. Decompilers compile object code into a source representation. In many ways they are easier to write than traditional compilers - the 'source' code is already syntax checked and usually very precisely formatted.
They build up a symbol table (of addresses) and construct a target language representation of the application. The usual difficulty is that the original compiler has to a greater or lesser degree optimised the original application by removing common sub-expressions, hoisting constant code out of loops and many other similar techniques. These are often not possible to represent in the target language.
In cases where the source is for a well defined VM, then often this optimisation is left to the JIT compiler and the resulting decompiled code is very readable - in many cases almost identical to the original. Compilers of this type often leave some or all of the symbols in the object code allowing these to be recovered. Others include line numbers to help with debugging and troubleshooting. These all help to recover the original code.
As a counter, there are code obfuscators that deliberately perform transformations to the code that prevent simple restoration of the original source by scrambling names, change the sequence code is generated (without changing its resulting meaning) and introducing constructs for which there is no source language equivalent.
What is Dynamic Code Analysis?
How is it different from Static Code Analysis (ie, what can it catch that can't be caught in static)?
I've heard of bounds checking and memory analysis - what are these?
What other things are checked using dynamic analysis?
-Adam
Simply put, static analysis collect information based on source code and dynamic analysis is based on the system execution, often using instrumentation.
Advantages of dynamic analysis
Is able to detect dependencies that are not possible to detect in static analysis. Ex.: dynamic dependencies using reflection, dependency injection, polymorphism.
Can collect temporal information.
Deals with real input data. During the static analysis it is difficult to impossible to know what files will be passed as input, what WEB requests will come, what user will click, etc.
Disadvantages of dynamic analysis
May negatively impact the performance of the application.
Cannot guarantee the full coverage of the source code, as it's runs are based on user interaction or automatic tests.
Resources
There's many dynamic analysis tools in the market, being debuggers the most notorious one. On the other hand, it's still an academic research field. There's many researchers studying how to use dynamic analysis for better understanding of software systems. There's an annual workshop dedicated to dependency analysis.
Basically you instrument your code to analyze your software as it is running (dynamic) rather than just analyzing the software without running (static). Also see this JavaOne presentation comparing the two. Valgrind is one example dynamic analysis tool for C. You could also use code coverage tools like Cobertura or EMMA for Java analysis.
From Wikipedia's definition of dynamic program analysis:
Dynamic program analysis is the
analysis of computer software that is
performed with executing programs
built from that software on a real or
virtual processor (analysis performed
without executing programs is known as
static code analysis). Dynamic program
analysis tools may require loading of
special libraries or even
recompilation of program code.
You asked for a good explanation of "bounds checking and memory analysis" issues.
Our Memory Safety Check tool instruments your application to watch at runtime for memory access errors (buffer overruns, array subscript errors, bad pointers, alloc/free errors). The link contains
a detailed explanation complete with examples. This SO answer shows two programs that have pointers into dead stack frame, and how CheckPointer detects and reports the point of error in the source code
A briefer example: C (and C++) infamously do not check accesses to arrays, to see if the access is inside the bounds of the array. The benefit: well-designed program don't pay the cost of such a check in production mode. The downside: buggy programs can touch things outside the array, and this can cause behavior which is very hard to understand; thus the buggy program is difficult to debug.
What a dynamic instrumentation tool like the Memory Safety Checker does, is associate some metadata with every pointer (e.g., the type of the thing to which the pointer "points", and if it is an array, the array bounds), and then check at runtime, any accesses via pointers to arrays, whether the array bound is violated. The tool modifies the original program to collect the metadata where it is generated (e.g., on entry to scopes in which arrays are declared, or as the result of a malloc operation, etc.) and modifies the program at every array reference (written both as x[y] where either x or y is an array pointer and the the value is some type of integral type, similarly for *(x+y)!) to check the access. Now if the program runs, and performs an out-of-bounds access, the check catches the error and it reported at the first place where it could be detected. [If you think about it, you'll realize the instrumentation for metadata collection and checking has to be pretty clever, to handle all the variant cases a language like C may have. Its actually hard to make this work completely).
The good news is that now such access is reported early where it is easier to detect the problem and fix the program. Such a tool isn't intended production use; one uses during development and testing to help verify absence of errors. If there are no errors discovered, then one does a normal compile and runs the programs without the checks.
This is an extremely good example of a dynamic analysis tool: the testing happens at runtime.
Bounds checking
This means runtime checks of array accesses. Contrary to C's laissez-faire approach to memory accesses and pointer arithmetic, other languages like Java or C# actually check whether or not a given array has the element one is trying to access.