Look for a VBA/VB parser/compiler written in OCaml - vb.net

I am planning to write a compiler (including parser) in OCaml to parse and run VBA or/and VB programs. I have done this for simple imperative languages, but I am not sure how to handle the "object" features of VBA or/and VB...
Does anyone know if there is any existing work that I can inspire?

Not an OCaml solution (but OP asked):
Our DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery. It is intended to be a convenient foundation for custom software engineering tools for computer languages, with the goal being to help the tool engineer get his job done, rather than spending his time reinventing the wheel. In particular, many people think that getting a parser is the big part of the job. This is simply false. See Life After Parsing.
DMS has production front ends for many languages, both modern and legacy, including Visual Basic in its variety of dialects (VB6, VBA [essentially the same as VB6]) and VB.net.
By production I mean have been applied to real code systems of significant size and handle all the corresponding parsing issues. This is pretty hard for legacy languages, e.g., VB, especially the older dialects because such languages are generally poorly documented (VB6 and VBA especially so). The only way to get this right is to build a draft parser, run it against reality, and revise until lots of code goes through sensibly. This often takes longer than doing the draft parser because it isn't easy to understand the errors (they're undocumented!), you have to decide if they are real or the code base just has junk (more often than you'd think), guess what it means for the grammar and try it all again.
These front ends as a minimum parse source code and build ASTs; they can also invert this process to regenerate legal compilable code with the comments back as source text files. The VisualBasic front ends do this. Some of our other front ends (C, C++, Java, COBOL) go further: name/type resolution, flow analysis, etc.; they do that by collecting key program facts from the language-specific AST and then apply DMS-supplied machinery to compute the results. This would be possible for VisualBasic, too, if such facts were useful.

For an example of a tiny OO language written in OCaml check out the source code for boa at: http://andrej.com/plzoo/.
The OO flavour is not class based though so I'm not sure how useful it will be.

Related

interpreting a script through F#

I really like F# but I feel like it's not succint and short enough. I want to go further. I do have an idea of how I'd like to improve it but I have no experience in making compilers so I thought I'd make it a scripting language. Then I realized that I could make it a scripting language and interpret it using F# but still get pretty much 100% performance thanks to F# having the inline option. Am I right? Is it really possible to make a script interpreter in F# that would go through my script and turn it into lots of functors and stuff and so get really good performance?
I really like F# but I feel like it's not succinct and short enough. I want to go further. I do have an idea of how I'd like to improve it but I have no experience in making compilers so I thought I'd make it a scripting language.
F# supports scripting scenarios via F# Interactive, so I'd recommend considering an internal DSL first, or suggesting features on the F# Language UserVoice page.
Then I realized that I could make it a scripting language and interpret it using F# but still get pretty much 100% performance thanks to F# having the inline option. Am I right?
Depending on the scenario, interpreted code may be fast enough, for example if 99% of your application's time is spent waiting on network, database or graphics rendering, the overall cost of interpreting the code may be negligible. This is less true for compute based operations. F#'s inline functions can help with performance tuning but are unlikely to provide a global panacea.
Is it really possible to make a script interpreter in F#
As a starting point, it is possible to write an interpreter for vanilla F# code. You could for example use F#'s quotation mechanism to get an abstract syntax tree (AST) for a code fragment or entire module and then evaluate it. Here's a small F# snippet that evaluates a small subset of F# code quotations: http://fssnip.net/h1
Alternatively you could design your own language from scratch...
Is it really possible to make a script interpreter in F# that would go through my script and turn it into lots of functors and stuff and so get really good performance?
Yes, you could design your own scripting language, defining an AST using the F# type system, then writing a parser that transforms script code into the AST representation, and finally interpreting the AST.
Parser
There are a number of options for parsing including:
active patterns & regex, for example evaluating cells in a spreadsheet
FsLex & FsYacc, for example to parse SQL
FParsec, a parser combinator library, for example to parse Small Basic
I'd recommend starting with FParsec, it's got a good tutorial, plenty of samples and gives basic error messages for free based on your code.
Small Examples
Here's a few simple example interpreters using FParsec to get you started:
Turtle - http://fssnip.net/nM
Minimal Logo language - http://fssnip.net/nN
Small Basic - http://fssnip.net/le
Fun Basic
A while back I wrote my own simple programming language with F#, based on Microsoft's Small Basic with interesting extensions like support for tuples and pattern matching. It's called Fun Basic, has an IDE with code completion and is available free on the Windows Store. The Windows Store version is interpreted (due to restrictions on emitting code) and the performance is adequate. There is also a compiler version for the desktop which runs on Windows, Mac and Linux.
Is it really possible to make a script interpreter in F#
So I guess, the answer is YES, if you'd like to learn more there's a free recording of a talk I did at NDC London last year on how to Write Your Own Compiler in 24 Hours
I'd also recommend picking up Peter Sestoft's Programming Language Concepts book which has a chapter on building your own functional language.

Why do tools like yacc and ANTLR generate source code?

These tools basically input a grammar and output code which processes a series of tokens into something more useful, like a syntax tree. But could these tools be written in the form of a library instead? What is the reason for generating source code as output? Is there a performance gain? Is it more flexible for the end user? Easier to implement for the authors of yacc and ANTLR?
Sorry if the question is too vague, I'm just curious about the historical reasons behind the decisions the authors made, and what purpose auto-generated code has in today's environment.
There's a big performance advantage achieved by the parser generator working out the interactions of the grammar rules with respect to one another, and compiling the result to code.
One could build interpreters that simply accepted grammars and did the parsing; there are parser types (Earley) that would actually be relatively good at that, and one could compute the grammar interactions at runtime (Earley parsers kind of do this anyway) rather than offline and then execute the parsing algorithm.
But you would pay a parsing performance penalty of 10 to 100x slowdown, and probably a big storage demand.
If you are parsing using only very small grammars, or you are parsing only very small documents, this might not matter. But the grammars that many parser generators get applied too end up being fairly big (people keep wanting to add things to what you can say in a language), and they often end up processing pretty big documents. So performance now matters, and viola, people build code-generating parser generators.
Once you have a tool, it is often easier to use even in simple cases. So now that you have parser generators, you can even apply them to little grammars or to parsing little documents.
EDIT: Addendum. The historical reason is probably driven by space and time demands. Earlier systems had not a lot of room (32Kb in 1975), didn't run very fast (1 MIPS same time frame), and people had big source files already. Parser generators tended to help with this set of problems; interpreted grammars would have had intolerably bad performance.
Ira Baxter gave you one set of reasons for not handling the grammar parsing at runtime.
There is another reason too. Associated with each rule in the grammar is the appropriate action. The action is normally a fragment of a separate language (for example, C or C++). All actions in a grammar interpreted at runtime would have to be mappable to something appropriate in the program. In general, that's a losing proposition. The fragments can do all sorts of things, referencing parts of the stack ($$, $1, etc) and invoking actions (YYACCEPT, etc). Designing the runtime system so that it could be reliably used with such fragments would be tough. You'd like be into creating source code and compiling that into a DSO (dynamic shared object) or DLL (dynamic link library) and loading it. That requires a compiler on the customer's machine, where the customer may have deliberately designed their production system to be compiler-free.

Creating a simple Domain Specific Language

I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.

Can PMD be customized to fully support a new language?

Can PMD be customized to fully support a new language, in a reasonable amount of time. I mean I know that technically almost anything can be done, but im wondering if this can be done in a reasonable amount of time? E.g. < 2 weeks
This page mentions how to write a CPD parser http://pmd.sourceforge.net/cpd-parser-howto.html
But is this just for copy / paste detection? Does writing a CPD parser give me full support of PMD in terms of rile sets?
I would guess not, but I'm not a PMD expert (and I have my own bias, check my bio).
The issues are:
Can you define a syntax for my langauge quickly (maybe, depending on how good you are, how messy the language is, and the strength of the parsing machinery offered by PMD)
Can you define the semantics of my language so that "semantic checks" provided by PMD work. You have to do this, because syntax tells you (and a tool) literally nothing about semantic of the syntax. I would guess that the PMD tool 'semantic checks' are pretty wired into the precise details of Java; if you language matched java perfectly, this would be zero work. But it doesn't, or you wouldn't be asking the question. And langauge semantics differences, even minor ones, cause discontinuous changes to the interpreation of the code. Before you get to doing even "serious" semantics, you're likely to have to build a symbol table mapping identifiers in the code to declarations (and the "semantic" type) for those symbols. Based on tool infrastructure I work with, this step alone takes 1-2 months for a real language.
Lastly, you are likely to have to code special PMD checks that are specific to your langauge. That takes time and energy, too.
I build generic compiler-type machinery (parsers, flow analyzers, style/error checkers) and get asked the equivalent of this question all the time WRT to our machinery. We try to have a lot of machinery available, try to make it easy to integrate new langauges, and we've been working on trying to make this "convenient and fast" for 15+ years. Its still not convenient, and there's no way to do this with our tools in a few weeks. I doubt PMD is better.

Which scripting language to support in an existing codebase?

I'm looking at adding scripting functionality to an existing codebase and am weighing up the pros/cons of various packages. Lua is probably the most obvious choice, but I was wondering if people have any other suggestions based on their experience.
Scripts will be triggered upon certain events and may stay resident for a period of time. For example upon startup a script may define several options which the program presents to the user as a number of buttons. Upon selecting one of these buttons the program will notify the script where further events may occur.
These are the only real requirements;
Must be a cross-platform library that is compilable from source
Scripts must be able to call registered code-side functions
Code must be able to call script-side functions
Be used within a C/C++ codebase.
Based on my own experience:
Python. IMHO this is a good choice. We have a pretty big code base with a lot of users and they like it a lot.
Ruby. There are some really nice apps such as Google Sketchup that use this. I wrote a Sketchup plugin and thought it was pretty nice.
Tcl. This is the old-school embeddable scripting language of choice, but it doesn't have a lot of momentum these days. It's high quality though, they use it on the Hubble Space Telescope!
Lua. I've only done baby stuff with it but IIRC it only has a floating point numeric type, so make sure that's not a problem for the data you will be working with.
We're lucky to be living in the golden age of scripting, so it's hard to make a bad choice if you choose from any of the popular ones.
I have played around a little bit with Spidermonkey. It seems like it would at least be worth a look at in your situation. I have heard good things about Lua as well. The big argument for using a javascript scripting language is that a lot of developers know it already and would probably be more comfortable from the get go, whereas Lua most likely would have a bit of a learning curve.
I'm not completely positive but I think that spidermonkey your 4 requirements.
I've used Python extensively for this purpose and have never regretted it.
Lua is has the most straight-forward C API for binding into a code base that I've ever used. In fact, I usually quickly roll bindings for it by hand. Whereas, you often wouldn't consider doing so without a generator like swig for others. Also, it's typically faster and more light weight than the alternatives, and coroutines are a very useful feature that few other languages provide.
AngelScript
lets you call standard C functions and C++ methods with no need for proxy functions. The application simply registers the functions, objects, and methods that the scripts should be able to work with and nothing more has to be done with your code. The same functions used by the application internally can also be used by the scripting engine, which eliminates the need to duplicate functionality.
For the script writer the scripting language follows the widely known syntax of C/C++ (with minor changes), but without the need to worry about pointers and memory leaks.
The original question described Tcl to a "T".
Tcl was designed from the beginning to be an embedded scripting language. It has evolved to be a first class dynamic language in its own right but still is used all over the world as an embeded language. It is available under the BSD license so it is just about as free as it gets. It also compiles on pretty much any moden platform, and many not-so-modern. And not only does it work on desktop systems, there are variations available for mobile platforms.
Tcl excels as a "glue" language, where you can write performance-intensive functions in C while still benefiting from the advantages of a scripting language for less performance critical parts of the application.
Tcl also comes with a first class GUI toolkit (Tk) that is arguably one of the easiest cross platform GUI toolkits available. It also interfaces very nicely with SQLite and other databases, and has had built-in support for unicode for quite some time.
If the scripting interface will be made available to your customers (as opposed to simply enabling your own engineers to work at the scripting level), Tcl is extremely easy to learn as there are a total of only 12 rules that govern the entire language (as of tcl 8.6). In fact, Tcl shines as a way to invent domain specific languages which is often how it is used as an end-user scripting solution.
There were some excellent suggestions already, but I just wanted to mention that Perl can also be called / can call to C/C++.
You probably could use any modern scripting / bytecode language.
If you're willing to put up with the growing pains of a new product, you could use the Parrot VM. Which has support for many, if not all of the languages listed on this page. Unfortunately it's not done yet, but that hasn't stopped some people from using it in a production environment.
I think most people are probably mentioning the scripting language that they are most familiar with. From my perspective, Tcl was designed specifically to interface with C, so your problem domain is tailor-made for the language. However, I'm sure Python, Perl, or Lua would be fine. You should probably choose the language that is most familiar to your current team, since that will reduce the learning time.