I am trying to design a compiler for a language like C# in ANTLR. But I don't fully comprehend the proper order of steps that should be undertaken.
This is how I see it:
First I define Lexer tokens
Then grammar rules (with rewrite rules to build AST) with actions that gather informations about classes and methods declarations (so that I can resolve method invocations in the next step)
Finally, I create "tree grammar" which traverse AST tree and invokes rules that generate the opcodes of (virtual) machine language.
Is this correct? Is the second step's role reading methods' declarations and building AST?
How can I resolve overloaded methods' declarations without build AST ? (backpatching?)
Have a look at Language Implementation Patterns it explains how to create your own languages (both interpreted and byte-code/VM-like). At the moment, your questions are too broad, and I don't think anyone is able to post an answer in a forum that explains all the details of how to create your own language from start to finish.
Feel free to ask more specific questions when you have them, of course.
Good luck!
Related
I have an ANTLR ParseTree for an sql grammar.
Example :
My goal is to edit this tree so that i can delete all the middle (booleanExpression, predicated, valueExpression, primaryExpression ) nodes inbetween.
I have explored visitor and listenors but they don’t generate the tree for me. And i'd like to not touch the grammar since its the official source one.
So how can i do it?
Thanks
There is no such feature in ANTLR's API (removing/mutating the ParseTree). You'll have to walk the tree yourself and create a copy of the ParseTree and ignore certain sub-trees you do not want/need.
I’ve usually found the ParseTree to be too cumbersome (for exactly the reason you’ve shown). I don’t know of any automated way to alter the tree in memory. ( Could be an interesting project to attempt.)
I a couple of implementations I’ve written some generalized approach to make the process easier. The “best” one was where I wrote a small DSL to define the structure I wanted. It generated the classes as well as ANTLR style visitors and listeners for those. I then used a listener to transform the ANTLR ParseTree to my ideal tree, and wrote the rest of my code using that structure with that, much simpler, tree. You can have a listener/visitor generate a tree of your own design as an artifact of processing the ParseTree, but that’s as close as I’ve come.
It was actually a relatively minor effort to set that up early in the process and accounted for a quite small percentage of the overall effort of implementing the language (so, well worth the effort).
A few questions scratch an itch around perl6 grammars and raster (binary in general) data. For what I understand, the text approach is to work at the grapheme-level trhough grammars, may we approach raster data that way ? Can we make custom grapheme definition to approach raster data or a basic unit of binary data to parse them using Grammars ?
Seeing that perl6 is defined by perl6 grammars, can we define similar grammars as kind of "validation" test with a basic case being if the grammar can parse the data, the data is well-formed and is structurally validated ? Using this approach for text data, it is kind of obvious with grammars as the basic unit are text-oriented but can we customize those back-end definition (by example, it's possible to overwrite the :sigspace to make rules and tokens parse with a another separator for grapheme) to enable the power of grammars in the binary data territory ?
Thanks!
For the background part:
During the past few weeks, I begin to learn-ish Perl6 by personal interest. After seeing this talk at FOSDEM 2019 and I begin to ask myself (and the people around me) about using using grammars to inspect/parse binary data. My usecase will be for example to replicate the Cloud Optimized Geotiff validator without the support of a GDAL binding (I didn't see one yet in perl6). It's clearly a learning project for me.
The Spec for Cloud Optimized Geotiff
For now, the basic idea is to parse the binary structure with the help of perl6 grammars if it possible as a first basic step, hoping to be able to inspect the data and metadata as a main goal.
Note : Not native speaker, if some parts need rewriting/precisions feel free to point out.
As only comments where posted, I will summarize all the answers I got from the comments here, my further research and the #perl6 IRC chatroom.
Concerning the support of binding for X library (in the test case, it was GDAL), the strategy inside the perl6 community is to either leverage :
Use the Inline::Foo modules aiming at launching and accessing the ecosystem of the Foo language (by example : Inline::Perl5, Inline::Python and so on). List of Inline::X modules from the Perl6 Module Directory ;
Use or write a binding using NativeCall to bind to dynamic libraries who follow the C Calling Covention ;
Use or write a native perl6 module.
Concerning the parsing of binary data, I'll split the subject in two parts :
Generally speaking ;
Leveraging Grammars ;
1. Generally speaking
Leveraging the P5pack module or using Inline::Perl5 to use the unpack/pack is actually (with perl6.c) the best to parse binary data structure (the former seems favoraed as it's native module).
Go to see first comment from #raiph to a SO anwser showing a basic use case.
2. Leveraging the grammars
With perl6.c, grammars can only parse text.
However, the question about parsing binary data seems to be moderatly hot (based on feedbacks seen on the #perl6 irc channel) and a few to document, yet not implemetend, seems to pave the way with a hope to see it happens in a future (near or distant?).
The last part of the #raiph's anwser list a lot of ressources aiming at that direction. Moreover, in the Synopses 05 - Regexes and Rules : line 432 a :bytes modifier is evoked. We will have to see at which point those modifiers will be implemented and what is missing to bring them to the language.
On the #perl6 irc channel, MasterDuke said « also, i think the nqp binary reading/writing ops that jnthn recently specced and nine implemented were a prerequisite for anything further ». I still have to investigated what exactly he is talking about but it seems to go to the good direction.
One of the main point, IMO, is the related to the grapheme definition based on the UTF-8 one. If we were able to overwrite the grapheme definition to a custom one for specialized grammar as we can for now overwrite the :sigspace modifier to affect what is the separators for rulesand tokens, we will access a new way to operate around data structure and grammars. For now, the grapheme is defined in the string-level not the grammar-level or meta. See #timotimo comments linking to the UTF-8 document describing the Grapheme Cluster Boundary Rules.
A way to bend the rules was linked by #jjmererlo : Parsing GFX3 format with perl6 grammars.
For syntax there is the EBNF ISO 14977 standard.
for runtime we have CLI ISO 23271 standard
see also Simple definition of "semantics" as it is commonly used in relation to programming languages/APIs?
but how to describe the transition from EBNF to CLI specs in declarative way?
i.e. is it enough to use the S-attributed grammar? Which standard define the syntax of such grammar?
There are many ways to define the semantics of a language. All of them have to express somehow the relationship between the program text and "what it computes".
A short but incomplete list of basic techniques:
Define an interpreter ("operational semantics")
Define a map from the source code to an enriched lambda calculus ("denotational semantics")
Define a map from the source code to another well-defined language ("transformational semantics")
Essentially, these are computations defined over the source text of a program instance.
You can implement these computations in many different ways. One way to implement them might be "S-attributed" grammars, although why you would want to restrict yourself to only S-attributes rather than a standard attributed grammar with inherited attributes is beyond me.
Given that there are so many ways to do this, I doubt you are going to find a standard. Certainly the programming langauge committees aren't using one. Heck, they won't even use a standard for BNF.
I am doing a little research on source to source compilation but now that I am getting an understanding of Source to Source compilation. I am wondering are there any examples of API's for these source to source compilers.
I mean an Interface Descriptor to pass the source code of one programming language to another compiler to be compile? Please if so can you point me to these examples or could you give me tips (Just pure explanation) on writing one am still in research okay.
Oh I should note I am researching this for several days an I have came across things such as ROSE, DMS and LLVM. As said its purely research so I dont know whats the best approach I know I wouldn't use ROSE for it is only for C/C++. LLVMs' seems promising but I am new to LLVM. Oh my aim is to create a transpiler for 4 language support (Is that feasible). Which is why I just need expert Advice :)
Yes, you can have a procedural API for doing source-to-source translation. These are pretty straightforward in the abstract: define a core data structure to represent AST nodes, then define APIs to "parse file to AST", "visit tree nodes", "inspect tree nodes", "modify tree nodes", "spit out text". They get messy in the concrete, especially if the API is specific the language being translated; too much of the details of that language get wound into the APIs. While traditional, this is really a rather clumsy way to define source-to-source translators, because you then have to write tons of procedural code invoking the APIs to do the translation.
You can instead define them using a program transformation system (PTS) using source to source transformations based on surface syntax; these are patterns written using the notation of your to-be-compiled language, and your target-language, in the form of "if you see this, then replace it by that", operating on syntax trees not text strings. This means you can inspect the transforms simply by staring at them. So can your fellow programmer.
One such translation rule might look like:
rule tranlate_add_to(t: access_path, u: access_path):COBOL -> Java
" add \t to \u "
-> " \object_for\(\u\).\u += \object_for\(\t\).\t; ";
with a left-hand side "add \t to \u " specifying a COBOL fragment (this) to be replaced by the right-hand side " \object_for... " representing corresponding Java code (that). This rule uses a helper function "object_for" to decide where in a target Java program, a global variable in a the source COBOL program will be placed. (There's no avoiding writing such a function if you are translating Java to COBOL. You can argue about how sophisticated). In practice, the way such a rule works is the pattern ASTs of each side are constructed, and then the patterns are matched against a parsed AST; a match causes the corresponding subtree to be spliced into place where the match was found. (All this low level tree matching and splicing has to be done... procedurally, but somebody else has already implemented that in a PTS).
In our experience, you need one to two thousand such rules to translate one language to another. The plethora of rules comes from the combinatorics of language syntax constructs for the source language (and their perhaps different interpretations according to types; "a+b" means different things when a is an int vs when a is a string) and the target language opportunities. A nice plus of such rewrites is that one can build a somewhat simpler base translation, and apply additional rewrites from the target language to itself to clean up and optimize the translated result.
Many PTS are purely based on source-to-source surface syntax rewrites. We have found that combining both PTS and a procedural API, and making it possible to segue between them makes for very nice tool: you can use the rewrites where convenient, and procedural APIs where they don't work so well (the "object_for" function suggested above is easier to code as a procedure).
See lot more detail on how our DMS Software Reengineering Toolkit encodes such transformation rules (the one above is code in DMS style), in a language agnostic (well, parameterized) fashion. DMS offers a "pure" procedural API as OP requested with some 400 functions, but DMS encourages its users to lean heavily on the rewrites and only code as a little as necessary agains the procedural API. It would be "straightforward" (at least as straightforward as practical) to build your "4 language support" this way.
Don't underestimate the amount of effort to build such translators, even with a lot of good technical machinery as a foundation. Langauges tend to be complex beasts, and their translations doubly so. And you have to decide if you want a truly crummy translation or a good one.
I have been using ROSE compiler framework to write a source to source translator. ROSE can parse a language that it supports and create an AST from it. It provides different APIs (found in SageInterface) to perform transformation and analysis on the AST. After the transformation, you can unparse the transformed AST to produce your target source code.
If ROSE does not support parsing your input language, you can write your own parser while utilizing ROSE's SageBuilder API to build the AST. If your target language is one of the languages which ROSE supports, then you can rely on ROSE's unparser to get the target code. But if ROSE does not support your target language, then you can write your own unparser as well using different AST traversal mechanism provided by ROSE.
I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.