I am trying to implement algorithm for procedural generation of buildings for my thesis. I've been reading a lot about shape/split grammars. Here's the link to the most popular paper covering that topic.
I managed to implement very basic grammar without attributes and guard conditions. It can generate primitives, and do geometric transformations. I'm using flex and bison to parse shape grammar files, and generate objects (Symbols, Rules etc.) that representat given grammar in object oriented manner that could later be called to generate geometry.
But now I'm stuck with the attribute part. For example:
fac(h) : h > 9; floor(h/3) floor(h/3) floor(h/3)
I am clueless how to represent grammar to contains informations about attributes, how to pass values to the symbols on the left and how to evaluate condition.
Can anyone help me with this, please? I'm using C++.
Note: I have some knowledge about grammars and parsers, and I know how to implement top-down parser with attributes using recursive descent but that approach is useless here. I can't generate source code for functions because interpretation of grammar files will be done in the same runtime as the application. Even if I could, this is not parser but generator of sentences and there are lot more problems like condition evaluation and production selection.
Related
For syntax there is the EBNF ISO 14977 standard.
for runtime we have CLI ISO 23271 standard
see also Simple definition of "semantics" as it is commonly used in relation to programming languages/APIs?
but how to describe the transition from EBNF to CLI specs in declarative way?
i.e. is it enough to use the S-attributed grammar? Which standard define the syntax of such grammar?
There are many ways to define the semantics of a language. All of them have to express somehow the relationship between the program text and "what it computes".
A short but incomplete list of basic techniques:
Define an interpreter ("operational semantics")
Define a map from the source code to an enriched lambda calculus ("denotational semantics")
Define a map from the source code to another well-defined language ("transformational semantics")
Essentially, these are computations defined over the source text of a program instance.
You can implement these computations in many different ways. One way to implement them might be "S-attributed" grammars, although why you would want to restrict yourself to only S-attributes rather than a standard attributed grammar with inherited attributes is beyond me.
Given that there are so many ways to do this, I doubt you are going to find a standard. Certainly the programming langauge committees aren't using one. Heck, they won't even use a standard for BNF.
I am doing a little research on source to source compilation but now that I am getting an understanding of Source to Source compilation. I am wondering are there any examples of API's for these source to source compilers.
I mean an Interface Descriptor to pass the source code of one programming language to another compiler to be compile? Please if so can you point me to these examples or could you give me tips (Just pure explanation) on writing one am still in research okay.
Oh I should note I am researching this for several days an I have came across things such as ROSE, DMS and LLVM. As said its purely research so I dont know whats the best approach I know I wouldn't use ROSE for it is only for C/C++. LLVMs' seems promising but I am new to LLVM. Oh my aim is to create a transpiler for 4 language support (Is that feasible). Which is why I just need expert Advice :)
Yes, you can have a procedural API for doing source-to-source translation. These are pretty straightforward in the abstract: define a core data structure to represent AST nodes, then define APIs to "parse file to AST", "visit tree nodes", "inspect tree nodes", "modify tree nodes", "spit out text". They get messy in the concrete, especially if the API is specific the language being translated; too much of the details of that language get wound into the APIs. While traditional, this is really a rather clumsy way to define source-to-source translators, because you then have to write tons of procedural code invoking the APIs to do the translation.
You can instead define them using a program transformation system (PTS) using source to source transformations based on surface syntax; these are patterns written using the notation of your to-be-compiled language, and your target-language, in the form of "if you see this, then replace it by that", operating on syntax trees not text strings. This means you can inspect the transforms simply by staring at them. So can your fellow programmer.
One such translation rule might look like:
rule tranlate_add_to(t: access_path, u: access_path):COBOL -> Java
" add \t to \u "
-> " \object_for\(\u\).\u += \object_for\(\t\).\t; ";
with a left-hand side "add \t to \u " specifying a COBOL fragment (this) to be replaced by the right-hand side " \object_for... " representing corresponding Java code (that). This rule uses a helper function "object_for" to decide where in a target Java program, a global variable in a the source COBOL program will be placed. (There's no avoiding writing such a function if you are translating Java to COBOL. You can argue about how sophisticated). In practice, the way such a rule works is the pattern ASTs of each side are constructed, and then the patterns are matched against a parsed AST; a match causes the corresponding subtree to be spliced into place where the match was found. (All this low level tree matching and splicing has to be done... procedurally, but somebody else has already implemented that in a PTS).
In our experience, you need one to two thousand such rules to translate one language to another. The plethora of rules comes from the combinatorics of language syntax constructs for the source language (and their perhaps different interpretations according to types; "a+b" means different things when a is an int vs when a is a string) and the target language opportunities. A nice plus of such rewrites is that one can build a somewhat simpler base translation, and apply additional rewrites from the target language to itself to clean up and optimize the translated result.
Many PTS are purely based on source-to-source surface syntax rewrites. We have found that combining both PTS and a procedural API, and making it possible to segue between them makes for very nice tool: you can use the rewrites where convenient, and procedural APIs where they don't work so well (the "object_for" function suggested above is easier to code as a procedure).
See lot more detail on how our DMS Software Reengineering Toolkit encodes such transformation rules (the one above is code in DMS style), in a language agnostic (well, parameterized) fashion. DMS offers a "pure" procedural API as OP requested with some 400 functions, but DMS encourages its users to lean heavily on the rewrites and only code as a little as necessary agains the procedural API. It would be "straightforward" (at least as straightforward as practical) to build your "4 language support" this way.
Don't underestimate the amount of effort to build such translators, even with a lot of good technical machinery as a foundation. Langauges tend to be complex beasts, and their translations doubly so. And you have to decide if you want a truly crummy translation or a good one.
I have been using ROSE compiler framework to write a source to source translator. ROSE can parse a language that it supports and create an AST from it. It provides different APIs (found in SageInterface) to perform transformation and analysis on the AST. After the transformation, you can unparse the transformed AST to produce your target source code.
If ROSE does not support parsing your input language, you can write your own parser while utilizing ROSE's SageBuilder API to build the AST. If your target language is one of the languages which ROSE supports, then you can rely on ROSE's unparser to get the target code. But if ROSE does not support your target language, then you can write your own unparser as well using different AST traversal mechanism provided by ROSE.
I want to use some tools (free is better) or languages to help me do the following two tasks:
Task 1:
1. Read the specification file (text file) the user gives as input. To the user, the format of the specification file is designed by me and the user must follow it.
2. Use the specification input to generate an AST (abstract syntax tree).
3. Transform the AST into another AST by applying some optimization techniques such loop optimization, blocking or any other optimization I want. (Optional step)
4. Export the transformed AST to a source code file (C program file).
Task 2:
1. Read a source code file (C program file) and generate an AST to represent it.
2. Transform the AST into another AST by applying some optimization techniques such loop optimization, blocking or any other optimization I want. (To some optimization, I can parameterize it such as the loop unroll depth.)
3. Export the transformed AST to another optimized source file (C program file).
What OP wants in general is a program transformation system (PTS). PTS are generally capable of accepting an arbitrary syntax specification, building a parser producing ASTs from that syntax, applying source-to-source transforms to map the parsed AST to other ASTs, and then regenerating source text from the final AST.
A specific issue for OP is parsing/unparsing C source code. Almost none of the PTSs available do this for production C code (ANSI, GCC, MSStudio) and it is quite a lot of work to get this right. Nor do they provide auxiliary analyses which are needed to do interesting transformations, such as symbol tables, control or data flow analysis.
To my knowledge, only our DMS Software Reengineering Toolkit, and Rose Compiler, have specific support like this for C.
Rose, however, isn't designed to accept a DSL; it violates the PTS model by not allowing arbitrary syntax definitions. Instead, it uses the EDG parser front end (I think this means it also accepts C++14). But it can't handle OP's first request easily. Rose also does "source-to-source" transformations, but does so by hand-written procedural code that crawls the AST. It is focused on scientific computing, so they have done specific work on blocking loops, etc.
DMS is designed to accept arbitrary grammars (and handles C as well as C++14), and in fact can handle more than one at the same time, so it will support OP's first task directly. DMS does surface-syntax (written using C syntax directly) source-to-source rewrites as well as procedural ones. It has not been used for loop blocking, but DMS has been used to build vector extensions of C++ with code generation for SIMD instructions including appropriate loop optimizations.
The POET (Parameterized Optimization for Empirical Tuning, http://www.cs.uccs.edu/~qyi/poet) script language is one candidate. Are there any other tools or language?
I want to write a translator from a DSL to Java using ANTLR. So, I wrote the lexer and the parser using two different grammars. Now I have to write the tree grammar and I want to know which are the best practices (or recommended practices) for obtaining my result. More exactly, I would like to know which are the best ways to do stuff like: enrich the tree with attributes (for example, adding types) and optimizations.
Should I write different tree grammars for identifying types and for optimizations and then call then serially after the parser and before the final code generation tree grammar? Is there another way which is easier to maintain? I also though about manually parsing the tree generated by the parser in order to identify the types. But this is quite to maintain.
Thanks you.
There are no real best practices: just common sense and personal preference.
However, it is more logical to separate the adding of certain attributes to nodes from optimization actions (rewrite ^(* 0 ^(...)) to 0) in separate passes over the AST. Don't worry too much about performance: tree walking is pretty fast: most of the time is usually spent during parsing. And with ANTLR 3.2's addition of tree pattern matching, you can write pretty small tree grammars to perform a very specific operation on your AST (easy to maintain!).
Also see this previous Q&A that is about manually walking the AST or using a tree grammar for it:
Systematic way to generate ANTLR tree grammar?
I've got a BNF and EBNF for a grammar. The BNF is obviously more verbose. I have a fairly good idea as far as using the BNF to build a recursive-descent parser; there are many resources for this. I am having trouble finding resources to convert an EBNF to a recursive-descent parser. Is this because it's more difficult? I recall from my CS theory classes that we went over EBNFs, but we didn't go over converting them into a recursive-descent parser. We did go over converting BNF's into a recursive-descent parser.
The reason I'm asking is because the EBNF is more compact.
From looking at the EBNF's in general, I notice that terms enclosed between { and } can be converted into a while loop. Are there any other guidelines or rules?
You should investigate so-called metacompilers, which essentially compile EBNF into recursive descent parsers. How they do it is exactly the answer your question.
(Its pretty straightfoward, but good to understand the details).
A really wonderful paper is the "MetaII" paper by Val Schorre. This is metacompiler technology from honest-to-God 1964. In 10 pages, he shows you how to build a metacompiler, and provides not just that, but another compiler too and the output of both!. There's an astonishing moment that you come too if you go build one of these, where you realized how the meta-compiler compiles itself using its own grammar. This moment got me
hooked on compiler back in about 1970 when I first tripped over this paper. This is one of those computer science papers that everybody in the software business should read.
James Neighbors (the inventor of the term "domain" in software engineering, and builder of the first program transformation system [based on these metacompilers] has a great online MetaII tutorial, for those of you that don't want the do-it-from-scratch experience. (I have nothing to do with this except that Neighbors and I were undergraduates together).
Both ways are a fine way to learn about metacompilers and generating parsers from EBNF.
The key ideas are that the left hand side of a rule creates a function that parses that nonterminal and returns true if match and advances the input stream; false if no match and the input stream doesn't advance.
The contents of the function is determined by the right hand side. Literal tokens are matched directly.
Nonterminals cause calls to other functions generated for the other rules.
Kleene* maps to while loops, alternations map to conditional branches. What EBNF doesn't address,
and the metacompilers do, is how does parsing do anyting other than saying "matched" or not?
The secret is weaving output operations into the EBNF. The MetaII paper makes all this crystal clear.
Neither is harder than the other. It is really the difference between implementing something iteratively and implementing something recursively. In BNF, everything is recursive. In EBNF, some of the recursion is expressed iteratively. There are different variations in EBNF syntax, so I'll just use the English... "zero or more" is a simple while loop as you have discovered. "One or more" is the same as one followed by "zero or more". "Zero or one times" is a simple if statement. That should cover most of the cases.
The early meta compilers META II and TREEMETA and their kin are not exactly recursive decent parser. They were were stated as using recursive functions. That just meant they could call them selves.
We do not call C a recursive language. A C or C++ function is recursive in the same way the early meta compilers are recursive.
Recursion can be used. They were programming languages. Recursion is generally used only when analyzing nexted language constructs. For example parenthesized expression and nexted blocks.
More of an LR recursive decent combination. CWIC the last documented one has extensive backtracking and look ahead features. The '-' not operator can match any language construct. And inverts it success or failure. -term fails if a term is matched for example. The input is never advanced. The '?' looks ahead and matches any language construct ?expr for example would try to parse an expr. The look ahead '?' matched construct is not kept or is the input advanced.