Writing a TemplateLanguage/VewEngine - antlr

Aside from getting any real work done, I have an itch. My itch is to write a view engine that closely mimics a template system from another language (Template Toolkit/Perl). This is one of those if I had time/do it to learn something new kind of projects.
I've spent time looking at CoCo/R and ANTLR, and honestly, it makes my brain hurt, but some of CoCo/R is sinking in. Unfortunately, most of the examples are about creating a compiler that reads source code, but none seem to cover how to create a processor for templates.
Yes, those are the same thing, but I can't wrap my head around how to define the language for templates where most of the source is the html, rather than actual code being parsed and run.
Are there any good beginner resources out there for this kind of thing? I've taken a ganer at Spark, which didn't appear to have the grammar in the repo.
Maybe that is overkill, and one could just test-replace template syntax with c# in the file and compile it. http://msdn.microsoft.com/en-us/magazine/cc136756.aspx#S2
If you were in my shoes and weren't a language creating expert, where would you start?

The Spark grammar is implemented with a kind-of-fluent domain specific language.
It's declared in a few layers. The rules which recognize the html syntax are declared in MarkupGrammar.cs - those are based on grammar rules copied directly from the xml spec.
The markup rules refer to a limited subset of csharp syntax rules declared in CodeGrammar.cs - those are a subset because Spark only needs to recognize enough csharp to adjust single-quotes around strings to double-quotes, match curley braces, etc.
The individual rules themselves are of type ParseAction<TValue> delegate which accept a Position and return a ParseResult. The ParseResult is a simple class which contains the TValue data item parsed by the action and a new Position instance which has been advanced past the content which produced the TValue.
That isn't very useful on it's own until you introduce a small number of operators, as described in Parsing expression grammar, which can combine single parse actions to build very detailed and robust expressions about the shape of different syntax constructs.
The technique of using a delegate as a parse action came from a Luke H's blog post Monadic Parser Combinators using C# 3.0. I also wrote a post about Creating a Domain Specific Language for Parsing.
It's also entirely possible, if you like, to reference the Spark.dll assembly and inherit a class from the base CharGrammar to create an entirely new grammar for a particular syntax. It's probably the quickest way to start experimenting with this technique, and an example of that can be found in CharGrammarTester.cs.

Step 1. Use regular expressions (regexp substitution) to split your input template string to a token list, for example, split
hel<b>lo[if foo]bar is [bar].[else]baz[end]world</b>!
to
write('hel<b>lo')
if('foo')
write('bar is')
substitute('bar')
write('.')
else()
write('baz')
end()
write('world</b>!')
Step 2. Convert your token list to a syntax tree:
* Sequence
** Write
*** ('hel<b>lo')
** If
*** ('foo')
*** Sequence
**** Write
***** ('bar is')
**** Substitute
***** ('bar')
**** Write
***** ('.')
*** Write
**** ('baz')
** Write
*** ('world</b>!')
class Instruction {
}
class Write : Instruction {
string text;
}
class Substitute : Instruction {
string varname;
}
class Sequence : Instruction {
Instruction[] items;
}
class If : Instruction {
string condition;
Instruction then;
Instruction else;
}
Step 3. Write a recursive function (called the interpreter), which can walk your tree and execute the instructions there.
Another, alternative approach (instead of steps 1--3) if your language supports eval() (such as Perl, Python, Ruby): use a regexp substitution to convert the template to an eval()-able string in the host language, and run eval() to instantiate the template.

There are sooo many thing to do. But it does work for on simple GET statement plus a test. That's a start.
http://github.com/claco/tt.net/
In the end, I already had too much time in ANTLR to give loudejs' method a go. I wanted to spend a little more time on the whole process rather than the parser/lexer. Maybe in version 2 I can have a go at the Spark way when my brain understands things a little more.

Vici Parser (formerly known as LazyParser.NET) is an open-source tokenizer/template parser/expression parser which can help you get started.
If it's not what you're looking for, then you may get some ideas by looking at the source code.

Related

How does `Yacc` identifies function calls?

I am trying to figure out how yacc identifies function calls in a C code. For Example: if there is a function call like my_fun(a,b); then which rules does this statement reduces to.
I am using the cGrammar present in : C Grammar
Following the Grammar given over there manually; I figured out that we only have two choices in translation unit. Everything has to either be a function definition or a declaration. Now all declaration starts type_specifiers, storage_class_specifier etc but none of them starts with IDENTIFIER
Now in case of a function call the name would be IDENTIFIER. This leaves me unclear as to how it will be parsed and which rules will be used exactly?
According to the official yacc specification specified here yacc, everything is handled by user given routines. When you have a function call the name of course is IDENTIFIER.It is parsed using the user defined procedures.According to the specifications, the user can specify his input in terms of individual input characters, or in terms of higher level constructs such as names and numbers. The user-supplied routine may also handle idiomatic features such as comment and continuation conventions, which typically defy easy grammatical specification.
Do have a look.By the way you are supposed to do a thorough research before putting questions here.

Derive string from const enum

I have the following in my constants file:
typedef enum
{
AnimalTypeBear,
AnimalTypeCamel,
AnimalTypeCow,
AnimalTypeCount
}
AnimalType;
If I declare an AnimalType variable somewhere in my code like following and set it to AnimalTypeBear:
AnimalType animalType = 0;
Is there away to somehow derive the string "Bear" from that animalType variable or just in general to access the string of its corresponding constant type (in this case AnimalTypeBear).
Enums are constant expressions like #define. Enums at compile time will be "translated" into the code as constants (while #define will be evaluated before compilation). So basically it is not possible to reference the enum string in this way.
As suggested by others you can use a string array.
You cannot do this without code in (Objective-)C. If you want to be able to use actual enumeration literals as strings in your code, or during I/O, with language support then you need to use a language with enumeration type support such as Pascal or Ada.
If you are keen to have this and don't mind work as long as it is reusable then you need to learn about reading the symbol tables structures from a binary and make sure that the information is not stripped from your application. You'll see the debugger can show the correct literals, also if you use Xcode's "Product > Generate Output > Assembly File" menu item you'll see the literals are in there as strings. It will be a lot of work for you, but would be reusable once done.
After that give up and write some code - a simple static array of labels and an index operation. Yes, it's a maintenance headache if you ever change your enumeration.
Alternatively you can write some different code, say in Ruby... Xcode supports adding your own file "types" and running scripts to (pre-)process them. So you could define, say, a file type ".enum" and use a Ruby script to convert that into a C enumeration definition and code to provide the strings. Apple has examples of using Ruby to pre-process files in this way. Once you have your script Xcode will do the rest, on each compilation it will run your script to convert your ".enum" into ".m" (or ".c") and the compile the result. This approach is usually best though for files which contain only one thing, e.g. localised string file processing, you don't usually write your enum declarations in their own files.

Generate java classes from DSL grammar file

I'm looking for a way to generate a parser from a grammar file (BNF/BNF-like) that will populate an AST. However, I'm also looking to automatically generate the various AST classes in a way that is developer-readable.
Example:
For the following grammar file
expressions = expression+;
expression = CONST | math_expression;
math_expression = add_expression | substract_expression;
add_expression = expression PLUS expression;
substract_expression = expression MINUS expression;
CONST: ('0'..'9')+;
PLUS: '+';
MINUS: '-';
I would like to have the following Java classes generated (with example of what I expect their fields to be):
class Expressions {List<Expression> expression};
class Expression {String const; MathExpression mathExpression;} //only one should be filled.
class MathExpression {AddExpression addExpression; SubstractExpression substractExpression;}
class AddExpression {Expression expression1; Expression expression2;}
class SubstractExpression {Expression expression1; Expression expression2;}
And, in runtime, I would like the expression "1+1-2" to generate the following object graph to represent the AST:
Expressions(Expression(MathExpression(AddExpression(1, SubstractExpression(1, 2)))))
(never mind operator precedence).
I've been exploring DSL parser generators (JavaCC/ANTLR and friends) and the closest thing I could find was using ANTLR to generate a listener class with "enterExpression" and "leaveExpression" style methods. I found a somewhat similar code generated using JavaCC and jjtree using "multi" - but it's extremely awkward and difficult to use.
My grammar needs are somewhat simple - and I would like to automate the AST object graph creation as much as possible.
Any hints?
If you want a lot of support for DSL construction, ANTLR and JavaCC probably aren't the way to go. They provide parsing, some support of building trees... and after that you're on your own. But, as you've figured out, its a lot of work to design your own trees, work out the details, and you're hardly done with the DSL at that point; you still can't use it.
There are more complete solutions out there: JetBrains MPS, Xtext, Spoofax, DMS. All of them provide ways to define a DSL, convert it to an internal form ("build trees"), and provide support for code generation. The first three have integrated IDE support and are intended for "small" DSLs; DMS does not, but handles real languages like C++ as well as DSLs. I think the first three are open source; DMS is commercial (I'm the party behind DMS).
Markus Voelter has just released an online book on DSL Engineering, available for your idea of a donation. He goes into great detail on MPS, XText, Spoofax but none on DMS. He tells you what you need to know and what you need to do; based on my skim of the book, it is pretty extensive. You probably are not going to get off on "simple"; DSLs have a lot of semantic complexity and the supporting machinery is difficult.
I know the author, have huge respect for his skills in this arena, and have co-lectured at technical summer skills with him including having some nice beer. Otherwise I have nothing to do this book.

What are the steps I need to do to complete this programming assignment?

I'm having a hard time understanding what I'm supposed to do. The only thing I've figured out is I need to use yacc on the cminus.y file. I'm totally confused about everything after that. Can someone explain this to me differently so that I can understand what I need to do?
INTRODUCTION:
We will use lex/flex and yacc/Bison to generate an LALR parser. I will give you a file called cminus.y. This is a yacc format grammar file for a simple C-like language called C-minus, from the book Compiler Construction by Kenneth C. Louden. I think the grammar should be fairly obvious.
The Yahoo group has links to several descriptions of how to use yacc. Now that you know flex it should be fairly easy to learn yacc. The only base type is int. An int is 4 bytes. Booleans are handled as ints, as in C. (Actually the grammar allows you to declare a variable as a type void, but let's not do that.) You can have one-dimensional arrays.
There are no pointers, but references to array elements should be treated as pointers (as in C).
The language provides for assignment, IF-ELSE, WHILE, and function calls and returns.
We want our compiler to output MIPS assembly code, and then we will be able to run it on SPIM. For a simple compiler like this with no optimization, an IR should not be necessary. We can output assembly code directly in one pass. However, our first step is to generate a symbol table.
SYMBOL TABLE:
I like Dr. Barrett’s approach here, which uses a lot of pointers to handle objects of different types. In essence the elements of the symbol table are identifier, type and pointer to an attribute object. The structure of the attribute object will differ according to the type. We only have a small number of types to deal with. I suggest using a linear search to find symbols in the table, at least to start. You can change it to hashing later if you want better performance. (If you want to keep in C, you can do dynamic allocation of objects using malloc.)
First you need to make a list of all the different types of symbols that there are—there are not many—and what attributes would be necessary for each. Be sure to allow for new attributes to be added, because we
have not covered all the issues yet. Looking at the grammar, the question of parameter lists for functions is a place where some thought needs to be put into the design. I suggest more symbol table entries and pointers.
TESTING:
The grammar is correct, so taking the existing grammar as it is and generating a parser, the parser will accept a correct C-minus program but it won’t produce any output, because there are no code snippets associated with the rules.
We want to add code snippets to build the symbol table and print information as it does so.
When an identifier is declared, you should print the information being entered into the symbol table. If a previous declaration of the same symbol in the same scope is found, an error message should be printed.
When an identifier is referenced, you should look it up in the table to make sure it is there. An error message should be printed if it has not been declared in the current scope.
When closing a scope, warnings should be generated for unreferenced identifiers.
Your test input should be a correctly formed C-minus program, but at this point nothing much will happen on most of the production rules.
SCOPING:
The most basic approach has a global scope and a scope for each function declared.
The language allows declarations within any compound statement, i.e. scope nesting. Implementing this will require some kind of scope numbering or stacking scheme. (Stacking works best for a one-pass
compiler, which is what we are building.)
(disclaimer) I don't have much experience with compiler classes (as in school courses on compilers) but here's what I understand:
1) You need to use the tools mentioned to create a parser which, when given input will tell the user if the input is a correct program as to the grammar defined in cminus.y. I've never used yacc/bison so I don't know how it is done, but this is what seems to be done:
(input) file-of-some-sort which represents output to be parsed
(output) reply-of-some-sort which tells if the (input) is correct with respect to the provided grammar.
2) It also seems that the output needs to check for variable consistency (ie, you can't use a variable you haven't declared same as any programming language), which is done via a symbol table. In short, every time something is declared you add it to the symbol table. When you encounter an identifier, if it is not one of the language identifiers (like if or while or for), you'll look it up in the symbol table to determine if it has been declared. If it is there, go on. If it's not - print some-sort-of-error
Note: point(2) there is a simplified take on a symbol table; in reality there's more to them than I just wrote but that should get you started.
I'd start with yacc examples - see what yacc can do and how it does it. I guess there must be some big example-complete-with-symbol-table out there which you can read to understand further.
Example:
Let's take input A:
int main()
{
int a;
a = 5;
return 0;
}
And input B:
int main()
{
int a;
b = 5;
return 0;
}
and assume we're using C syntax for parsing. Your parser should deem Input A all right, but should yell "b is undeclared" for Input B.

Write a compiler for a language that looks ahead and multiple files?

In my language I can use a class variable in my method when the definition appears below the method. It can also call methods below my method and etc. There are no 'headers'. Take this C# example.
class A
{
public void callMethods() { print(); B b; b.notYetSeen();
public void print() { Console.Write("v = {0}", v); }
int v=9;
}
class B
{
public void notYetSeen() { Console.Write("notYetSeen()\n"); }
}
How should I compile that? what i was thinking is:
pass1: convert everything to an AST
pass2: go through all classes and build a list of define classes/variable/etc
pass3: go through code and check if there's any errors such as undefined variable, wrong use etc and create my output
But it seems like for this to work I have to do pass 1 and 2 for ALL files before doing pass3. Also it feels like a lot of work to do until I find a syntax error (other than the obvious that can be done at parse time such as forgetting to close a brace or writing 0xLETTERS instead of a hex value). My gut says there is some other way.
Note: I am using bison/flex to generate my compiler.
My understanding of languages that handle forward references is that they typically just use the first pass to build a list of valid names. Something along the lines of just putting an entry in a table (without filling out the definition) so you have something to point to later when you do your real pass to generate the definitions.
If you try to actually build full definitions as you go, you would end up having to rescan repatedly, each time saving any references to undefined things until the next pass. Even that would fail if there are circular references.
I would go through on pass one and collect all of your class/method/field names and types, ignoring the method bodies. Then in pass two check the method bodies only.
I don't know that there can be any other way than traversing all the files in the source.
I think that you can get it down to two passes - on the first pass, build the AST and whenever you find a variable name, add it to a list that contains that blocks' symbols (it would probably be useful to add that list to the corresponding scope in the tree). Step two is to linearly traverse the tree and make sure that each symbol used references a symbol in that scope or a scope above it.
My description is oversimplified but the basic answer is -- lookahead requires at least two passes.
The usual approach is to save B as "unknown". It's probably some kind of type (because of the place where you encountered it). So you can just reserve the memory (a pointer) for it even though you have no idea what it really is.
For the method call, you can't do much. In a dynamic language, you'd just save the name of the method somewhere and check whether it exists at runtime. In a static language, you can save it in under "unknown methods" somewhere in your compiler along with the unknown type B. Since method calls eventually translate to a memory address, you can again reserve the memory.
Then, when you encounter B and the method, you can clear up your unknowns. Since you know a bit about them, you can say whether they behave like they should or if the first usage is now a syntax error.
So you don't have to read all files twice but it surely makes things more simple.
Alternatively, you can generate these header files as you encounter the sources and save them somewhere where you can find them again. This way, you can speed up the compilation (since you won't have to consider unchanged files in the next compilation run).
Lastly, if you write a new language, you shouldn't use bison and flex anymore. There are much better tools by now. ANTLR, for example, can produce a parser that can recover after an error, so you can still parse the whole file. Or check this Wikipedia article for more options.