Generate classes from grammar rules, objects on parse - objective-c

Is it possible to generate .m and .h's for any grammar/ rules so that during parsing it creates an object that represents that rule.
So some grammar
coolObjName = Word;
could generate a class that is named coolObjName (or some variation) and has a field for the word, and generates the action:
coolObjName = Word{
CoolObjName* newName = [[CoolObjName alloc] initWithWord:POP_STR()];
PUSH(newName);
};
Then a higher level rule such as:
myhigherlevel = coolObjName Number;
would create a myHigherLevel class that has a coolObjName member and a number, which then adds the action:
myhigherlevel = coolObjName Number{
double num = POP_DOUBLE();
coolObjName* name = POP();
MyHigherLevel* higherLevel = [[MyHigherLevel alloc] init];
higherLevel.number = num;
higherLevel.name = name;
PUSH(higherLevel);
};
Empty tags turn to empty objects and * and + result in arrays.
Is there a tool that can do this or where would I go to create such. (seems super useful and awesome)

Creator of PEGKit here.
There's nothing currently in PEGKit which will inspect your rules and auto-generate ObjC AST or model classes which might match your intent. For now, PEGKit can only produce source code for your parser, but not your related AST or model classes.
It is probably very likely that you will be building an Abstract Syntax Tree (or Intermediate Representation) in your Actions or Parser Delegate Callbacks. ANLTR has some wonderful high-level Tree Rewriting features which allow you to specify in the grammar how to build your AST. Although looking at the docs it seems this might have changed significantly since ANTLR 3? Not sure. It used to look like something like this (maybe it still does, I'm not an ANTLR expert):
addExpr : lhs=NUM op='+' rhs=NUM -> ^(op lhs rhs);
The -> means rewrite to tree like…
And the ^(op …) means tree rooted by a plus operator containing children….
I would love to add tree rewriting features like this to PEGKit, for the express purpose of building ASTs (but not other model objects). But honestly, it's a huge task, and it is not likely to appear soon.

Related

How should I implement Obj C-like headers in Lua?

Now I have an import(a) function, that in short words dofile's header in .framework like this:
import("<Kakao/KARect>") => dofile("/System/Library/Frameworks/Kakao.framework/Headers/KARect.lua")
And in KARect.lua for example I have:
KARect = {}
function KARect:new(_x, _y, _width, _height, _colorBack)
local new = setmetatable({}, {__index = self})
new.id = KAEntities:generateID()
...
return new
end
function KARect:draw()
...
end
After some time I thought about reworking this system and making "headers" work like typical Lua modules with advanced require() so function will do e.g.:
import("<Kakao/KARect>") => package.path = "/System/Library/Frameworks/Kakao.framework/Headers/?.lua"; KARect = require("KARect")
and file will contain:
local KARect = {}
...
return KARect
Because headers should not contain anything but only classes with their names? I'm getting confused when thinking about it, as I never used Obj C :s
I never used Obj C
Then why are you trying to implement its headers in a language, that does not use headers at all?
Header! What is a header?
Header files in C-like languages store more than just a name. They store constants and macro commands, function and class method argument and return types, structure and class fields. In essence, the contents of the header file are forward declarations. They came into existence due to the need to perform the same forward-declarations across many files.
I don't know what additional rules and functions were added to header files in Obj-C, but you can get general understanding of what they do in the following links: 1, 2, 3, 4 with the last one being the most spot-on.
Answer to the question present
Lua is dynamically-typed interpreted language. It does not do compile time type checks and, typically, Lua programs can and should be structured in a way that does not need forward declarations across files. So there is no meaningful way for a programmer to create and for lua bytecode generator and interpreter to use header files.
Lua does not have classes at all. The code you've posted is a syntactic sugar for an assignment of a function with a slightly different signature to a table which imitates class:
KARect.new = function( first_arg_is_self, _x, _y, _width, _height, _colorBack)
local new = setmetatable({}, {__index = first_arg_is_self})
return new
end
There is no declarations here, only generation of an anonymous function and its assignment to a field in a table. Other parts of program do not need to know anything about a particular field, variable or function (which is stored in variable) in advance (unlike C).
So, no declaration means nothing to separate from implementation. You of course can first list fields of the class-table and do dummy assignments to them, but, again, Lua will have no use for those. If you want to give hints to humans, it is probably better to write a dedicated manual or put comments in the implementation.
Lua has situations where forward declarations are needed to reference local functions. But this situation does not arise in object oriented code, as all methods are accessed through reference to the object, and by the time first object is created, the class itself is usually fully constructed.

Serialization of ANTLR ParseTree

I have a generated grammar that does two things:
Check the syntax of a domain specific language
Evaluate input against that domain specific language
These two functions are separate, lets call them validate() and evaluate().
The validate() function builds the tree from a String input while ensuring it meets the requirements of the BNF for the language. The evaluate() function plugs in values to that tree to get a result (usually true or false).
What the code is currently doing is running validate() each time on the input, just to generate the tree that evaluate() uses. Some of the inputs take up to 60 seconds to be checked. What I would LIKE to do is serialize the results of validate() (assuming it meets the syntax requirements), store the serialized form in the backend database, and just load it from the database as part of evaluate().
I noticed that I can execute the method toStringTree() on the parse tree, and retrieve a LISP style tree. However, can I restore a LISP style tree to an ANTLR parse tree? If not, can anyone recommend another way to serialize and store the generated parse tree?
Thanks for any help.
Jason
ANTLR 4's ParseRuleContext data structure (the specific implementation of ParseTree used by generated parsers to represent grammar rules in the parse tree) is not serializable by default. Open issue #233 on the project issue tracker covers the feature request. However, based on my experience with many applications using ANTLR for parsing, I'm not convinced serializing the parse trees would be useful in the long run. For each problem serializing the parse tree is meant to address, a better solution already exists.
Another option is to store a hash of the last known valid file in the database. After you use the parser to create a parse tree, you could skip the validation step if the input file has the same hash as the last time it was validated. This leverages two aspects of ANTLR 4:
For the same input file, running the parser twice will produce the same parse tree.
The ANTLR 4 parser is extremely fast in almost all cases (e.g. the Java grammar can process around 20MB of source per second). The remaining cases tend to be caused by poorly structured grammar rules that the new parser interpreter feature in ANTLRWorks 2.2 can analyze and make suggestions for improvement.
If you need performance beyond what you get with this, then a parse tree isn't the data structure you should be using. StringTemplate 4's enormous performance advantage over StringTemplate 3 came primarily from the fact that the interpreter switched from using ASTs (equivalent to parse trees for this reasoning) to a linear bytecode representation/interpreter. The ASTs for ST4 would never need to be serialized for performance reasons because the bytecode would be serialized instead. In fact, the C# port of StringTemplate 4 provides exactly this feature.
If the input data to your grammar is made of several independent blocks, you could try to store the string of each block separately, and run the parsing process again for each block independently, using a ThreadPool for example.
Say for example your input data is a set of method declarations:
int add(int a, int b) {
return a+b;
}
int mul(int a, int b) {
return a*b;
}
...
and the grammar is something like:
methodList : methodDeclaration methodList
|
;
methodDeclaration : // your method declaration rules...
The first run of the parser just collects each method text and store it. The parser starts the process at the methodList rule.
void visitMethodList(MethodListContext ctx) {
if(ctx.methodDeclaration() != null) {
String methodStr = formatParseTree(ctx.methodDeclaration(), " ");
// store methodStr for later parsing
}
// visit next method list item, if any
if(ctx.methodList() != null) {
visit(ctx.methodList());
}
}
The second run launch the parsing of each method declaration (in a separate thread for example). For this, the parser starts at the methodDeclaration rule.
void visitMethodDeclaration(MethodDeclarationContext ctx) {
// parse the method block
}
The reason why the text of a methodDeclaration rule is formatted if because calling directly ctx.methodDeclaration().getText() would combine the text of all child nodes AntLR doc, possibly making it unusable for parsing again. If white space is a token separator in the grammar, then adding one space between tokens should not change the parse tree.
String formatParseTree(ParseTree tree, String separator) {
StringBuilder builder = new StringBuilder();
for(int i = 0; i < tree.getChildCount(); i ++) {
ParseTree child = tree.getChild(i);
if(child instanceof TerminalNode) {
builder.append(child.getText());
builder.append(separator);
} else if(child instanceof RuleContext) {
builder.append(formatParseTree(child, separator));
}
}
return builder.toString();
}

how to use the generic void pointer of parser tree in ANTLR3

I'm writing a grammar for a simple language using ANTLR3 C target. I want to attach some data
to the AST generated by ANTLR. As the data to attach is small, using the generic void pointer in ANTLR3_BASE_TREE_struct is quite straightforward. The ANTLR3c documentation states that
void * u
Generic void pointer allows the grammar programmer to attach any structure they like to
a tree node, in many cases saving the need to create their own tree and tree adaptors.
Then I wrote the grammar as below(only a piece of grammar is shown here)
datum
#declarations {
parser_data_t* data;
}
#init{
data = HPS_MALLOC(sizeof(parser_data_t));
HPS_RT_ASSERT(data, HPS_ERR_NOMEM);
}
: simpleDatum
{
data->candidate = 0;
$datum.tree->u = data;
}
-> ^(I_DATUM simpleDatum)
| compoundDatum
{
data->candidate = 1;
$datum.tree->u = data;
}
-> ^(I_DATUM compoundDatum)
;
then I generate the lexer/parser, compiled it and linked it to a main function. When I ran the program, a segment fault occurred. I checked the parser generated by ANTLR and found the problem. The assignment $datum.tree->u = data caused the segment fault because
the data structure $datum.tree hadn't been assigned a valid value, so $datum.tree->u is an invalid reference.
What I want is to attach some data to the tree node representing the non-terminal datum, anyone could tell me how to achieve that? I've searched in google and StackOverflow but no answer found.
Thank you!

Generate java classes from DSL grammar file

I'm looking for a way to generate a parser from a grammar file (BNF/BNF-like) that will populate an AST. However, I'm also looking to automatically generate the various AST classes in a way that is developer-readable.
Example:
For the following grammar file
expressions = expression+;
expression = CONST | math_expression;
math_expression = add_expression | substract_expression;
add_expression = expression PLUS expression;
substract_expression = expression MINUS expression;
CONST: ('0'..'9')+;
PLUS: '+';
MINUS: '-';
I would like to have the following Java classes generated (with example of what I expect their fields to be):
class Expressions {List<Expression> expression};
class Expression {String const; MathExpression mathExpression;} //only one should be filled.
class MathExpression {AddExpression addExpression; SubstractExpression substractExpression;}
class AddExpression {Expression expression1; Expression expression2;}
class SubstractExpression {Expression expression1; Expression expression2;}
And, in runtime, I would like the expression "1+1-2" to generate the following object graph to represent the AST:
Expressions(Expression(MathExpression(AddExpression(1, SubstractExpression(1, 2)))))
(never mind operator precedence).
I've been exploring DSL parser generators (JavaCC/ANTLR and friends) and the closest thing I could find was using ANTLR to generate a listener class with "enterExpression" and "leaveExpression" style methods. I found a somewhat similar code generated using JavaCC and jjtree using "multi" - but it's extremely awkward and difficult to use.
My grammar needs are somewhat simple - and I would like to automate the AST object graph creation as much as possible.
Any hints?
If you want a lot of support for DSL construction, ANTLR and JavaCC probably aren't the way to go. They provide parsing, some support of building trees... and after that you're on your own. But, as you've figured out, its a lot of work to design your own trees, work out the details, and you're hardly done with the DSL at that point; you still can't use it.
There are more complete solutions out there: JetBrains MPS, Xtext, Spoofax, DMS. All of them provide ways to define a DSL, convert it to an internal form ("build trees"), and provide support for code generation. The first three have integrated IDE support and are intended for "small" DSLs; DMS does not, but handles real languages like C++ as well as DSLs. I think the first three are open source; DMS is commercial (I'm the party behind DMS).
Markus Voelter has just released an online book on DSL Engineering, available for your idea of a donation. He goes into great detail on MPS, XText, Spoofax but none on DMS. He tells you what you need to know and what you need to do; based on my skim of the book, it is pretty extensive. You probably are not going to get off on "simple"; DSLs have a lot of semantic complexity and the supporting machinery is difficult.
I know the author, have huge respect for his skills in this arena, and have co-lectured at technical summer skills with him including having some nice beer. Otherwise I have nothing to do this book.

Writing a TemplateLanguage/VewEngine

Aside from getting any real work done, I have an itch. My itch is to write a view engine that closely mimics a template system from another language (Template Toolkit/Perl). This is one of those if I had time/do it to learn something new kind of projects.
I've spent time looking at CoCo/R and ANTLR, and honestly, it makes my brain hurt, but some of CoCo/R is sinking in. Unfortunately, most of the examples are about creating a compiler that reads source code, but none seem to cover how to create a processor for templates.
Yes, those are the same thing, but I can't wrap my head around how to define the language for templates where most of the source is the html, rather than actual code being parsed and run.
Are there any good beginner resources out there for this kind of thing? I've taken a ganer at Spark, which didn't appear to have the grammar in the repo.
Maybe that is overkill, and one could just test-replace template syntax with c# in the file and compile it. http://msdn.microsoft.com/en-us/magazine/cc136756.aspx#S2
If you were in my shoes and weren't a language creating expert, where would you start?
The Spark grammar is implemented with a kind-of-fluent domain specific language.
It's declared in a few layers. The rules which recognize the html syntax are declared in MarkupGrammar.cs - those are based on grammar rules copied directly from the xml spec.
The markup rules refer to a limited subset of csharp syntax rules declared in CodeGrammar.cs - those are a subset because Spark only needs to recognize enough csharp to adjust single-quotes around strings to double-quotes, match curley braces, etc.
The individual rules themselves are of type ParseAction<TValue> delegate which accept a Position and return a ParseResult. The ParseResult is a simple class which contains the TValue data item parsed by the action and a new Position instance which has been advanced past the content which produced the TValue.
That isn't very useful on it's own until you introduce a small number of operators, as described in Parsing expression grammar, which can combine single parse actions to build very detailed and robust expressions about the shape of different syntax constructs.
The technique of using a delegate as a parse action came from a Luke H's blog post Monadic Parser Combinators using C# 3.0. I also wrote a post about Creating a Domain Specific Language for Parsing.
It's also entirely possible, if you like, to reference the Spark.dll assembly and inherit a class from the base CharGrammar to create an entirely new grammar for a particular syntax. It's probably the quickest way to start experimenting with this technique, and an example of that can be found in CharGrammarTester.cs.
Step 1. Use regular expressions (regexp substitution) to split your input template string to a token list, for example, split
hel<b>lo[if foo]bar is [bar].[else]baz[end]world</b>!
to
write('hel<b>lo')
if('foo')
write('bar is')
substitute('bar')
write('.')
else()
write('baz')
end()
write('world</b>!')
Step 2. Convert your token list to a syntax tree:
* Sequence
** Write
*** ('hel<b>lo')
** If
*** ('foo')
*** Sequence
**** Write
***** ('bar is')
**** Substitute
***** ('bar')
**** Write
***** ('.')
*** Write
**** ('baz')
** Write
*** ('world</b>!')
class Instruction {
}
class Write : Instruction {
string text;
}
class Substitute : Instruction {
string varname;
}
class Sequence : Instruction {
Instruction[] items;
}
class If : Instruction {
string condition;
Instruction then;
Instruction else;
}
Step 3. Write a recursive function (called the interpreter), which can walk your tree and execute the instructions there.
Another, alternative approach (instead of steps 1--3) if your language supports eval() (such as Perl, Python, Ruby): use a regexp substitution to convert the template to an eval()-able string in the host language, and run eval() to instantiate the template.
There are sooo many thing to do. But it does work for on simple GET statement plus a test. That's a start.
http://github.com/claco/tt.net/
In the end, I already had too much time in ANTLR to give loudejs' method a go. I wanted to spend a little more time on the whole process rather than the parser/lexer. Maybe in version 2 I can have a go at the Spark way when my brain understands things a little more.
Vici Parser (formerly known as LazyParser.NET) is an open-source tokenizer/template parser/expression parser which can help you get started.
If it's not what you're looking for, then you may get some ideas by looking at the source code.