Boolean expression parser - lark-parser

I'd like to implement a parser that helps me working on more complex boolean logic expressions for designing digital circuits. But I seem to be dumb.
start:equation
equation: SYMBOL
| expression
expression: nottk
nottk: "!" SYMBOL
and: expression "*" expression
SYMBOL: /[A-Za-z]/
This is my grammar for now. I know that it is incomplete, but I first wanted to get the surrounding code working in principle!
For now I use a Transformer to convert symbols into sympy smbols. I am using sympy for working with the expressions, because I don't fancy implementing operations on logic expressions on my own and sympy has most of everything I need.
class SymbolTransformer(Transformer):
def SYMBOL(self, token: Token):
"""Converts every symbol into asympy symbol"""
return token.update(value=sympy.symbols(f'{token.value}'))
Just for good measure this is my Transformer.
I tried to implement everything as a Transformer, but it wasn't working, since Non-terminals work differently.
So I am asking, How would you change my grammar, or what would you use to implement generating sympy expressions from a Tree.

Related

How does `Yacc` identifies function calls?

I am trying to figure out how yacc identifies function calls in a C code. For Example: if there is a function call like my_fun(a,b); then which rules does this statement reduces to.
I am using the cGrammar present in : C Grammar
Following the Grammar given over there manually; I figured out that we only have two choices in translation unit. Everything has to either be a function definition or a declaration. Now all declaration starts type_specifiers, storage_class_specifier etc but none of them starts with IDENTIFIER
Now in case of a function call the name would be IDENTIFIER. This leaves me unclear as to how it will be parsed and which rules will be used exactly?
According to the official yacc specification specified here yacc, everything is handled by user given routines. When you have a function call the name of course is IDENTIFIER.It is parsed using the user defined procedures.According to the specifications, the user can specify his input in terms of individual input characters, or in terms of higher level constructs such as names and numbers. The user-supplied routine may also handle idiomatic features such as comment and continuation conventions, which typically defy easy grammatical specification.
Do have a look.By the way you are supposed to do a thorough research before putting questions here.

How are functions defined for mathematical operators?

Since '+', '-', and the rest of the arithmetic operators must be, at heart, just function calls (I think), how are they defined? More specifically, how are they written such that they can look for arguments that precede them and follow them?
For example, the function '*', or multiply, in the expression 7 * 9 must look for the first argument 7 before it's called, and then the 9, it's second argument, which appears to be in the right place.
Most languages (not OCaml) require parentheses around arguments -- how have they gotten around this requirement in languages that do?
It sounds like you're asking about operator overloading, or at least reading into operator overloading will give you a pretty solid understanding on how arithmetic operations are defined in some languages. Here's a link to a nice tutorial on C++ operator overloading:
http://www.tutorialspoint.com/cplusplus/cpp_overloading.htm
In object-oriented world, this is called "operator overloading". It is just a kind of syntactic sugar, that enables you to call a + b, instead of a.+(b), that is actually the same method call.

Is "Implicit token definition in parser rule" something to worry about?

I'm creating my first grammar with ANTLR and ANTLRWorks 2. I have mostly finished the grammar itself (it recognizes the code written in the described language and builds correct parse trees), but I haven't started anything beyond that.
What worries me is that every first occurrence of a token in a parser rule is underlined with a yellow squiggle saying "Implicit token definition in parser rule".
For example, in this rule, the 'var' has that squiggle:
variableDeclaration: 'var' IDENTIFIER ('=' expression)?;
How it looks exactly:
The odd thing is that ANTLR itself doesn't seem to mind these rules (when doing test rig test, I can't see any of these warning in the parser generator output, just something about incorrect Java version being installed on my machine), so it's just ANTLRWorks complaining.
Is it something to worry about or should I ignore these warnings? Should I declare all the tokens explicitly in lexer rules? Most exaples in the official bible The Defintive ANTLR Reference seem to be done exactly the way I write the code.
I highly recommend correcting all instances of this warning in code of any importance.
This warning was created (by me actually) to alert you to situations like the following:
shiftExpr : ID (('<<' | '>>') ID)?;
Since ANTLR 4 encourages action code be written in separate files in the target language instead of embedding them directly in the grammar, it's important to be able to distinguish between << and >>. If tokens were not explicitly created for these operators, they will be assigned arbitrary types and no named constants will be available for referencing them.
This warning also helps avoid the following problematic situations:
A parser rule contains a misspelled token reference. Without the warning, this could lead to silent creation of an additional token that may never be matched.
A parser rule contains an unintentional token reference, such as the following:
number : zero | INTEGER;
zero : '0'; // <-- this implicit definition causes 0 to get its own token
If you're writing lexer grammar which wouldn't be used across multiple parser grammmar(s) then you can ignore this warning shown by ANTLRWorks2.

Generate java classes from DSL grammar file

I'm looking for a way to generate a parser from a grammar file (BNF/BNF-like) that will populate an AST. However, I'm also looking to automatically generate the various AST classes in a way that is developer-readable.
Example:
For the following grammar file
expressions = expression+;
expression = CONST | math_expression;
math_expression = add_expression | substract_expression;
add_expression = expression PLUS expression;
substract_expression = expression MINUS expression;
CONST: ('0'..'9')+;
PLUS: '+';
MINUS: '-';
I would like to have the following Java classes generated (with example of what I expect their fields to be):
class Expressions {List<Expression> expression};
class Expression {String const; MathExpression mathExpression;} //only one should be filled.
class MathExpression {AddExpression addExpression; SubstractExpression substractExpression;}
class AddExpression {Expression expression1; Expression expression2;}
class SubstractExpression {Expression expression1; Expression expression2;}
And, in runtime, I would like the expression "1+1-2" to generate the following object graph to represent the AST:
Expressions(Expression(MathExpression(AddExpression(1, SubstractExpression(1, 2)))))
(never mind operator precedence).
I've been exploring DSL parser generators (JavaCC/ANTLR and friends) and the closest thing I could find was using ANTLR to generate a listener class with "enterExpression" and "leaveExpression" style methods. I found a somewhat similar code generated using JavaCC and jjtree using "multi" - but it's extremely awkward and difficult to use.
My grammar needs are somewhat simple - and I would like to automate the AST object graph creation as much as possible.
Any hints?
If you want a lot of support for DSL construction, ANTLR and JavaCC probably aren't the way to go. They provide parsing, some support of building trees... and after that you're on your own. But, as you've figured out, its a lot of work to design your own trees, work out the details, and you're hardly done with the DSL at that point; you still can't use it.
There are more complete solutions out there: JetBrains MPS, Xtext, Spoofax, DMS. All of them provide ways to define a DSL, convert it to an internal form ("build trees"), and provide support for code generation. The first three have integrated IDE support and are intended for "small" DSLs; DMS does not, but handles real languages like C++ as well as DSLs. I think the first three are open source; DMS is commercial (I'm the party behind DMS).
Markus Voelter has just released an online book on DSL Engineering, available for your idea of a donation. He goes into great detail on MPS, XText, Spoofax but none on DMS. He tells you what you need to know and what you need to do; based on my skim of the book, it is pretty extensive. You probably are not going to get off on "simple"; DSLs have a lot of semantic complexity and the supporting machinery is difficult.
I know the author, have huge respect for his skills in this arena, and have co-lectured at technical summer skills with him including having some nice beer. Otherwise I have nothing to do this book.

Writing a TemplateLanguage/VewEngine

Aside from getting any real work done, I have an itch. My itch is to write a view engine that closely mimics a template system from another language (Template Toolkit/Perl). This is one of those if I had time/do it to learn something new kind of projects.
I've spent time looking at CoCo/R and ANTLR, and honestly, it makes my brain hurt, but some of CoCo/R is sinking in. Unfortunately, most of the examples are about creating a compiler that reads source code, but none seem to cover how to create a processor for templates.
Yes, those are the same thing, but I can't wrap my head around how to define the language for templates where most of the source is the html, rather than actual code being parsed and run.
Are there any good beginner resources out there for this kind of thing? I've taken a ganer at Spark, which didn't appear to have the grammar in the repo.
Maybe that is overkill, and one could just test-replace template syntax with c# in the file and compile it. http://msdn.microsoft.com/en-us/magazine/cc136756.aspx#S2
If you were in my shoes and weren't a language creating expert, where would you start?
The Spark grammar is implemented with a kind-of-fluent domain specific language.
It's declared in a few layers. The rules which recognize the html syntax are declared in MarkupGrammar.cs - those are based on grammar rules copied directly from the xml spec.
The markup rules refer to a limited subset of csharp syntax rules declared in CodeGrammar.cs - those are a subset because Spark only needs to recognize enough csharp to adjust single-quotes around strings to double-quotes, match curley braces, etc.
The individual rules themselves are of type ParseAction<TValue> delegate which accept a Position and return a ParseResult. The ParseResult is a simple class which contains the TValue data item parsed by the action and a new Position instance which has been advanced past the content which produced the TValue.
That isn't very useful on it's own until you introduce a small number of operators, as described in Parsing expression grammar, which can combine single parse actions to build very detailed and robust expressions about the shape of different syntax constructs.
The technique of using a delegate as a parse action came from a Luke H's blog post Monadic Parser Combinators using C# 3.0. I also wrote a post about Creating a Domain Specific Language for Parsing.
It's also entirely possible, if you like, to reference the Spark.dll assembly and inherit a class from the base CharGrammar to create an entirely new grammar for a particular syntax. It's probably the quickest way to start experimenting with this technique, and an example of that can be found in CharGrammarTester.cs.
Step 1. Use regular expressions (regexp substitution) to split your input template string to a token list, for example, split
hel<b>lo[if foo]bar is [bar].[else]baz[end]world</b>!
to
write('hel<b>lo')
if('foo')
write('bar is')
substitute('bar')
write('.')
else()
write('baz')
end()
write('world</b>!')
Step 2. Convert your token list to a syntax tree:
* Sequence
** Write
*** ('hel<b>lo')
** If
*** ('foo')
*** Sequence
**** Write
***** ('bar is')
**** Substitute
***** ('bar')
**** Write
***** ('.')
*** Write
**** ('baz')
** Write
*** ('world</b>!')
class Instruction {
}
class Write : Instruction {
string text;
}
class Substitute : Instruction {
string varname;
}
class Sequence : Instruction {
Instruction[] items;
}
class If : Instruction {
string condition;
Instruction then;
Instruction else;
}
Step 3. Write a recursive function (called the interpreter), which can walk your tree and execute the instructions there.
Another, alternative approach (instead of steps 1--3) if your language supports eval() (such as Perl, Python, Ruby): use a regexp substitution to convert the template to an eval()-able string in the host language, and run eval() to instantiate the template.
There are sooo many thing to do. But it does work for on simple GET statement plus a test. That's a start.
http://github.com/claco/tt.net/
In the end, I already had too much time in ANTLR to give loudejs' method a go. I wanted to spend a little more time on the whole process rather than the parser/lexer. Maybe in version 2 I can have a go at the Spark way when my brain understands things a little more.
Vici Parser (formerly known as LazyParser.NET) is an open-source tokenizer/template parser/expression parser which can help you get started.
If it's not what you're looking for, then you may get some ideas by looking at the source code.