Antlr4 Arithmetic Grammar Is Ignoring Order of Precedence (PEMDAS) - antlr

Grammar Definition
In my understanding ANTLR4 support left recursion to respect order of precedence for arithmetic. With that said here's the grammar:
grammar Arithmetic;
arithmetic: arithmeticExpression;
arithmeticExpression:
LPARAN inner = arithmeticExpression RPARAN # Parentheses
| left = arithmeticExpression POW right = arithmeticExpression # Power
| left = arithmeticExpression MUL right = arithmeticExpression # Multiplication
| left = arithmeticExpression DIV right = arithmeticExpression # Division
| left = arithmeticExpression ADD right = arithmeticExpression # Addition
| left = arithmeticExpression SUB right = arithmeticExpression # Subtraction
| arithmeticExpressionInput # ArithmeticInput;
arithmeticExpressionInput: NUMBER;
number: NUMBER;
/* Operators */
LPARAN: '(';
RPARAN: ')';
POW: '^';
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
/* Data Types */
NUMBER: '-'? [0-9]+;
/* Whitespace & End of Lines */
EOL: '\r'? '\n';
WS: [ \t]+ -> channel(HIDDEN);
Note: I've simplified the grammar for testing.
Input
5 + 21 / 7 * 3
Output Parse Tree
Problem
In the outputted parse tree starting at the arithmetic. You can see that the Order of Precedence is not following PEMDAS even though it's defined via left recursion in the grammar. This is also observed when debugging the visitor code generated by Antlr with the function call being VisitAddition.
I've google this and I can't see what I'm doing wrong compared to examples as they all look the same.
Environment
ANTLR Version: 4.11.1
Build Target: CSharp
.NET Project Packages:
Antlr4BuildTasks#11.1.0
Antlr4.Runtime.Standard#4.11.1

As mentioned in the comments by #sepp2k (Thank you).
The issue is with the grammar because both the multiplication, division, addition and subtraction had been split into seperate OR rule lines. Essentially creating PEMDAS when it should be PE(MD)(AS).
Here's an example of the fixed grammar:
arithmeticExpression:
arithmeticExpressionInput # ArithmeticInput
| LPARAN inner = arithmeticExpression RPARAN # Parentheses
| left = arithmeticExpression operator = POW right = arithmeticExpression # Power
| left = arithmeticExpression operator = (MUL|DIV) right = arithmeticExpression #
MultiplicationOrDivision
| left = arithmeticExpression operator = (ADD|SUB) right = arithmeticExpression #
AdditionOrSubtraction;
and now the outputted parse tree is much cleaner:

Related

Is it possible to do patern binding a la Haskell in Idris?

An example would be:
fib: Stream Integer
fib#(1::tfib) = 1 :: 1 :: [ a+b | (a,b) <- zip fib tfib]
But this generates the error:
50 | fib#(1::tfib) = 1 :: 1 :: [ a+b | (a,b) <- zip fib tfib]
| ^
unexpected "#(1::tfib)"
expecting "<==", "using", "with", ':', argument expression, constraint argument, expression, function right hand side, implementation
block, implicit function argument, or with pattern
This doesn't look promising given that it doesn't recognize # at the likely position.
Note that the related concept of as-patterns works the same in Haskell and Idris:
growHead : List a -> List a
growHead nnl#(x::_) = x::nnl
growHead ([]) = []

Constructing a linear grammar for the language

I find difficulties in constructing a Grammar for the language especially with linear grammar.
Can anyone please give me some basic tips/methodology where i can construct the grammar for any language ? thanks in advance
I have a doubt whether the answer for this question "Construct a linear grammar for the language: is right
L ={a^n b c^n | n belongs to Natural numbers}
Solution:
Right-Linear Grammar :
S--> aS | bA
A--> cA | ^
Left-Linear Grammar:
S--> Sc | Ab
A--> Aa | ^
As pointed out in the comments, these grammars are wrong since they generate strings not in the language. Here's a derivation of abcc in both grammars:
S -> aS -> abA -> abcA -> abccA -> abcc
S -> Sc -> Scc -> Abcc -> Aabcc -> abcc
Also as pointed out in the comments, there is a simple linear grammar for this language, where a linear grammar is defined as having at most one nonterminal symbol in the RHS of any production:
S -> aSc | b
There are some general rules for constructing grammars for languages. These are either obvious simple rules or rules derived from closure properties and the way grammars work. For instance:
if L = {a} for an alphabet symbol a, then S -> a is a gammar for L.
if L = {e} for the empty string e, then S -> e is a grammar for L.
if L = R U T for languages R and T, then S -> S' | S'' along with the grammars for R and T are a grammar for L if S' is the start symbol of the grammar for R and S'' is the start symbol of the grammar for T.
if L = RT for languages R and T, then S = S'S'' is a grammar for L if S' is the start symbol of the grammar for R and S'' is the start symbol of the grammar for T.
if L = R* for language R, then S = S'S | e is a grammar for L if S' is the start symbol of the grammar for R.
Rules 4 and 5, as written, do not preserve linearity. Linearity can be preserved for left-linear and right-linear grammars (since those grammars describe regular languages, and regular languages are closed under these kinds of operations); but linearity cannot be preserved in general. To prove this, an example suffices:
R -> aRb | ab
T -> cTd | cd
L = RT = a^n b^n c^m d^m, 0 < a,b,c,d
L' = R* = (a^n b^n)*, 0 < a,b
Suppose there were a linear grammar for L. We must have a production for the start symbol S that produces something. To produce something, we require a string of terminal and nonterminal symbols. To be linear, we must have at most one nonterminal symbol. That is, our production must be of the form
S := xYz
where x is a string of terminals, Y is a single nonterminal, and z is a string of terminals. If x is non-empty, reflection shows the only useful choice is a; anything else fails to derive known strings in the language. Similarly, if z is non-empty, the only useful choice is d. This gives four cases:
x empty, z empty. This is useless, since we now have the same problem to solve for nonterminal Y as we had for S.
x = a, z empty. Y must now generate exactly a^n' b^n' b c^m d^m where n' = n - 1. But then the exact same argument applies to the grammar whose start symbol is Y.
x empty, z = d. Y must now generate exactly a^n b^n c c^m' d^m' where m' = m - 1. But then the exact same argument applies to the grammar whose start symbol is Y.
x = a, z = d. Y must now generate exactly a^n' b^n' bc c^m' d^m' where n' and m' are as in 2 and 3. But then the exact same argument applies to the grammar whose start symbol is Y.
None of the possible choices for a useful production for S is actually useful in getting us closer to a string in the language. Therefore, no strings are derived, a contradiction, meaning that the grammar for L cannot be linear.
Suppose there were a grammar for L'. Then that grammar has to generate all the strings in (a^n b^n)R(a^m b^m), plus those in e + R. But it can't generate the ones in the former by the argument used above: any production useful for that purpose would get us no closer to a string in the language.

With respect to grammas, when are eplison production rules allowed?

I'm trying to understand a concept with respect to grammar and Production Rules.
According to most material on this subject:
1) Epsilon production rules are only allowable if they do not appear on the RHS of any other production rule.
However, taking a grammar:
G = { T,N,P,S }
Where:
T = {a,b}
N = {S,S1}
S = {S}
P {
S -> aSb
S -> ab
S1 -> SS1
S1 -> E //Please note, using E to represent Epsilon.
}
Where, the language of the grammar is:
L(G) = { a^n, b^n | n >= 1 }
In this case, a production rule containing Epsilon exists (derived from S1) but S1 also forms part of a RHS of another production rule (S1 -> SS1).
Doesn't this violate point 1?
Your statement:
Epsilon production rules are only allowable if they do not appear on the RHS of any other production rule.
would be better stated as
A non-terminal may have an epsilon production rules if that non-terminal does not appear on the right-hand side of any production rule.
In Chomsky's original hierarchy, epsilon productions were banned for all but Type 0 (unrestricted) grammars. If all epsilon productions are banned, then it is impossible for the grammar to produce the empty string. I believe this was not a concern for Chomsky; consequently, most modern formulations allow the start symbol to have an empty right-hand side as long as the start symbol itself does not appear on the right-hand side of any production.
As it happens, the restriction on epsilon-productions is somewhat stronger than is necessary. In the case of both context-free grammars and regular grammars (Chomsky type 2 and type 3 grammars), it is always possible to create a weakly-equivalent grammar without epsilon productions (except possibly the single production S → ε if the grammar can produce the empty string.) It is also possible to remove a number of other anomalies which complicate grammar analysis: unreachable symbols, unproductive symbols, and cyclic productions. The result of the combination of all these eliminations is a "proper context-free grammar".
Consequently, most modern formulations of context-free grammars do not require the right-hand sides to be non-empty.
Your grammar G = {T, N, S, P} with
T = {a, b}
N = {S, S1}
S = {S}
P {
S → a S b
S → a b
S1 → S S1
S1 → ε
}
contains an unreachable symbol, S1. We can easily eliminate it, producing the equivalent grammar G' = { T, N', S, P' }:
N' = {S}
P' {
S → a S b
S → a b
}
G' does not contain any epsilon productions (but even if it had, they could have been eliminated).

How to deal with variable references in yacc/bison (with ocaml)

I was wondering how to deal with variable references inside statements while writing grammars with ocamlyacc and ocamllex.
The problem is that statements of the form
var x = y + z
var b = true | f;
should be both correct but in the first case variable refers to numbers while in the second case f is a boolean variable.
In the grammar I'm writing I have got this:
numeric_exp_val:
| nint { Syntax.Int $1 }
| FLOAT { Syntax.Float $1 }
| LPAREN; ne = numeric_exp; RPAREN { ne }
| INCR; r = numeric_var_ref { Syntax.VarIncr (r,1) }
| DECR; r = numeric_var_ref { Syntax.VarIncr (r,-1) }
| var_ref { $1 }
;
boolean_exp_val:
| BOOL { Syntax.Bool $1 }
| LPAREN; be = boolean_exp; RPAREN { be }
| var_ref { $1 }
;
which obviously can't work, since both var_ref non terminals reduce to the same (reduce/reduce conflict). But I would like to have type checking that is mostly statically done (with respect to variable references) during the parsing phase itself.
That's why I'm wondering which is the best way to have variable references and keep this structure. Just as an additional info I have functions that compile the syntax tree by translating it into a byte code similar to this one:
let rec compile_numeric_exp exp =
match exp with
Int i -> [Push (Types.I.Int i)]
| Float f -> [Push (Types.I.Float f)]
| Bop (BNSum,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Plus]
| Bop (BNSub,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Minus]
| Bop (BNMul,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Times]
| Bop (BNDiv,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Div]
| Bop (BNOr,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Or]
| VarRef n -> [Types.I.MemoryGet (Memory.index_for_name n)]
| VarIncr ((VarRef n) as vr,i) -> (compile_numeric_exp vr) # [Push (Types.I.Int i);Types.I.Plus;Types.I.Dupe] # (compile_assignment_to n)
| _ -> []
Parsing is simply not the right place to do type-checking. I don't understand why you insist on doing this in this pass. You would have much clearer code and greater expressive power by doing it in a separate pass.
Is it for efficiency reasons? I'm confident you could devise efficient incremental-typing routines elsewhere, to be called from the grammar production (but I'm not sure you'll win that much). This looks like premature optimization.
There has been work on writing type systems as attribute grammars (which could be seen as a declarative way to express typing derivations), but I don't think it is meant to conflate parsing and typing in a single pass.
If you really want to go further in this direction, I would advise you to use a simple lexical differentiation between num-typed and bool-typed variables. This sounds ugly but is simple.
If you want to treat numeric expressions and boolean expressions as different syntactic categories, then consider how you must parse var x = ( ( y + z ) ). You don't know which type of expression you're parsing until you hit the +. Therefore, you need to eat up several tokens before you know whether you are seeing a numeric_exp_val or a boolean_exp_val: you need some unbounded lookahead. Yacc does not provide such lookahead (Yacc only provides a restricted form of lookahead, roughly described as LALR, which puts bounds on parsing time and memory requirements). There is even an ambiguous case that makes your grammar context-sensitive: with a definition like var x = y, you need to look up the type of y.
You can solve this last ambiguity by feeding back the type information into the lexer, and you can solve the need for lookahead by using a parser generator that supports unbounded lookahead. However, both of these techniques will push your parser towards a point where it can't easily evolve if you want to expand the language later on (for example to distinguish between integer and floating-point numbers, to add strings or lists, etc.).
If you want a simple but constraining fix with a low technological overhead, I'll second gasche's suggestion of adding a syntactic distinguisher for numeric and boolean variable definitions, something like bvar b = … and nvar x = …. There again, this will make it difficult to support other types later on.
You will have an easier time overall if you separate the type checking from the parsing. Once you've built an abstract syntax tree, do a pass of type checking (in which you will infer the type of variables.
type numeric_expression = Nconst of float | Nplus of numeric_expression * numeric_expression | …
and boolean_expression = Bconst of bool | Bor of boolean_expression * boolean_expression | …
type typed_expression = Tnum of numeric_expression | Tbool of boolean_expression
type typed_statement = Tvar of string * typed_expression
let rec type_expression : Syntax.expression -> typed_expression = function
| Syntax.Float x -> Tnum (Nconst x)
| Syntax.Plus (e1, e2) ->
begin match type_expression e1, type_expression e2 with
| Tnum n1, Tnum n2 -> Tnum (Nplus (n1, n2))
| _, (Tbool _ as t2) -> raise (Invalid_argument_type ("+", t2))
| (Tbool _ as t1), _ -> raise (Invalid_argument_type ("+", t1))
end
| …

What is a 'semantic predicate' in ANTLR?

What is a semantic predicate in ANTLR?
ANTLR 4
For predicates in ANTLR 4, checkout these stackoverflow Q&A's:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
ANTLR 3
A semantic predicate is a way to enforce extra (semantic) rules upon grammar
actions using plain code.
There are 3 types of semantic predicates:
validating semantic predicates;
gated semantic predicates;
disambiguating semantic predicates.
Example grammar
Let's say you have a block of text consisting of only numbers separated by
comma's, ignoring any white spaces. You would like to parse this input making
sure that the numbers are at most 3 digits "long" (at most 999). The following
grammar (Numbers.g) would do such a thing:
grammar Numbers;
// entry point of this parser: it parses an input string consisting of at least
// one number, optionally followed by zero or more comma's and numbers
parse
: number (',' number)* EOF
;
// matches a number that is between 1 and 3 digits long
number
: Digit Digit Digit
| Digit Digit
| Digit
;
// matches a single digit
Digit
: '0'..'9'
;
// ignore spaces
WhiteSpace
: (' ' | '\t' | '\r' | '\n') {skip();}
;
Testing
The grammar can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("123, 456, 7 , 89");
NumbersLexer lexer = new NumbersLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
NumbersParser parser = new NumbersParser(tokens);
parser.parse();
}
}
Test it by generating the lexer and parser, compiling all .java files and
running the Main class:
java -cp antlr-3.2.jar org.antlr.Tool Numbers.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
When doing so, nothing is printed to the console, which indicates that nothing
went wrong. Try changing:
ANTLRStringStream in = new ANTLRStringStream("123, 456, 7 , 89");
into:
ANTLRStringStream in = new ANTLRStringStream("123, 456, 7777 , 89");
and do the test again: you will see an error appearing on the console right after the string 777.
Semantic Predicates
This brings us to the semantic predicates. Let's say you want to parse
numbers between 1 and 10 digits long. A rule like:
number
: Digit Digit Digit Digit Digit Digit Digit Digit Digit Digit
| Digit Digit Digit Digit Digit Digit Digit Digit Digit
/* ... */
| Digit Digit Digit
| Digit Digit
| Digit
;
would become cumbersome. Semantic predicates can help simplify this type of rule.
1. Validating Semantic Predicates
A validating semantic predicate is nothing
more than a block of code followed by a question mark:
RULE { /* a boolean expression in here */ }?
To solve the problem above using a validating
semantic predicate, change the number rule in the grammar into:
number
#init { int N = 0; }
: (Digit { N++; } )+ { N <= 10 }?
;
The parts { int N = 0; } and { N++; } are plain Java statements of which
the first is initialized when the parser "enters" the number rule. The actual
predicate is: { N <= 10 }?, which causes the parser to throw a
FailedPredicateException
whenever a number is more than 10 digits long.
Test it by using the following ANTLRStringStream:
// all equal or less than 10 digits
ANTLRStringStream in = new ANTLRStringStream("1,23,1234567890");
which produces no exception, while the following does thow an exception:
// '12345678901' is more than 10 digits
ANTLRStringStream in = new ANTLRStringStream("1,23,12345678901");
2. Gated Semantic Predicates
A gated semantic predicate is similar to a validating semantic predicate,
only the gated version produces a syntax error instead of a FailedPredicateException.
The syntax of a gated semantic predicate is:
{ /* a boolean expression in here */ }?=> RULE
To instead solve the above problem using gated predicates to match numbers up to 10 digits long you would write:
number
#init { int N = 1; }
: ( { N <= 10 }?=> Digit { N++; } )+
;
Test it again with both:
// all equal or less than 10 digits
ANTLRStringStream in = new ANTLRStringStream("1,23,1234567890");
and:
// '12345678901' is more than 10 digits
ANTLRStringStream in = new ANTLRStringStream("1,23,12345678901");
and you will see the last on will throw an error.
3. Disambiguating Semantic Predicates
The final type of predicate is a disambiguating semantic predicate, which looks a bit like a validating predicate ({boolean-expression}?), but acts more like a gated semantic predicate (no exception is thrown when the boolean expression evaluates to false). You can use it at the start of a rule to check some property of a rule and let the parser match said rule or not.
Let's say the example grammar creates Number tokens (a lexer rule instead of a parser rule) that will match numbers in the range of 0..999. Now in the parser, you'd like to make a distinction between low- and hight numbers (low: 0..500, high: 501..999). This could be done using a disambiguating semantic predicate where you inspect the token next in the stream (input.LT(1)) to check if it's either low or high.
A demo:
grammar Numbers;
parse
: atom (',' atom)* EOF
;
atom
: low {System.out.println("low = " + $low.text);}
| high {System.out.println("high = " + $high.text);}
;
low
: {Integer.valueOf(input.LT(1).getText()) <= 500}? Number
;
high
: Number
;
Number
: Digit Digit Digit
| Digit Digit
| Digit
;
fragment Digit
: '0'..'9'
;
WhiteSpace
: (' ' | '\t' | '\r' | '\n') {skip();}
;
If you now parse the string "123, 999, 456, 700, 89, 0", you'd see the following output:
low = 123
high = 999
low = 456
high = 700
low = 89
low = 0
I've always used the terse reference to ANTLR predicates on wincent.com as my guide.