What is a 'semantic predicate' in ANTLR? - antlr

What is a semantic predicate in ANTLR?

ANTLR 4
For predicates in ANTLR 4, checkout these stackoverflow Q&A's:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
ANTLR 3
A semantic predicate is a way to enforce extra (semantic) rules upon grammar
actions using plain code.
There are 3 types of semantic predicates:
validating semantic predicates;
gated semantic predicates;
disambiguating semantic predicates.
Example grammar
Let's say you have a block of text consisting of only numbers separated by
comma's, ignoring any white spaces. You would like to parse this input making
sure that the numbers are at most 3 digits "long" (at most 999). The following
grammar (Numbers.g) would do such a thing:
grammar Numbers;
// entry point of this parser: it parses an input string consisting of at least
// one number, optionally followed by zero or more comma's and numbers
parse
: number (',' number)* EOF
;
// matches a number that is between 1 and 3 digits long
number
: Digit Digit Digit
| Digit Digit
| Digit
;
// matches a single digit
Digit
: '0'..'9'
;
// ignore spaces
WhiteSpace
: (' ' | '\t' | '\r' | '\n') {skip();}
;
Testing
The grammar can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("123, 456, 7 , 89");
NumbersLexer lexer = new NumbersLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
NumbersParser parser = new NumbersParser(tokens);
parser.parse();
}
}
Test it by generating the lexer and parser, compiling all .java files and
running the Main class:
java -cp antlr-3.2.jar org.antlr.Tool Numbers.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
When doing so, nothing is printed to the console, which indicates that nothing
went wrong. Try changing:
ANTLRStringStream in = new ANTLRStringStream("123, 456, 7 , 89");
into:
ANTLRStringStream in = new ANTLRStringStream("123, 456, 7777 , 89");
and do the test again: you will see an error appearing on the console right after the string 777.
Semantic Predicates
This brings us to the semantic predicates. Let's say you want to parse
numbers between 1 and 10 digits long. A rule like:
number
: Digit Digit Digit Digit Digit Digit Digit Digit Digit Digit
| Digit Digit Digit Digit Digit Digit Digit Digit Digit
/* ... */
| Digit Digit Digit
| Digit Digit
| Digit
;
would become cumbersome. Semantic predicates can help simplify this type of rule.
1. Validating Semantic Predicates
A validating semantic predicate is nothing
more than a block of code followed by a question mark:
RULE { /* a boolean expression in here */ }?
To solve the problem above using a validating
semantic predicate, change the number rule in the grammar into:
number
#init { int N = 0; }
: (Digit { N++; } )+ { N <= 10 }?
;
The parts { int N = 0; } and { N++; } are plain Java statements of which
the first is initialized when the parser "enters" the number rule. The actual
predicate is: { N <= 10 }?, which causes the parser to throw a
FailedPredicateException
whenever a number is more than 10 digits long.
Test it by using the following ANTLRStringStream:
// all equal or less than 10 digits
ANTLRStringStream in = new ANTLRStringStream("1,23,1234567890");
which produces no exception, while the following does thow an exception:
// '12345678901' is more than 10 digits
ANTLRStringStream in = new ANTLRStringStream("1,23,12345678901");
2. Gated Semantic Predicates
A gated semantic predicate is similar to a validating semantic predicate,
only the gated version produces a syntax error instead of a FailedPredicateException.
The syntax of a gated semantic predicate is:
{ /* a boolean expression in here */ }?=> RULE
To instead solve the above problem using gated predicates to match numbers up to 10 digits long you would write:
number
#init { int N = 1; }
: ( { N <= 10 }?=> Digit { N++; } )+
;
Test it again with both:
// all equal or less than 10 digits
ANTLRStringStream in = new ANTLRStringStream("1,23,1234567890");
and:
// '12345678901' is more than 10 digits
ANTLRStringStream in = new ANTLRStringStream("1,23,12345678901");
and you will see the last on will throw an error.
3. Disambiguating Semantic Predicates
The final type of predicate is a disambiguating semantic predicate, which looks a bit like a validating predicate ({boolean-expression}?), but acts more like a gated semantic predicate (no exception is thrown when the boolean expression evaluates to false). You can use it at the start of a rule to check some property of a rule and let the parser match said rule or not.
Let's say the example grammar creates Number tokens (a lexer rule instead of a parser rule) that will match numbers in the range of 0..999. Now in the parser, you'd like to make a distinction between low- and hight numbers (low: 0..500, high: 501..999). This could be done using a disambiguating semantic predicate where you inspect the token next in the stream (input.LT(1)) to check if it's either low or high.
A demo:
grammar Numbers;
parse
: atom (',' atom)* EOF
;
atom
: low {System.out.println("low = " + $low.text);}
| high {System.out.println("high = " + $high.text);}
;
low
: {Integer.valueOf(input.LT(1).getText()) <= 500}? Number
;
high
: Number
;
Number
: Digit Digit Digit
| Digit Digit
| Digit
;
fragment Digit
: '0'..'9'
;
WhiteSpace
: (' ' | '\t' | '\r' | '\n') {skip();}
;
If you now parse the string "123, 999, 456, 700, 89, 0", you'd see the following output:
low = 123
high = 999
low = 456
high = 700
low = 89
low = 0

I've always used the terse reference to ANTLR predicates on wincent.com as my guide.

Related

Antlr4 Arithmetic Grammar Is Ignoring Order of Precedence (PEMDAS)

Grammar Definition
In my understanding ANTLR4 support left recursion to respect order of precedence for arithmetic. With that said here's the grammar:
grammar Arithmetic;
arithmetic: arithmeticExpression;
arithmeticExpression:
LPARAN inner = arithmeticExpression RPARAN # Parentheses
| left = arithmeticExpression POW right = arithmeticExpression # Power
| left = arithmeticExpression MUL right = arithmeticExpression # Multiplication
| left = arithmeticExpression DIV right = arithmeticExpression # Division
| left = arithmeticExpression ADD right = arithmeticExpression # Addition
| left = arithmeticExpression SUB right = arithmeticExpression # Subtraction
| arithmeticExpressionInput # ArithmeticInput;
arithmeticExpressionInput: NUMBER;
number: NUMBER;
/* Operators */
LPARAN: '(';
RPARAN: ')';
POW: '^';
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
/* Data Types */
NUMBER: '-'? [0-9]+;
/* Whitespace & End of Lines */
EOL: '\r'? '\n';
WS: [ \t]+ -> channel(HIDDEN);
Note: I've simplified the grammar for testing.
Input
5 + 21 / 7 * 3
Output Parse Tree
Problem
In the outputted parse tree starting at the arithmetic. You can see that the Order of Precedence is not following PEMDAS even though it's defined via left recursion in the grammar. This is also observed when debugging the visitor code generated by Antlr with the function call being VisitAddition.
I've google this and I can't see what I'm doing wrong compared to examples as they all look the same.
Environment
ANTLR Version: 4.11.1
Build Target: CSharp
.NET Project Packages:
Antlr4BuildTasks#11.1.0
Antlr4.Runtime.Standard#4.11.1
As mentioned in the comments by #sepp2k (Thank you).
The issue is with the grammar because both the multiplication, division, addition and subtraction had been split into seperate OR rule lines. Essentially creating PEMDAS when it should be PE(MD)(AS).
Here's an example of the fixed grammar:
arithmeticExpression:
arithmeticExpressionInput # ArithmeticInput
| LPARAN inner = arithmeticExpression RPARAN # Parentheses
| left = arithmeticExpression operator = POW right = arithmeticExpression # Power
| left = arithmeticExpression operator = (MUL|DIV) right = arithmeticExpression #
MultiplicationOrDivision
| left = arithmeticExpression operator = (ADD|SUB) right = arithmeticExpression #
AdditionOrSubtraction;
and now the outputted parse tree is much cleaner:

Squeak Smalltak, Does +, -, *, / have more precedence over power?

I understand in Smalltalk numerical calculation, if without round brackets, everything starts being calculated from left to right. Nothing follows the rule of multiplication and division having more precedence over addition and subtraction.
Like the following codes
3 + 3 * 2
The print output is 12 while in mathematics we get 9
But when I started to try power calculation, like
91 raisedTo: 3 + 1.
I thought the answer should be 753572
What I actual get is 68574964
Why's that?
Is it because that +, -, *, / have more precedence over power ?
Smalltalk does not have operators with precedence. Instead, there are three different kinds of messages. Each kind has its own precedence.
They are:
unary messages that consist of a single identifier and do not have parameters as squared or asString in 3 squared or order asString;
binary messages that have a selector composed of !%&*+,/<=>?#\~- symbols and have one parameter as + and -> in 3 + 4 or key -> value;
keyword messages that have one or more parameters and a selector with colons before each parameter as raisedTo: and to:by:do: in 4 risedTo: 3 and 1 to: 10 by: 3 do: [ … ].
Unary messages have precedence over binary and both of them have precedence over keyword messages. In other words:
unary > binary > keyword
So for example
5 raisedTo: 7 - 2 squared = 125
Because first unary 2 squared is evaluated resulting in 4, then binary 7 - 4 is evaluated resulting in 3 and finally keyword 5 risedTo: 3 evaluates to 125.
Of course, parentheses have the highest precedence of everything.
To simplify the understanding of this concept don't think about numbers and math all the numbers are objects and all the operators are messages. The reason for this is that a + b * c does not mean that a, b, and c are numbers. They can be humans, cars, online store articles. And they can define their own + and * methods, but this does not mean that * (which is not a "multiplication", it's just a "star message") should happen before +.
Yes, +, -, *, / have more precedence than raisedTo:, and the interesting aspect of this is the reason why this happens.
In Smalltalk there are three types of messages: unary, binary and keyword. In our case, +, -, * and / are examples of binary messages, while raisedTo: is a keyword one. You can tell this because binary messages are made from characters that are not letters or numbers, unlike unary or keywords, which start with a letter or underscore and follow with numbers or letters or underscores. Also, you can tell when a selector is unary because they do not end with a colon. Thus, raisedTo: is a keyword message because it ends with colon (and is not made of non-letter or numeric symbols).
So, the expression 91 raisedTo: 3 + 1 includes two selectors, one binary (+) and one keyword (raisedTo:) and the precedence rule says:
first evaluate unary messages, then binary ones and finally those with keywords
This is why 3 + 1 gets evaluated first. Of course, you can always change the precedence using parenthesis. For example:
(91 raisedTo: 3) + 1
will evaluate first raisedTo: and then +. Note that you could write
91 raisedTo: (3 + 1)
too. But this is usually not done because Smalltalk precedence rules are so easy to remember that you don't need to emphasize them.
Commonly used binary selectors
# the Point creation message for x # y
>= greater or equal, etc.
-> the Association message for key -> value
==> production tranformation used by PetitParser
= equal
== identical (very same object)
~= not equal
~~ not identical
\\ remainder
// quotient
and a lot more. Of course, you are always entitled to create your own.

With respect to grammas, when are eplison production rules allowed?

I'm trying to understand a concept with respect to grammar and Production Rules.
According to most material on this subject:
1) Epsilon production rules are only allowable if they do not appear on the RHS of any other production rule.
However, taking a grammar:
G = { T,N,P,S }
Where:
T = {a,b}
N = {S,S1}
S = {S}
P {
S -> aSb
S -> ab
S1 -> SS1
S1 -> E //Please note, using E to represent Epsilon.
}
Where, the language of the grammar is:
L(G) = { a^n, b^n | n >= 1 }
In this case, a production rule containing Epsilon exists (derived from S1) but S1 also forms part of a RHS of another production rule (S1 -> SS1).
Doesn't this violate point 1?
Your statement:
Epsilon production rules are only allowable if they do not appear on the RHS of any other production rule.
would be better stated as
A non-terminal may have an epsilon production rules if that non-terminal does not appear on the right-hand side of any production rule.
In Chomsky's original hierarchy, epsilon productions were banned for all but Type 0 (unrestricted) grammars. If all epsilon productions are banned, then it is impossible for the grammar to produce the empty string. I believe this was not a concern for Chomsky; consequently, most modern formulations allow the start symbol to have an empty right-hand side as long as the start symbol itself does not appear on the right-hand side of any production.
As it happens, the restriction on epsilon-productions is somewhat stronger than is necessary. In the case of both context-free grammars and regular grammars (Chomsky type 2 and type 3 grammars), it is always possible to create a weakly-equivalent grammar without epsilon productions (except possibly the single production S → ε if the grammar can produce the empty string.) It is also possible to remove a number of other anomalies which complicate grammar analysis: unreachable symbols, unproductive symbols, and cyclic productions. The result of the combination of all these eliminations is a "proper context-free grammar".
Consequently, most modern formulations of context-free grammars do not require the right-hand sides to be non-empty.
Your grammar G = {T, N, S, P} with
T = {a, b}
N = {S, S1}
S = {S}
P {
S → a S b
S → a b
S1 → S S1
S1 → ε
}
contains an unreachable symbol, S1. We can easily eliminate it, producing the equivalent grammar G' = { T, N', S, P' }:
N' = {S}
P' {
S → a S b
S → a b
}
G' does not contain any epsilon productions (but even if it had, they could have been eliminated).

Context-free grammar for { [^n a ]^n + [^(n-1) a ]^(n-1) + ... + [a] |, n>= 1 }

I need help with building context-free grammar for the next language:
{ [n a1 ]n +1 [n-1 a1 ]n-1 +1 ... +1 [1 a1 ]1 | n>= 1 }
n can be any natural number.
So the language contains strings like:
(a)
((a))+(a)
(((a)))+((a))+(a)
... and so on.
I tried to build a grammar and stuck: it seems that we need to store information about amount of parentheses somehow.
My questions are next:
Is the language context-free at all?
If it is not how to prove it? But if the language is context-free after all then what productions can produce its strings?
Thanks!

How to deal with variable references in yacc/bison (with ocaml)

I was wondering how to deal with variable references inside statements while writing grammars with ocamlyacc and ocamllex.
The problem is that statements of the form
var x = y + z
var b = true | f;
should be both correct but in the first case variable refers to numbers while in the second case f is a boolean variable.
In the grammar I'm writing I have got this:
numeric_exp_val:
| nint { Syntax.Int $1 }
| FLOAT { Syntax.Float $1 }
| LPAREN; ne = numeric_exp; RPAREN { ne }
| INCR; r = numeric_var_ref { Syntax.VarIncr (r,1) }
| DECR; r = numeric_var_ref { Syntax.VarIncr (r,-1) }
| var_ref { $1 }
;
boolean_exp_val:
| BOOL { Syntax.Bool $1 }
| LPAREN; be = boolean_exp; RPAREN { be }
| var_ref { $1 }
;
which obviously can't work, since both var_ref non terminals reduce to the same (reduce/reduce conflict). But I would like to have type checking that is mostly statically done (with respect to variable references) during the parsing phase itself.
That's why I'm wondering which is the best way to have variable references and keep this structure. Just as an additional info I have functions that compile the syntax tree by translating it into a byte code similar to this one:
let rec compile_numeric_exp exp =
match exp with
Int i -> [Push (Types.I.Int i)]
| Float f -> [Push (Types.I.Float f)]
| Bop (BNSum,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Plus]
| Bop (BNSub,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Minus]
| Bop (BNMul,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Times]
| Bop (BNDiv,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Div]
| Bop (BNOr,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Or]
| VarRef n -> [Types.I.MemoryGet (Memory.index_for_name n)]
| VarIncr ((VarRef n) as vr,i) -> (compile_numeric_exp vr) # [Push (Types.I.Int i);Types.I.Plus;Types.I.Dupe] # (compile_assignment_to n)
| _ -> []
Parsing is simply not the right place to do type-checking. I don't understand why you insist on doing this in this pass. You would have much clearer code and greater expressive power by doing it in a separate pass.
Is it for efficiency reasons? I'm confident you could devise efficient incremental-typing routines elsewhere, to be called from the grammar production (but I'm not sure you'll win that much). This looks like premature optimization.
There has been work on writing type systems as attribute grammars (which could be seen as a declarative way to express typing derivations), but I don't think it is meant to conflate parsing and typing in a single pass.
If you really want to go further in this direction, I would advise you to use a simple lexical differentiation between num-typed and bool-typed variables. This sounds ugly but is simple.
If you want to treat numeric expressions and boolean expressions as different syntactic categories, then consider how you must parse var x = ( ( y + z ) ). You don't know which type of expression you're parsing until you hit the +. Therefore, you need to eat up several tokens before you know whether you are seeing a numeric_exp_val or a boolean_exp_val: you need some unbounded lookahead. Yacc does not provide such lookahead (Yacc only provides a restricted form of lookahead, roughly described as LALR, which puts bounds on parsing time and memory requirements). There is even an ambiguous case that makes your grammar context-sensitive: with a definition like var x = y, you need to look up the type of y.
You can solve this last ambiguity by feeding back the type information into the lexer, and you can solve the need for lookahead by using a parser generator that supports unbounded lookahead. However, both of these techniques will push your parser towards a point where it can't easily evolve if you want to expand the language later on (for example to distinguish between integer and floating-point numbers, to add strings or lists, etc.).
If you want a simple but constraining fix with a low technological overhead, I'll second gasche's suggestion of adding a syntactic distinguisher for numeric and boolean variable definitions, something like bvar b = … and nvar x = …. There again, this will make it difficult to support other types later on.
You will have an easier time overall if you separate the type checking from the parsing. Once you've built an abstract syntax tree, do a pass of type checking (in which you will infer the type of variables.
type numeric_expression = Nconst of float | Nplus of numeric_expression * numeric_expression | …
and boolean_expression = Bconst of bool | Bor of boolean_expression * boolean_expression | …
type typed_expression = Tnum of numeric_expression | Tbool of boolean_expression
type typed_statement = Tvar of string * typed_expression
let rec type_expression : Syntax.expression -> typed_expression = function
| Syntax.Float x -> Tnum (Nconst x)
| Syntax.Plus (e1, e2) ->
begin match type_expression e1, type_expression e2 with
| Tnum n1, Tnum n2 -> Tnum (Nplus (n1, n2))
| _, (Tbool _ as t2) -> raise (Invalid_argument_type ("+", t2))
| (Tbool _ as t1), _ -> raise (Invalid_argument_type ("+", t1))
end
| …