yacc/bison grammar: whats the difference between two rules? - grammar

I have to write some sort of parser and it's quite easy with tools like yacc and bison. But I have the following question:
NotRec: /*empty*/
| NotRec T_NOT
| NotRec T_MINUS
;
Expr:
| Term /*1*/
| NotRec Term /*2*/
;
What is the difference between rule 1 and 2?
In my opinion NotRec can be empty (because it has an empty branch) and therefore Term should be the same as NotRec Term. But if I remove the first rule I get different results!

As written, the grammar is ambiguous, as NotRec will match 0 or more T_NOT or T_MINUS tokens, so if you have an Expr which has no such tokens before the Term, it can be matched by either rule 1 or rule 2.
If you remove the NotRec: /*empty*/ rule, then NotRec becomes useless as it won't match any finite sequence of tokens. This changes the language, removing any finite strings of T_NOT/T_MINUS.
If you remove the Expr: Term rule, that gets rid of the ambiguity without changing the langauge.
If you use this grammar as-is with yacc or bison, you'll get a shift/reduce conflict due to the ambiguity. The default conflict resolution of using the shift will resolve the ambiguity without changing the language -- as long as there are no other conflicting uses of those tokens in the part of the grammar you left out. It will use rule 1 for any Expr with no T_NOT/T_MINUS instructions and rule 2 for any Expr with one or more such tokens. This is equivalent to changing the NotRec rules to
NotRec: T_NOT
| T_MINUS
| NotRec T_NOT
| NotRec T_MINUS
;
This makes NotRec match one or more tokens instead of zero or more.

Related

What is the meaning of the ANTLR syntax in this grammar file?

I am trying to parse a file using ANTLR4 via Python. I am following a tutorial (https://faun.pub/introduction-to-antlr-python-af8a3c603d23); I am able to execute the code and get responses like the ones shown in the tutorial, but I'm failing to understand the logic of the grammar file.
grammar MyGrammer;
expr: left=expr op=('*'|'/') right=expr # InfixExpr
| left=expr op=('+'|'-') right=expr # InfixExpr
| atom=INT # NumberExpr
| '(' expr ')' # ParenExpr
| atom=HELLO # HelloExpr
| atom=BYE # ByeExpr
;
HELLO: ('hello'|'hi') ;
BYE : ('bye'| 'tata') ;
INT : [0-9]+ ;
WS : [ \t]+ -> skip ;
From my understanding, The constants (what I call them since they are all capitals) HELLO, BYE, INT, and WS define rules for what that set of text can contain. I think they are relating to functions somehow, but I am not sure. So the HELLO function will be executed if the parser encounters something that says either 'hello' or 'hi'. The expr is what is confusing me.
expr: left=expr op=('*'|'/') right=expr # InfixExpr
| left=expr op=('+'|'-') right=expr # InfixExpr
| atom=INT # NumberExpr
| '(' expr ')' # ParenExpr
| atom=HELLO # HelloExpr
| atom=BYE # ByeExpr
;
HELLO: ('hello'|'hi') ;
BYE : ('bye'| 'tata') ;
INT : [0-9]+ ;
WS : [ \t]+ -> skip ;
When I run the command
antlr4 -Dlanguage=Python3 MyGrammer.g4 -visitor -o dist
it produces many files but the main one contains InfixExpr, NumberExpr, ParenExpr, HelloExpr, and ByeExpr. I can see that somehow the author knows that he is doing something with the constants HELLO, BYE, etc. Is there any documentation on the expr piece above and what do the keywords atom, left, right mean?
Any rules that begin with a capital letter (often we captilize the entire rule name to make it obvious) is a Lexer rule.
Rules that begin with lower case letters are parser rules.
It’s VERY important to understand the difference and the flow of your input all the way through to a parse tree.
Your input stream of characters is first processed by the Lexer (using the Lexer rules) to produce a stream of tokens for the parser to act upon. It’s important to understand that the parser has NO impact on how the Lexer interprets the input.
When multiple Lexer rules could match you input, two “tie breakers” come into play.
1 - if a rules matches more characters in your input stream than other rules, then that will be the rules used to produce a token.
2 - if there is a tie of multiple Lexer rules matching the same sequence of input characters, then the Lexer rules that appears first in your grammar will be used to generate a token.
Your parser rules are evaluated using a recursive descent approach beginning with whatever startRule you specify. ANTLR uses several techniques to do it’s best to recognize your input, that includes trying alternatives until one is found that matches, ignoring a token (and producing an error) if that allows the parser to continue on, and inserting a missing token (and producing an error) if that allows the parser to continue.
re: the expr portion:
The rule says that there are 6 possible ways to recognize an expr
left=expr op=('*'|'/') right=expr (which will create an InfixExprContext node in the parse tree)
left=expr op=('+'|'-') right=expr (InfixExprContext (also))
atom=INT (NumberExprContext)
'(' expr ')' (ParenExprContext)
atom=HELLO (HelloExprContext)
atom=BYE (ByeExprContext)
The benefit of the labels (ex: # InfixExpr) is that, by creating a Context more specific than an ExprContext) you will have visitInfixExpr, visitNumberExpr, (etc.) methods that you can override in you Visitor instead of just a visitExpr method that contains all the alternatives. A similar thing will result for the enterXX and exitXX methods for your Listener classes.
In the left=expr op=('*'|'/') right=expr rule, the left, op and right names will generate accessors that make it easier to access those parts of you parse tree in you *Context class (without them you'd just have an array of expr, for example and expr[0] would be the first expr and expr[1] would be the second. (It's probably a good idea to look at the generated code with and without the names and labels to see the difference. Both make it MUCH easier to write the logic in your visitor/listeners.

How to make a simple calculator syntax highlighting for IntelliJ?

I'm making a custom language support plugin according to this tutorial and I'm stuck with a few .bnf concepts. Let's say I want to parse a simple calculator language that supports +,-,*,/,unary -, and parentheses. Here's what I currently have:
Flex:
package com.intellij.circom;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.intellij.circom.psi.CircomTypes;
import com.intellij.psi.TokenType;
%%
%class CircomLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
WHITESPACE = [ \n\r\t]+
NUMBER = [0-9]+
%%
{WHITESPACE} { return TokenType.WHITE_SPACE; }
{NUMBER} { return CircomTypes.NUMBER; }
Bnf:
{
parserClass="com.intellij.circom.parser.CircomParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Circom"
psiImplClassSuffix="Impl"
psiPackage="com.intellij.circom.psi"
psiImplPackage="com.intellij.circom.psi.impl"
elementTypeHolderClass="com.intellij.circom.psi.CircomTypes"
elementTypeClass="com.intellij.circom.psi.CircomElementType"
tokenTypeClass="com.intellij.circom.psi.CircomTokenType"
}
expr ::=
expr ('+' | '-') expr
| expr ('*' | '/') expr
| '-' expr
| '(' expr ')'
| literal;
literal ::= NUMBER;
First it complains that expr is recursive. How do I rewrite it to not be recursive? Second, when I try to compile and run it, it freezes idea test instance when trying to parse this syntax, looks like an endless loop.
Calling the grammar files "BNF" is a bit misleading, since they are actually modified PEG (parsing expression grammar) format, which allows certain extended operators, including grouping, repetition and optionality, and ordered choice (which is semantically different from the regular definition of |).
Since the underlying technology is PEG, you cannot use left-recursive rules. Left-recursion will cause an infinite loop in the parser, unless the code generator refuses to generate left-recursive code. Fortunately, repetition operators are available so you only need recursion for syntax involving parentheses, and that's not left-recursion so it presents no problem.
As far as I can see from the documentation I found, grammar kit does not provide for operator precedence declarations. If you really need to produce a correct parse taking operator-precedence into account, you'll need to use multiple precedence levels. However, if your only use case is syntax highlighting, you probably do not require a precisely accurate parse, and it would be sufficient to do something like the following:
expr ::= unary (('+' | '-' | '*' | '/') unary)*
unary ::= '-'* ( '(' expr ')' | literal )
(For precise parsing, you'd need to split expr above into two precedence levels, one for additive operators and another for multiplicative. But I suggest not doing that unless you intend to use the parse for evaluation or code-generation.)
Also, you almost certainly require some lexical rule to recognise the various operator characters and return appropriate single character tokens.

Ambiguous Lexer rules in Antlr

I have an antlr grammar with multiple lexer rules that match the same word. It can't be resolved during lexing, but with the grammar, it becomes unambiguous.
Example:
conversion: NUMBER UNIT CONVERT UNIT;
NUMBER: [0-9]+;
UNIT: 'in' | 'meters' | ......;
CONVERT: 'in';
Input: 1 in in meters
The word "in" matches the lexer rules UNIT and CONVERT.
How can this be solved while keeping the grammar file readable?
When an input matches two lexer rules, ANTLR chooses either the longest or the first, see disambiguate. With your grammar, in will be interpreted as UNIT, never CONVERT, and the rule
conversion: NUMBER UNIT CONVERT UNIT;
can't work because there are three UNIT tokens :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<UNIT>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<UNIT>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<UNIT>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0159
line 1:5 missing 'in' at 'in'
line 1:8 mismatched input 'meters' expecting <EOF>
What you can do is to have only ID or TEXT tokens and distinguish them with a label, like this :
grammar Question;
question
#init {System.out.println("Question last update 0132");}
: conversion NL EOF
;
conversion
: NUMBER unit1=ID convert=ID unit2=ID
{System.out.println("Quantity " + $NUMBER.text + " " + $unit1.text +
" to convert " + $convert.text + " " + $unit2.text);}
;
ID : LETTER ( LETTER | DIGIT | '_' )* ; // or TEXT : LETTER+ ;
NUMBER : DIGIT+ ;
NL : [\r\n] ;
WS : [ \t] -> channel(HIDDEN) ; // -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
Execution :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<ID>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<ID>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<ID>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0132
Quantity 1 in to convert in meters
Labels are available from the rule's context in the visitor, so it is easy to distinguish tokens of the same type.
Based on the info in your question, it's hard to say what the best solution would be - I don't know what your lexer rules are, for example - nor can I tell why you have lexer rules that are ambiguous at all.
In my experience with antlr, lexer rules don't generally carry any semantic meaning; they are just text that matches some kind of regular expression. So, instead of having VARIABLE, METHOD_NAME, etc, I'd just have IDENTIFIER, and then figure it out at a higher level.
In other words, it seems (from the little I can glean from your question) that you might benefit either from replacing UNIT and CONVERT with grammar rules, or just having a single rule:
conversion: NUMBER TEXT TEXT TEXT
and validating the text values in your ANTLR listener/tree-walker/etc.
EDIT
Thanks for updating your question with lexer rules. It's clear now why it's failing - as BernardK points out, antlr will always choose the first matching lexer rule. This means it's impossible for the second of two ambiguous lexer rules to match, and makes your proposed design infeasible.
My opinion is that lexer rules are not the correct layer to do things like unit validation; they excel at structure, not content. Evaluating the parse tree will be much more practical than trying to contort an antlr grammar.
Finally, you might also do something with embedded actions on parse rules, like validating the value of an ID token against a known set of units. It could work, but would destroy the reusability of your grammar.

shift/reduce error in yacc

I know this part of my grammar cause error but I don't know how to fix it I even use %left and right but it didn't help. Can anybody please help me to find out what is the problem with this grammar.
Thanks in advance for your help.
%token VARIABLE NUM
%right '='
%left '+' '-'
%left '*' '/'
%left '^'
%start S_PROOP
EQUATION_SEQUENCE
: FORMULA '=' EQUATION
;
EQUATION
: FORMULA
| FORMULA '=' EQUATION
;
FORMULA
: SUM EXPRESSION
| PRODUCT EXPRESSION
| EXPRESSION '+' EXPRESSION
| EXPRESSION '*' EXPRESSION
| EXPRESSION '/' EXPRESSION
| EXPRESSION '^' EXPRESSION
| EXPRESSION '-' EXPRESSION
| EXPRESSION
;
EXPRESSION
: EXPRESSION EXPRESSION
| '(' EXPRESSION ')'
| NUM
| VARIABLE
;
Normal style is to use lower case for non-terminals and upper case for terminals; using upper case indiscriminately makes your grammar harder to read (at least for those of us used to normal yacc/bison style). So I've written this answer without so much recourse to the caps lock key.
The basic issue is the production
expression: expression expression
which is obviously ambiguous, since it does not provide any indication of associativity. In that, it is not different from
expression: expression '+' expression
but that conflict can be resolved using a precedence declaration:
%left '+'
The difference is that the first production does not have any terminal symbol, which makes it impossible to use precedence rules to disambiguate: in yacc/bison, precedence is always a comparison between a potential reduction and a potential shift. The potential reduction is some production which could be reduced; the potential shift is a terminal symbol which might be able to extend some production. Since the potential shift must be a terminal symbol, that is what is used in the precedence declaration; by default, the precedence of the potential reduction is defined by the last terminal symbol in the right-hand side but it is possible to specify a different terminal using a %prec marker. In any case, the precedence relation involves a terminal symbol, and if the grammar allows juxtaposition of two terminals, there is no relevant terminal symbol.
That's easy to work around, since you are under no obligation to use precedence to resolve conflicts. You could just avoid the conflict:
/* Left associative rule */
expr_sequence: expr | expr_sequence expr
/* Alternative: right associative rule */
expr_sequence: expr | expr expr_sequence
Since there is no indication what you intend by the juxtaposition, I'm unable to recommend one or the other of the above alternatives, but normally I would incline towards the first one.
That's not terribly different from your grammar for equation_sequence, although equation_sequence actually uses a terminal symbol so it could have been handled with a precedence declaration. It's worth noting that equation_sequence, as written, is right-associative. That's usually considered correct for assignment operators, (a = b = c + 3, in a language like C, is parsed as a = (b = c + 3) and not as (a = b) = c + 3, making assignment one of the few right-associative operators.) But if you are using = as an equality operator, it might not actually be what you intended.

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.