Reading EBNF Grammar - grammar

I just needed help with reading this EBNF grammar, I'm new to it and don't particularly understand the first , I understand the second one but just don't understand how it ties in with the first one.
Term ::= Primary { (T_STAR|T_SLASH) Term }
Primary ::= T_ICONST | T_SCONST | T_ID | T_LPAREN Expr T_RPAREN

In EBNF, expression surrounded by curl parenthesis could omit or repeat more time
so we can construct some code according to this EBNF(lets assume that T_STAR representing character "*" and T_SLASH for "/")
// Term ::= Primary { (T_STAR|T_SLASH) Term }
// Primary ::= T_ICONST | T_SCONST | T_ID | T_LPAREN Expr T_RPAREN
Term ::= Primary * Term
Term ::= Primary * Primary / Term
Term ::= Primary * Primary / Primary
Term ::= ThisIsT_ID * ThisIsT_ID / ThisIsT_ID

Related

How do I force the the parser to match a content as an ID rather than a token?

I have a grammar as the following (It's a partial view with only the relevant parts):
elem_course : INIT_ABSCISSA '=' expression;
expression
: ID
| INT_VALUE
| '(' expression ')'
| expression OPERATOR1 expression
| expression OPERATOR2 expression
;
OPERATOR1 : '*' | '/' ;
OPERATOR2 : '+' | '-' ;
fragment
WORD : LETTER (LETTER | NUM | '_' )*;
ID : WORD;
fragment
NUM : [0-9];
fragment
LETTER : [a-zA-Z];
BEACON_ANTENNA_TRAIN : 'BEACON_ANTENNA_TRAIN';
And, I would like to match the following line :
INIT_ABSCISSA = 40 + BEACON_ANTENNA_TRAIN
But as BEACON_ANTENNA_TRAIN is a lexer token and even the rule states that I except and ID, the parser matchs the token and raise me the following error when parsing:
line 11:29 mismatched input 'BEACON_ANTENNA_TRAIN' expecting {'(', INT_VALUE, ID}
Is there a way to force the parser that it should match the content as an ID rather than a token?
(Quick note: It's nice to abbreviate content in questions, but it really helps if it is functioning, stand-alone content that demonstrates your issue)
In this case, I've had to add the following lever rules to get this to generate, so I'm making some (probably legitimate) assumptions.
INT_VALUE: [\-+]? NUM+;
INIT_ABSCISSA: 'INIT_ABSCISSA';
WS: [ \t\r\n]+ -> skip;
I'm also going to have to assume that BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN'; appears before your ID rule. As posted your token stream is as follows and could not generate the error you show)
[#0,0:12='INIT_ABSCISSA',<ID>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<ID>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
If I reorder the lexer rules like this:
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
I can get your error message.
The lexer looks at you input stream of characters and attempts to match all lexer rules. To choose the token type, ANTLR will:
select the rule that matches the longest stream of input characters
If multiple Lever rules match the same sequence of input characters, then the rule that appears first will be used (that's why I had to re-order the rules to get your error.
With those assumptions, now to your question.
The short answer is "you can't". The Lexer processes input and determines token types before the parser is involved in any way. There is nothing you can do in parser rules to influence Token Type.
The parser, on the other hand starts with the start rule and then uses a recursive descent algorithm to attempt to match your token stream to parser rules.
You don't really give any idea what really guides whether BEACON_ANTENNA_TRAIN should be a BEACON_ANTENNA_TRAIN or an ID, so I'll put an example together that assumes that it's an ID if it's on the right hand side (rhs) of the elemen_course rule.
Then this grammar:
grammar IDG
;
elem_course: INIT_ABSCISSA '=' rhs_expression;
rhs_expression
: id = (ID | BEACON_ANTENNA_TRAIN | INIT_ABSCISSA)
| INT_VALUE
| '(' rhs_expression ')'
| rhs_expression OPERATOR1 rhs_expression
| rhs_expression OPERATOR2 rhs_expression
;
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
produces this token stream and parse tree:
$ grun IDG elem_course -tokens -tree IDG.txt
[#0,0:12='INIT_ABSCISSA',<'INIT_ABSCISSA'>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<'BEACON_ANTENNA_TRAIN'>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
(elem_course INIT_ABSCISSA = (rhs_expression (rhs_expression 40) + (rhs_expression BEACON_ANTENNA_TRAIN)))
As a side note: It's possible that, depending on what drives your decision, you might be able to leverage Lexer modes, but there's not anything in your example to leaves that impression.
This is the well known keyword-as-identifier problem and Mike Cargal gave you a working solution. I just want to add that the general approach for this problem is to add all keywords to a parser id rule that should be matched as an id. To restrict which keyword is allowed in certain grammar positions, you can use multiple id rules. For example the MySQL grammar uses this approach to a large extend to define keywords that can go as identifier in general or only as a label, for role names etc.

How can I correctly express in BNF this condition?

I am looking for a way to express the following types of conditions in BNF:
if(carFixed) { }
if(carFixed = true) {}
if(cars >= 4) { }
if(cars != 15) { }
if(cars < 3 && cars > 1) {}
Note:
* denotes 0 or more instances of something.
I have replaced normal BNF ::= with :.
I presently am using the following code, and am not sure if it's correct:
conditionOperator: "=" | "!=" | "<=" | ">=" | "<" | ">" | "is";
logicalAndOperator: "&&";
condition: (booleanIdentifier ((conditionOperator booleanIdentifier)* (logicalAndOperator | logicalOrOperator) booleanIdentifer (conditionOperator booleanIdentifier)*)*);
There are several approaches and they usually rely on the capabilities of the parser to indicate precedence and associativty. One that is typically used with recursive-descent parsers is to recreate the precedence of the operators by using the hierarchy provided by the bnf (or, in this case, pseudo-bnf) structure.
(In the examples bellow, CONDITIONAL_OP are the likes of <, != etc and LOGICAL_OP are &&, || etc)
Something in the lines of:
condition: logicalExpr
logicalExpr: conditionalExpr (LOGICAL_OP conditionalExpr)*
conditionalExpr: primary (CONDITIONAL_OP primary)*
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
The problem with the above solution is that the left-associativity of the operators is lost and requires special measures to restore it while parsing.
For parsers able to deal with left recursion, a more 'correct' notation could be:
condition: logicalExpr
logicalExpr: logicalExpr LOGICAL_OP conditionalExpr
| conditionalExpr
conditionalExpr: conditionalExpr CONDITIONAL_OP primary
| primary
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
Finally, some parsers allow a special notation to indicate precedence and associativity. Something like (note that this is a completely invented syntax):
%LEFT LOGICAL_OP
%LEFT CONDITIONAL_OP
condition: condition CONDITIONAL_OP condition
| condition LOGICAL_OP condition
| '(' condition ')'
| NUMBER
| IDENTIFIER
| BOOLEAN_LITERAL
Hope this points you the right direction.

What does this ANLTR4 notation mean?

I have a question regarding the notation of a UCB Logo grammar that I found was generated for ANTLR4. There are some notations can't make out and thought about asking. If anyone is willing to clarify, I will be grateful.
Here are the notations I don't quite understand:
WORD
: {listDepth > 0}? ~[ \t\r\n\[\];] ( ~[ \t\r\n\];~] | LINE_CONTINUATION | '\\' ( [ \t\[\]();~] | LINE_BREAK ) )*
| {arrayDepth > 0}? ~[ \t\r\n{};] ( ~[ \t\r\n};~] | LINE_CONTINUATION | '\\' ( [ \t{}();~] | LINE_BREAK ) )*;
array
: '{' ( ~( '{' | '}' ) | array )* '}';
NAME
: ~[-+*/=<> \t\r\n\[\]()":{}] ( ~[-+*/=<> \t\r\n\[\](){}] | LINE_CONTINUATION | '\\' [-+*/=<> \t\r\n\[\]();~{}] )*;
I guess the array means that it can start with { and have an arbitrary number of levels, but has to end with }.
I take it that the others are some form of regular expressions?
Too my knowledge, regex is different for different programming languages.
Did I get that right?
Antlr does not do regular expressions. It does implement some of the same operators, but that is where the similarity largely ends.
The first sub-terms ( {listDepth > 0}?) in the WORD rule are predicates - no relation to anything in the regular expression world. They are defined in the Antlr documentation and explained in detail in the TDAR.
Your understanding of the array rule is essentially correct.

Why can't I have operator associativity with precedence in Antlr v4?

I'm using antlr v4 (eliminates direct left recursion).
My grammar's non-terminals are: and, or, id.
and has higher priority than or, and both of them are left associative.
According to Antlr4 reference if I put and before or it will have higher precedence.
So I have written this simple grammar:
expr : 'id'
| expr BINOP expr
;
BINOP: 'and'<assoc=left> //higher precedence
| 'or'<assoc=left> //lower precedence
;
But when it parses the string id and id or id and id the associativity is ok but
precedence is not ok: ((id and id) or id) and id.
If I turn BINOP into a parser rule:
binop: 'and'<assoc=left>
| 'or'<assoc=left>
;
neither associativity nor precedence work correctly:
id and (id or (id and id))
However when I implement BINOP inside expr parser rule:
expr : 'id'
| expr 'and'<assoc=left> expr
| expr 'or'<assoc=left> expr
;
everything works fine and I get the desired parse tree:
(id and id) or (id and id)
I googled a lot about the problem but I couldn't find anything.
I would be very glad if anyone could tell me where the problem is, and how
can I get correct associativity and precedence by having a separate rule for BINOP.
Thanks for your time.
The assoc=left is default so does nothing here. Also, precedence works on alternative level and you have put both operators at the same level:
expr BINOP expr
Ter

ANTLR ambiguity in DeCaf - professor unsure where error is

I'm working on a project for school with converting a BNF form Decaf spec into a context-free grammar and building it in ANTLR. I've been working on it for a few weeks and been going to the professor when I've become stuck, but I finally ran into something that he says should not be causing an error. Here's the isolated part of my grammar, expr is the starting point. Before I do that I have one question.
Does it matter if my lexer rules appear before my parser rules in my grammar, or if they're mixed in intermittently through my grammar file?
calloutarg: expr | STRING;
expr: multexpr ((PLUS|MINUS) multexpr)* ;
multexpr : atom ((MULT|DIVISION) atom)*
;
atom : OPENPAR expr CLOSEPAR | ID ((OPENBRACKET expr CLOSEBRACKET)? | OPENPAR ((expr (COMMA)* )+)? CLOSEPAR)|
CALLOUT OPENPAR STRING (COMMA (calloutarg)+ COMMA)? CLOSEPAR | constant;
constant: INT | CHAR | boolconstant;
boolconstant: TRUE|FALSE;
The ugly formatting is because part of his advice for debugging was to take individual rules and break them down where the ambiguity is to see where the errors are starting. In this case, it's saying the problem is in the long ID portion, that OPENBRACKET and OPENPAR are the cause. If you have any ideas at all, I am deeply appreciative. Thank you, and sorry for how nasty the formatting is on the code I posted.
Does it matter if my lexer rules appear before my parser rules in my grammar ...
No, that does not matter.
The problem is that inside your atom rule, ANTLR cannot make a choice between these three variants:
ID ( ...
ID [ ...
ID
without resorting to (possibly) backtracking. You could resolve it by using some syntactic predicates (which looks like: (...)=> ...). A syntactic predicates is nothing more than a "look ahead" and if this "look ahead" is successful, it chooses that particular path.
Your current atom rule can be rewritten as follows:
atom
: OPENPAR expr CLOSEPAR
| ID OPENPAR ((expr (COMMA)* )+)? CLOSEPAR
| ID OPENBRACKET expr CLOSEBRACKET
| ID
| CALLOUT OPENPAR STRING (COMMA (calloutarg)+ COMMA)? CLOSEPAR
| constant
;
And with the predicates it will look like:
atom
: OPENPAR expr CLOSEPAR
| (ID OPENPAR)=> ID OPENPAR ((expr (COMMA)* )+)? CLOSEPAR
| (ID OPENBRACKET)=> ID OPENBRACKET expr CLOSEBRACKET
| ID
| CALLOUT OPENPAR STRING (COMMA (calloutarg)+ COMMA)? CLOSEPAR
| constant
;
which should do the trick.
Note: do not use ANTLRWorks to generate or test the parser! It cannot handle predicates (well). Best do it on the command line.
Also see: https://wincent.com/wiki/ANTLR_predicates
EDIT
Let's label the six different "branches" from your atom rule from A to F:
atom // branch
: OPENPAR expr CLOSEPAR // A
| ID OPENBRACKET expr CLOSEBRACKET // B
| ID OPENPAR ((expr COMMA*)+)? CLOSEPAR // C
| ID // D
| CALLOUT OPENPAR STRING (COMMA calloutarg+ COMMA)? CLOSEPAR // E
| constant // F
;
Now, when the (future) parser should handle input like this:
ID OPENPAR expr CLOSEPAR
ANTLR does not know how the parser should handle it. It could be parsed in two different ways:
branch D followed by branch A
branch C
Which is the source of the ambiguity ANTLR is complaining about. If you were to comment out one of the branches A, C or D, the error would disappear.
Hope that helps.