Remove warnings instead of using backtrack option - antlr

I am not sure how to solve this problem without using backtrack=true;.
My sample grammar:
grammar Test;
options {
language = Java;
output = AST;
}
parse : expression
;
expression : binaryExpression
| tupleExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/'|'div'|'inter') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| sumExpression
| '(' expression ')'
;
sumExpression : 'sum'|'div'|'inter' expression
;
tupleExpression : ('<' expression '>' (',' '<' expression '>')*)
;
literalExpression : INT
;
id : IDENTIFIER
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
Is there a way to fix the grammar such a way that no warnings can happen? Let's assume I want to choose both alternatives depending on the case.
Thank you in advance!

Note that:
sumExpression : 'sum'|'div'|'inter' expression
;
gets interpreted as:
sumExpression : 'sum' /* nothing */
| 'div' /* nothing */
| 'inter' expression
;
since the | has a low precedence. You probably want:
sumExpression : ('sum'|'div'|'inter') expression
;
Let's assume I want to choose both alternatives depending on the case.
That is not possible: you cannot let the parser choose both (or more) alternatives, it can only choose one.
I assume you know why the grammar is ambiguous? If not, here's why: the input A div B can be parsed in two ways:
alternative 1
unaryExpression 'div' unaryExpression
| |
A B
alternative 2
id sumExpression
| | \
A 'div' B
It looks like you want 'sum', 'div' and 'inter' to be some sort of unary operator, in which case you could just merge them into your unaryExpression rule:
unaryExpression : '!' unaryExpression
| '-' unaryExpression
| 'sum' unaryExpression
| 'div' unaryExpression
| 'inter' unaryExpression
| primitiveElement
;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
That way you don't have any ambiguity. Note that A div B will now be parsed as a multiplyingExpression and A div sum B as:
multiplyingExpression
/ \
'div' unaryExpression
/ / \
A 'sum' B

Related

ANTLR4 - How to close "longest-match-wins" and use first match rule?

Orignial question:
My code to parse:
N100G1M4
What I expcted: N100 G1 M4
But ANTLR can not idetify this because ANTLR always match longest substring?
How to handle the case?
Update
What I am going to do:
I am trying to parse CNC G-Code txt and get keywords from a file stream, which is usually used to control a machine and drive motors to move.
The G-Code rule is :
// Define a grammar called Hello
grammar GCode;
script : blocks+ EOF;
blocks:
assign_stat
| ncblock
| NEWLINE
;
ncblock :
ncelements NEWLINE //
;
ncelements :
ncelement+
;
ncelement
:
LINENUMEXPR // linenumber N100
| GCODEEXPR // G10 G54.1
| MCODEEXPR // M30
| coordexpr // X100 Y100 Z[A+b*c]
| FeedExpr // F10.12
| AccExpr // E2.0
// | callSubroutine
;
assign_stat:
VARNAME '=' expression NEWLINE
;
expression:
multiplyingExpression ('+' | '-') multiplyingExpression
;
multiplyingExpression
: powExpression (('*' | '/') powExpression)*
;
powExpression
: signedAtom ('^' signedAtom)*
;
signedAtom
: '+' signedAtom
| '-' signedAtom
| atom
;
atom
: scientific
| variable
| '(' expression ')'
;
LINENUMEXPR: 'N' Digit+ ;
GCODEEXPR : 'G' GPOSTFIX;
MCODEEXPR : 'M' INT;
coordexpr:
CoordExpr
| ParameterKeyword getValueExpr
;
getValueExpr:
'[' expression ']'
;
CoordExpr
:
ParameterKeyword SCIENTIFIC_NUMBER
;
ParameterKeyword: [XYZABCUVWIJKR];
FeedExpr: 'F' SCIENTIFIC_NUMBER;
AccExpr: 'E' SCIENTIFIC_NUMBER;
fragment
GPOSTFIX
: Digit+ ('.' Digit+)*
;
variable
: VARNAME
;
scientific
: SCIENTIFIC_NUMBER
;
SCIENTIFIC_NUMBER
: SIGN? NUMBER (('E' | 'e') SIGN? NUMBER)?
;
fragment NUMBER
: ('0' .. '9') + ('.' ('0' .. '9') +)?
;
HEX_INTEGER
: '0' [xX] HEX_DIGIT+
;
fragment HEX_DIGIT
: [0-9a-fA-F]
;
INT : Digit+;
fragment
Digit : [0-9];
fragment
SIGN
: ('+' | '-')
;
VARNAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
NEWLINE
: '\r'? '\n'
;
WS : [ \t]+ -> skip ; // skip spaces, tabs, newlines
Sample program(it works well except the last line):
N200 G54.1
a = 100
b = 10
c = a + b
Z[a + b*c]
N002 G2 X30.1 Y20.1 I20.1 J0.1 K0.2 R20
N100 G1X100.5Z[VAR1+100]M3H3 // it works well except the last line
I want to parse N100G1X100.5YE5Z[VAR1+100]M3H3 to
-> N100 G1 X100 Z[VAR1+100]
-> or it will be better to split the node X100 to two subnode X 100:
I am trying to use ANTLR, but ANTLR always take the rule "longest match wins". N100G1X100 is identified to a word.
Append question:
What's the best tool to finish the task?
ANTLR has a strict separation between pasrer and lexer, and therefor the lexer operates in a predictable way (longest match wins). So if you have some sort of identifier rule that matches N100G1M4 but sometimes want to match N100, G1 and M4 separately, you're out of luck.
How to handle the case?
The only answer one can give (with the amount of details given) is: remove the rule that matches N100G1M4 as 1 token. If that is something you cannot do, then don't use ANTLR, but use a "scannerless" parser.
Scannerless Parser Generators

antlr grammar definition

I am relatively new to compilers theory and I just wanted to create a grammar to parse some comparisons in order to evaluate them later. I found antlr which is a powerful tool to specify the grammar. From what I have learned in the theory I know that operators with higher precedence must be declard in deeper levels than operators with lower precedence. Additionally if I want some rule to be left associative I know that I have to set the recursivity to the left of the rule. Knowing that I have created a basic grammar to use &&, ||, !=, ==, <, >, <=, >=, (,) and !
start
: orExpr
;
orExpr
: orExpr OR andExpr
| andExpr
;
andExpr
: andExpr AND eqNotEqExpr
| eqNotEqExpr
;
eqNotEqExpr
: eqNotEqExpr NEQ compExpr
| eqNotEqExpr EQ compExpr
| compExpr
;
compExpr
: compExpr LT compExpr
| compExpr GT compExpr
| compExpr LTEQ compExpr
| compExpr GTEQ compExpr
| notExpr
;
notExpr
: NOT notExpr
| parExpr
;
parExpr
: OPAR orExpr CPAR
| id
;
id
: INT
| FLOAT
| TRUE
| FALSE
| ID
| STRING
| NULL
;
However searching in internet I have found a different way to specify above grammar which does not follow the above rules I mentioned regarding operator precedence and left associativity:
start
: expr
;
expr
: NOT expr //notExpr
| expr op=(LTEQ | GTEQ | LT | GT) expr //relationalExpr
| expr op=(EQ | NEQ) expr //equalityExpr
| expr AND expr //andExpr
| expr OR expr //orExpr
| atom //atomExpr
;
atom
: OPAR expr CPAR //parExpr
| (INT | FLOAT) //numberAtom
| (TRUE | FALSE) //booleanAtom
| ID //idAtom
| STRING //stringAtom
| NULL //nullAtom
;
Can someone explain why this way of definig the grammar also works? Is it because some specific treatment of antlr or another type of grammar definition?
Below there are the operators and ids defined for the grammar:
OR : '||';
AND : '&&';
EQ : '==';
NEQ : '!=';
GT : '>';
LT : '<';
GTEQ : '>=';
LTEQ : '<=';
NOT : '!';
OPAR : '(';
CPAR : ')';
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
STRING
: '"' (~["\r\n] | '""')* '"'
;
COMMENT
: '//' ~[\r\n]* -> skip
;
SPACE
: [ \t\r\n] -> skip
;
OTHER
: .
;
This is specific to ANTLR v4.
Under the hood, a rule like this one will be rewritten to something equivalent to what you have done manually as part of the left-recursion elimination step. ANTLR does this as a convenience because LL grammars cannot contain left-recursive rules, as a direct conversion of such a rule into parser code would produce an infinite recursion in code (a function which unconditionnally calls itself).
There is more info and a transformation example in the docs page about left-recursion.

fatal error in grammar - piecewise definition

I am translating a grammar from LALR to ANTLR and I am having trouble with translating this one rule, piecewise expression.
Attached is the sample grammar:
grammar Test;
options {
language = Java;
output = AST;
}
parse : expression ';'
;
expression : binaryExpression
| piecesExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
literalExpression : INT
;
id : IDENTIFIER
;
piecesExpression : 'piecewise' '{' piece expression '}' ('(' expression ',' expression ')')? expression?
;
piece : expression '->' expression ';' (expression '->' expression ';')*
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
ANTLR v3.5 is complaining about the piecesExpression rule. It has 2 fatal errors and I would rather not use backtrack option.
Expected results:
piecewise {t -> s; t -> x; 100}
piecewise {t -> s; t -> x; 100} (0, x+1)
piecewise {t -> s; t -> x; 100} (0, x+1) y+5
How can piecesExpression be able to capture the above results?
Thanks in advance!
ANTLR has problems determining which alternatives to take in (at least) 2 cases:
piece starts with a expression but inside the piecewise{...}, it should also end with an expression
piecesExpression ends with '(' expression ... but also has an optional trailing expression (and an primitiveElement also matches '(' expression ... in its turn)
There's no need to use global backtracking, but without rewriting many rules, you do need to add some predicates (the (...)=> in the example below) to fix the two issues outlined above.
Try this:
piecesExpression
: 'piecewise' '{' ((expression '->')=> piece)+ expression '}'
( ('(' expression ',')=> '(' expression ',' expression ')' expression?
| expression
)
;
piece
: expression '->' expression ';'
;

How to create a rule to match 2^3 to create a power operator?

Given that I have the following grammar how would I add a rule to match something like 2^3 to create a power operator?
negation : '!'* term ;
unary : ('+'!|'-'^)* negation ;
mult : unary (('*' | '/' | ('%'|'mod') ) unary)* ;
add : mult (('+' | '-') mult)* ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (('&&' | '||') relation)* ;
// LEXER ================================================================
HEX_NUMBER : '0x' HEX_DIGIT+;
fragment
FLOAT: ;
INTEGER : DIGIT+ ({input.LA(1)=='.' && input.LA(2)>='0' && input.LA(2)<='9'}?=> '.' DIGIT+ {$type=FLOAT;})? ;
fragment
HEX_DIGIT : (DIGIT|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9') ;
What I have tried:
I tried something like power : ('+' | '-') unary'^' unary but that doesn't seem to work.
I also tried mult : unary (('*' | '/' | ('%'|'mod') | '^' ) unary)* ; but that doesn't work either.
To give ^ higher precedence than negation, do this:
pow : term ('^' term)* ;
negation : '!' negation | pow ;
unary : ('+'! | '-'^)* negation ;
If you want to consider the right-associativity already in the grammar, you can also use recursion:
pow : term ('^'^ pow)?
;
negation : '!'* pow;
...

Why is this grammar giving me a "non LL(*) decision" error?

I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.