ANTLR4 grammar rule that cannot be reached from start rule affects language - antlr

The following minimal grammar shows the issue:
grammar test;
call : exp LP exp RP ;
exp : exp LP exp RP | ID;
ID : [a-z] ;
LP : '(' ;
RP : ')' ;
Newline : '\r\n' | '\n' ;
If I use call as the start rule, then the generated parser will gladly parse the following input:
f(x)
(tried it in ANTLR lab, which probably uses the java target, and locally using a C++ target with ANTLR 4.9.3).
If I now add the following rule to the grammar, but keep call as the start rule, then the same input does not match call anymore.
callWithNewline : call Newline;
Why does callWithNewline affect whether call matches?
If I change the input to have a newline character after it will suddenly match call in ANTLR lab (even though the newline is not part of the match of course), but not in the C++ target, so the targets have slightly different behavior here.
I ran into this behavior while unit testing subrules, it does appear that parsing a full grammar which contains this kind of subgrammar somewhere lower in the hierarchy does not lead to issues.
Edit
The issue still occurs if I remove the ambiguity
grammar test;
callWithNewline : call Newline ;
call : exp LP ID RP ;
exp : exp LP ID RP | ID;
ID : [a-z] ;
LP : '(' ;
RP : ')' ;
Newline : '\r\n' | '\n' ;

Related

ANTLR proper ordering of grammar rules

I am trying to write a grammar that will recognize <<word>> as a special token but treat <word> as just a regular literal.
Here is my grammar:
grammar test;
doc: item+ ;
item: func | atom ;
func: '<<' WORD '>>' ;
atom: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| WORD #wordAtom
;
WS : [ \t] -> skip ;
NEWLINE : [\n\r]+ ;
PUNCT : [.,?!]+ ;
WORD : CHAR+ ;
fragment CHAR : (LETTER | DIGIT | SYMB | PUNCT) ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}\n\r\t] ;
So something like <<word>> will be matched by two rules, both func and atom. I want it to be recognized as a func, so I put the func rule first.
When I test my grammar with <word> it treats it as an atom, as expected. However when I test my grammar and give it <<word>> it treats it as an atom as well.
Is there something I'm missing?
PS - I have separated atom into PUNCT, NEWLINE, and WORD and given them labels #punctAtom, #newlineAtom, and #wordAtom because I want to treat each of those differently when I traverse the parse tree. Also, a WORD can contain PUNCT because, for instance, someone can write "Hello," and I want to treat that as a single word (for simplicity later on).
PPS - One thing I've tried is I've included < and > in the last rule, which is a list of symbols that I'm "disallowing" to exist inside a WORD. This solves one problem, in that <<word>> is now recognized as a func, but it creates a new problem because <word> is no longer accepted as an atom.
ANTLR's lexer tries to match as much characters as possible, so both <<WORD>> and <WORD> are matched by the lexer rul WORD. Therefor, there in these cases the tokens << and >> (or < and > for that matter) will not be created.
You can see what tokens are being created by running these lines of code:
Lexer lexer = new testLexer(CharStreams.fromString("<word> <<word>>"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", testLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
which will print:
WORD <word>
WORD <<word>>
EOF <EOF>
What you could do is something like this:
func
: '<<' WORD '>>'
;
atom
: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| word #wordAtom
;
word
: WORD
| '<' WORD '>'
;
...
fragment SYMB : ~[<>a-zA-Z0-9.,?! |{}\n\r\t] ;
Of course, something like foo<bar will not become a single WORD, which it previously would.

Why isn't antlr 4 breaking my tokens up as expected?

So I am fairly new to ANTLR 4. I have stripped down the grammar as much as I can to show the problem:
grammar DumbGrammar;
equation
: expression (AND expression)*
;
expression
: ID
;
ID : LETTER(LETTER|DIGIT)* ;
AND: 'and';
LETTER: [a-zA-Z_];
DIGIT : [0-9];
WS : [ \r\n\t] + -> channel (HIDDEN);
If use this grammar, and use the sample text: abc and d I get a weird tree with unexpected structure as shown below(using IntelliJ and ANTLR4 plug in):
If I simply change the terminal rule AND: 'and'; to read AND: '&&'; and then submit abc && d as input I get the following tree, as expected:
I cannot figure out why it isn't parsing "and" correctly, but does parse '&&' correctly.
The input "and" is being tokenized as an ID token. Since both ID and AND match the input "and", ANTLR needs to make a decision which token to choose. It takes ID since it was defined before AND.
The solution: define AND before ID:
AND: 'and';
ID : LETTER(LETTER|DIGIT)* ;

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

ANTLR rule works on its own, but fails when included in another rule

I am trying to write an ANTLR grammar for a reparsed and retagged kconfig file (retagged to solve a couple of ambiguities). A simplified version of the grammar is:
grammar FailureExample;
options {
language = Java;
}
#lexer::header {
package parse.failure.example;
}
reload
: configStatement*
EOF
;
configStatement
: CONFIG IDENT
configOptions
;
configOptions
: (type
| defConfigStatement
| dependsOnStatement
| helpStatement
| rangeStatement
| defaultStatement
| selectStatement
| visibleIfStatement
| prompt
)*
;
type : FAKE1;
dependsOnStatement: FAKE2;
helpStatement: FAKE3;
rangeStatement: FAKE4;
defaultStatement: FAKE5;
selectStatement:FAKE6;
visibleIfStatement:FAKE7;
prompt:FAKE8;
defConfigStatement
: defConfigType expression
;
defConfigType
: DEF_BOOL
;
//expression parsing
primative
: IDENT
| L_PAREN expression R_PAREN
;
negationExpression
: NOT* primative
;
orExpression
: negationExpression (OR negationExpression)*
;
andExpression
: orExpression (AND orExpression)*
;
unequalExpression
: andExpression (NOT_EQUAL andExpression)?
;
equalExpression
: unequalExpression (EQUAL unequalExpression)?
;
expression
: equalExpression (BECOMES equalExpression)?
;
DEF_BOOL: 'def_bool';
CONFIG : 'config';
COMMENT : '#' .* ('\n'|'\r') {$channel = HIDDEN;};
AND : '&&';
OR : '||';
NOT : '!';
L_PAREN : '(';
R_PAREN : ')';
BECOMES : '::=';
EQUAL : '=';
NOT_EQUAL : '!=';
FAKE1 : 'fake1';
FAKE2: 'fake2';
FAKE3: 'fake3';
FAKE4: 'fake4';
FAKE5: 'fake5';
FAKE6: 'fake6';
FAKE7: 'fake7';
FAKE8: 'fake8';
IDENT : (LETTER | DIGIT | '_')*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
With input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
I can set antlrworks to parse just the second line (commenting out the first) and I get the proper defConfigStatement token emitted with the proper expression following. However, if I exercise either the configOptions rule or the configStatement rule (with the first line uncommented), my configOptions results in an empty set and a NoViableAlt exception is thrown.
What would cause this behavior? I know that the defConfigStatement rule is accurate and can parse correctly, but as soon as it's added as a potential option in another rule, it fails. I know I don't have conflicting rules, and I've made DEF_BOOL and DEF_TRISTATE rules the top in my list of lexer rules, so they have priority over the other lexer rules.
/Added since edit/
To further complicate the issue, if I move the defConfigStatement choice in the configOptions rule, it will work, but other rules will fail.
Edit: Using full, simplified grammar.
In short, why does the rule work on its own, but fail when it's in configOptions (especially since configOptions is in (A | B | C)* form)?
When I parse the input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
with the parser generated from your grammar, I get the following parse tree:
So, I see no issues here. My guess is that you're using ANTLRWorks' interpreter: don't. It's buggy. Always test your grammar with a class of your own, or use ANTLWorks' debugger (press CTRL+D to launch is). The debugger works like a charm (without the package declaration, btw). The image I posted above is an export from the debugger.
EDIT
If the debugger doesn't work, try (temporarily) removing the package declaration (note that you're only declaring a package for the lexer, not the parser, but that might be a caused by posting a minimal grammar). You could also try changing the port number the debugger should connect to. It could be the port is already in use (see: File -> Preferences -> Debugger-tab).

enterDecision(int) in the type DebugEventListener is not applicable for the arguments (int, boolean)?

I am using ANTLR 3.1.3 to generate the parser. After importing the generated testParser, I found there is several errors like
try { dbg.enterDecision(2, decisionCanBacktrack[2]);
Description Resource Path Location Type
The method enterDecision(int) in the type DebugEventListener is not applicable for the arguments (int, boolean) testParser.java /ANTLRTest/src line 280 Java Problem
If I changed to dbg.enterDecision(2), then everything is fine.
The grammar is as follows,
grammar Test;
options {output=AST;}
expr : mexpr (PLUS^ mexpr)* SEMI! ;
mexpr : atom (STAR^ atom)* ;
atom: INT ;
//class csharpTestLexer extends Lexer;
WS : (' ' | '\t' | '\n' | '\r') { $channel = HIDDEN; } ;
LPAREN: '(' ;
RPAREN: ')' ;
STAR: '*' ;
PLUS: '+' ;
SEMI: ';' ;
DIGIT : '0'..'9' ;
INT : (DIGIT)+ ;
I am using ANTLRWorks 1.4.3 to generate lexer and parser.
JDK 1.6
Any reason to this error?
It looks like you've generated a lexer and parser with an ANTLR version that is different than the one you added to Eclipse's classpath.
If you generate a lexer and/or parser with ANTLRWorks 1.4.3 (which contains ANTLR 3.4), you should also add ANTLR 3.4 to your project's build path in Eclipse and remove ANTLR 3.1.3 from it.
BTW, I get error for 2 + 2 * 3, do you know why? anything wrong with the above grammar.
That is because single digit numbers are being tokenized as DIGIT tokens. Either make DIGIT a fragment:
fragment DIGIT : '0'..'9' ;
INT : (DIGIT)+ ;
or remove it:
INT : '0'..'9'+ ;
See: What does "fragment" mean in ANTLR?