ANTLR4 problem with balanced parentheses grammar - antlr

I'm doing some experiments with ANTLR4 with this grammar:
srule
: '(' srule ')'
| srule srule
| '(' ')';
this grammar is for the language of balanced parentheses.
The problem is that when I run antlr with this string: (()))(
This string is obviously wrong but antlr simply return this AST:
It seems to stop when it finds the wrong parenthesis, but no error message returns. I would like to know more about this behavior. Thank you

The parser recognises (())) and then stops. If you want to force the parser to consume all tokens, "anchor" your test rule with the EOF token:
parse_all
: srule EOF
;
Btw, it's always a good idea to include the EOF token in the entry point (entry rule) of your grammar.

Related

ANTLR4 No Viable Alternative At Input

I'm implementing a simple PseudoCode language with ANTLR4, this is my current grammar:
// Define a grammar called PseudoCode
grammar PseudoCode;
prog : FUNCTION SIGNATURE '(' ')'
| FUNCTION SIGNATURE '{' VARB '}' ;
param: VARB | VARB ',' param ;
assignment: VARB '=' NUMBER ;
FUNCTION: 'function' ;
VARB: [a-z0-9]+ ;
SIGNATURE: [a-zA-Z0-9]+ ;
NUMBER: [0-9]+ | [0-9]+ '.' [0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
The problem is after compiling and generating the Parser, Lexer, etc... and then running with grun PseudoCode prog -tree with the input being for example: function bla{bleh}
I keep on getting the following error:
line 1:9 no viable alternative at input 'functionbla'
Can someone point out what is wrong with my grammar?
bla is a VARB, not a SIGNATURE, because it matches both rules and VARB comes first in the grammar. The way you defined your lexer rules, an identifier can only be matched as a SIGNATURE if it contains capital letters.
The simplest solution to this problem would be to have a single lexer rule for identifiers and then use that everywhere where you currently use SIGNATURE or VARB. If you want to disallow capital letters in certain places, you could simply check for this condition in an action or listener, which would also allow you to produce clearer error messages than syntax errors (e.g. "capital letters are not allowed in variable names").
If you absolutely do need capital letters in variable names to be syntax errors, you could define one rule for identifiers with capital letters and one without. Then you could use ID_WITH_CAPITALS | ID_LOWER_CASE_ONLY in places where you want to allow both and ID_LOWER_CASE_ONLY in cases where you only want to allow lower case letters.
PS: You'll also want to make sure that your identifier rule does not match numbers (which both VARB and SIGNATURE currently do). Currently NUMBER tokens will only be generated for numbers with a decimal point.

Problems with ANTLR4 grammar

I have a very simple grammar file, which looks like this:
grammar Wort;
// Parser Rules:
word
: ANY_WORD EOF
;
// Lexer Rules:
ANY_WORD
: SMALL_WORD | CAPITAL_WORD
;
SMALL_WORD
: SMALL_LETTER (SMALL_LETTER)+
;
CAPITAL_WORD
: CAPITAL_LETTER (SMALL_LETTER)+
;
fragment SMALL_LETTER
: ('a'..'z')
;
fragment CAPITAL_LETTER
: ('A'..'Z')
;
If i try to parse the input "Hello", everything is OK, BUT if if modify my grammar file like this:
...
// Parser Rules:
word
: CAPITAL_WORD EOF
;
...
the input "Hello" is no longer recognized as a valid input. Can anybody explain, what is going wrong?
Thanx, Lars
The issue here has to do with precedence in the lexer grammar. Because ANY_WORD is listed before CAPITAL_WORD, it is given higher precedence. The lexer will identify Hello as a CAPITAL_WORD, but since an ANY_WORD can be just a CAPITAL_WORD, and the lexer is set up to prefer ANY_WORD, it will output the token ANY_WORD. The parser acts on the output of the lexer, and since ANY_WORD EOF doesn't match any of its rules, the parse fails.
You can make the lexer behave differently by moving CAPITAL_WORD above ANY_WORD in the grammar, but that will create the opposite problem -- capitalized words will never lex as ANY_WORDs. The best thing to do is probably what Mephy suggested -- make ANY_WORD a parser rule.

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

ANTLR and Empty Strings Contradictory Behavior

I fundamentally don't understand how antlr works. Using the following grammar:
blockcomment : '/\*' ANYCHARS '\*/';
ANYCHARS : ('a'..'z' | '\n' | 'r' | ' ' | '0'..'9')* ;
I get a warning message when I compile the grammar file that says:
"non-fragment lexer rule 'ANYCHARS' can match the empty string"
Fine. I want it to be able to match empty strings as: "/\*\*/" is perfectly valid. But when I run "/\*\*/" in the TestRig I get:
missing ANYCHARS at '*/'
Obviously I could just change it so that '/**/' is handled as a special case:
blockcomment : '/\*' ANYCHARS '\*/' | '/**/';
But that doesn't really address the underlying issue. Can someone please explain to me what I am doing wrong? How can ANTLR raise a warning about matching empty strings and then not match them at the same time?
add "fragment" to ANYCHARS? It will then do what you want.
"non-fragment lexer rule 'ANYCHARS' can match the empty string"
The error message hints you to make ANYCHARS fragment.
Empty string cannot be matched as a token, that would end up with infinitely many empty tokens anywhere in the source.
You want to make the ANYCHARS part of the BLOCKCOMMENT token, rather than a separate token.
That is basically what fragments are good for - they simplify the lexer rules, but don't produce tokens.
BLOCKCOMMENT : '/*' ANYCHARS '*/';
fragment ANYCHARS : ('a'..'z' | '\n' | 'r' | ' ' | '0'..'9')* ;
EDIT: switched parser rule blockcomment to lexer rule BLOCKCOMMENT to enable fragment usage

Antlr: Unintended behavior

Why this simple grammar
grammar Test;
expr
: Int | expr '+' expr;
Int
: [0-9]+;
doesn't match the input 1+1 ? It says "No method for rule expr or it has arguments" but in my opition it should be matched.
It looks like I haven't used ANTLR for a while... ANTLRv3 did not support left-recursive rules, but ANTLRv4 does support immediate left recursion. It also supports the regex-like character class syntax you used in your post. I tested this version and it works in ANTLRWorks2 (running on ANTLR4):
grammar Test;
start : expr
;
expr : expr '+' expr
| INT
;
INT : [0-9]+
;
If you add the start rule then ANTLR is able to infer that EOF goes at the end of that rule. It doesn't seem to be able to infer EOF for more complex rules like expr and expr2 since they're recursive...
There are a lot of comments below, so here is (co-author of ANTLR4) Sam Harwell's response (emphasis added):
You still want to include an explicit EOF in the start rule. The problem the OP faced with using expr directly is ANTLR 4 internally rewrote it to be expr[int _p] (it does so for all left recursive rules), and the included TestRig is not able to directly execute rules with parameters. Adding a start rule resolves the problem because TestRig is able to execute that rule. :)
I've posted a follow-up question with regard to EOF: When is EOF needed in ANTLR 4?
If your command looks like this:
grun MYGRAMMAR xxx -tokens
And this exception is thrown:
No method for rule xxx or it has arguments
Then this exception will get thrown with the rule you specified in the command above. It means the rule probably doesn't exist.
System.err.println("No method for rule "+startRuleName+" or it has arguments");
So startRuleName here, should print xxx if it's not the first (start) rule in the grammar. Put xxx as the first rule in your grammar to prevent this.