I am writing an ANTLR parser that attempts to recognize GDB backtrace output from a given input string.
I'm ignoring new lines with the following lexer grammar:
RETURN : ('\r' | '\n' | '\r\n') { skip(); };
However, when I run the parser against some input, ANTLR gives the following lexer error:
line 20:21 no viable alternative at character '\n'
line 23:14 no viable alternative at character '\n'
line 30:21 no viable alternative at character '\n'
line 33:31 no viable alternative at character '\n'
I am not sure why this would ever happen, since I have already specified '\n' in the lexer.
Does anybody has any ideas? Thanks.
It looks like the problem is elsewhere in your grammar: it is still lexing a different element that has not yet ended, and unexpectedly encountered the end of line while it was still expecting to finish the current element.
Related
I am new to writing Antlr grammars and I'm having some trouble. I have simplified my issue down to the following:
Here is my grammar:
grammar Test;
prog : stmt* ;
stmt : INT NEWLINE ;
INT : [0-9]+ ;
NEWLINE : '\r'? '\n' ;
My input.txt is simply 0\r\n (not the literals '\', 'r', '\', 'n', but the ASCII codes 0x0D 0x0A
I get the following error:
line 1:1 extraneous input '\r\n' expecting {<EOF>, INT}
Why does it expect another INT without first matching a NEWLINE when the stmt rule requires that the INT be followed by a NEWLINE?
If it's relevant, I'm using PyCharm and the Antlr4 plugin to do the grammar generation.
As it turns out, the issue was related to Python's odd behavior when running modules that import modules that have been altered (hint: they're not automatically reloaded as they should be). The solution was to simply restart the interpreter when you change the grammar.
I'm sorry that the original question didn't have enough information for anyone else to figure out that answer.
I fundamentally don't understand how antlr works. Using the following grammar:
blockcomment : '/\*' ANYCHARS '\*/';
ANYCHARS : ('a'..'z' | '\n' | 'r' | ' ' | '0'..'9')* ;
I get a warning message when I compile the grammar file that says:
"non-fragment lexer rule 'ANYCHARS' can match the empty string"
Fine. I want it to be able to match empty strings as: "/\*\*/" is perfectly valid. But when I run "/\*\*/" in the TestRig I get:
missing ANYCHARS at '*/'
Obviously I could just change it so that '/**/' is handled as a special case:
blockcomment : '/\*' ANYCHARS '\*/' | '/**/';
But that doesn't really address the underlying issue. Can someone please explain to me what I am doing wrong? How can ANTLR raise a warning about matching empty strings and then not match them at the same time?
add "fragment" to ANYCHARS? It will then do what you want.
"non-fragment lexer rule 'ANYCHARS' can match the empty string"
The error message hints you to make ANYCHARS fragment.
Empty string cannot be matched as a token, that would end up with infinitely many empty tokens anywhere in the source.
You want to make the ANYCHARS part of the BLOCKCOMMENT token, rather than a separate token.
That is basically what fragments are good for - they simplify the lexer rules, but don't produce tokens.
BLOCKCOMMENT : '/*' ANYCHARS '*/';
fragment ANYCHARS : ('a'..'z' | '\n' | 'r' | ' ' | '0'..'9')* ;
EDIT: switched parser rule blockcomment to lexer rule BLOCKCOMMENT to enable fragment usage
I couldn't understand a bug in my grammar. The file, Bug.g4, is:
grammar Bug;
text: TEXT;
WORD: ('a'..'z' | 'A'..'Z')+ ;
TEXT: ('a'..'z' | 'A'..'Z')+ ;
NEWLINE: [\n\r] -> skip ;
After running antlr4 and javac, I run
grun Bug text -tree
aa
line 1:0 mismatched input 'aa' expecting TEXT
(text aa)
But if I instead use text: WORD in the grammar, things are okay. What's wrong?
When two lexer rules each match the same string of text, and no other lexer rule matches a longer string of text, ANTLR assigns the token type according to the rule which appeared first in the grammar. In your case, a TEXT token can never be produced by the lexer rule because the WORD rule will always match the same text and the WORD rule appears before the TEXT rule in the grammar. If you were to reverse the order of these rules in the grammar, you would start to see TEXT tokens but you would never see a WORD token.
I want to parse a language in which statements are separated by EOLs. I tried this in the lexer grammar (copied from an example in the docs):
EOL : ('\r'? '\n')+ ; // any number of consecutive linefeeds counts as a single EOL
and then used this in the parser grammar:
stmt_sequence : (stmt EOL)* ;
The parser rejected code with statements separated by one or more blank lines.
However, this was successful:
EOL : '\r'? '\n' ;
stmt_sequence : (stmt EOL+)* ;
I'm an ANTLR newbie. It seems like both should work. Is there something about greedy/nongreedy lexer scanning that I don't understand?
I tried this with both 3.2 and 3.4; I'm running the ANTLR IDE in Eclipse Indigo on OS X 10.6.
Thanks.
The error was not in the original grammar; but in the input data. I was using an editor (in Eclipse) that automatically inserted tabs after an EOL, so my "blank lines" were not really blank.
I modified the grammar as follows:
fragment SPACE: ' ' | '\t';
EOL : ( '\r'? '\n' SPACE* )+;
This grammar works as expected.
The lesson here is that one must be careful with white spaces. The lexer may see white spaces in the input that the parser does not see (because it has already been sent to the hidden channel).
My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error.
";" or "45;" (without newlines) result in EarlyExitException.
"NL" by itself parses without error.
"456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";" or "45;", the token STMTEND will never be created.
";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.
What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.