Grammar not working (NoViableAltException) - antlr

I'm new to ANTLR and I´m trying to play with it. This is the simplest grammar that I could think and still it is not working (NoViableAltException) when I parse a variable "id123", but it works for "abc1", "ab", "c1d2f3".
I'm using ANTLR 3.1.3 and ANTLRWorks 1.4.
language = 'CSharp2';
output = AST;
assign : variable '=' value;
value : (variable|constant);
constant: (STRING|INTEGER);
DIGIT : '0'..'9';
LETTER : ('a'..'z'|'A'..'Z');
CR : '\r' { $channel = HIDDEN; };
LF : '\n' { $channel = HIDDEN; };
CRLF : CR LF { $channel = HIDDEN; };
SPACE : (' '|'\t') { $channel = HIDDEN; };
STRING : '"' (~'"')* '"';

ANTLR's lexer tries to match as much as possible. Whenever two (or more) rules match the same amount of characters, the rule defined first will "win". So, whenever the lexer stumbles upon a singe digit, a DIGIT token is created, because it is defined before NATURAL:
DIGIT : '0'..'9';
but for the input "id123" the lexer produced the following 3 tokens:
because the lexer matches greedily, and therefor a NATURAL is created, and not three DIGIT tokens.
What you should do is make a lexer rule of variable instead:
assign : VARIABLE '=' value;
value : (VARIABLE | constant);
constant : (STRING | INTEGER | REAL);
SPACE : (' ' | '\t' | '\r' | '\n') { $channel = HIDDEN; };
STRING : '"' (~'"')* '"';
fragment NATURAL : (DIGIT)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
Also note that I made a couple of lexer rules fragments. This means that the lexer will never produce NATURAL, DIGIT or LETTER tokens. These fragment rules can only be used by other lexer rules. In other words, your lexer will only ever produce VARIABLE, INTEGER, REAL, and STRING tokens* (so these are the only ones you can use in your parser rules!).
* and '=' token, of course...


ANTLR Pattern "line 1:9 extraneous input ' ' expecting WORD"

I'm just getting started with using ANTLR. I'm trying to write a parser for field definitions that look like:
field_name = value
is_true_true = yes;
My grammar looks like this:
grammar Hello;
//Lexer Rules
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9';
fragment TRUE: 'TRUE'|'true';
fragment FALSE: 'FALSE'|'false';
STRING : ('\''.*?'\'') ;
WHITESPACE : (' ' | '\t')+ ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
field_def : WORD '=' WORD ';' ;
But when I run the generated Parser on 'working = yes;' i get the error message:
line 1:7 extraneous input ' ' expecting '='
line 1:9 extraneous input ' ' expecting WORD
I do not understand this fully, is there an error in matching the WORD-pattern or is it something else entirely?
Since it's quite usual that the whitespace is not significant to your grammar (i.e. there's no semantic meaning to it, apart of separating words), ANTLR makes it possible to just skip it:
In ANTLR 4 this is done by
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
In ANTLR 3 the syntax is
WHITESPACE : (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE : ('\r'? '\n' | '\r')+ { $channel = HIDDEN; };
What this does is the lexer tokenizes the input as usual, but parser understands that these tokens are not significant to it and behaves as if they were not there, allowing you to keep your rules simple and without need to add optional whitespace everywhere.
Your example has whitespace but your field_def isn't accounting for it.

Why am I getting "mismatched input 'addr' expecting {<EOF>, 'addr'}"

Given this g4 grammar:
grammar smaller;
: ( componentDefinition )* EOF;
: Addr
: Num
| Id
Addr : 'addr' {System.out.println("addr");};
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
// Whitespace and Comments
Wspace : [ \t]+ -> skip;
Newline : ('\r' '\n'?
| '\n'
) -> skip;
and this file to parse
addr basic {
this cmdline:
rm *.class *.java ; java -Xmx500M org.antlr.v4.Tool smaller.g4 ; javac *.java ; cat basic | java org.antlr.v4.runtime.misc.TestRig smaller root -tree
I get this error:
line 2:0 mismatched input 'addr' expecting {<EOF>, 'addr'}
(root addr basic { } ;)
If I remove the ExprElem (which is not used anywhere else in the grammar), the parser works:
(root (componentDefinition addr basic { } ;) <EOF>)
Why? Note that this is a greatly reduced version of the grammar. Normally, the ExprElem does have a purpose.
Addr is a literal, so it shouldn't conflict with Id in the way that other questions like this usually do.
Your rule ExprElem is a lexer rule, not a parser rule (it begins with an upercase) and is masking the Addr rule, so, no Addr :(
Also, as ExprElem is a lexer rule and it relies on Id or Num rule. Consequently, when an Id is found, ANTLR lexer gives it the ExprElem token type and not the Id token type.
So, two things, you can either rewrite your ExprElem rule to exprElem (assuming you want a parser rule):
exprElem : Num | Id;
or you can use Id token in your ExprElem as part of the rule but you need something that can differentiate ExprElem from Id (example below, but I really think you want a parser rule):
Addr : 'addr' {System.out.println("addr");};
: Sharp Num // This token use others but defines its own 'pattern'
| Sharp Id
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
Sharp : '#';
From what I suppose, this is definitely not what you want, but I just put it here to illustrate how lexer rule can reuse others.
When you have doubt about what your token do, do not hesitate to display the recognize tokens. Here is the Java code fragment I often use (I named your grammar test in this case):
public class Main {
public static void main(String[] args) throws InterruptedException {
String txt =
"addr Basic {\n"
+ "\n"
+ "};";
TestLexer lexer = new TestLexer(new ANTLRInputStream(txt));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
for (Token t : tokens.getTokens()) {
NOTE: by the way, Num will never be recognized as Id rule can match the same thing. Try this instead:
Id : Letter (Letter | [0-9])*;
Num : [0-9]+;
fragment Letter : [a-zA-z_];

Why does my antlr grammar seem to properly parse this input?

I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
program : idList EOF
| integerList EOF
idList : ID whitespace idList
| ID
integerList : INTEGER whitespace integerList
whitespace : (WHITESPACE | COMMENT) +;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?
I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
program : idList EOF
| integerList EOF
idList : ID+
integerList : INTEGER+
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.

Using different case keywords in ANTLR grammar

Some keywords (string constant) in my grammar contain capital letters
PREV_VALUE : 'PreviousValue';
This causes strange parsing behavior: other tokens that contain same capital letters ('P','V') are parsed incorrectly.
Here's a simplified version of the lexer grammar:
lexer grammar ExpressionLexer;
COMMA : ',';
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
PREV_VALUE : 'PreviousValue';
fragment DIGIT : ('0'..'9');
fragment LETTER : ('a'..'z'|'A'..'Z'|'_');
fragment TAB : ('\t') ;
fragment NEWLINE : ('\r'|'\n') ;
fragment SPACE : (' ') ;
When I try parsing such expression:
var expression = "P"; //Capital 'P' which included to the keyword 'PreviousValue'
var stringReader = new StringReader(expression);
var input = new ANTLRReaderStream(stringReader);
var expressionLexer = new ExpressionLexer(input);
var tokens = new CommonTokenStream(expressionLexer);
tokens._tokens collection contains one value
[0] = {[#0,1:1='<EOF>',<-1>,1:1]}
It's incorrect.
If I change expression to 'p' (lowercase letter)
tokens._tokens collection contains two values
[0] = {[#0,0:0='p',<0>,1:0]}
[1] = {[#1,1:1='<EOF>',<-1>,1:1]}
It's correct.
When string PREV_VALUE : 'PreviousValue'; is removed from grammar, both expressions are parsed correctly.
Is it possible to use different case in keywords?
Is there any example of using such keywords in ANTLR grammar?
I find it hard to believe a p token is created based on the grammar you posted. Lexer rules that have fragment in front of them will not produce tokens: these rules are only used by other lexer rules.
A simple demo shows this:
lexer grammar ExpressionLexer;
#lexer::members {
public static void main(String[] args) throws Exception {
ExpressionLexer lexer = new ExpressionLexer(new ANTLRStringStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill(); // remove this line when using ANTLR 3.2 or an older version
COMMA : ',';
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
PREV_VALUE : 'PreviousValue';
fragment DIGIT : ('0'..'9');
fragment LETTER : ('a'..'z'|'A'..'Z'|'_');
fragment TAB : ('\t') ;
fragment NEWLINE : ('\r'|'\n') ;
fragment SPACE : (' ') ;
Now generate the lexer and compile the .java source file:
java -cp antlr-3.3.jar org.antlr.Tool ExpressionLexer.g
javac -cp antlr-3.3.jar *.java
and run a few tests:
java -cp .:antlr-3.3.jar ExpressionLexer p
line 1:0 no viable alternative at character 'p'
which is correct since there is no (non-fragment) rule that starts with, or matches, a "p".
java -cp .:antlr-3.3.jar ExpressionLexer P
line 1:1 mismatched character '' expecting 'r'
which is correct since the only (non-fragment) rule that starts with a "P" expects an "r" to be the next character (which isn't there).

ANTLR parsing Java Properties

I'm trying to pick up ANTLR and writing a grammar for Java Properties. I'm hitting an issue here and will appreciate some help.
In Java Properties, it has a little strange escape handling. For example,
results in key-value pairs in Java runtime as
=== =====
key1 1=Key1
key=2 =
So far, this is the best I can mimic.. by folding the '=' and value into one single token.
grammar Prop;
file : (pair | LINE_COMMENT)* ;
pair : ID VALUE ;
ID : (~('='|'\r'|'\n') | '\\=')* ;
VALUE : '=' (~('\r'|'\n'))*;
: ('\r'|'\n') + {$channel=HIDDEN;}
: '#' ~('\r'|'\n')* ('\r'|'\n'|EOF)
Is there any good suggestion if I can implement a better one?
Thanks a lot
It's not as easy as that. You can't handle much at the lexing level because many things depend on a certain context. So at the lexing level, you can only match single characters and construct key and values in parser rules. Also, the = and : as possible key-value separators and the fact that these characters can be the start of a value, makes them a pain in the butt to translate into a grammar. The easiest would be to include these (possible) separator chars in your value-rule and after matching the separator and value together, strip the separator chars from it.
A small demo:
grammar JavaProperties;
: line* EOF
: Space* keyValue
| Space* Comment eol
| Space* LineBreak
: key separatorAndValue eol
// Replace all escaped `=` and `:`
String k = $key.text.replace("\\:", ":").replace("\\=", "=");
// Remove the separator, if it exists
String v = $separatorAndValue.text.replaceAll("^\\s*[:=]\\s*", "");
// Remove all escaped line breaks with trailing spaces
v = v.replaceAll("\\\\(\r?\n|\r)[ \t\f]*", "").trim();
System.out.println("\nkey : `" + k + "`");
System.out.println("value : `" + v + "`");
: keyChar+
: AlphaNum
| Backslash (Colon | Equals)
: (Space | Colon | Equals) valueChar+
: AlphaNum
| Space
| Backslash LineBreak
| Equals
| Colon
: LineBreak
Backslash : '\\';
Colon : ':';
Equals : '=';
: ('!' | '#') ~('\r' | '\n')*
: '\r'? '\n'
| '\r'
: ' '
| '\t'
| '\f'
: 'a'..'z'
| 'A'..'Z'
| '0'..'9'
The grammar above can be tested with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRFileStream("");
JavaPropertiesLexer lexer = new JavaPropertiesLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
JavaPropertiesParser parser = new JavaPropertiesParser(tokens);
and the input file:
key1 = value 1
key2:value 2
key3 :value3
e 4
key6 value6
to produce the following output:
key : `key1`
value : `value 1`
key : `key2`
value : `value 2`
key : `key3`
value : `value3`
key : `ke:=y4`
value : `value 4`
key : `key=5`
value : `=`
key : `key6`
value : `value6`
Realize that my grammar is just an example: it does not account for all valid properties files (sometimes backslashes should be ignored, there's no Unicode escapes, many characters are missing in the key and value). For a complete specification of properties files, see: