ANTLR No viable alternative at input query - antlr

I am designing a parser for a small language - the grammar I have is as follows:
grammar Test;
r : specification+;
specification : MODULE module_def EOF;
module_def : ID EQ classExp;
classExp : basicClassExp
| altClassExp
;
basicClassExp : CLASS (classCode)* END;
altClassExp : ALTCLASS (classCode)* END;
classCode : 'classCode';
EXCLAMATION : '!';
EQ : '=';
MODULE : 'module';
END : 'end';
CLASS : 'class';
ALTCLASS : 'altclass';
ID
: JavaLetter JavaLetterOrDigit*
;
fragment
JavaLetter
: [a-zA-Z$_]
| ~[\u0000-\u007F\uD800-\uDBFF]
| [\uD800-\uDBFF] [\uDC00-\uDFFF]
;
fragment
JavaLetterOrDigit
: [a-zA-Z0-9$_]
| ~[\u0000-\u007F\uD800-\uDBFF]
| [\uD800-\uDBFF] [\uDC00-\uDFFF]
;
WS : [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(HIDDEN)
;
An example of correct code would be:
module T = class end
When I run it on the following code :
module ! T ! = !
class !
end !
I expect it to report errors where the exclamation marks are. Instead I get:
line 1:7 extraneous input '!' expecting ID
line 1:11 extraneous input '!' expecting '='
line 1:15 no viable alternative at input '!'
When I remove the altClassExp rule alternative, it reports all errors i.e.
line 1:7 extraneous input '!' expecting ID
line 1:11 extraneous input '!' expecting '='
line 1:15 extraneous input '!' expecting 'class'
line 2:10 extraneous input '!' expecting {'classCode', 'end'}
line 3:6 extraneous input '!' expecting <EOF>
What do I need to change so I can keep my altClassExp rule and also report all the extraneous exclamation marks?
Thanks

Related

Problems defining an ANTLR parser for template file

I have started working with ANTLR4 to create a syntax parser for a self defined template file format.
The format basically consists of a mandatory part called '#settings' and at least one part called '#region'. Parts body is surrounded by braces.
I have created a sample file and also copy-pasted-modified an antlr g4 file to parse it. Works fine so far:
File:
#settings
{
setting1: value1
setting2: value2
}
#region
{
[Key1]=Value1(Comment1)
[Key2]=Value2(Comment2)
}
The G4 file for this sample:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
This works as expected. Now I want to add complexity to the file format and the parser and extend the #region header by #region NAME (Attributes).
So what I changed in the sample and in the G4 file is:
Sample changed to
...
#region name (attributes, moreAttributes)
{
...
and g4 file modified to
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
Now the parser brings up the following error:
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
And I don't get why it is behaving like this. I expected the parser to not concat the whole line when comparing. What am I doing wrong?
Thanks.
There are a couple of problems here:
whatever NOSPACE matches, is also matched by TEXT
TEXT is waaaaay too greedy
Issue 1
ANTLR's lexer works independently from the parser and the lexer will match as much characters as possible.
When 2 (or more) lexer rules match the same amount of characters, the one defined first "wins".
So, if the input is Foo and the parser is. trying to match a NOSPACE token, you're out of luck: because both TEXT and NOSPACE match the text Foo and TEXT is defined first, the lexer will produce a TEXT token. There's nothing you can do about that: it's the way ANTLR works.
Issue 2
As explained in issue 1, the lexer tries to match as much characters as possible. Because of that, your TEXT rule is too greedy. This is what your input is tokenised as:
'{' `{`
TEXT `setting1: value1`
TEXT `setting2: value2`
'}' `}`
TEXT `#region name (attributes, moreAttributes)`
'{' `{`
TEXT `[Key1]=Value1(Comment1)`
TEXT `[Key2]=Value2(Comment2)`
'}' `}`
As you can see, TEXT matches too much. And this is what the error
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
is telling you: #region name (attributes, moreAttributes) is a single TEXT token where a #region is trying to be matched by the parser.
Solution?
Remove NOSPACE and make the TEXT token less greedy (or the other way around).
Bart,
thank you very much for clarifying this to me. The key phrase was the lexer will match as much characters as possible. This is a behavior I still need to get used to. I redesigned my Lexer and Parser rules and it seems to work for my test case now.
For completeness, this is my g4 File now:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: TEXT
;
settingsText
: TEXT
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: TEXT
;
regionText
: regionLine '('? (regionComment?) ')'?
;
regionLine
: TEXT
;
regionComment
: TEXT
;
TEXT
: ([A-z0-9:\-|= ])+
;
WS
: [ \t\n\r] + -> skip
;

ANTLR Pattern "line 1:9 extraneous input ' ' expecting WORD"

I'm just getting started with using ANTLR. I'm trying to write a parser for field definitions that look like:
field_name = value
Example:
is_true_true = yes;
My grammar looks like this:
grammar Hello;
//Lexer Rules
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9';
fragment TRUE: 'TRUE'|'true';
fragment FALSE: 'FALSE'|'false';
INTEGER : DIGIT+ ;
STRING : ('\''.*?'\'') ;
BOOLEAN : (TRUE|FALSE);
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WHITESPACE : (' ' | '\t')+ ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
field_def : WORD '=' WORD ';' ;
But when I run the generated Parser on 'working = yes;' i get the error message:
line 1:7 extraneous input ' ' expecting '='
line 1:9 extraneous input ' ' expecting WORD
I do not understand this fully, is there an error in matching the WORD-pattern or is it something else entirely?
Since it's quite usual that the whitespace is not significant to your grammar (i.e. there's no semantic meaning to it, apart of separating words), ANTLR makes it possible to just skip it:
In ANTLR 4 this is done by
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
In ANTLR 3 the syntax is
WHITESPACE : (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE : ('\r'? '\n' | '\r')+ { $channel = HIDDEN; };
What this does is the lexer tokenizes the input as usual, but parser understands that these tokens are not significant to it and behaves as if they were not there, allowing you to keep your rules simple and without need to add optional whitespace everywhere.
Your example has whitespace but your field_def isn't accounting for it.

Ignore some part of input when parsing with ANTLR

I'm trying to parse a language by ANTLR (ANTLRWorks-3.5.2). The goal is to enter complete input but Antlr gives a parse tree of defined parts in grammar and ignore the rest of inputs, for example this is my grammar :
grammar asap;
project : '/begin PROJECT' name module+ '/end PROJECT';
module : '/begin MODULE'name '/end MODULE';
name : IDENT ;
IDENT : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|':'|'-')*;
Given input:
/begin PROJECT HybridSailboat_2
/begin MODULE engine
/begin A2ML
/include XCP_common_v1_0.aml
"XCP" struct {
taggedstruct Common_Parameters ;
};
/end A2ML
/end MODULE
/end PROJECT
regarding to this input I just want the parse tree contains project and module and not A2ML part.
Is it possible in antlr that it ignore some part of inputs?
Can I specify start and end points of unimportant parts in grammar?
Simply match the A2ML part as a single token in the lexer and skip() it:
grammar asap;
project
: BEGIN_PROJECT name module* END_PROJECT EOF
;
module
: BEGIN_MODULE name END_MODULE
;
name
: IDENT
;
IDENT
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|':'|'-')*
;
BEGIN_PROJECT
: '/begin' S 'PROJECT'
;
END_PROJECT
: '/end' S 'PROJECT'
;
BEGIN_MODULE
: '/begin' S 'MODULE'
;
END_MODULE
: '/end' S 'MODULE'
;
A2ML
: '/begin' S 'A2ML' .* '/end' S 'A2ML' {skip();}
;
SPACES
: S {skip();}
;
fragment S
: (' ' | '\t' | '\r' | '\n')+
;

Lexer to handle lines with line number prefix

I'm writing a parser for a language that looks like the following:
L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>
That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).
My current lexer looks like:
// Parser rules
commands : command*;
command : LINE_NUM? keyword NEWLINE
| LINE_NUM? IDENTIFIER NEWLINE;
keyword : KEYWORD_A | KEYWORD_B | ... ;
// Lexer rules
fragment INT : [0-9]+;
LINE_NUM : 'L' INT;
KEYWORD_A : 'someKeyword';
KEYWORD_B : 'reservedWord';
...
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*
However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs.
Is there a way to properly tokenize this input using an ANTLR grammar?
You need to add a semantic predicate to IDENTIFIER:
IDENTIFIER
: {_input.getCharPositionInLine() != 0
|| _input.LA(1) != 'L'
|| !Character.isDigit(_input.LA(2))}?
[a-zA-Z_] [a-zA-Z0-9_]*
;
You could also avoid semantic predicates by using lexer modes.
//
// Default mode is active at the beginning of a line
//
LINE_NUM
: 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
;
KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
: ( 'L'
| 'L' [a-zA-Z_] [a-zA-Z0-9_]*
| [a-zA-KM-Z_] [a-zA-Z0-9_]*
)
-> pushMode(NotBeginningOfLine)
;
NL : ('\r' '\n'? | '\n');
mode NotBeginningOfLine;
NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
NotBeginningOfLine_IDENTIFIER
: [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
;

extraneous input error using ANTLR4

I am an ANTLR newbie and struggling with some of the errors that I am getting. Below I have included the grammar that I am using, The input file and the error that I am getting.
My Antlr Grammar file is as follows:
grammar Simple;
#header
{
package simple;
}
PARSER
program :
anylinebefore+
processline
anylineafter+
'MSEND' NEWLINE
'.' EOF
;
anylinebefore: CH* NEWLINE | commentline;
anylineafter: statement | commentline;
statement: movestatement ;
movestatement : 'MOVE' arg ('to' | 'TO') ID '.' NEWLINE ;
arg : ID|STRING;
processline: PROCESSLITERAL NEWLINE;
commentline: '!' CH* NEWLINE;
LEXER
WS : [ \t]+ -> skip ;
STRING : '\'' (~['])* '\'';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
TO : ('to' | 'TO');
CH : [\u0000-\uFFFE];
PROCESSLITERAL : 'PROCESS SOURCE FOLLOWS';
NEWLINE : '\r'? '\n' ;
My input file is as follows:
MODIFY
PROCESS SOURCE FOLLOWS
MOVE 'WSFRED' TO AGRPASSEDTWO.
MSEND
.
The error that I get is:
showtree:
[java] line 1:0 extraneous input 'MODIFY' expecting {'!', CH, NEWLINE}
I don't understand why this isn't matching anylinebefore in the grammar
Any help would be appreciated.
"MODIFY" is an ID, which doesn't match anylinebefore+