I have a task to write simple parser-generator, so I wrote ANTLR-like grammar and tried to parse simple file like "foo:bar;", but got the following output:
[#0,0:2='foo',<1>,1:0]
[#1,3:3=':',<16>,1:3]
[#2,4:6='bar',<1>,1:4]
[#3,7:7=';',<18>,1:7]
[#4,8:7='<EOF>',<-1>,1:8]
line 1:0 no viable alternative at input 'foo'
(rule foo : bar ;)
My grammar looks like
grammar parsGen;
gram : rule SEMICOLON (NEWLINE+ rule SEMICOLON)* ;
rule : lRule | pRule ;
lRule : LRULEID COLON lRule1 ;
lRule1 : (((LRULEID | STRING | SET) | LBRACE lRule1 PIPE lRule1 RBRACE) modificator? SPACE+)+ ;
pRule : PRULEID COLON pRule1 ;
pRule1 : (((LRULEID | PRULEID) | LBRACE lRule1 PIPE lRule1 RBRACE) modificator? SPACE+)+ ;
modificator : PLUS | ASTERISK | QUESTION ;
ID : LRULEID | PRULEID ;
LRULEID : UPPERLETTER (UPPERLETTER | LOWERLETTER | DIGIT)* ;
PRULEID : LOWERLETTER (UPPERLETTER | LOWERLETTER | DIGIT)* ;
STRING : ('\''.*?'\'') ;
SET : '\''.*?'\'..\''.*?'\'' ;
UPPERLETTER : [A-Z] ;
LOWERLETTER : [a-z] ;
DIGIT : [0-9] ;
NEWLINE : '\r\n'|'\n'|'\r' ;
PLUS : '+' ;
ASTERISK : '*' ;
QUESTION : '?' ;
LBRACE : '(' ;
RBRACE : ')' ;
SPACE : ' ' ;
COLON : ':' ;
PIPE : '|' ;
SEMICOLON : ';' ;
So where could I make a mistake? I tried to search everywhere (google, SO etc.) error "no viable alternative", but it didn't really help me.
ANTLR lexers fully assign unambiguous token types before the parser is ever used. When multiple token types can match a token, the first one appearing in the grammar is the one that is used. For your grammar, a token cannot have the type ID and the type LRULEID at the same time. Since the input foo matches both of these lexer rules, the first appearing in the grammar is used so your tokens are: ID, COLON, ID, SEMICOLON, <EOF>.
Since the ID token is never actually referenced in the parser, I suggest one of the following changes. Either of these options will resolve the problem you have described, so the choice is entirely your preference for how the final grammar looks.
Foreword
You need to change the space references from SPACE+ to SPACE*, or the rule will require at least one space character between bar and ;.
Option 1
Remove the ID lexer rule altogether.
Option 2
Change ID to a parser rule so it's not trying to assign token type ID to all of your identifiers.
id : LRULEID | PRULEID;
Update pRule1 rule by referencing id.
pRule1 : ((id | LBRACE lRule1 PIPE lRule1 RBRACE) modificator? SPACE+)+ ;
Unrelated Side Note
You grammar might be easier to read if you remove the outermost + closure inside the lRule and pRule1 rules, and instead add them to the rule references themselves, like this. Note that I changed the SPACE references as described in the foreword.
lRule : LRULEID COLON lRule1+ ;
lRule1 : ((LRULEID | STRING | SET) | LBRACE lRule1 PIPE lRule1 RBRACE) modificator? SPACE* ;
pRule : PRULEID COLON pRule1+ ;
pRule1 : ((LRULEID | PRULEID) | LBRACE lRule1 PIPE lRule1 RBRACE) modificator? SPACE* ;
Also from the http://www.antlr.org/api/Java/org/antlr/v4/runtime/NoViableAltException.html:
Indicates that the parser could not decide which of two or more paths to take based upon the remaining input. It tracks the starting token of the offending input and also knows where the parser was in the various paths when the error [occured].
In my case I was calling lexer.nextToken() before parsing for debugging purposes. That in turn without lexer.reset() was causing no viable alternative at input EOF error.
Related
Orignial question:
My code to parse:
N100G1M4
What I expcted: N100 G1 M4
But ANTLR can not idetify this because ANTLR always match longest substring?
How to handle the case?
Update
What I am going to do:
I am trying to parse CNC G-Code txt and get keywords from a file stream, which is usually used to control a machine and drive motors to move.
The G-Code rule is :
// Define a grammar called Hello
grammar GCode;
script : blocks+ EOF;
blocks:
assign_stat
| ncblock
| NEWLINE
;
ncblock :
ncelements NEWLINE //
;
ncelements :
ncelement+
;
ncelement
:
LINENUMEXPR // linenumber N100
| GCODEEXPR // G10 G54.1
| MCODEEXPR // M30
| coordexpr // X100 Y100 Z[A+b*c]
| FeedExpr // F10.12
| AccExpr // E2.0
// | callSubroutine
;
assign_stat:
VARNAME '=' expression NEWLINE
;
expression:
multiplyingExpression ('+' | '-') multiplyingExpression
;
multiplyingExpression
: powExpression (('*' | '/') powExpression)*
;
powExpression
: signedAtom ('^' signedAtom)*
;
signedAtom
: '+' signedAtom
| '-' signedAtom
| atom
;
atom
: scientific
| variable
| '(' expression ')'
;
LINENUMEXPR: 'N' Digit+ ;
GCODEEXPR : 'G' GPOSTFIX;
MCODEEXPR : 'M' INT;
coordexpr:
CoordExpr
| ParameterKeyword getValueExpr
;
getValueExpr:
'[' expression ']'
;
CoordExpr
:
ParameterKeyword SCIENTIFIC_NUMBER
;
ParameterKeyword: [XYZABCUVWIJKR];
FeedExpr: 'F' SCIENTIFIC_NUMBER;
AccExpr: 'E' SCIENTIFIC_NUMBER;
fragment
GPOSTFIX
: Digit+ ('.' Digit+)*
;
variable
: VARNAME
;
scientific
: SCIENTIFIC_NUMBER
;
SCIENTIFIC_NUMBER
: SIGN? NUMBER (('E' | 'e') SIGN? NUMBER)?
;
fragment NUMBER
: ('0' .. '9') + ('.' ('0' .. '9') +)?
;
HEX_INTEGER
: '0' [xX] HEX_DIGIT+
;
fragment HEX_DIGIT
: [0-9a-fA-F]
;
INT : Digit+;
fragment
Digit : [0-9];
fragment
SIGN
: ('+' | '-')
;
VARNAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
NEWLINE
: '\r'? '\n'
;
WS : [ \t]+ -> skip ; // skip spaces, tabs, newlines
Sample program(it works well except the last line):
N200 G54.1
a = 100
b = 10
c = a + b
Z[a + b*c]
N002 G2 X30.1 Y20.1 I20.1 J0.1 K0.2 R20
N100 G1X100.5Z[VAR1+100]M3H3 // it works well except the last line
I want to parse N100G1X100.5YE5Z[VAR1+100]M3H3 to
-> N100 G1 X100 Z[VAR1+100]
-> or it will be better to split the node X100 to two subnode X 100:
I am trying to use ANTLR, but ANTLR always take the rule "longest match wins". N100G1X100 is identified to a word.
Append question:
What's the best tool to finish the task?
ANTLR has a strict separation between pasrer and lexer, and therefor the lexer operates in a predictable way (longest match wins). So if you have some sort of identifier rule that matches N100G1M4 but sometimes want to match N100, G1 and M4 separately, you're out of luck.
How to handle the case?
The only answer one can give (with the amount of details given) is: remove the rule that matches N100G1M4 as 1 token. If that is something you cannot do, then don't use ANTLR, but use a "scannerless" parser.
Scannerless Parser Generators
It looks like I have a problem understanding a too greedy rule match. I'm trying to lex a .g4 file for syntax coloring. Here is a minimum (simplified) extract for making this problem reproducible:
lexer grammar ANTLRv4Lexer;
Range
: '[' RangeChar+ ']'
;
fragment EscapedChar
: '\\' ~[u]
| '\\u' EscapedCharHex EscapedCharHex EscapedCharHex EscapedCharHex
;
fragment EscapedCharHex
: [0-9A-Fa-f]
;
fragment RangeChar
: ~']'
| EscapedChar
;
Punctuation
: [:;()+\->*[\]~|]
;
Identifier
: [a-zA-Z0-9]+
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
LineComment
: '//' ~[\r\n]*
;
The (incomplete) test file is following:
: (~ [\]\\] | EscAny)+ -> more
;
// ------
fragment Id
: NameStartChar NameChar*
;
String2Part
: ( ~['\\]
| EscapeSequence
)+
;
I don't understand why it matches Range so greedy:
[#0,3:3=':',<Punctuation>,1:3]
[#1,5:5='(',<Punctuation>,1:5]
[#2,6:6='~',<Punctuation>,1:6]
[#3,8:135='[\]\\] | EscAny)+ -> more\r\n ;\r\n\r\n // ------\r\n\r\nfragment Id\r\n : NameStartChar NameChar*\r\n ;\r\n\r\n\r\nString2Part\r\n\t: ( ~['\\]',<Range>,1:8]
[#4,141:141='|',<Punctuation>,13:3]
[#5,143:156='EscapeSequence',<Identifier>,13:5]
[#6,162:162=')',<Punctuation>,14:3]
[#7,163:163='+',<Punctuation>,14:4]
[#8,167:167=';',<Punctuation>,15:1]
[#9,170:169='<EOF>',<EOF>,16:0]
I understand why in the first line it matches [, \] and \\, but why it obviously treats ] as RangeChar?
Your lexer matches the first \ in \\] using the ~']' alternative and then matches the remaining \] as an EscapedChar. The reason it does this is that this interpretation leads to a longer match than the one where \\ is the EscapedChar and ] is the end of the range and when there are multiple valid ways to match a lexer rule, ANTLR always chooses the longest one (except when *? is involved).
To fix this, you should change RangeChar, so that backslashes are only allowed as part of escape sequences, i.e. replace ~']' with ~[\]\\].
This may be a newbee question, since I don't have a lot of ANTLR experience, but I've done a lot of research and troubleshooting and have not found a solution so resorting to asking. I am trying to write a parser for a very odd format file (PCGEN open source role playing game character editor) that I plan to use for several uses, not the least of which is learning ANTLR. I am to the point that I have everything I want working on the LEX and Parse, except that it stops parsing when it hits blank lines. I know I could add a line to throw away all whitespace, but the file format is such that strings are not really quoted, and white space is usually important, so the only white space that should be ignored is a totally blank line. When I run the Lexer it gives the tokens for the entire file, so I thought the Parser would process the tokens without concern for where they came from, so I am missing something simple. Here is the beggining of my input:
PCGVERSION:2.0
# System Information
CAMPAIGN:Advanced Player's Guide|CAMPAIGN:Ultimate Magic|CAMPAIGN:Ultimate Combat
VERSION:6.07.05
ROLLMETHOD:3|EXPRESSION:2d6+6
PURCHASEPOINTS:N
And this is my current grammar:
grammar PCG;
pcgFile : lines=line+;
line : statement (NEWLINE | EOF)
;
statement : KEYWORD ASSIGN
| KEYWORD ASSIGN YES_NO
| KEYWORD ASSIGN TEXT
| KEYWORD ASSIGN VERSIONNUM
| KEYWORD ( ASSIGN INT )+
| KEYWORD ASSIGN INT
| KEYWORD ASSIGN SUB_START statement SUB_END
| statement SEP statement
;
NEWLINE : '\r\n' | 'r' | '\n' ;
YES_NO : ('Y'|'N');
KEYWORD : [A-Z]+;
INT : [0-9]+;
TEXT : ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN : ':';
SEP : '|';
COMMENT : '#' ~[\r\n]*->skip ;
VERSIONNUM : ([0-9]+ ('.' [0-9]+)?)
| ('.' [0-9]+)
| ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
;
ROLL : INT [dD] INT (('+'|'-') INT)?;
SUB_START : '[';
SUB_END : ']';
Any help would be appreciated.
You need to allow for more than 1 new line between statements. Do that by removing the rule and rewriting to this:
pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;
The main problem is that your lexer matches # System Information as a TEXT token. Whenever 2 or more rules match the same amount of characters, the rule defined first will "win" *. So that's TEXT. When you place COMMENT before TEXT, it will work:
grammar PCG;
pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;
statement : KEYWORD ASSIGN
| KEYWORD ASSIGN YES_NO
| KEYWORD ASSIGN TEXT
| KEYWORD ASSIGN VERSIONNUM
| KEYWORD ( ASSIGN INT )+
| KEYWORD ASSIGN INT
| KEYWORD ASSIGN SUB_START statement SUB_END
| statement SEP statement
;
NEWLINE : '\r\n' | 'r' | '\n' ;
YES_NO : ('Y'|'N');
KEYWORD : [A-Z]+;
INT : [0-9]+;
COMMENT : '#' ~[\r\n]* ->skip ;
TEXT : ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN : ':';
SEP : '|';
VERSIONNUM : ([0-9]+ ('.' [0-9]+)?)
| ('.' [0-9]+)
| ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
;
ROLL : INT [dD] INT (('+'|'-') INT)?;
SUB_START : '[';
SUB_END : ']';
Keep in mind that ~(':'|'|'|'\r'|'\n'|'['|']')+ is dangerous: it could easily match a lot of characters.
* because the lexer works like this, input like 12 will never be tokenised as a VERSIONNUM token since INT matches this too an occurs before VERSIONNUM. Fix it by doing something like this:
statement : ...
| KEYWORD ASSIGN versionnum
| ...
;
versionnum : VERSIONNUM
| INT
;
...
INT : [0-9]+;
...
VERSIONNUM : [0-9]* '.' [0-9]+ ('.' [0-9]+)?
;
...
I have to parse the following query using antlr
sys_nameLIKEvalue
Here sys_name is a variable which has lower case and underscores.
LIKE is a fixed key word.
value is a variable which can contain lower case uppercase as well as number.
Below the grammer rule i am using
**expression : parameter 'LIKE' values EOF;
parameter : (ID);
ID : (LOWERCASE) (LOWERCASE | UNDERSCORE)* ;
values : (VALUE);
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
LOWERCASE : 'a'..'z' ;
UPPERCASE : 'A'..'Z' ;
NUMBER : '0'..'9' ;
UNDERSCORE : '_' ;**
Test Case 1
Input : sys_nameLIKEabc
error thrown : line 1:8 missing 'LIKE' at 'LIKEabc'
Test Case 2
Input : sysnameLIKEabc
error thrown : line 1:0 mismatched input 'sysnameLIKEabc' expecting ID
A literal token inside your parser rule will be translated into a plain lexer rule. So, your grammar really looks like this:
expression : parameter LIKE values EOF;
parameter : ID;
values : VALUE;
LIKE : 'LIKE';
ID : LOWERCASE (LOWERCASE | UNDERSCORE)* ;
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
// Fragment rules will never become tokens of their own: good practice!
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment NUMBER : '0'..'9' ;
fragment UNDERSCORE : '_' ;
Since lexer rules are greedy, and if two or more lexer rules match the same amount of character the first will "win", your input is tokenized as follows:
Input: sys_nameLIKEabc, 2 tokens:
sys_name: ID
LIKEabc: VALUE
Input: sysnameLIKEabc, 1 token:
sys_nameLIKEabc: VALUE
So, the token LIKE will never be created with your test input, so none of your parser rule will ever match. It also seems a bit odd to parse input without any delimiters, like spaces.
To fix your issue, you will either have to introduce delimiters, or disallow your VALUE to contain uppercases.
I'm looking for a grammar for analyzing two type of sentences, that
means words separated by white spaces:
ID1: sentences with words not beginning with numbers
ID2: sentences with words not beginning with numbers and numbers
Basically, the structure of the grammar should look like
ID1 separator ID2
ID1: Word can contain number like Var1234 but not start with a number
ID2: Same as above but 1234 is allowed
separator: e. g. '='
#Bart
I just tried to add two tokens '_' and '"' as lexer-rule Special for later use in lexer-rule Word.
Even I haven't used Special in the following grammar, I get the following error in ANTLRWorks 1.4.2:
The following token definitions can never be matched because prior tokens match the same input: Special
But when I add fragment before Special, I don't get that error. Why?
grammar Sentence1b1;
tokens
{
TCUnderscore = '_' ;
TCQuote = '"' ;
}
assignment
: id1 '=' id2
;
id1
: Word+
;
id2
: ( Word | Int )+
;
Int
: Digit+
;
// A word must start with a letter
Word
: ( 'a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit )*
;
Special
: ( TCUnderscore | TCQuote )
;
Space
: ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
;
fragment Digit
: '0'..'9'
;
Lexer-rule Special shall then be used in lexer-rule Word:
Word
: ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
;
I'd go for something like this:
grammar Sentence;
assignment
: id1 '=' id2
;
id1
: Word+
;
id2
: (Word | Int)+
;
Int
: Digit+
;
// A word must start with a letter
Word
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit)*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
fragment Digit
: '0'..'9'
;
which will parse the input:
Word can contain number like Var1234 but not start with a number = Same as above but 1234 is allowed
as follows:
EDIT
To keep lexer rule nicely packed together, I'd keep them all at the bottom of the grammar instead of partly in the tokens { ... } block, which I only use for defining "imaginary tokens" (used in AST creation):
// wrong!
Special : (TCUnderscore | TCQuote);
TCUnderscore : '_';
TCQuote : '"';
Now, with the rules above, TCUnderscore and TCQuote can never become a token because when the lexer stumbles upon a _ or ", a Special token is created. Or in this case:
// wrong!
TCUnderscore : '_';
TCQuote : '"';
Special : (TCUnderscore | TCQuote);
the Special token can never be created because the lexer would first create TCUnderscore and TCQuote tokens. Hence the error:
The following token definitions can never be matched because prior tokens match the same input: ...
If you make TCUnderscore and TCQuote a fragment rule, you don't have that problem because fragment rules only "serve" other lexer rules. So this works:
// good!
Special : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote : '"';
Also, fragment rules can therefor never be "visible" in any of your parser rules (the lexer will never create a TCUnderscore or TCQuote token!).
// wrong!
parse : TCUnderscore;
Special : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote : '"';
I'm not sure if that fits your needs but with Bart's help in my post
ANTLR - identifier with whitespace
i came to this grammar:
grammar PropertyAssignment;
assignment
: id_nodigitstart '=' id_digitstart EOF
;
id_nodigitstart
: ID_NODIGITSTART+
;
id_digitstart
: (ID_DIGITSTART|ID_NODIGITSTART)+
;
ID_NODIGITSTART
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*
;
ID_DIGITSTART
: ('0'..'9'|'a'..'z'|'A'..'Z')+
;
WS : (' ')+ {skip();}
;
"a name = my 4value" works while "4a name = my 4value" causes an exception.