antlr4 two lexer rule match the same string - antlr

I'm currently using antlr4 to build a parser, but I encountered a problem which I tried my best but didn't figure out. Can you help me to explaain and solve it ?
# grammer file : PluginDoc.g4:
grammer PluginDoc
pluginDef : pluginName | pluginDesc;
pluginName : PluginName IDENTIFIER;
pluginDesc : PluginDesc TEXT;
PluginName '#pluginName'
PluginDesc '#pluginDesc'
IDENTIFIER : [a-zA-Z_]+;
TEXT : ~( ' ' | '\n' | '\t' )+;
input content is:
#pluginName kafka
#pluginDesc abc
If I put IDENTIFIER before TEXT, I will get "mismatched input 'abc' expecting TEXT"
If I put TEXT before IDENTIFIER, I will get "mismatched input 'kafka' expecting IDENTIFIER"
Looks like both IDENTIFIER and TEXT are matched, how can I only match IDENTIFIER in pluginName and only match TEXT in pluginDesc ?

First of all, you have several errors in the grammar that you posted:
The header of the file should specify grammar, not grammer. Your Lexer tokens PluginName and PluginDesc do not have a colon in front of them and semicolon to terminate them. It is also an (unwritten?) rule to write your parser rules as all lower-case and your lexer rules as all upper-case.
grammar PluginDoc;
pluginDef : pluginName | pluginDesc;
pluginName : PLUGIN_NAME IDENTIFIER;
pluginDesc : PLUGIN_DESC TEXT;
PLUGIN_NAME : '#pluginName';
PLUGIN_DESC : '#pluginDesc';
IDENTIFIER : [a-zA-Z_]+;
TEXT : ~( ' ' | '\n' | '\t' )+;
Some of the problems that I encountered while testing your grammar were due to the unhandled whitespace. First of all, you should include a Lexer rule to skip the whitespace at the end of the file after all of the other Lexer rules.
WS: [ \n\t\r]+ -> skip;
Next, there is a problem with your TEXT and IDENTIFIER clashing with each other. When the character stream is tokenized by the Lexer, kafka and abc can be both IDENTIFIER and TEXT token. Since the Lexer lexes in a top-down fashion, they are both tokenized as whateve Lexer rule comes first in your grammar. This causes the error that you encounter - whatever you define as the second rule cannot be matched in the parser because it was not sent in as a token.
As suggested by Lucas, you should probably match both of these as TEXT and do the subsequent checking for validity of the input in your Listener/Visitor.
grammar PluginDoc;
pluginDef : (pluginName | pluginDesc)* EOF;
pluginName : PLUGIN_NAME TEXT;
pluginDesc : PLUGIN_DESC TEXT;
PLUGIN_NAME: '#pluginName';
PLUGIN_DESC: '#pluginDesc';
TEXT : ~[ \r\n\t]+;
WS: [ \r\n\t]+ -> skip;
I also changed the pluginDef Parser rule to
pluginDef : (pluginName | pluginDesc)* EOF;
since it was my impression that you want to input both #pluginName X and #pluginDesc Y at once and identify them. If this is not the case, feel free to change back to what you had before.
The resulting AST produced by the modified grammar above onyour sample input:
You can also run this with a text file as an input.

Related

ANTLR proper ordering of grammar rules

I am trying to write a grammar that will recognize <<word>> as a special token but treat <word> as just a regular literal.
Here is my grammar:
grammar test;
doc: item+ ;
item: func | atom ;
func: '<<' WORD '>>' ;
atom: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| WORD #wordAtom
;
WS : [ \t] -> skip ;
NEWLINE : [\n\r]+ ;
PUNCT : [.,?!]+ ;
WORD : CHAR+ ;
fragment CHAR : (LETTER | DIGIT | SYMB | PUNCT) ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}\n\r\t] ;
So something like <<word>> will be matched by two rules, both func and atom. I want it to be recognized as a func, so I put the func rule first.
When I test my grammar with <word> it treats it as an atom, as expected. However when I test my grammar and give it <<word>> it treats it as an atom as well.
Is there something I'm missing?
PS - I have separated atom into PUNCT, NEWLINE, and WORD and given them labels #punctAtom, #newlineAtom, and #wordAtom because I want to treat each of those differently when I traverse the parse tree. Also, a WORD can contain PUNCT because, for instance, someone can write "Hello," and I want to treat that as a single word (for simplicity later on).
PPS - One thing I've tried is I've included < and > in the last rule, which is a list of symbols that I'm "disallowing" to exist inside a WORD. This solves one problem, in that <<word>> is now recognized as a func, but it creates a new problem because <word> is no longer accepted as an atom.
ANTLR's lexer tries to match as much characters as possible, so both <<WORD>> and <WORD> are matched by the lexer rul WORD. Therefor, there in these cases the tokens << and >> (or < and > for that matter) will not be created.
You can see what tokens are being created by running these lines of code:
Lexer lexer = new testLexer(CharStreams.fromString("<word> <<word>>"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", testLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
which will print:
WORD <word>
WORD <<word>>
EOF <EOF>
What you could do is something like this:
func
: '<<' WORD '>>'
;
atom
: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| word #wordAtom
;
word
: WORD
| '<' WORD '>'
;
...
fragment SYMB : ~[<>a-zA-Z0-9.,?! |{}\n\r\t] ;
Of course, something like foo<bar will not become a single WORD, which it previously would.

Problems with ANTLR4 grammar

I have a very simple grammar file, which looks like this:
grammar Wort;
// Parser Rules:
word
: ANY_WORD EOF
;
// Lexer Rules:
ANY_WORD
: SMALL_WORD | CAPITAL_WORD
;
SMALL_WORD
: SMALL_LETTER (SMALL_LETTER)+
;
CAPITAL_WORD
: CAPITAL_LETTER (SMALL_LETTER)+
;
fragment SMALL_LETTER
: ('a'..'z')
;
fragment CAPITAL_LETTER
: ('A'..'Z')
;
If i try to parse the input "Hello", everything is OK, BUT if if modify my grammar file like this:
...
// Parser Rules:
word
: CAPITAL_WORD EOF
;
...
the input "Hello" is no longer recognized as a valid input. Can anybody explain, what is going wrong?
Thanx, Lars
The issue here has to do with precedence in the lexer grammar. Because ANY_WORD is listed before CAPITAL_WORD, it is given higher precedence. The lexer will identify Hello as a CAPITAL_WORD, but since an ANY_WORD can be just a CAPITAL_WORD, and the lexer is set up to prefer ANY_WORD, it will output the token ANY_WORD. The parser acts on the output of the lexer, and since ANY_WORD EOF doesn't match any of its rules, the parse fails.
You can make the lexer behave differently by moving CAPITAL_WORD above ANY_WORD in the grammar, but that will create the opposite problem -- capitalized words will never lex as ANY_WORDs. The best thing to do is probably what Mephy suggested -- make ANY_WORD a parser rule.

ANTLR and Empty Strings Contradictory Behavior

I fundamentally don't understand how antlr works. Using the following grammar:
blockcomment : '/\*' ANYCHARS '\*/';
ANYCHARS : ('a'..'z' | '\n' | 'r' | ' ' | '0'..'9')* ;
I get a warning message when I compile the grammar file that says:
"non-fragment lexer rule 'ANYCHARS' can match the empty string"
Fine. I want it to be able to match empty strings as: "/\*\*/" is perfectly valid. But when I run "/\*\*/" in the TestRig I get:
missing ANYCHARS at '*/'
Obviously I could just change it so that '/**/' is handled as a special case:
blockcomment : '/\*' ANYCHARS '\*/' | '/**/';
But that doesn't really address the underlying issue. Can someone please explain to me what I am doing wrong? How can ANTLR raise a warning about matching empty strings and then not match them at the same time?
add "fragment" to ANYCHARS? It will then do what you want.
"non-fragment lexer rule 'ANYCHARS' can match the empty string"
The error message hints you to make ANYCHARS fragment.
Empty string cannot be matched as a token, that would end up with infinitely many empty tokens anywhere in the source.
You want to make the ANYCHARS part of the BLOCKCOMMENT token, rather than a separate token.
That is basically what fragments are good for - they simplify the lexer rules, but don't produce tokens.
BLOCKCOMMENT : '/*' ANYCHARS '*/';
fragment ANYCHARS : ('a'..'z' | '\n' | 'r' | ' ' | '0'..'9')* ;
EDIT: switched parser rule blockcomment to lexer rule BLOCKCOMMENT to enable fragment usage

ANTLR: mismatched input

I couldn't understand a bug in my grammar. The file, Bug.g4, is:
grammar Bug;
text: TEXT;
WORD: ('a'..'z' | 'A'..'Z')+ ;
TEXT: ('a'..'z' | 'A'..'Z')+ ;
NEWLINE: [\n\r] -> skip ;
After running antlr4 and javac, I run
grun Bug text -tree
aa
line 1:0 mismatched input 'aa' expecting TEXT
(text aa)
But if I instead use text: WORD in the grammar, things are okay. What's wrong?
When two lexer rules each match the same string of text, and no other lexer rule matches a longer string of text, ANTLR assigns the token type according to the rule which appeared first in the grammar. In your case, a TEXT token can never be produced by the lexer rule because the WORD rule will always match the same text and the WORD rule appears before the TEXT rule in the grammar. If you were to reverse the order of these rules in the grammar, you would start to see TEXT tokens but you would never see a WORD token.

ANTLR rule works on its own, but fails when included in another rule

I am trying to write an ANTLR grammar for a reparsed and retagged kconfig file (retagged to solve a couple of ambiguities). A simplified version of the grammar is:
grammar FailureExample;
options {
language = Java;
}
#lexer::header {
package parse.failure.example;
}
reload
: configStatement*
EOF
;
configStatement
: CONFIG IDENT
configOptions
;
configOptions
: (type
| defConfigStatement
| dependsOnStatement
| helpStatement
| rangeStatement
| defaultStatement
| selectStatement
| visibleIfStatement
| prompt
)*
;
type : FAKE1;
dependsOnStatement: FAKE2;
helpStatement: FAKE3;
rangeStatement: FAKE4;
defaultStatement: FAKE5;
selectStatement:FAKE6;
visibleIfStatement:FAKE7;
prompt:FAKE8;
defConfigStatement
: defConfigType expression
;
defConfigType
: DEF_BOOL
;
//expression parsing
primative
: IDENT
| L_PAREN expression R_PAREN
;
negationExpression
: NOT* primative
;
orExpression
: negationExpression (OR negationExpression)*
;
andExpression
: orExpression (AND orExpression)*
;
unequalExpression
: andExpression (NOT_EQUAL andExpression)?
;
equalExpression
: unequalExpression (EQUAL unequalExpression)?
;
expression
: equalExpression (BECOMES equalExpression)?
;
DEF_BOOL: 'def_bool';
CONFIG : 'config';
COMMENT : '#' .* ('\n'|'\r') {$channel = HIDDEN;};
AND : '&&';
OR : '||';
NOT : '!';
L_PAREN : '(';
R_PAREN : ')';
BECOMES : '::=';
EQUAL : '=';
NOT_EQUAL : '!=';
FAKE1 : 'fake1';
FAKE2: 'fake2';
FAKE3: 'fake3';
FAKE4: 'fake4';
FAKE5: 'fake5';
FAKE6: 'fake6';
FAKE7: 'fake7';
FAKE8: 'fake8';
IDENT : (LETTER | DIGIT | '_')*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
With input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
I can set antlrworks to parse just the second line (commenting out the first) and I get the proper defConfigStatement token emitted with the proper expression following. However, if I exercise either the configOptions rule or the configStatement rule (with the first line uncommented), my configOptions results in an empty set and a NoViableAlt exception is thrown.
What would cause this behavior? I know that the defConfigStatement rule is accurate and can parse correctly, but as soon as it's added as a potential option in another rule, it fails. I know I don't have conflicting rules, and I've made DEF_BOOL and DEF_TRISTATE rules the top in my list of lexer rules, so they have priority over the other lexer rules.
/Added since edit/
To further complicate the issue, if I move the defConfigStatement choice in the configOptions rule, it will work, but other rules will fail.
Edit: Using full, simplified grammar.
In short, why does the rule work on its own, but fail when it's in configOptions (especially since configOptions is in (A | B | C)* form)?
When I parse the input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
with the parser generated from your grammar, I get the following parse tree:
So, I see no issues here. My guess is that you're using ANTLRWorks' interpreter: don't. It's buggy. Always test your grammar with a class of your own, or use ANTLWorks' debugger (press CTRL+D to launch is). The debugger works like a charm (without the package declaration, btw). The image I posted above is an export from the debugger.
EDIT
If the debugger doesn't work, try (temporarily) removing the package declaration (note that you're only declaring a package for the lexer, not the parser, but that might be a caused by posting a minimal grammar). You could also try changing the port number the debugger should connect to. It could be the port is already in use (see: File -> Preferences -> Debugger-tab).