ANTLR rule works on its own, but fails when included in another rule - antlr

I am trying to write an ANTLR grammar for a reparsed and retagged kconfig file (retagged to solve a couple of ambiguities). A simplified version of the grammar is:
grammar FailureExample;
options {
language = Java;
}
#lexer::header {
package parse.failure.example;
}
reload
: configStatement*
EOF
;
configStatement
: CONFIG IDENT
configOptions
;
configOptions
: (type
| defConfigStatement
| dependsOnStatement
| helpStatement
| rangeStatement
| defaultStatement
| selectStatement
| visibleIfStatement
| prompt
)*
;
type : FAKE1;
dependsOnStatement: FAKE2;
helpStatement: FAKE3;
rangeStatement: FAKE4;
defaultStatement: FAKE5;
selectStatement:FAKE6;
visibleIfStatement:FAKE7;
prompt:FAKE8;
defConfigStatement
: defConfigType expression
;
defConfigType
: DEF_BOOL
;
//expression parsing
primative
: IDENT
| L_PAREN expression R_PAREN
;
negationExpression
: NOT* primative
;
orExpression
: negationExpression (OR negationExpression)*
;
andExpression
: orExpression (AND orExpression)*
;
unequalExpression
: andExpression (NOT_EQUAL andExpression)?
;
equalExpression
: unequalExpression (EQUAL unequalExpression)?
;
expression
: equalExpression (BECOMES equalExpression)?
;
DEF_BOOL: 'def_bool';
CONFIG : 'config';
COMMENT : '#' .* ('\n'|'\r') {$channel = HIDDEN;};
AND : '&&';
OR : '||';
NOT : '!';
L_PAREN : '(';
R_PAREN : ')';
BECOMES : '::=';
EQUAL : '=';
NOT_EQUAL : '!=';
FAKE1 : 'fake1';
FAKE2: 'fake2';
FAKE3: 'fake3';
FAKE4: 'fake4';
FAKE5: 'fake5';
FAKE6: 'fake6';
FAKE7: 'fake7';
FAKE8: 'fake8';
IDENT : (LETTER | DIGIT | '_')*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
With input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
I can set antlrworks to parse just the second line (commenting out the first) and I get the proper defConfigStatement token emitted with the proper expression following. However, if I exercise either the configOptions rule or the configStatement rule (with the first line uncommented), my configOptions results in an empty set and a NoViableAlt exception is thrown.
What would cause this behavior? I know that the defConfigStatement rule is accurate and can parse correctly, but as soon as it's added as a potential option in another rule, it fails. I know I don't have conflicting rules, and I've made DEF_BOOL and DEF_TRISTATE rules the top in my list of lexer rules, so they have priority over the other lexer rules.
/Added since edit/
To further complicate the issue, if I move the defConfigStatement choice in the configOptions rule, it will work, but other rules will fail.
Edit: Using full, simplified grammar.
In short, why does the rule work on its own, but fail when it's in configOptions (especially since configOptions is in (A | B | C)* form)?

When I parse the input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
with the parser generated from your grammar, I get the following parse tree:
So, I see no issues here. My guess is that you're using ANTLRWorks' interpreter: don't. It's buggy. Always test your grammar with a class of your own, or use ANTLWorks' debugger (press CTRL+D to launch is). The debugger works like a charm (without the package declaration, btw). The image I posted above is an export from the debugger.
EDIT
If the debugger doesn't work, try (temporarily) removing the package declaration (note that you're only declaring a package for the lexer, not the parser, but that might be a caused by posting a minimal grammar). You could also try changing the port number the debugger should connect to. It could be the port is already in use (see: File -> Preferences -> Debugger-tab).

Related

ANTLR Exclude keywords while parsing a string

I'm trying to make the grammar for a rather simple language using ANTLR4. It's supposed to process some theater-related text. There are just 3 rules.
1 - Any text that starts with a tab (\t), should be just printed out.
It was a rather warm
Summer day.
2 - In case the text doesn't start with a tab, it'll most likely contain a character name. For example:
Captain Go forth, my minions!
It would be perfect to grab character name and text they're saying separately.
3 - And there are commands, that also start with a tab, followed by a keyword and some arguments, kind of like this:
lights ON
curtain OPEN
This is my grammar:
grammar Theater;
module: statement+ EOF;
statement: function | print | print_with_name;
function: '\t' command NL;
command: lights | curtain;
lights: 'lights' WS ('ON' | 'OFF');
curtain: 'curtain' WS ('OPEN' | 'CLOSE');
print: PRINT;
PRINT: '\t' .*? NL NL;
print_with_name: PRINT_WITH_NAME;
PRINT_WITH_NAME: ~[ \t\r\n] .*? NL NL;
NL: '\r\n' | '\r' | '\n';
WS: [ \t]+?;
I run this on the following test file:
It was a rather warm
Summer day.
Captain Go forth, my minions!
lights ON
curtain OPEN
And these are tokens I get:
[#0,0:22='\tIt was a rather warm\r\n',<PRINT>,1:0]
[#1,23:36='\tSummer day.\r\n',<PRINT>,2:0]
[#2,37:67='Captain Go forth, my minions!\r\n',<PRINT_WITH_NAME>,3:0]
[#3,68:79='\tlights ON\r\n',<PRINT>,4:0]
[#4,80:94='\tcurtain OPEN\r\n',<PRINT>,5:0]
[#5,95:94='<EOF>',<EOF>,6:0]
print and print with name both work as expected. Commands, on the other hand, are being treated as print. I guess, this is because those are lexer rules, but commands are parser rules.
Is there any way I can make it work without converting all commands to lexer rules? I tried hard to write something like "treat all text as Print, except when it starts with one of the keywords". But couldn't really find anything that would work. I'm only starting with antlr, so I must be missing something.
I don't expect you to write the grammar for me. Just mentionion a feature I should use would be perfect.
Lexer modes can be helpful here, which is a way to nudge the lexer in the right direction (make it a bit context sensitive).
To use lexer modes, you must divide the lexer- and parser-grammar into separate files. Here is TheaterLexer.g4:
lexer grammar TheaterLexer;
Name : ~[ \t]+ -> mode(DialogMode);
K_Lights : '\tlights' -> mode(CommandMode);
K_Curtain : '\tcurtain' -> mode(CommandMode);
Tab : '\t' -> skip, mode(TabMode);
mode DialogMode;
DialogText : ~[\r\n]+;
DialogNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
mode CommandMode;
CommandText : ~[\r\n]+;
CommandNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
mode TabMode;
LiteralText : ~[\r\n]+;
LiteralNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
And the parser part (put it in TheaterParser.g4):
parser grammar TheaterParser;
options { tokenVocab=TheaterLexer; }
parse
: file EOF
;
file
: atom*
;
atom
: literal
| dialog
| command
;
literal
: LiteralText+
;
dialog
: Name DialogText+
;
command
: K_Lights CommandText+
| K_Curtain CommandText+
;
If you now generate the lexer and parser classes and run the following Java code:
String source =
"\tIt was a rather warm\n" +
"\tSummer day.\n" +
"Captain Go forth, my minions!\n" +
"\tlights ON\n" +
"\tcurtain OPEN";
TheaterLexer lexer = new TheaterLexer(CharStreams.fromString(source));
TheaterParser parser = new TheaterParser(new CommonTokenStream(lexer));
ParseTree root = parser.parse();
System.out.println(root.toStringTree(parser));
the following will be printed to your console:
(parse
(file
(atom
(literal It was a rather warm Summer day.))
(atom
(dialog Captain Go forth, my minions!))
(atom
(command \tlights ON))
(atom
(command \tcurtain OPEN))) <EOF>)
(the indentation is added for readability)
Note that you can use just a single mode, but I assumed you'd want to treat the tokens differently in the different modes. If this is not the case, you could just do:
lexer grammar TheaterLexer;
Name : ~[ \t]+ -> mode(Step2Mode);
K_Lights : '\tlights' -> mode(Step2Mode);
K_Curtain : '\tcurtain' -> mode(Step2Mode);
Tab : '\t' -> skip, mode(Step2Mode);
mode Step2Mode;
Text : ~[\r\n]+;
NewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
and change the parser rules accordingly.

ANTLR4 grammar rule that cannot be reached from start rule affects language

The following minimal grammar shows the issue:
grammar test;
call : exp LP exp RP ;
exp : exp LP exp RP | ID;
ID : [a-z] ;
LP : '(' ;
RP : ')' ;
Newline : '\r\n' | '\n' ;
If I use call as the start rule, then the generated parser will gladly parse the following input:
f(x)
(tried it in ANTLR lab, which probably uses the java target, and locally using a C++ target with ANTLR 4.9.3).
If I now add the following rule to the grammar, but keep call as the start rule, then the same input does not match call anymore.
callWithNewline : call Newline;
Why does callWithNewline affect whether call matches?
If I change the input to have a newline character after it will suddenly match call in ANTLR lab (even though the newline is not part of the match of course), but not in the C++ target, so the targets have slightly different behavior here.
I ran into this behavior while unit testing subrules, it does appear that parsing a full grammar which contains this kind of subgrammar somewhere lower in the hierarchy does not lead to issues.
Edit
The issue still occurs if I remove the ambiguity
grammar test;
callWithNewline : call Newline ;
call : exp LP ID RP ;
exp : exp LP ID RP | ID;
ID : [a-z] ;
LP : '(' ;
RP : ')' ;
Newline : '\r\n' | '\n' ;

Parsing letter ranges with ANTLR

I have the following parser rules:
defDirective : defType whiteSpace letterSpec (whiteSpace? COMMA whiteSpace? letterSpec)*;
defType :
DEFBOOL | DEFBYTE | DEFINT | DEFLNG | DEFLNGLNG | DEFLNGPTR | DEFCUR |
DEFSNG | DEFDBL | DEFDATE |
DEFSTR | DEFOBJ | DEFVAR
;
letterSpec : universalLetterRange | letterRange | singleLetter;
singleLetter : RESTRICTED_LETTER;
universalLetterRange : upperCaseA whiteSpace? MINUS whiteSpace? upperCaseZ;
upperCaseA : {_input.Lt(1).Text.Equals("A")}? RESTRICTED_LETTER;
upperCaseZ : {_input.Lt(1).Text.Equals("Z")}? RESTRICTED_LETTER;
letterRange : firstLetter whiteSpace? MINUS whiteSpace? lastLetter;
firstLetter : RESTRICTED_LETTER;
lastLetter : RESTRICTED_LETTER;
whiteSpace : (WS | LINE_CONTINUATION)+;
with the relevant Lexer Rules:
RESTRICTED_LETTER : [a-zA-Z];
MINUS : '-';
COMMA : ',';
WS : [ \t];
LINE_CONTINUATION : [ \t]* UNDERSCORE [ \t]* '\r'? '\n';
and the DefTypes matching their camel-case spelling.
Now when I try to test this on the following inputs, it works exactly as expected:
DefInt I,J,K
DefBool A-Z
It does not work however on arbitary letter ranges (see rule letterRange). When I use the input DefByte B-F, I get the error message "line 1:8 mismatched input 'B' expecting RESTRICTED_LETTER"
I've tried expressing RESTRICTED_IDENTIFIER as a range ('A'..'Z'|'a'..'z'), but that didn't change anything about the error message.
When changing the first whiteSpace in defDirective to whiteSpace+ the error message gets a little longer (now including WS and LINE_CONTINUATION in the expected alternatives).
Also the parse-tree generated by the IntelliJ ANTLR Plugin suddenly starts recognizing the F as a singleLetter, which it previously didn't.
This behaviour seems to be consistent between targetlanguages Java and CSharp.
Previously the rule used to be a lot more relaxed, but that led to incorrect parse-trees, so I kinda want to fix this.
How can I correctly recognize letterRange here?
So ... #BartKiers had the right suspicion. The given Lexer rules weren't all the rules involved in the process.
The full grammar contains a lexer rule B_CHAR : B that's used in a special case of an unrelated grammar rule. That B_CHAR took precedence over RESTRICTED_LETTER when lexing the input stream.
The grammar rules presented are correct (and work fine), but the B_CHAR token needs to be removed from the Tokens lexed.

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

How to use similar lexers

I have the following grammar:
cmds
: cmd+
;
cmd
: include_cmd | other_cmd
;
include_cmd
: INCLUDE DOUBLE_QUOTE FILE_NAME DOUBLE_QUOTE
;
other_cmd
: CMD_NAME ARG+
;
INCLUDE
: '#include'
;
DOUBLE_QUOTE
: '"'
;
CMD_NAME
: ('a'..'z')*
;
ARG
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
;
FILE_NAME
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
;
So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and "_" and FILE_NAME yet can have ".".
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?
Thanks.
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
The set of all valid CMD_NAMEs intersects with the set of all valid FILE_NAMEs. Input abc qualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it's the first one matched.
Do I have to rely on such technique as [predicate] to deal with this? Is there a pure EBNF solution other than relying on host programming language?
It depends on what you're willing accept in your grammar. Consider changing your include_cmd rule to something more conventional, like this:
include_cmd : INCLUDE STRING;
STRING
: '"' ~('"'|'\r'|'\n')* '"' {String text = getText(); setText(text.substring(1, text.length() - 1));}
;
Now input #include "abc" turns into tokens [INCLUDE : #include] [STRING : abc].
I don't think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn't imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you'll be fine if you're willing to drop rule FILE_NAME for something like the rules the above.
Also worth noting, your CMD_NAME rule matches zero-length input. Consider changing ('a'..'z')* to ('a'..'z')+ unless a CMD_NAME really can be empty.
Keep in mind, too, that you'll have the same problem with ARG that you did with FILE_NAME. It's listed after CMD_NAME, so any input that qualifies for both rules (like abc again) will hit CMD_NAME. Consider breaking these rules up into more conventional ones like so:
other_cmd : ID (ID | NUMBER)+ SEMI; //instead of CMD_NAME ARG+
ID : ('a'..'z'|'A'..'Z'|'_')+; //instead of CMD_NAME, "id" part of ARG
NUMBER : ('0'..'9')+; //"number" part of ARG
SEMI : ';';
I added rule SEMI to mark the end of a command. Otherwise the parser won't know if input a b c d is supposed to be one command with three arguments (a(b,c,d)) or two commands with one argument each (a(b), c(d)).