ANTLR Exclude keywords while parsing a string - antlr

I'm trying to make the grammar for a rather simple language using ANTLR4. It's supposed to process some theater-related text. There are just 3 rules.
1 - Any text that starts with a tab (\t), should be just printed out.
It was a rather warm
Summer day.
2 - In case the text doesn't start with a tab, it'll most likely contain a character name. For example:
Captain Go forth, my minions!
It would be perfect to grab character name and text they're saying separately.
3 - And there are commands, that also start with a tab, followed by a keyword and some arguments, kind of like this:
lights ON
curtain OPEN
This is my grammar:
grammar Theater;
module: statement+ EOF;
statement: function | print | print_with_name;
function: '\t' command NL;
command: lights | curtain;
lights: 'lights' WS ('ON' | 'OFF');
curtain: 'curtain' WS ('OPEN' | 'CLOSE');
print: PRINT;
PRINT: '\t' .*? NL NL;
print_with_name: PRINT_WITH_NAME;
PRINT_WITH_NAME: ~[ \t\r\n] .*? NL NL;
NL: '\r\n' | '\r' | '\n';
WS: [ \t]+?;
I run this on the following test file:
It was a rather warm
Summer day.
Captain Go forth, my minions!
lights ON
curtain OPEN
And these are tokens I get:
[#0,0:22='\tIt was a rather warm\r\n',<PRINT>,1:0]
[#1,23:36='\tSummer day.\r\n',<PRINT>,2:0]
[#2,37:67='Captain Go forth, my minions!\r\n',<PRINT_WITH_NAME>,3:0]
[#3,68:79='\tlights ON\r\n',<PRINT>,4:0]
[#4,80:94='\tcurtain OPEN\r\n',<PRINT>,5:0]
[#5,95:94='<EOF>',<EOF>,6:0]
print and print with name both work as expected. Commands, on the other hand, are being treated as print. I guess, this is because those are lexer rules, but commands are parser rules.
Is there any way I can make it work without converting all commands to lexer rules? I tried hard to write something like "treat all text as Print, except when it starts with one of the keywords". But couldn't really find anything that would work. I'm only starting with antlr, so I must be missing something.
I don't expect you to write the grammar for me. Just mentionion a feature I should use would be perfect.

Lexer modes can be helpful here, which is a way to nudge the lexer in the right direction (make it a bit context sensitive).
To use lexer modes, you must divide the lexer- and parser-grammar into separate files. Here is TheaterLexer.g4:
lexer grammar TheaterLexer;
Name : ~[ \t]+ -> mode(DialogMode);
K_Lights : '\tlights' -> mode(CommandMode);
K_Curtain : '\tcurtain' -> mode(CommandMode);
Tab : '\t' -> skip, mode(TabMode);
mode DialogMode;
DialogText : ~[\r\n]+;
DialogNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
mode CommandMode;
CommandText : ~[\r\n]+;
CommandNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
mode TabMode;
LiteralText : ~[\r\n]+;
LiteralNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
And the parser part (put it in TheaterParser.g4):
parser grammar TheaterParser;
options { tokenVocab=TheaterLexer; }
parse
: file EOF
;
file
: atom*
;
atom
: literal
| dialog
| command
;
literal
: LiteralText+
;
dialog
: Name DialogText+
;
command
: K_Lights CommandText+
| K_Curtain CommandText+
;
If you now generate the lexer and parser classes and run the following Java code:
String source =
"\tIt was a rather warm\n" +
"\tSummer day.\n" +
"Captain Go forth, my minions!\n" +
"\tlights ON\n" +
"\tcurtain OPEN";
TheaterLexer lexer = new TheaterLexer(CharStreams.fromString(source));
TheaterParser parser = new TheaterParser(new CommonTokenStream(lexer));
ParseTree root = parser.parse();
System.out.println(root.toStringTree(parser));
the following will be printed to your console:
(parse
(file
(atom
(literal It was a rather warm Summer day.))
(atom
(dialog Captain Go forth, my minions!))
(atom
(command \tlights ON))
(atom
(command \tcurtain OPEN))) <EOF>)
(the indentation is added for readability)
Note that you can use just a single mode, but I assumed you'd want to treat the tokens differently in the different modes. If this is not the case, you could just do:
lexer grammar TheaterLexer;
Name : ~[ \t]+ -> mode(Step2Mode);
K_Lights : '\tlights' -> mode(Step2Mode);
K_Curtain : '\tcurtain' -> mode(Step2Mode);
Tab : '\t' -> skip, mode(Step2Mode);
mode Step2Mode;
Text : ~[\r\n]+;
NewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
and change the parser rules accordingly.

Related

String Interpolation in Antlr4

I have a grammar that uses modes to do string interpolation:
Something along the lines of:
lexer grammar Example;
//default mode tokens
LBRACE: '{' -> pushMode(DEFAULT_MODE);
RBRACE: '}' -> popMode;
OPEN_STRING: '"' -> pushMode(STRING);
mode STRING;
ID_INTERPOLATION: '$' IDEN;
OPEN_EXPR_INTERPOLATION: '${' -> pushMode(DEFAULT_MODE);
TEXT: '$' | (~[$\r\n])+;
CLOSE_STRING: '"' -> popMode;
parser grammar ExampleParser;
options {tokenVocab = Example;}
test: string* EOF;
string: OPEN_STRING string_part* CLOSE_STRING;
string_part: TEXT | ID_INTERPOLATION | OPEN_EXPR_INTERPOLATION expr RBRACE;
//more rules that use LBRACE and RBRACE
Now this works and tokenizes everything mostly how I want it, but it does have 2 flaws.
if the number of RBRACES goes too far, it can pop the first default mode which can glitch out the IDE, and does not just show an error.
The token for closing a block and closing interpolation is the same, so I cannot highlight them however I want. (this is the main one)
My IDE highlights based on tokens only, so this is a problem, I'd like to be able to highlight them differently. So basically I'd like a solution for this that makes the RBRACE a different token when it's in a string.
I'd prefer to do it without semantic predicates because I don't want to tie it down to a language, but if needed, I'm ok with it, I just might need a little more explanation because I haven't used them that much.
Thank you #sepp2k for helping me solve my issue.
It's a bit of a hack but it does exactly what I need it to
I solved it by changing my popMode on RBRACE to be the following:
RBRACE: '}' {
if(_modeStack.size() > 0) {
popMode();
if(_mode != DEFAULT_MODE) {
setType(EXPR_INTERPOLATION);
}
}
};
I also changed my parser to be
string_part: TEXT | ID_INTERPOLATION | EXPR_INTERPOLATION expr EXPR_INTERPOLATION;
I know it's pretty hacky to change the token type under a specific circumstance, but it got the job done for me, so I'm gonna keep it unless I find a less hacky way to do this.
So I set out to implement an interpolated string parser with using only ANTLR code (no host language code blocks). I found that this works well, including nesting interpolated strings...
lexer grammar Lexer;
LeftBrace: '{';
RightBrace: '}' -> popMode;
Backtick: '`' -> pushMode(InterpolatedString);
Integer: [0-9]+;
Plus: '+';
mode InterpolatedString;
EscapedLeftBrace: '\\{' -> type(Grapheme);
EscapedBacktick: '\\`' -> type(Grapheme);
ExprStart: '{' -> type(LeftBrace), pushMode(DEFAULT_MODE);
End: '`' -> type(Backtick), popMode;
Grapheme: ~('{' | '`');
parser grammar Parser;
options {
tokenVocab = Lexer;
}
startRule: expression EOF;
interpolatedString: Backtick (Grapheme | interpolatedStringExpression)* Backtick;
interpolatedStringExpression: LeftBrace expression RightBrace;
expression
: expression Plus expression
| atom
;
atom: Integer | interpolatedString;
You can test it with input
`{`{`{`{`{`{`{`hello world`}`}`}`}`}`}`}`

ANTLR proper ordering of grammar rules

I am trying to write a grammar that will recognize <<word>> as a special token but treat <word> as just a regular literal.
Here is my grammar:
grammar test;
doc: item+ ;
item: func | atom ;
func: '<<' WORD '>>' ;
atom: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| WORD #wordAtom
;
WS : [ \t] -> skip ;
NEWLINE : [\n\r]+ ;
PUNCT : [.,?!]+ ;
WORD : CHAR+ ;
fragment CHAR : (LETTER | DIGIT | SYMB | PUNCT) ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}\n\r\t] ;
So something like <<word>> will be matched by two rules, both func and atom. I want it to be recognized as a func, so I put the func rule first.
When I test my grammar with <word> it treats it as an atom, as expected. However when I test my grammar and give it <<word>> it treats it as an atom as well.
Is there something I'm missing?
PS - I have separated atom into PUNCT, NEWLINE, and WORD and given them labels #punctAtom, #newlineAtom, and #wordAtom because I want to treat each of those differently when I traverse the parse tree. Also, a WORD can contain PUNCT because, for instance, someone can write "Hello," and I want to treat that as a single word (for simplicity later on).
PPS - One thing I've tried is I've included < and > in the last rule, which is a list of symbols that I'm "disallowing" to exist inside a WORD. This solves one problem, in that <<word>> is now recognized as a func, but it creates a new problem because <word> is no longer accepted as an atom.
ANTLR's lexer tries to match as much characters as possible, so both <<WORD>> and <WORD> are matched by the lexer rul WORD. Therefor, there in these cases the tokens << and >> (or < and > for that matter) will not be created.
You can see what tokens are being created by running these lines of code:
Lexer lexer = new testLexer(CharStreams.fromString("<word> <<word>>"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", testLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
which will print:
WORD <word>
WORD <<word>>
EOF <EOF>
What you could do is something like this:
func
: '<<' WORD '>>'
;
atom
: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| word #wordAtom
;
word
: WORD
| '<' WORD '>'
;
...
fragment SYMB : ~[<>a-zA-Z0-9.,?! |{}\n\r\t] ;
Of course, something like foo<bar will not become a single WORD, which it previously would.

ANTLR trying to create a lexer rule that goes up to, but not including, some symbols

I'm using ANTLR4 to parse text adventure game dialogue files written in Yarn, so mostly free form text and loads of island grammars, and for the most part things are going smoothly but I am having an issue excluding certain text inside the Shortcut mode (when presenting options for the player to choose from).
Basically I need to write a rule to match anything except #, newline or <<. When it hits a << it needs to move into a new mode for handling expressions of various kinds or to just leave the current mode so that the << will get picked up by the already existing rules.
A cut down version of my lexer (ignoring rules for expressions):
lexer grammar YarnLexer;
NEWLINE : ('\n') -> skip;
CMD : '<<' -> pushMode(Command);
SHORTCUT : '->' -> pushMode(Shortcut);
HASHTAG : '#' ;
LINE_GOBBLE : . -> more, pushMode(Line);
mode Line;
LINE : ~('\n'|'#')* -> popMode;
mode Shortcut ;
TEXT : CHAR+ -> popMode;
fragment CHAR : ~('#'|'\n'|'<');
mode Command ;
CMD_EXIT : '>>' -> popMode;
// RULES FOR OPERATORS/IDs/NUMBERS/KEYWORDS/etc
CMD_TEXT : ~('>')+ ;
And the parser grammar (again ignoring all the rules for expressions):
parser grammar YarnParser;
options { tokenVocab=YarnLexer; }
dialogue: statement+ EOF;
statement : line_statement | shortcut_statement | command_statement ;
hashtag : HASHTAG LINE ;
line_statement : LINE hashtag? ;
shortcut_statement : SHORTCUT TEXT command_statement? hashtag?;
command_statement : CMD expression CMD_EXIT;
expression : CMD_TEXT ;
I have tested the Command mode when it is by itself and everything inside there is working fine, but when I try to parse my example input:
Where should we go?
-> the park
-> the zoo
-> Peter's house <<if $metPeter == true >>
ok shall we take the bus?
-> :<
-> ok
<<set $daySpent = true>>
my issue is the line:
-> Peter's house <<if $metPeter == true >>
gets matched completely as TEXT and the CMD rules just gets ignored in favour by far longer TEXT.
My first thought was to add < to the set but then I can't have text like:
-> :<
which should be perfectly valid. Any idea how to do this?
Adding a single left angle bracket to the exclusion list creates a single corner case that is easily handled:
TEXT : CHAR+ ;
CMD : '<<' -> pushMode(Command);
LAB : '<' -> type(TEXT) ;
fragment CHAR : ~('\n' | '#' | '<') ;

ANTLR rule works on its own, but fails when included in another rule

I am trying to write an ANTLR grammar for a reparsed and retagged kconfig file (retagged to solve a couple of ambiguities). A simplified version of the grammar is:
grammar FailureExample;
options {
language = Java;
}
#lexer::header {
package parse.failure.example;
}
reload
: configStatement*
EOF
;
configStatement
: CONFIG IDENT
configOptions
;
configOptions
: (type
| defConfigStatement
| dependsOnStatement
| helpStatement
| rangeStatement
| defaultStatement
| selectStatement
| visibleIfStatement
| prompt
)*
;
type : FAKE1;
dependsOnStatement: FAKE2;
helpStatement: FAKE3;
rangeStatement: FAKE4;
defaultStatement: FAKE5;
selectStatement:FAKE6;
visibleIfStatement:FAKE7;
prompt:FAKE8;
defConfigStatement
: defConfigType expression
;
defConfigType
: DEF_BOOL
;
//expression parsing
primative
: IDENT
| L_PAREN expression R_PAREN
;
negationExpression
: NOT* primative
;
orExpression
: negationExpression (OR negationExpression)*
;
andExpression
: orExpression (AND orExpression)*
;
unequalExpression
: andExpression (NOT_EQUAL andExpression)?
;
equalExpression
: unequalExpression (EQUAL unequalExpression)?
;
expression
: equalExpression (BECOMES equalExpression)?
;
DEF_BOOL: 'def_bool';
CONFIG : 'config';
COMMENT : '#' .* ('\n'|'\r') {$channel = HIDDEN;};
AND : '&&';
OR : '||';
NOT : '!';
L_PAREN : '(';
R_PAREN : ')';
BECOMES : '::=';
EQUAL : '=';
NOT_EQUAL : '!=';
FAKE1 : 'fake1';
FAKE2: 'fake2';
FAKE3: 'fake3';
FAKE4: 'fake4';
FAKE5: 'fake5';
FAKE6: 'fake6';
FAKE7: 'fake7';
FAKE8: 'fake8';
IDENT : (LETTER | DIGIT | '_')*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
With input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
I can set antlrworks to parse just the second line (commenting out the first) and I get the proper defConfigStatement token emitted with the proper expression following. However, if I exercise either the configOptions rule or the configStatement rule (with the first line uncommented), my configOptions results in an empty set and a NoViableAlt exception is thrown.
What would cause this behavior? I know that the defConfigStatement rule is accurate and can parse correctly, but as soon as it's added as a potential option in another rule, it fails. I know I don't have conflicting rules, and I've made DEF_BOOL and DEF_TRISTATE rules the top in my list of lexer rules, so they have priority over the other lexer rules.
/Added since edit/
To further complicate the issue, if I move the defConfigStatement choice in the configOptions rule, it will work, but other rules will fail.
Edit: Using full, simplified grammar.
In short, why does the rule work on its own, but fail when it's in configOptions (especially since configOptions is in (A | B | C)* form)?
When I parse the input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
with the parser generated from your grammar, I get the following parse tree:
So, I see no issues here. My guess is that you're using ANTLRWorks' interpreter: don't. It's buggy. Always test your grammar with a class of your own, or use ANTLWorks' debugger (press CTRL+D to launch is). The debugger works like a charm (without the package declaration, btw). The image I posted above is an export from the debugger.
EDIT
If the debugger doesn't work, try (temporarily) removing the package declaration (note that you're only declaring a package for the lexer, not the parser, but that might be a caused by posting a minimal grammar). You could also try changing the port number the debugger should connect to. It could be the port is already in use (see: File -> Preferences -> Debugger-tab).

ANTLR treating multiple EOLs as one?

I want to parse a language in which statements are separated by EOLs. I tried this in the lexer grammar (copied from an example in the docs):
EOL : ('\r'? '\n')+ ; // any number of consecutive linefeeds counts as a single EOL
and then used this in the parser grammar:
stmt_sequence : (stmt EOL)* ;
The parser rejected code with statements separated by one or more blank lines.
However, this was successful:
EOL : '\r'? '\n' ;
stmt_sequence : (stmt EOL+)* ;
I'm an ANTLR newbie. It seems like both should work. Is there something about greedy/nongreedy lexer scanning that I don't understand?
I tried this with both 3.2 and 3.4; I'm running the ANTLR IDE in Eclipse Indigo on OS X 10.6.
Thanks.
The error was not in the original grammar; but in the input data. I was using an editor (in Eclipse) that automatically inserted tabs after an EOL, so my "blank lines" were not really blank.
I modified the grammar as follows:
fragment SPACE: ' ' | '\t';
EOL : ( '\r'? '\n' SPACE* )+;
This grammar works as expected.
The lesson here is that one must be careful with white spaces. The lexer may see white spaces in the input that the parser does not see (because it has already been sent to the hidden channel).