ANTLR proper ordering of grammar rules - antlr

I am trying to write a grammar that will recognize <<word>> as a special token but treat <word> as just a regular literal.
Here is my grammar:
grammar test;
doc: item+ ;
item: func | atom ;
func: '<<' WORD '>>' ;
atom: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| WORD #wordAtom
;
WS : [ \t] -> skip ;
NEWLINE : [\n\r]+ ;
PUNCT : [.,?!]+ ;
WORD : CHAR+ ;
fragment CHAR : (LETTER | DIGIT | SYMB | PUNCT) ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}\n\r\t] ;
So something like <<word>> will be matched by two rules, both func and atom. I want it to be recognized as a func, so I put the func rule first.
When I test my grammar with <word> it treats it as an atom, as expected. However when I test my grammar and give it <<word>> it treats it as an atom as well.
Is there something I'm missing?
PS - I have separated atom into PUNCT, NEWLINE, and WORD and given them labels #punctAtom, #newlineAtom, and #wordAtom because I want to treat each of those differently when I traverse the parse tree. Also, a WORD can contain PUNCT because, for instance, someone can write "Hello," and I want to treat that as a single word (for simplicity later on).
PPS - One thing I've tried is I've included < and > in the last rule, which is a list of symbols that I'm "disallowing" to exist inside a WORD. This solves one problem, in that <<word>> is now recognized as a func, but it creates a new problem because <word> is no longer accepted as an atom.

ANTLR's lexer tries to match as much characters as possible, so both <<WORD>> and <WORD> are matched by the lexer rul WORD. Therefor, there in these cases the tokens << and >> (or < and > for that matter) will not be created.
You can see what tokens are being created by running these lines of code:
Lexer lexer = new testLexer(CharStreams.fromString("<word> <<word>>"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", testLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
which will print:
WORD <word>
WORD <<word>>
EOF <EOF>
What you could do is something like this:
func
: '<<' WORD '>>'
;
atom
: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| word #wordAtom
;
word
: WORD
| '<' WORD '>'
;
...
fragment SYMB : ~[<>a-zA-Z0-9.,?! |{}\n\r\t] ;
Of course, something like foo<bar will not become a single WORD, which it previously would.

Related

ANTLR Exclude keywords while parsing a string

I'm trying to make the grammar for a rather simple language using ANTLR4. It's supposed to process some theater-related text. There are just 3 rules.
1 - Any text that starts with a tab (\t), should be just printed out.
It was a rather warm
Summer day.
2 - In case the text doesn't start with a tab, it'll most likely contain a character name. For example:
Captain Go forth, my minions!
It would be perfect to grab character name and text they're saying separately.
3 - And there are commands, that also start with a tab, followed by a keyword and some arguments, kind of like this:
lights ON
curtain OPEN
This is my grammar:
grammar Theater;
module: statement+ EOF;
statement: function | print | print_with_name;
function: '\t' command NL;
command: lights | curtain;
lights: 'lights' WS ('ON' | 'OFF');
curtain: 'curtain' WS ('OPEN' | 'CLOSE');
print: PRINT;
PRINT: '\t' .*? NL NL;
print_with_name: PRINT_WITH_NAME;
PRINT_WITH_NAME: ~[ \t\r\n] .*? NL NL;
NL: '\r\n' | '\r' | '\n';
WS: [ \t]+?;
I run this on the following test file:
It was a rather warm
Summer day.
Captain Go forth, my minions!
lights ON
curtain OPEN
And these are tokens I get:
[#0,0:22='\tIt was a rather warm\r\n',<PRINT>,1:0]
[#1,23:36='\tSummer day.\r\n',<PRINT>,2:0]
[#2,37:67='Captain Go forth, my minions!\r\n',<PRINT_WITH_NAME>,3:0]
[#3,68:79='\tlights ON\r\n',<PRINT>,4:0]
[#4,80:94='\tcurtain OPEN\r\n',<PRINT>,5:0]
[#5,95:94='<EOF>',<EOF>,6:0]
print and print with name both work as expected. Commands, on the other hand, are being treated as print. I guess, this is because those are lexer rules, but commands are parser rules.
Is there any way I can make it work without converting all commands to lexer rules? I tried hard to write something like "treat all text as Print, except when it starts with one of the keywords". But couldn't really find anything that would work. I'm only starting with antlr, so I must be missing something.
I don't expect you to write the grammar for me. Just mentionion a feature I should use would be perfect.
Lexer modes can be helpful here, which is a way to nudge the lexer in the right direction (make it a bit context sensitive).
To use lexer modes, you must divide the lexer- and parser-grammar into separate files. Here is TheaterLexer.g4:
lexer grammar TheaterLexer;
Name : ~[ \t]+ -> mode(DialogMode);
K_Lights : '\tlights' -> mode(CommandMode);
K_Curtain : '\tcurtain' -> mode(CommandMode);
Tab : '\t' -> skip, mode(TabMode);
mode DialogMode;
DialogText : ~[\r\n]+;
DialogNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
mode CommandMode;
CommandText : ~[\r\n]+;
CommandNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
mode TabMode;
LiteralText : ~[\r\n]+;
LiteralNewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
And the parser part (put it in TheaterParser.g4):
parser grammar TheaterParser;
options { tokenVocab=TheaterLexer; }
parse
: file EOF
;
file
: atom*
;
atom
: literal
| dialog
| command
;
literal
: LiteralText+
;
dialog
: Name DialogText+
;
command
: K_Lights CommandText+
| K_Curtain CommandText+
;
If you now generate the lexer and parser classes and run the following Java code:
String source =
"\tIt was a rather warm\n" +
"\tSummer day.\n" +
"Captain Go forth, my minions!\n" +
"\tlights ON\n" +
"\tcurtain OPEN";
TheaterLexer lexer = new TheaterLexer(CharStreams.fromString(source));
TheaterParser parser = new TheaterParser(new CommonTokenStream(lexer));
ParseTree root = parser.parse();
System.out.println(root.toStringTree(parser));
the following will be printed to your console:
(parse
(file
(atom
(literal It was a rather warm Summer day.))
(atom
(dialog Captain Go forth, my minions!))
(atom
(command \tlights ON))
(atom
(command \tcurtain OPEN))) <EOF>)
(the indentation is added for readability)
Note that you can use just a single mode, but I assumed you'd want to treat the tokens differently in the different modes. If this is not the case, you could just do:
lexer grammar TheaterLexer;
Name : ~[ \t]+ -> mode(Step2Mode);
K_Lights : '\tlights' -> mode(Step2Mode);
K_Curtain : '\tcurtain' -> mode(Step2Mode);
Tab : '\t' -> skip, mode(Step2Mode);
mode Step2Mode;
Text : ~[\r\n]+;
NewLine : [\r\n]+ -> skip, mode(DEFAULT_MODE);
and change the parser rules accordingly.

Handle strings starting with whitespaces

I'm trying to create an ANTLR v4 grammar with the following set of rules:
1.In case a line starts with #, it is considered a label:
#label
2.In case the line starts with cmd, it is treated as a command
cmd param1 param2
3.If a line starts with a whitespace, it is considered a string. All the text should be extracted. Strings can be multiline, so they end with an empty line
A long string with multiline support
and any special characters one can imagine.
<-empty line here->
4.Lastly, in case a line starts with anything but whitespace, # and cmd, it's first word should be considered a heading.
Heading A long string with multiline support
and any special characters one can imagine.
<-empty line here->
It was easy to handle lables and commands. But I am clueless about strings and headings.
What is the best way to separate whitespace word whitespace whatever doubleNewline and whatever doubleNewline? I've seen a lot of samples with whitespaces, but none of them works with both random text and newlines. I don't expect you to write actual code for me. Suggesting an approach will do.
Something like this should do the trick:
lexer grammar DemoLexer;
LABEL
: '#' [a-zA-Z]+
;
CMD
: 'cmd' ~[\r\n]+
;
STRING
: ' ' .*? NL NL
;
HEADING
: ( ~[# \t\r\nc] | 'c' ~'m' | 'cm' ~'d' ).*? NL NL
;
SPACE
: [ \t\r\n] -> skip
;
OTHER
: .
;
fragment NL
: '\r'? '\n'
| '\r'
;
This does not mandate the "beginning of the line" requirement. If that is something you want, you'll have to add semantic predicates to your grammar, which ties it to a target language. For Java, that would look like this:
LABEL
: {getCharPositionInLine() == 0}? '#' [a-zA-Z]+
;
See:
Semantic predicates in ANTLR4?
https://github.com/antlr/antlr4/blob/master/doc/predicates.md

Getting inconsistent results

I'm using ANTLR 4.6 and I was trying to do some clean up on my grammar and ended up breaking it. I found out that it's because I had made the following change that I assumed would have been equivalent. Can someone explain why they are different?
First try
DIGIT : [0-9] ;
LETTER : [a-zA-Z] ;
ident : ('_'|LETTER) ('_'|LETTER|DIGIT)* ;
Second try
DIGIT : [0-9] ;
LETTER : [a-zA-Z_] ;
ident : LETTER (LETTER | DIGIT)* ;
Both produce different results than this
DIGIT : [0-9] ;
LETTER : [a-zA-Z_] ;
IDENT : LETTER (LETTER | DIGIT)* ;
In both your tries you changed your ident rule from a lexer rule to a parser rule since you wrote it in lower case and since it's the only difference from the second try I assume that's the problem. The lexer rules are for defining tokens for parsing, parsing rules define the way you construct your AST. Beware that making changes like that can result in great differences in the way the your AST is constructed.

ANTLR4 disambiguation of terminal tokens

This is my grammar in ANTLR4:
grammar Hello;
r : WORD ID ;
ID : [a-z]+ ;
WORD : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
When I type in something like:
hello buddy
I got the following error message:
line 1 missing WORD at 'hello'
But, if I change the grammar in
grammar Hello;
r : WORD ID ;
ID : [a-z]+ ;
WORD : [1-9]+ ;
WS : [ \t\r\n]+ -> skip ;
where now WORD is a number, everything is ok.
I strongly suspect that since in the first grammar we have two terminal node with the same regex, the parser doesn't know the correspondance of the real word.
So am I wrong thinking of it? If not, how would you solve this issue keeping more than one terminal with the same regex?
You cannot have two terminals that match the same pattern.
If your grammar actually needs to match twice [a-z]+, then use a production like
r : WORD WORD ;
and the discrimination will be done at the parser / tree traversal level.
If either WORD or ID can be restricted to a fixed list, you could declare all the possible words as terminals then use them to define e.g. what a WORD can be.
where now WORD is a number, everything is ok.
Not really :
$ alias
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Hello r -tokens data.txt
[#0,0:4='hello',<ID>,1:0]
[#1,6:10='buddy',<ID>,1:6]
[#2,12:11='<EOF>',<EOF>,2:0]
line 1:0 missing WORD at 'hello'
When the lexer can match some input with two rules, there is an ambiguity, and it chooses the first rule. With a hello buddy input, the lexer produces two ID tokens
with the first grammar, because it's ambiguous and ID comes first
with the second grammar, the input can only be matched by ID WS ID
You can disambiguate with a predicate in the lexer rule like so :
grammar Question;
/* Ambiguous input */
file
: HELLO ID
;
HELLO
: [a-z]+ {getText().equals("hello")}? ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
Execution :
$ grun Question file -tokens data.txt
[#0,0:4='hello',<HELLO>,1:0]
[#1,6:10='buddy',<ID>,1:6]
[#2,12:11='<EOF>',<EOF>,2:0]
More on semantic predicates in The Definitive ANTLR Reference.

Why isn't antlr 4 breaking my tokens up as expected?

So I am fairly new to ANTLR 4. I have stripped down the grammar as much as I can to show the problem:
grammar DumbGrammar;
equation
: expression (AND expression)*
;
expression
: ID
;
ID : LETTER(LETTER|DIGIT)* ;
AND: 'and';
LETTER: [a-zA-Z_];
DIGIT : [0-9];
WS : [ \r\n\t] + -> channel (HIDDEN);
If use this grammar, and use the sample text: abc and d I get a weird tree with unexpected structure as shown below(using IntelliJ and ANTLR4 plug in):
If I simply change the terminal rule AND: 'and'; to read AND: '&&'; and then submit abc && d as input I get the following tree, as expected:
I cannot figure out why it isn't parsing "and" correctly, but does parse '&&' correctly.
The input "and" is being tokenized as an ID token. Since both ID and AND match the input "and", ANTLR needs to make a decision which token to choose. It takes ID since it was defined before AND.
The solution: define AND before ID:
AND: 'and';
ID : LETTER(LETTER|DIGIT)* ;