Match lowercase with ANTLR - antlr

I use ANTLRWorks for a simple grammar:
grammar boolean;
// [...]
lowercase_string
: ('a'..'z')+ ;
However, the lowercase_string doesn't match foobar according to the Interpreter (MismatchedSetException(10!={}). Ideas?

You can't use the .. operator inside parser rules like that. To match the range 'a' to 'z', create a lexer rule for it (lexer rules start with a capital).
Try it like this:
lowercase_string
: Lower+
;
Lower
: 'a'..'z'
;
or:
lowercase_string
: Lower
;
Lower
: 'a'..'z'+
;
Also see this previous Q&A: Practical difference between parser rules and lexer rules in ANTLR?

Related

Getting inconsistent results

I'm using ANTLR 4.6 and I was trying to do some clean up on my grammar and ended up breaking it. I found out that it's because I had made the following change that I assumed would have been equivalent. Can someone explain why they are different?
First try
DIGIT : [0-9] ;
LETTER : [a-zA-Z] ;
ident : ('_'|LETTER) ('_'|LETTER|DIGIT)* ;
Second try
DIGIT : [0-9] ;
LETTER : [a-zA-Z_] ;
ident : LETTER (LETTER | DIGIT)* ;
Both produce different results than this
DIGIT : [0-9] ;
LETTER : [a-zA-Z_] ;
IDENT : LETTER (LETTER | DIGIT)* ;
In both your tries you changed your ident rule from a lexer rule to a parser rule since you wrote it in lower case and since it's the only difference from the second try I assume that's the problem. The lexer rules are for defining tokens for parsing, parsing rules define the way you construct your AST. Beware that making changes like that can result in great differences in the way the your AST is constructed.

Problems with ANTLR4 grammar

I have a very simple grammar file, which looks like this:
grammar Wort;
// Parser Rules:
word
: ANY_WORD EOF
;
// Lexer Rules:
ANY_WORD
: SMALL_WORD | CAPITAL_WORD
;
SMALL_WORD
: SMALL_LETTER (SMALL_LETTER)+
;
CAPITAL_WORD
: CAPITAL_LETTER (SMALL_LETTER)+
;
fragment SMALL_LETTER
: ('a'..'z')
;
fragment CAPITAL_LETTER
: ('A'..'Z')
;
If i try to parse the input "Hello", everything is OK, BUT if if modify my grammar file like this:
...
// Parser Rules:
word
: CAPITAL_WORD EOF
;
...
the input "Hello" is no longer recognized as a valid input. Can anybody explain, what is going wrong?
Thanx, Lars
The issue here has to do with precedence in the lexer grammar. Because ANY_WORD is listed before CAPITAL_WORD, it is given higher precedence. The lexer will identify Hello as a CAPITAL_WORD, but since an ANY_WORD can be just a CAPITAL_WORD, and the lexer is set up to prefer ANY_WORD, it will output the token ANY_WORD. The parser acts on the output of the lexer, and since ANY_WORD EOF doesn't match any of its rules, the parse fails.
You can make the lexer behave differently by moving CAPITAL_WORD above ANY_WORD in the grammar, but that will create the opposite problem -- capitalized words will never lex as ANY_WORDs. The best thing to do is probably what Mephy suggested -- make ANY_WORD a parser rule.

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

Antlr: Unintended behavior

Why this simple grammar
grammar Test;
expr
: Int | expr '+' expr;
Int
: [0-9]+;
doesn't match the input 1+1 ? It says "No method for rule expr or it has arguments" but in my opition it should be matched.
It looks like I haven't used ANTLR for a while... ANTLRv3 did not support left-recursive rules, but ANTLRv4 does support immediate left recursion. It also supports the regex-like character class syntax you used in your post. I tested this version and it works in ANTLRWorks2 (running on ANTLR4):
grammar Test;
start : expr
;
expr : expr '+' expr
| INT
;
INT : [0-9]+
;
If you add the start rule then ANTLR is able to infer that EOF goes at the end of that rule. It doesn't seem to be able to infer EOF for more complex rules like expr and expr2 since they're recursive...
There are a lot of comments below, so here is (co-author of ANTLR4) Sam Harwell's response (emphasis added):
You still want to include an explicit EOF in the start rule. The problem the OP faced with using expr directly is ANTLR 4 internally rewrote it to be expr[int _p] (it does so for all left recursive rules), and the included TestRig is not able to directly execute rules with parameters. Adding a start rule resolves the problem because TestRig is able to execute that rule. :)
I've posted a follow-up question with regard to EOF: When is EOF needed in ANTLR 4?
If your command looks like this:
grun MYGRAMMAR xxx -tokens
And this exception is thrown:
No method for rule xxx or it has arguments
Then this exception will get thrown with the rule you specified in the command above. It means the rule probably doesn't exist.
System.err.println("No method for rule "+startRuleName+" or it has arguments");
So startRuleName here, should print xxx if it's not the first (start) rule in the grammar. Put xxx as the first rule in your grammar to prevent this.

ANTLR: Lexer Throwing NoViableAltException

After a break of a few weeks, it's time to fight with ANTLR again...
Anyhow, I have the following Lexer tokens defined:
fragment EQ: '=';
fragment NE: '<>';
BOOLEAN_FIELD
: ('ISTRAINED'|'ISCITIZEN')
;
BOOLEAN_CONSTANT
: ('TRUE'|'FALSE'|'Y'|'N')
;
BOOLEAN_LOGICAL
: BOOLEAN_FIELD (EQ|NE) (BOOLEAN_FIELD|BOOLEAN_CONSTANT)
;
Unfortunately, the BOOLEAN_LOGICAL token is throwning NoViableAltException on simple terms such as "ISTRAINED = ISTRAINED".
I know some of the responses are going to be "This should be in the parser". It WAS previously in the parser, however, I'm trying to offload some simple items into the lexer since I just need a "Yes/No, is this text block valid?"
Any help is appreciated.
BOOLEAN_LOGICAL should not be a lexer rule. A lexer rule must (or should) be a single token. As a lexer rule, there cannot be any spaces between BOOLEAN_FIELD and (EQ|NE) (you might have skipped spaces during lexing, but that will only cause spaces to be skipped from inside parser rules!).
Do this instead:
boolean_logical
: BOOLEAN_FIELD (EQ|NE) (BOOLEAN_FIELD|BOOLEAN_CONSTANT)
;
which would also mean that EQ and NE cannot be fragment rules anymore:
EQ : '=';
NE : '<>';
This does look like it should be a parser rule. However, if you want to keep it as a lexer rule, you need to allow whitespace.
BOOLEAN_LOGICAL
: BOOLEAN_FIELD WS+ (EQ|NE) WS+ (BOOLEAN_FIELD|BOOLEAN_CONSTANT)
;