ANTLR and Empty Strings Contradictory Behavior - antlr

I fundamentally don't understand how antlr works. Using the following grammar:
blockcomment : '/\*' ANYCHARS '\*/';
ANYCHARS : ('a'..'z' | '\n' | 'r' | ' ' | '0'..'9')* ;
I get a warning message when I compile the grammar file that says:
"non-fragment lexer rule 'ANYCHARS' can match the empty string"
Fine. I want it to be able to match empty strings as: "/\*\*/" is perfectly valid. But when I run "/\*\*/" in the TestRig I get:
missing ANYCHARS at '*/'
Obviously I could just change it so that '/**/' is handled as a special case:
blockcomment : '/\*' ANYCHARS '\*/' | '/**/';
But that doesn't really address the underlying issue. Can someone please explain to me what I am doing wrong? How can ANTLR raise a warning about matching empty strings and then not match them at the same time?

add "fragment" to ANYCHARS? It will then do what you want.

"non-fragment lexer rule 'ANYCHARS' can match the empty string"
The error message hints you to make ANYCHARS fragment.
Empty string cannot be matched as a token, that would end up with infinitely many empty tokens anywhere in the source.
You want to make the ANYCHARS part of the BLOCKCOMMENT token, rather than a separate token.
That is basically what fragments are good for - they simplify the lexer rules, but don't produce tokens.
BLOCKCOMMENT : '/*' ANYCHARS '*/';
fragment ANYCHARS : ('a'..'z' | '\n' | 'r' | ' ' | '0'..'9')* ;
EDIT: switched parser rule blockcomment to lexer rule BLOCKCOMMENT to enable fragment usage

Related

Parsing letter ranges with ANTLR

I have the following parser rules:
defDirective : defType whiteSpace letterSpec (whiteSpace? COMMA whiteSpace? letterSpec)*;
defType :
DEFBOOL | DEFBYTE | DEFINT | DEFLNG | DEFLNGLNG | DEFLNGPTR | DEFCUR |
DEFSNG | DEFDBL | DEFDATE |
DEFSTR | DEFOBJ | DEFVAR
;
letterSpec : universalLetterRange | letterRange | singleLetter;
singleLetter : RESTRICTED_LETTER;
universalLetterRange : upperCaseA whiteSpace? MINUS whiteSpace? upperCaseZ;
upperCaseA : {_input.Lt(1).Text.Equals("A")}? RESTRICTED_LETTER;
upperCaseZ : {_input.Lt(1).Text.Equals("Z")}? RESTRICTED_LETTER;
letterRange : firstLetter whiteSpace? MINUS whiteSpace? lastLetter;
firstLetter : RESTRICTED_LETTER;
lastLetter : RESTRICTED_LETTER;
whiteSpace : (WS | LINE_CONTINUATION)+;
with the relevant Lexer Rules:
RESTRICTED_LETTER : [a-zA-Z];
MINUS : '-';
COMMA : ',';
WS : [ \t];
LINE_CONTINUATION : [ \t]* UNDERSCORE [ \t]* '\r'? '\n';
and the DefTypes matching their camel-case spelling.
Now when I try to test this on the following inputs, it works exactly as expected:
DefInt I,J,K
DefBool A-Z
It does not work however on arbitary letter ranges (see rule letterRange). When I use the input DefByte B-F, I get the error message "line 1:8 mismatched input 'B' expecting RESTRICTED_LETTER"
I've tried expressing RESTRICTED_IDENTIFIER as a range ('A'..'Z'|'a'..'z'), but that didn't change anything about the error message.
When changing the first whiteSpace in defDirective to whiteSpace+ the error message gets a little longer (now including WS and LINE_CONTINUATION in the expected alternatives).
Also the parse-tree generated by the IntelliJ ANTLR Plugin suddenly starts recognizing the F as a singleLetter, which it previously didn't.
This behaviour seems to be consistent between targetlanguages Java and CSharp.
Previously the rule used to be a lot more relaxed, but that led to incorrect parse-trees, so I kinda want to fix this.
How can I correctly recognize letterRange here?
So ... #BartKiers had the right suspicion. The given Lexer rules weren't all the rules involved in the process.
The full grammar contains a lexer rule B_CHAR : B that's used in a special case of an unrelated grammar rule. That B_CHAR took precedence over RESTRICTED_LETTER when lexing the input stream.
The grammar rules presented are correct (and work fine), but the B_CHAR token needs to be removed from the Tokens lexed.

antlr4 two lexer rule match the same string

I'm currently using antlr4 to build a parser, but I encountered a problem which I tried my best but didn't figure out. Can you help me to explaain and solve it ?
# grammer file : PluginDoc.g4:
grammer PluginDoc
pluginDef : pluginName | pluginDesc;
pluginName : PluginName IDENTIFIER;
pluginDesc : PluginDesc TEXT;
PluginName '#pluginName'
PluginDesc '#pluginDesc'
IDENTIFIER : [a-zA-Z_]+;
TEXT : ~( ' ' | '\n' | '\t' )+;
input content is:
#pluginName kafka
#pluginDesc abc
If I put IDENTIFIER before TEXT, I will get "mismatched input 'abc' expecting TEXT"
If I put TEXT before IDENTIFIER, I will get "mismatched input 'kafka' expecting IDENTIFIER"
Looks like both IDENTIFIER and TEXT are matched, how can I only match IDENTIFIER in pluginName and only match TEXT in pluginDesc ?
First of all, you have several errors in the grammar that you posted:
The header of the file should specify grammar, not grammer. Your Lexer tokens PluginName and PluginDesc do not have a colon in front of them and semicolon to terminate them. It is also an (unwritten?) rule to write your parser rules as all lower-case and your lexer rules as all upper-case.
grammar PluginDoc;
pluginDef : pluginName | pluginDesc;
pluginName : PLUGIN_NAME IDENTIFIER;
pluginDesc : PLUGIN_DESC TEXT;
PLUGIN_NAME : '#pluginName';
PLUGIN_DESC : '#pluginDesc';
IDENTIFIER : [a-zA-Z_]+;
TEXT : ~( ' ' | '\n' | '\t' )+;
Some of the problems that I encountered while testing your grammar were due to the unhandled whitespace. First of all, you should include a Lexer rule to skip the whitespace at the end of the file after all of the other Lexer rules.
WS: [ \n\t\r]+ -> skip;
Next, there is a problem with your TEXT and IDENTIFIER clashing with each other. When the character stream is tokenized by the Lexer, kafka and abc can be both IDENTIFIER and TEXT token. Since the Lexer lexes in a top-down fashion, they are both tokenized as whateve Lexer rule comes first in your grammar. This causes the error that you encounter - whatever you define as the second rule cannot be matched in the parser because it was not sent in as a token.
As suggested by Lucas, you should probably match both of these as TEXT and do the subsequent checking for validity of the input in your Listener/Visitor.
grammar PluginDoc;
pluginDef : (pluginName | pluginDesc)* EOF;
pluginName : PLUGIN_NAME TEXT;
pluginDesc : PLUGIN_DESC TEXT;
PLUGIN_NAME: '#pluginName';
PLUGIN_DESC: '#pluginDesc';
TEXT : ~[ \r\n\t]+;
WS: [ \r\n\t]+ -> skip;
I also changed the pluginDef Parser rule to
pluginDef : (pluginName | pluginDesc)* EOF;
since it was my impression that you want to input both #pluginName X and #pluginDesc Y at once and identify them. If this is not the case, feel free to change back to what you had before.
The resulting AST produced by the modified grammar above onyour sample input:
You can also run this with a text file as an input.

ANTLR: mismatched input

I couldn't understand a bug in my grammar. The file, Bug.g4, is:
grammar Bug;
text: TEXT;
WORD: ('a'..'z' | 'A'..'Z')+ ;
TEXT: ('a'..'z' | 'A'..'Z')+ ;
NEWLINE: [\n\r] -> skip ;
After running antlr4 and javac, I run
grun Bug text -tree
aa
line 1:0 mismatched input 'aa' expecting TEXT
(text aa)
But if I instead use text: WORD in the grammar, things are okay. What's wrong?
When two lexer rules each match the same string of text, and no other lexer rule matches a longer string of text, ANTLR assigns the token type according to the rule which appeared first in the grammar. In your case, a TEXT token can never be produced by the lexer rule because the WORD rule will always match the same text and the WORD rule appears before the TEXT rule in the grammar. If you were to reverse the order of these rules in the grammar, you would start to see TEXT tokens but you would never see a WORD token.

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

Parsing Newlines, EOF as End-of-Statement Marker with ANTLR3

My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error.
";" or "45;" (without newlines) result in EarlyExitException.
"NL" by itself parses without error.
"456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";" or "45;", the token STMTEND will never be created.
";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.
What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.