Antlr4 lexer takes wrong rule [duplicate] - antlr

This question already has answers here:
mismatched Input when lexing and parsing with modes
(2 answers)
Closed 7 years ago.
My language has commands that can be parameter-less or with parameters, and an "if" keyword:
cmd1 // parameter-less command
cmd2 a word // with parameter: "a word" - it starts with first non-WS char
if cmd3 // if, not a command, followed by parameter-less command
cmd4 if text // command with parameter: "if text"
"if" is recognized as if only if it's the first non-WS string in the line (let's ignore comments for now...)
These are my grammer rules:
grammar TestFlow;
// Parser Rules:
root: (lineComment | ifStat | cmd )* EOF;
lineComment : LC;
ifStat : IF;
cmd : CMD;
// Lexer Rules:
LC : '//' ~([\n\r\u2028\u2029])* -> channel(HIDDEN); // line comment
IF : 'if';
CMD : [-_a-zA-Z0-9]+ GAP LINE
| [-_a-zA-Z0-9]+
;
fragment GAP : [ \t]+;
fragment LINE : ~([\n\r\u2028\u2029])*;
But my lexer identifies 3rd line as a CMD: if cmd3, and not as if followed by cmd3 as I need.
What's my mistake? how to fix it?

There doesn't appear to be a parser rule in your example that defines the grammar. Meaning there is no rule indicating to look for an 'if' AND a command.
What is happening in your words:
But my lexer identifies 3rd line as a CMD: if cmd3, and not as if followed by cmd3 as I need
The first alternative in the lexer rule CMD looks for one or more characters ("if"), followed by a space ' ', followed by a LINE (cmd3).
So, with the input "if cmd3" it matches the entire line, which is exactly what you told it to do!
I can tell you from personal experience that for even a simple language, you'll learn a lot and very quickly by taking a step back and review some example grammars, which is what I would do if I were you now to avoid frustration. I highly recommend the Antlr4 REference book from www.pragprog.com as well as the antlr website.
UPDATED
I think this is what you may be interested in:
grammar myGrammar;
root : statement NEWLINE
| comment NEWLINE
;
statement : ifStat (LC)?
| cmdStat (LC)?
;
ifStat : IF cmdStat;
cmdStat : cmd (args)*;
cmd : CMD;
args : LINE;
CMD : [-_a-zA-Z0-9]+ GAP LINE
| [-_a-zA-Z0-9]+
;
fragment GAP : [ \t]+;
fragment LINE : ~([\n\r\u2028\u2029])*;
NEWLINE : ('\r')?'\n';
Again, I must say, if you read the book (which I did), this may give you the expected response from your parser (not lexer).
The ifStat is optional (may ormay not be there, based on your test cases), there will always be a cmd and there may or may not be a line comment following it. Try this out and see if it is helpful. Good luck!

Just little tiny line, made everything perfect: in my MyParser.g4, just had to enter:
options { tokenVocab = MyLexer; }
right after the parser grammar MYParser;...
So much time was wasted just to find this little detail... :-(
(few of the) Other posts of people not knowing what was going on, just to finally reach this solution:
ANTLR: Lexer does not recognize token
mismatched Input when lexing and parsing with modes

Related

Can a parser fail silently?

May ANTLR generated parsers fail silently? That
is, can they omit diagnosing when not recognising?
Using a very small grammar for a demonstration and using defaults only for ANTLR, these are the contrasting observations:
When sending input to the usual test rig for the grammar below, I am
noticing two things:
the parsers recognize valid input (actions show that), o.K.;
however, the recognisers seem to “accept” certain invalid(?) inputs, meaning there is no
diagnosis. V3 and v4 parsers behave similarly. The issue—if there is
an issue—appears when there are characters ('1') missing
at the front of an input for stat, provided that prior to this input another input of
just a NEWLINE had been sent.
This is the v4 grammar:
grammar Simp;
prog : stat+ ;
stat : '1' '+' '1' NEWLINE
| NEWLINE
;
NEWLINE : [\r]?[\n] ;
The v3 grammar is the same, mutatis mutandis.
Some runs using v4; class TestSimp4 is the usual test rig as in the book(s),
see below:
% printf "1+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
% printf "+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
line 1:0 extraneous input '+' expecting {'1', NEWLINE}
line 1:2 mismatched input '\n' expecting '+'
% printf "\n+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
%
The first two invocations' results I had expected. I had expected the last invocation to visibly fail, though. Correct?
Looking at the generated SimpParser.java, the silent exit seems
consequential, as outlined below. But should it be that way? I am thinking that ANTLR just
stops before recognising invalid input here, but it shouldn't just stop.
Question: Is this silent failure rather to be expected? Have I
overlooked something like a greedyness setting for grammar tokens with a
+ suffix?
Some code analysis.
Referring to the loop that calls stat() (in the
prog() procedure):
The v3 parser sets a counter variable to >= 1 on sucessfully matching
the initial NEWLINE. The effect is that EarlyExitException is then
not being thrown on later inputs, it just breaks the loop.
The v4 parser similarly calls _input.LA(1) and then just terminates
the loop whenever that call’s result cannot be at the start of stat.
(So no recovery?)
The test rig:
class TestSimp4 {
public static void main(String[] args) throws Exception {
final CharStream subject = CharStreams.fromStream(System.in);
final TokenSource tknzr = new SimpLexer(subject);
final CommonTokenStream ts = new CommonTokenStream(tknzr);
final SimpParser parser = new SimpParser(ts);
parser.prog();
}
}
So another paraphrase of my question would be: “How does one
create ANTLR parsers such that they will always say YES or NO?”
Your 3rd test input, \n+1\n, does not produce an error because you're telling it to recognize the production/rule stat once or more. And prog successfully matches the input \n and then stops. If you want the entire input (token stream) to be consumed, "anchor" your prog rule with the EOF token:
prog : stat+ EOF;

Simple Island Grammar in ANTLR 4: Token Recognition Error

apparently, I wasn't able to deduce the answers to my problem from exiting posts on token recognition errors with Island Grammars here, so I hope someone can give me an advice on how to do this correctly.
Basically, I am trying to write a language that contains proprocessor directives. I narrowed my problem down to a very simple example. In my example lanuage, the following should be valid syntax:
##some preprocessor text
PRINT some regular text
When parsing the code, I want to be able to identify the tokens "some preprocessor text", "PRINT" and "some regular text".
This is the parser grammar:
parser grammar myp;
root: (preprocessor | command)*;
preprocessor: PREPROC PREPROCLINE;
command: PRINT STRINGLINE;
This is the lexer grammar:
lexer grammar myl;
PREPROC: '##' -> pushMode(PREPROC_MODE);
PRINT: 'PRINT' -> pushMode(STRING_MODE);
WS: [ \t\r\n] -> skip;
mode PREPROC_MODE;
PREPROCLINE: (~[\r\n])*[\r\n]+ -> popMode;
mode STRING_MODE;
STRINGLINE: (~[\r\n])*[\r\n]+ -> popMode;
When I parse the above example code, I get the following error:
line 1:2 extraneous input 'some preprocessor text\r\n' expecting
PREPROCLINE line 2:5 token recognition error at: ' some regular text'
This error occurs regardless of whether the line "WS: [ \t\r\n] -> skip;" is included in the lexer grammar or not. I guess that if I introduced quotes to the tokens PREPROCLINE and STRINGLINE instead of the line endings, it would work (at least I suceesfully implemented regular strings in other languages). But in this particular language, I really want to have the strings without the quotes.
Any help on why this error is occurring or how to implement a preprocessor language with unquoted strings is very appreciated.
Thanks
Updated: First, the recognition errors are because your parser needs to reference the lexer tokens. Add the options block to your parser:
options {
tokenVocab=MyLexer;
}
Second, when you generate your lexer/parser, be aware that the warnings usually need to be considered and corrected before proceeding.
Finally, these are all working alternatives, once you add the options block.
XXXX: (~[\r\n])*[\r\n]+ -> popMode;
is a bit cleaner as:
XXXX: .*? '\r'? '\n' -> popMode;
To not include the line endings, try
XXXX: .*? ~[\r\n] -> popMode;

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....

Parsing Newlines, EOF as End-of-Statement Marker with ANTLR3

My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error.
";" or "45;" (without newlines) result in EarlyExitException.
"NL" by itself parses without error.
"456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";" or "45;", the token STMTEND will never be created.
";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.
What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.