Can a parser fail silently? - antlr

May ANTLR generated parsers fail silently? That
is, can they omit diagnosing when not recognising?
Using a very small grammar for a demonstration and using defaults only for ANTLR, these are the contrasting observations:
When sending input to the usual test rig for the grammar below, I am
noticing two things:
the parsers recognize valid input (actions show that), o.K.;
however, the recognisers seem to “accept” certain invalid(?) inputs, meaning there is no
diagnosis. V3 and v4 parsers behave similarly. The issue—if there is
an issue—appears when there are characters ('1') missing
at the front of an input for stat, provided that prior to this input another input of
just a NEWLINE had been sent.
This is the v4 grammar:
grammar Simp;
prog : stat+ ;
stat : '1' '+' '1' NEWLINE
| NEWLINE
;
NEWLINE : [\r]?[\n] ;
The v3 grammar is the same, mutatis mutandis.
Some runs using v4; class TestSimp4 is the usual test rig as in the book(s),
see below:
% printf "1+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
% printf "+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
line 1:0 extraneous input '+' expecting {'1', NEWLINE}
line 1:2 mismatched input '\n' expecting '+'
% printf "\n+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
%
The first two invocations' results I had expected. I had expected the last invocation to visibly fail, though. Correct?
Looking at the generated SimpParser.java, the silent exit seems
consequential, as outlined below. But should it be that way? I am thinking that ANTLR just
stops before recognising invalid input here, but it shouldn't just stop.
Question: Is this silent failure rather to be expected? Have I
overlooked something like a greedyness setting for grammar tokens with a
+ suffix?
Some code analysis.
Referring to the loop that calls stat() (in the
prog() procedure):
The v3 parser sets a counter variable to >= 1 on sucessfully matching
the initial NEWLINE. The effect is that EarlyExitException is then
not being thrown on later inputs, it just breaks the loop.
The v4 parser similarly calls _input.LA(1) and then just terminates
the loop whenever that call’s result cannot be at the start of stat.
(So no recovery?)
The test rig:
class TestSimp4 {
public static void main(String[] args) throws Exception {
final CharStream subject = CharStreams.fromStream(System.in);
final TokenSource tknzr = new SimpLexer(subject);
final CommonTokenStream ts = new CommonTokenStream(tknzr);
final SimpParser parser = new SimpParser(ts);
parser.prog();
}
}
So another paraphrase of my question would be: “How does one
create ANTLR parsers such that they will always say YES or NO?”

Your 3rd test input, \n+1\n, does not produce an error because you're telling it to recognize the production/rule stat once or more. And prog successfully matches the input \n and then stops. If you want the entire input (token stream) to be consumed, "anchor" your prog rule with the EOF token:
prog : stat+ EOF;

Related

Antlr4: Mismatch input

My grammar:
qwe.g4
grammar qwe;
query
: COLUMN OPERATOR value EOF
;
COLUMN
: [a-z_]+
;
OPERATOR
: ('='|'>'|'<')
;
SCALAR
: [a-z_]+
;
value
: SCALAR
;
WS : [ \t\r\n]+ -> skip ;
Handling in Python code:
qwe.py
from antlr4 import InputStream, CommonTokenStream
from qweLexer import qweLexer
from qweParser import qweParser
conditions = 'total_sales>qwe'
lexer = qweLexer(InputStream(conditions))
stream = CommonTokenStream(lexer)
parser = qweParser(stream)
tree = parser.query()
for child in tree.getChildren():
print(child, '==>', type(child))
Running qwe.py outputs error when parsing (lexing?) value:
How to fix that?
I read some and suppose that there is something to do with COLUMN rule that also matches value...
Your COLUMN and SCALAR lexer rules are identical. When the LExer matches two rules, then the rule that recognizes the longest token will win. If the token lengths are the same (as the are here), the the first rule wins.
Your Token Stream will be COLUMN OPERATOR COLUMN
That's thy the query rule won't match.
As a general practice, it's good to use the grun alias (that the setup tutorial will tell you how to set up) to dump the token stream.
grun qwe tokens -tokens < sampleInputFile
Once that gives you the expected output, you'll probably want to use the grun tool to display parse trees of your input to verify that is correct. All of this can be done without hooking up the generated code into your target language, and helps ensure that your grammar is basically correct before you wire things up.

Erratic parser. Same grammar, same input, cycles through different results. What am I missing?

I'm writing a basic parser that reads form stdin and prints results to stdout. The problem is that I'm having troubles with this grammar:
%token WORD NUM TERM
%%
stmt: /* empty */
| word word term { printf("[stmt]\n"); }
| word number term { printf("[stmt]\n"); }
| word term
| number term
;
word: WORD { printf("[word]\n"); }
;
number: NUM { printf("[number]\n"); }
;
term: TERM { printf("[term]\n"); /* \n */}
;
%%
When I run the program, I and type: hello world\n The output is (as I expected) [word] [word] [term] [stmt]. So far, so good, but then if I type: hello world\n (again), I get syntax error [word][term].
When I type hello world\n (for the third time) it works, then it fails again, then it works, and so on and do forth.
Am I missing something obvious in here?
(I have some experience on hand rolled compilers, but I've not used lex/yacc et. al.)
This is the main func:
int main() {
do {
yyparse();
} while(!feof(yyin));
return 0;
}
Any help would be appreciated. Thanks!
Your grammar recognises a single stmt. Yacc/bison expect the grammar to describe the entire input, so after the statement is recognised, the parser waits for an end-of-input indication. But it doesn't get one, since you typed a second statement. That causes the parser to report a syntax error. But note that it has now read the first token in the second line.
You are calling yyparse() in a loop and not stopping when you get a syntax error return value. So when you call yyparse() again, it will continue where the last one left off, which is just before the second token in the second line. What remains is just a single word, which it then correctly parses.
What you probably should do is write your parser so that it accepts any number of statements, and perhaps so that it does not die when it hits an error. That would look something like this:
%%
prog: %empty
| prog line
line: stmt '\n' { puts("Got a statement"); }
| error '\n' { yyerrok; /* Simple error recovery */ }
...
Note that I print a message for a statement only after I know that the line was correctly parsed. That usually turns out to be less confusing. But the best solution is not use printf's, but rather to use Bison's trace facility, which is as simple as putting -t on the bison command line and setting the global variable yydebug = 1;. See Tracing your parser

Why do parser combinators don't backtrack in case of failure?

I looked through the Artima guide on parser combinators, which says that we need to append failure(msg) to our grammar rules to make error-reporting meaningful for the user
def value: Parser[Any] =
obj | stringLit | num | "true" | "false" | failure("illegal start of value")
This breaks my understanding of the recursive mechanism, used in these parsers. One one hand, Artima guide makes sense saying that if all productions fail then parser will arrive at the failure("illegal start of value") returned to the user. It however does not make sense, nevertheless, once we understand that grammar is not the list of value alternatives but a tree instead. That is, value parser is a node that is called when value is sensed at the input. This means that calling parser, which is also a parent, detects failure on value parsing and proceeds with value sibling alternative. Suppose that all alternatives to value also fail. Grandparser will try its alternatives then. Failed in turn, the process unwinds upward until the starting symbol parser fails. So, what will be the error message? It seems that the last alternative of the topmost parser is reported errorenous.
To figure out, who is right, I have created a demo where program is the topmost (starting symbol) parser
import scala.util.parsing.combinator._
object ExprParserTest extends App with JavaTokenParsers {
// Grammar
val declaration = wholeNumber ~ "to" ~ wholeNumber | ident | failure("declaration not found")
val term = wholeNumber | ident ; lazy val expr: Parser[_] = term ~ rep ("+" ~ expr)
lazy val statement: Parser[_] = ident ~ " = " ~ expr | "if" ~ expr ~ "then" ~ rep(statement) ~ "else" ~ rep(statement)
val program = rep(declaration) ~ rep(statement)
// Test
println(parseAll(program, "1 to 2")) // OK
println(parseAll(program, "1 to '2")) // failure, regex `-?\d+' expected but `'' found at '2
println(parseAll(program, "abc")) // OK
}
It fails with 1 to '2 due to extra ' tick. Yes, it seems to stuck in the program -> declaration -> num "to" num rule and does not even try the ident and failure("declaration not found") alternatives! I does not back track to the statements either for the same reason. So, neither my guess nor Artima guide seems right on what parser combinators are actually doing. I wonder: what is the real logic behind rule sensing, backtracking and error reporting in parser combinators? Why does the error message suggests that no backtracking to declaration -> ident | failure(), nor statements occured? What is the point of Artima guide suggesting to place failure() in the end if it is not reached as we see or ignored, as the backtracking logic should be, anyway?
Isn't parser combinator just a plain dumb PEG? It behaves like predictive parser. I expected it is PEG and, thus, that starting symbol parser should return all failed branches and wonder why/how does the actual parser manage to select the most appropriate failure.
Many parser combinators backtrack, unless they're in an 'or' block. As a speed optimization, they'll commit to the 1st successful 'or' item and not backtrack. So 1) try to avoid '|' as much as possible in your grammar, and 2) if using '|' is unavoidable, place the longest or least-likely-to-match items first.

Antlr4 lexer takes wrong rule [duplicate]

This question already has answers here:
mismatched Input when lexing and parsing with modes
(2 answers)
Closed 7 years ago.
My language has commands that can be parameter-less or with parameters, and an "if" keyword:
cmd1 // parameter-less command
cmd2 a word // with parameter: "a word" - it starts with first non-WS char
if cmd3 // if, not a command, followed by parameter-less command
cmd4 if text // command with parameter: "if text"
"if" is recognized as if only if it's the first non-WS string in the line (let's ignore comments for now...)
These are my grammer rules:
grammar TestFlow;
// Parser Rules:
root: (lineComment | ifStat | cmd )* EOF;
lineComment : LC;
ifStat : IF;
cmd : CMD;
// Lexer Rules:
LC : '//' ~([\n\r\u2028\u2029])* -> channel(HIDDEN); // line comment
IF : 'if';
CMD : [-_a-zA-Z0-9]+ GAP LINE
| [-_a-zA-Z0-9]+
;
fragment GAP : [ \t]+;
fragment LINE : ~([\n\r\u2028\u2029])*;
But my lexer identifies 3rd line as a CMD: if cmd3, and not as if followed by cmd3 as I need.
What's my mistake? how to fix it?
There doesn't appear to be a parser rule in your example that defines the grammar. Meaning there is no rule indicating to look for an 'if' AND a command.
What is happening in your words:
But my lexer identifies 3rd line as a CMD: if cmd3, and not as if followed by cmd3 as I need
The first alternative in the lexer rule CMD looks for one or more characters ("if"), followed by a space ' ', followed by a LINE (cmd3).
So, with the input "if cmd3" it matches the entire line, which is exactly what you told it to do!
I can tell you from personal experience that for even a simple language, you'll learn a lot and very quickly by taking a step back and review some example grammars, which is what I would do if I were you now to avoid frustration. I highly recommend the Antlr4 REference book from www.pragprog.com as well as the antlr website.
UPDATED
I think this is what you may be interested in:
grammar myGrammar;
root : statement NEWLINE
| comment NEWLINE
;
statement : ifStat (LC)?
| cmdStat (LC)?
;
ifStat : IF cmdStat;
cmdStat : cmd (args)*;
cmd : CMD;
args : LINE;
CMD : [-_a-zA-Z0-9]+ GAP LINE
| [-_a-zA-Z0-9]+
;
fragment GAP : [ \t]+;
fragment LINE : ~([\n\r\u2028\u2029])*;
NEWLINE : ('\r')?'\n';
Again, I must say, if you read the book (which I did), this may give you the expected response from your parser (not lexer).
The ifStat is optional (may ormay not be there, based on your test cases), there will always be a cmd and there may or may not be a line comment following it. Try this out and see if it is helpful. Good luck!
Just little tiny line, made everything perfect: in my MyParser.g4, just had to enter:
options { tokenVocab = MyLexer; }
right after the parser grammar MYParser;...
So much time was wasted just to find this little detail... :-(
(few of the) Other posts of people not knowing what was going on, just to finally reach this solution:
ANTLR: Lexer does not recognize token
mismatched Input when lexing and parsing with modes

How do I exclude characters / symbols using ANTLR grammar?

I'm trying to write a grammar for various time formats (12:30, 0945, 1:30-2:45, ...) using ANTLR. So far it works like a charm as long as I don't type in characters that haven't been defined in the grammar file.
I'm using the following JUnit test for example:
final CharStream stream = new ANTLRStringStream("12:40-1300,15:123-18:59");
final TimeGrammarLexer lexer = new TimeGrammarLexer(stream);
final CommonTokenStream tokenStream = new CommonTokenStream(lexer);
final TimeGrammarParser parser = new TimeGrammarParser(tokenStream);
try {
final timeGrammar_return tree = parser.timeGrammar();
fail();
} catch (final Exception e) {
assertNotNull(e);
}
An Exception gets thrown (as expected) because "15:123" isn't valid.
If I try ("15:23a") though, no exception gets thrown and ANTLR treats it like a valid input.
Now if I define characters in my grammar, ANTLR seems to notice them and I once again get the exception I want:
CHAR: ('a'..'z')|('A'..'Z');
But how do I exclude umlauts, symbols and other stuff a user is able to type in (äöü{%&<>!). So basically I'm looking for some kind of syntax that says: match everything BUT "0..9,:-"
...
So basically I'm looking for some kind of syntax that says: match everything BUT "0..9,:-"
The following rule matches any single character except a digit, ,, : and -:
Foo
: ~('0'..'9' | ',' | ':' | '-')
;
(the ~ negates single characters inside lexer-rules)
But you might want to post your entire grammar: I get the impression there are some other things you're not doing as they should have been done. Your call.
you can define a literal, that matches all the characters, that you do not want. If this literal is not contained in any of your rules, antlr will throw a NonViableException.
For unicode this could look like this:
UTF8 : ('\u0000'..'\u002A' // ! to *
| '\u002E'..'\u002F' // . /
| '\u003B'..'\u00FF' // ; < = > ? # as well as letters brackets and stuff
)
;