I have a little ANTLR v4 grammer and I am implementing a visitor on it.
Lets say it is a simple calculator and every input must be terminated with a ";"
e.g. x=4+5;
If I do not put the ; at the end, then it is working too but I get a output the teminal.
line 1:56 missing ';' at '<EOF>'
Seems it can find the rule and more or less ignores the missing terminal ";".
I would prefer a strict error or an exception instead of this soft information.
The output is generated by the line
ParseTree tree = parser.input ()
Is there a way I can intensify the error-handling and check for that kind of error?
Yes, you can. Like you, I wanted a 100% perfect parse from user-submitted text and so created a strict error handler that prevents recovery from even simple errors.
The first step is in removing the default error listeners and adding your own STRICT error handler:
AntlrInputStream inputStream = new AntlrInputStream(stream);
BailLexer lexer = new BailLexer(inputStream); // TALK ABOUT THIS AT BOTTOM
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
LISBASICParser parser = new LISBASICParser(tokenStream);
parser.RemoveErrorListeners(); // UNHOOK ERROR HANDLER
parser.ErrorHandler = new StrictErrorStrategy(); // REPLACE WITH YOUR OWN
LISBASICParser.CalculationContext context = parser.calculation();
CalculationVisitor visitor = new CalculationVisitor();
visitor.VisitCalculation(context);
Here's my StrictErrorStrategy class. It inherits from the DefaultErrorStrategy class and overrides the two 'recovery' methods that are letting small errors like your semicolon error be recoverable:
public class StrictErrorStrategy : DefaultErrorStrategy
{
public override void Recover(Parser recognizer, RecognitionException e)
{
IToken token = recognizer.CurrentToken;
string message = string.Format("parse error at line {0}, position {1} right before {2} ", token.Line, token.Column, GetTokenErrorDisplay(token));
throw new Exception(message, e);
}
public override IToken RecoverInline(Parser recognizer)
{
IToken token = recognizer.CurrentToken;
string message = string.Format("parse error at line {0}, position {1} right before {2} ", token.Line, token.Column, GetTokenErrorDisplay(token));
throw new Exception(message, new InputMismatchException(recognizer));
}
public override void Sync(Parser recognizer) { }
}
Overriding these two methods allows you to stop (in this case with an exception that is caught elsewhere) on ANY parser error. And making the Sync method empty prevents the normal 're-sync after error' behavior from happening.
The final step is in catching all LEXER errors. You do this by creating a new class that inherits from your main lexer class; it overrides the Recover() method like so:
public class BailLexer : LISBASICLexer
{
public BailLexer(ICharStream input) : base(input) { }
public override void Recover(LexerNoViableAltException e)
{
string message = string.Format("lex error after token {0} at position {1}", _lasttoken.Text, e.StartIndex);
BasicEnvironment.SyntaxError = message;
BasicEnvironment.ErrorStartIndex = e.StartIndex;
throw new ParseCanceledException(BasicEnvironment.SyntaxError);
}
}
(Edit: In this code, BasicEnvironment is a high-level context object I used in the application to hold settings, errors, results, etc. So if you decide to use this, either do as another reader commented below, or substitute your own context/container.)
With this in place, even small errors during the lexing step will be caught as well. With these two overridden classes in place, the user of my app must supply absolutely perfect syntax to get a successful execution. There you go!
Because my ANTLR is in Java I add the answer here too. But it is the same idea as the accepted answer.
TempParser parser = new TempParser (tokens);
parser.removeErrorListeners ();
parser.addErrorListener (new BaseErrorListener ()
{
#Override
public void syntaxError (final Recognizer <?,?> recognizer, Object sym, int line, int pos, String msg, RecognitionException e)
{
throw new AssertionError ("ANTLR - syntax-error - line: " + line + ", position: " + pos + ", message: " + msg);
}
});
Related
I'd like to run my ANTLR parser with an ExecutorService so I can call Future.cancel() on it after a timeout. AIUI, I need the parser to check Thread.isInterrupted(); is there a mechanism in the parser interface for this kind of instrumentation?
In the cases where this is relevant, the parser seems to be deep in PredictionContext recursion.
There is a ParseCancellationException (https://javadoc.io/doc/org.antlr/antlr4-runtime/latest/index.html).
Per the docs: “This exception is thrown to cancel a parsing operation. This exception does not extend RecognitionException, allowing it to bypass the standard error recovery mechanisms. BailErrorStrategy throws this exception in response to a parse error.”
You can attach a Listener that overrides enterEveryRule() to your parser. You could have a method on that listener to set a flag to throw the ParseCancellationException the next time the parser enters a rule (which happens very frequently).
Here's a short example of what the listener might look like:
public class CancelListener extends YourBaseListener {
public boolean cancel = false;
#Override
public void enterEveryRule(ParserRuleContext ctx) {
if (cancel) {
throw new ParseCancellationException("gotta go");
}
super.enterEveryRule(ctx);
}
}
then you can add that listener to your parser:
parser.addParseListener(cancelListener);
And then:
cancelListener.cancel = true
I am using Lucene 4.6, and am apparently unclear on how to reuse a TokenStream, because I get the exception:
java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
at the start of the second pass. I've read the Javadoc, but I'm still missing something. Here is a simple example that throws the above exception:
#Test
public void list() throws Exception {
String text = "here are some words";
TokenStream ts = new StandardTokenizer(Version.LUCENE_46, new StringReader(text));
listTokens(ts);
listTokens(ts);
}
public static void listTokens(TokenStream ts) throws Exception {
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
try {
ts.reset();
while (ts.incrementToken()) {
System.out.println("token text: " + termAtt.toString());
}
ts.end();
}
finally {
ts.close();
}
}
I've tried not calling TokenStream.end() or TokenStream.close() thinking maybe they should only be called at the very end, but I get the same exception.
Can anyone offer a suggestion?
The Exception lists, as a possible issue, calling reset() multiple times, which you are doing. This is explicitly not allowed in the implementation of Tokenizer. Since the the java.io.Reader api does not guarantee support of the reset() operation by all subclasses, the Tokenizer can't assume that the Reader passed in can be reset, after all.
You may simply construct a new TokenStream, or I believe you could call Tokenizer.setReader(Reader) (in which case you certainly must close() it first).
Basically, I've extended the BaseErrorListener, and I need to know when the error is semantic and when it's syntactic. So I want the following to give me a failed predicate exception, but I'm getting a NoViableAltException instead (I know the counting is working, because I can print out the value of things, and it's correct). Is there a way I can re-work it to do what I want? In my example below, I want there to be a failed predicate exception if we don't end up with 6 things.
grammar Test;
#parser::members {
int things = 0;
}
.
.
.
samplerule : THING { things++; } ;
.
.
.
// Want this to be a failed predicate instead of NoViableAltException
anotherrule : ENDTOKEN { things == 6 }? ;
.
.
.
I'm already properly getting failed predicate exceptions with the following (for a different scenario):
somerule : { Integer.valueOf(getCurrentToken().getText()) < 256 }? NUMBER ;
.
.
.
NUMBER : [0-9]+ ;
In ANTLR 4, predicates should only be used in cases where your input leads to two different possible parse trees (ambiguous grammar) and the default handling is producing the wrong parse tree. You should create a listener or visitor implementation containing your logic for semantic validation of the source.
Due to 280Z28's answer and the apparent fact that predicates should not be used for what I was trying to do, I went a different route.
If you know what you're looking for, ANTLR4's documentation is actually pretty useful--visit Parser.getCurrentToken()'s documentation and poke around further to see what more you can do with my implementation below.
My driver ended up looking something like the following:
// NameOfMyGrammar.java
public class NameOfMyGrammar {
public static void main(String[] args) throws Exception {
String inputFile = args[0];
try {
ANTLRInputStream input = new ANTLRFileStream(inputFile);
NameOfMyGrammarLexer lexer = new NameOfMyGrammarLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
MyCustomParser parser = new MyCustomParser(tokens);
try {
// begin parsing at "start" rule
ParseTree tree = parser.start();
// can print out parse tree if you want..
} catch (RuntimeException e) {
// Handle errors if you want..
}
} catch (IOException e) {
System.err.println("Error: " + e);
}
}
// extend ANTLR-generated parser
private static class MyCustomParser extends NameOfMyGrammarParser {
// Constructor (my understanding is that you DO need this)
public MyCustomParser(TokenStream input) {
super(input);
}
#Override
public Token getCurrentToken() {
// Do your semantic checking as you encounter tokens here..
// Depending on how you want to handle your errors, you can
// throw exceptions, print out errors, etc.
// Make sure you end by returning the current token
return _input.LT(1);
}
}
}
I am using Antlr4, and here is a simplified grammar I wrote:
grammar BooleanExpression;
/*******************************
* Parser Rules
*******************************/
booleanTerm
: booleanLiteral (KW_OR booleanLiteral)+
| booleanLiteral
;
id
: IDENTIFIER
;
booleanLiteral
: KW_TRUE
| KW_FALSE
;
/*******************************
* Lexer Rules
*******************************/
KW_TRUE
: 'true'
;
KW_FALSE
: 'false'
;
KW_OR
: 'or'
;
IDENTIFIER
: (SIMPLE_LATIN)+
;
fragment
SIMPLE_LATIN
: 'A' .. 'Z'
| 'a' .. 'z'
;
WHITESPACE
: [ \t\n\r]+ -> skip
;
I used a BailErrorStategy and BailLexer like below:
public class BailErrorStrategy extends DefaultErrorStrategy {
/**
* Instead of recovering from exception e, rethrow it wrapped in a generic
* IllegalArgumentException so it is not caught by the rule function catches.
* Exception e is the "cause" of the IllegalArgumentException.
*/
#Override
public void recover(Parser recognizer, RecognitionException e) {
throw new IllegalArgumentException(e);
}
/**
* Make sure we don't attempt to recover inline; if the parser successfully
* recovers, it won't throw an exception.
*/
#Override
public Token recoverInline(Parser recognizer) throws RecognitionException {
throw new IllegalArgumentException(new InputMismatchException(recognizer));
}
/** Make sure we don't attempt to recover from problems in subrules. */
#Override
public void sync(Parser recognizer) {
}
#Override
protected Token getMissingSymbol(Parser recognizer) {
throw new IllegalArgumentException(new InputMismatchException(recognizer));
}
}
public class BailLexer extends BooleanExpressionLexer {
public BailLexer(CharStream input) {
super(input);
//removeErrorListeners();
//addErrorListener(new ConsoleErrorListener());
}
#Override
public void recover(LexerNoViableAltException e) {
throw new IllegalArgumentException(e); // Bail out
}
#Override
public void recover(RecognitionException re) {
throw new IllegalArgumentException(re); // Bail out
}
}
Everything works okay except one case. I tried the following expression:
true OR false
I expect this expression to be rejected and an IllegalArgumentException is thrown because the 'or' token should be lower case instead of upper case. But it turned out Antlr4 didn't reject this expression and the expression is tokenized into "KW_TRUE IDENTIFIER KW_FALSE" (which is expected, upper case 'OR' will be considered as an IDENTIFIER), but the parser didn't throw an error during processing this token stream and parsed it into a tree containing only "true" and discarded the remaining "IDENTIFIER KW_FALSE" tokens. I tried different prediction modes but all of them worked like above. I have no idea why it works like this and did some debugging, and it eventually led to to this piece of code in Antlr:
ATNConfigSet reach = computeReachSet(previous, t, false);
if ( reach==null ) {
// if any configs in previous dipped into outer context, that
// means that input up to t actually finished entry rule
// at least for SLL decision. Full LL doesn't dip into outer
// so don't need special case.
// We will get an error no matter what so delay until after
// decision; better error message. Also, no reachable target
// ATN states in SLL implies LL will also get nowhere.
// If conflict in states that dip out, choose min since we
// will get error no matter what.
int alt = getAltThatFinishedDecisionEntryRule(previousD.configs);
if ( alt!=ATN.INVALID_ALT_NUMBER ) {
// return w/o altering DFA
return alt;
}
throw noViableAlt(input, outerContext, previous, startIndex);
}
The code "int alt = getAltThatFinishedDecisionEntryRule(previousD.configs);" returned the second alternative in booleanTerm (because "true" matches the second alternative "booleanLiteral") but since it is not equal to ATN.INVALID_ALT_NUMBER, noViableAlt is not thrown immediately. According to the Java comments there, "We will get an error no matter what, so delay until after decision" but it seems no error was thrown eventually.
I really have no idea how to make Antlr reports an error in this case, could some one shed me some light on this? Any help is appreciated, thanks.
If your top-level rule does not end with an explicit EOF, then ANTLR is not required to parse to the end of the input sequence. Rather than throw an exception, it simply parsed the valid portion of the sequence you gave it.
The following start rule would force it to parse the entire input sequence as a single booleanTerm.
start : booleanTerm EOF;
Also, BailErrorStrategy is provided by the ANTLR 4 runtime, and throws a more informative ParseCancellationException than the one shown in your example.
I've got a rule like this:
declaration returns [RuntimeObject obj]:
DECLARE label value { $obj = new RuntimeObject($label.text, $value.text); };
Unfortunately, it throws an exception in the RuntimeObject constructor because $label.text is null. Examining the debug output and some other things reveals that the match against "label" actually failed, but the Antlr runtime "helpfully" continues with the match for the purpose of giving a more helpful error message (http://www.antlr.org/blog/antlr3/error.handling.tml).
Okay, I can see how this would be useful for some situations, but how can I tell Antlr to stop doing that? The defaultErrorHandler=false option from v2 seems to be gone.
I don't know much about Antlr, so this may be way off base, but the section entitled "Error Handling" on this migration page looks helpful.
It suggests you can either use #rulecatch { } to disable error handling entirely, or override the mismatch() method of the BaseRecogniser with your own implementation that doesn't attempt to recover. From your problem description, the example on that page seems like it does exactly what you want.
You could also override the reportError(RecognitionException) method, to make it rethrow the exception instead of print it, like so:
#parser::members {
#Override
public void reportError(RecognitionException e) {
throw new RuntimeException(e);
}
}
However, I'm not sure you want this (or the solution by ire_and_curses), because you will only get one error per parse attempt, which you can then fix, just to find the next error. If you try to recover (ANTLR does it okay) you can get multiple errors in one try, and fix all of them.
You need to override the mismatch and recoverFromMismatchedSet methods to ensure an exception is thrown immediately (examples are for Java):
#members {
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow) throws RecognitionException {
throw e;
}
}
then you need to change how the parser deals with those exceptions so they're not swallowed:
#rulecatch {
catch (RecognitionException e) {
throw e;
}
}
(The bodies of all the rule-matching methods in your parser will be enclosed in try blocks, with this as the catch block.)
For comparison, the default implementation of recoverFromMismatchedSet inherited from BaseRecognizer:
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow) throws RecognitionException {
if (mismatchIsMissingToken(input, follow)) {
reportError(e);
return getMissingSymbol(input, e, Token.INVALID_TOKEN_TYPE, follow);
}
throw e;
}
and the default rulecatch:
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}