Syntax error recovery in ANTLR to produce valid parse tree - antlr

My simplified grammar is similar to the following:
parseExpression: expression EOF;
identifier: IDENT;
expression
: identifier
| identifier (+|-) identifier
;
When I try to parse an expression like abc +, a NoViableAlternativeException is thrown and ANTLR will recover by producing a ExpressionContext containing ErrorNode children.
I would like to recover from such a syntax error by ignoring the + token or by generating a synthetic token that is valid based on the expected tokens.
The error reporting will still be done through other means, but I need a valid parse tree that I can walk.
Any idea how this can be done in a custom ANTLRErrorListener or ANTLRErrorStrategy? I was trying to understand what happens in these classes when errors are encountered, but I couldn't figure out so far how to do this recovery.

Related

ANTLR4: parse number as identifier instead as numeric literal

I have this situation, of having to treat integer as identifier.
Underlying language syntax (unfortunately) allows this.
grammar excerpt:
grammar Alang;
...
NLITERAL : [0-9]+ ;
...
IDENTIFIER : [a-zA-Z0-9_]+ ;
Example code, that has to be dealt with:
/** declaration block **/
Method 465;
...
In above code example, because NLITERAL has to be placed before IDENTIFIER, parser picks 465 as NLITERAL.
What is a good way to deal with such a situations?
(Ideally, avoiding application code within grammar, to keep it runtime agnostic)
I found similar questions on SO, not exactly helpful though.
There's no good way to make 465 produce either an NLITERAL token or an IDENTIFIER token depending on context (you might be able to use lexer modes, but that's probably not a good fit for your needs).
What you can do rather easily though, is to allow NLITERALs in addition to IDENTIFIERS in certain places. So you could define a parser rule
methodName: IDENTIFIER | NLITERAL;
and then use that rule instead of IDENTIFIER where appropriate.

Antlr4 extremely simple grammar failing

Antlr4 has always been a kind of love-hate relationship for me, but I am currently a bit perplexed. I have started creating a grammar to my best knowledge and then wanted to test it and it didnt work at all. I then reduced it a lot to just a bare minimum example and I managed to make it not work. This is my grammar:
grammar SwiftMtComponentFormat;
separator : ~ZERO EOF;
ZERO : '0';
In my understanding it should anything except a '0' and then expect the end of the file. I have been testing it with the single character input '1' which I had expected to work. However this is what happens:
If i change the ~ZEROto ZERO and change my input from 1 to 0 it actually perfectly matches... For some reason the simple negation does not seem to work. I am failing to understand what the reason here is...
In a parser rule ~ZERO matches any token that is not a ZERO token. The problem in your case is that ZERO is the only type of token that you defined at all, so any other input will lead to a token recognition error and not get to the parser at all. So if you enter the input 1, the lexer will discard the 1 with a token recognition error and the parser will only see an empty token stream.
To fix this, you can simply define a lexer rule OTHER that matches any character not matched by previous lexer rules:
OTHER: .;
Note that this definition has to go after the definition of ZERO - otherwise it would match 0 as well.
Now the input 1 will produce an OTHER token and ~ZERO will match that token. Of course, you could now replace ~ZERO with OTHER and it wouldn't change anything, but once you add additional tokens, ~ZERO will match those as well whereas OTHER would not.

Another implicit token error - how to tweak definitions to address it

I am aware what implicit token definition error in parser means, but am having difficulty getting rid of it. (v4)
stripped down statements:
enum_decl : GTYPE_ENUM ID LSQUARE STRING STRING* RSQUARE SEMI ;
string_decl: GTYPE_STRING ID (COMMA ID)* SEMI ;
In string_decl, that error appears on SEMI
In enum_decl the same error is on RSQUARE
GTYPE_ENUM, ID, etc. all are defined / accepted correctly, in the Lexer section.
Have you type in that little tiny section trying to find a small test case that doesn't work? Without a grammar to test there's nothing we can do. Is either a bug or a problem with your grammar.

ANTLR reports error and I think it should be able to resolve input with backtracking

I have a simple grammar that works for the most part, but at one place it reports error and I think it shouldn't, because it can be resolved using backtracking.
Here is the portion that is problematic.
command: object message_chain;
object: ID;
message_chain: unary_message_chain keyword_message?
| binary_message_chain keyword_message?
| keyword_message;
unary_message_chain: unary_message+;
binary_message_chain: binary_message+;
unary_message: ID;
binary_message: BINARY_OPERATOR object;
keyword_message: (ID ':' object)+;
This is simplified version, object is more complex (it can be result of other command, raw value and so on, but that part works fine). Problem is in message_chain, in first alternative. For input like obj unary1 unary2 it works fine, but for intput like obj unary1 unary2 keyword1:obj2 is trys to match keyword1 as unary message and fails when it reaches :. I would think that it this situation parser would backtrack and figure that there is : and recognize that that is keyword message.
If I make keyword message non-optional it works fine, but I need keyword message to be optional.
Parser finds keyword message if it is in second alternative (binary_message) and third alternative (just keyword_message). So something like this gives good results: 1 + 2 + 3 Keyword1:Value
What am I missing? Backtracking is set to true in options and it works fine in other cases in the same grammar.
Thanks.
This is not really a case for PEG-style backtracking, because upon failure that returns to decision points in uncompleted derivations only. For input obj unary1 unary2 keyword1:obj2, with a single token lookahead, keyword1 could be consumed by unary_message_chain. The failure may not occur before keyword_message, and next to be tried would be the second alternative of message_chain, i.e. binary_message_chain, thus missing the correct parse.
However as this grammar is LL(2), it should be possible to extend lookahead to avoid consuming keyword1 from within unary_message_chain. Have you tried explicitly setting k=2, without backtracking?

Xtext rule consisting of Terminals not working

As part of a larger grammar I'm trying to define rules to describe "method calls". I ran into trouble and I think I reduced the problem to my lack of knowledge regarding Terminals.
Here's a simple grammar describing my problem:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model: methodCalls+=MethodCall*;
MethodCall: 'call' ID '.' ID;
With that grammar I can write something like
call variable.method
call foo.bar
Now I would like to allow wildcard characters in the method name. I changed the MethodCall-rule to
MethodCall: 'call' ID '.' WildcardName;
and to the end of the grammar I added
terminal WildcardName : ('a'..'z'|'A'..'Z'|'_'|'*'|'?') ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'*'|'?')*;
Trying
call variable.method
call foo.bar
again I got the error messages:
mismatched input 'foo' expecting RULE_ID
mismatched input 'variable' expecting RULE_ID
Why are 'foo' and 'variable' not being matched by Terminal ID? And more importantly, why does even adding the new Terminal without actually using it cause this error message?
the parsing is done in two steps: lexing and parsing. terminal rules are done in the lexing phase => at the places you expect an ID a WildcardName is recognized
=> you have to use a Datatype rule for this as well
WildcardName : (ID | '*')+;