Another implicit token error - how to tweak definitions to address it - antlr

I am aware what implicit token definition error in parser means, but am having difficulty getting rid of it. (v4)
stripped down statements:
enum_decl : GTYPE_ENUM ID LSQUARE STRING STRING* RSQUARE SEMI ;
string_decl: GTYPE_STRING ID (COMMA ID)* SEMI ;
In string_decl, that error appears on SEMI
In enum_decl the same error is on RSQUARE
GTYPE_ENUM, ID, etc. all are defined / accepted correctly, in the Lexer section.

Have you type in that little tiny section trying to find a small test case that doesn't work? Without a grammar to test there's nothing we can do. Is either a bug or a problem with your grammar.

Related

Syntax error recovery in ANTLR to produce valid parse tree

My simplified grammar is similar to the following:
parseExpression: expression EOF;
identifier: IDENT;
expression
: identifier
| identifier (+|-) identifier
;
When I try to parse an expression like abc +, a NoViableAlternativeException is thrown and ANTLR will recover by producing a ExpressionContext containing ErrorNode children.
I would like to recover from such a syntax error by ignoring the + token or by generating a synthetic token that is valid based on the expected tokens.
The error reporting will still be done through other means, but I need a valid parse tree that I can walk.
Any idea how this can be done in a custom ANTLRErrorListener or ANTLRErrorStrategy? I was trying to understand what happens in these classes when errors are encountered, but I couldn't figure out so far how to do this recovery.

Antlr4 extremely simple grammar failing

Antlr4 has always been a kind of love-hate relationship for me, but I am currently a bit perplexed. I have started creating a grammar to my best knowledge and then wanted to test it and it didnt work at all. I then reduced it a lot to just a bare minimum example and I managed to make it not work. This is my grammar:
grammar SwiftMtComponentFormat;
separator : ~ZERO EOF;
ZERO : '0';
In my understanding it should anything except a '0' and then expect the end of the file. I have been testing it with the single character input '1' which I had expected to work. However this is what happens:
If i change the ~ZEROto ZERO and change my input from 1 to 0 it actually perfectly matches... For some reason the simple negation does not seem to work. I am failing to understand what the reason here is...
In a parser rule ~ZERO matches any token that is not a ZERO token. The problem in your case is that ZERO is the only type of token that you defined at all, so any other input will lead to a token recognition error and not get to the parser at all. So if you enter the input 1, the lexer will discard the 1 with a token recognition error and the parser will only see an empty token stream.
To fix this, you can simply define a lexer rule OTHER that matches any character not matched by previous lexer rules:
OTHER: .;
Note that this definition has to go after the definition of ZERO - otherwise it would match 0 as well.
Now the input 1 will produce an OTHER token and ~ZERO will match that token. Of course, you could now replace ~ZERO with OTHER and it wouldn't change anything, but once you add additional tokens, ~ZERO will match those as well whereas OTHER would not.

antlr2 to antlr4 class specifier, options, TOKENS and more

I need to rewrite a grammar file from antlr2 syntax to antlr4 syntax and have the following questions.
1) Bart Kiers states there is a strict order: grammar, options, tokens, #header, #members in this SO post. This antlr2.org post disagrees stating header is before options. Is there a resource that states the correct order (if one exists) for antlr4?
2) The same antlr2.org post states: "The options section for a grammar, if specified, must immediately follow the ';' of the class specifier:
class MyParser extends Parser;
options { k=2; }
However, when running with antlr4, any class specifier creates this error:
syntax error: missing COLON at 'MyParser' while matching a rule
3) What happened to options in antlr4? says there are no rule-level options at the time.
warning(83): MyGrammar.g4:4:4: unsupported option k
warning(83): MyGrammar.g4:5:4: unsupported option exportVocab
warning(83): MyGrammar.g4:6:4: unsupported option codeGenMakeSwitchThreshold
warning(83): MyGrammar.g4:7:4: unsupported option codeGenBitsetTestThreshold
warning(83): MyGrammar.g4:8:4: unsupported option defaultErrorHandler
warning(83): MyGrammar.g4:9:4: unsupported option buildAST
i.) does antlr4's adaptive LL(*) parsing algorithm no longer require k token lookhead?
ii.) is there an equivalent in antlr4 for exportVocab?
iii.) are there equivalents in antlr4 for optimizations codeGenMakeSwitchThreshold and codeGenBitsetTestThreshold or have they become obsolete?
iv.) is there an equivalent for defaultErrorHandler ?
v.) I know antlr4 no longer builds AST. I'm still trying to get a grasp of how this will affect what uses the currently generated *Parser.java and *Lexer.java.
4) My current grammar file specifies a TOKENS section
tokens {
ROOT; FOO; BAR; TRUE="true"; FALSE="false"; NULL="null";
}
I changed the double quotes to single quotes and the semi-colons to commas and the equal sign to a colon to try and get rid of each syntax error but have this error:
mismatched input ':' expecting RBRACE
along with others. Rewritten looks like:
tokens {
ROOT; FOO; BAR; TRUE:'true'; FALSE:'false' ...
}
so I removed :'true' and :'false' and TRUE and FALSE will appear in the generated MyGrammar.tokens but I'm not sure if it will function the same as before.
Thanks!
Just look at the ultimate source for the syntax: the ANTLR4 grammar. As you can see the order plays no role in the prequel section (which includes named actions, options and the like, you can even have more than one option section). The only condition is that the prequel section must appear before any rule.
The error is about a wrong option. Remove that and the error will go away.
Many (actually most of the old) options are no longer needed and supported in ANTLR4.
i.) ANTLR4 uses unlimited lookahead (hence the * in ALL(*)). You cannot specify any other lookahead.
ii.) The exportVocab has long gone (not even ANTLR3 supports it). It only specifies a name for the .tokens file. Use the default instead.
iii.) Nothing like that is needed nor supported anymore. The prediction algorithm has completely changed in ANTLR4.
iv.) You use an error listener instead. There are many examples how to do that (also here at SO).
v.) Is that a question or just thinking loudly? Hint: ANTLR4 based parsers generate a parse tree.
I'm not 100% sure about this one, but I believe you can no longer specify the value a token should match in the tokens section. Instead this is only for virtual tokens and everything else must be specified as normal lexer tokens.
To sum up: most of the special options and tricks required for older ANTLR grammars are not needed anymore and must be removed. The new parsing algorithm can deal with all the ambiquities automatically, which former versions had trouble with and needed guidance from the user for.

ANTLR4 Mutual left recursion

I just ran into a strange problem with ANTLR 4.2.2:
Consider a (simplified) java grammar. This does not compile:
classOrInterfaceType
: (classOrInterfaceType) '.' Identifier
| Identifier
;
ANTLR outputs the following error:
error(119): Java.g4::: The following sets of rules are mutually left-recursive [classOrInterfaceType]
Yes, I also see a left recursion. But I do not see a mutual left recursion, only a usual one.
When I remove the parenthesis around (classOrInterfaceType), then it compiles fine. Of course, the parenthesis are superfluous, but the grammar is generated automatically and the code generator always inserts parenthesis in some situations. So what is the problem here?
It has been confirmed that this is a bug. The fix is scheduled for the next milestone 4.x.
See https://github.com/antlr/antlr4/issues/564

ANTLR reports error and I think it should be able to resolve input with backtracking

I have a simple grammar that works for the most part, but at one place it reports error and I think it shouldn't, because it can be resolved using backtracking.
Here is the portion that is problematic.
command: object message_chain;
object: ID;
message_chain: unary_message_chain keyword_message?
| binary_message_chain keyword_message?
| keyword_message;
unary_message_chain: unary_message+;
binary_message_chain: binary_message+;
unary_message: ID;
binary_message: BINARY_OPERATOR object;
keyword_message: (ID ':' object)+;
This is simplified version, object is more complex (it can be result of other command, raw value and so on, but that part works fine). Problem is in message_chain, in first alternative. For input like obj unary1 unary2 it works fine, but for intput like obj unary1 unary2 keyword1:obj2 is trys to match keyword1 as unary message and fails when it reaches :. I would think that it this situation parser would backtrack and figure that there is : and recognize that that is keyword message.
If I make keyword message non-optional it works fine, but I need keyword message to be optional.
Parser finds keyword message if it is in second alternative (binary_message) and third alternative (just keyword_message). So something like this gives good results: 1 + 2 + 3 Keyword1:Value
What am I missing? Backtracking is set to true in options and it works fine in other cases in the same grammar.
Thanks.
This is not really a case for PEG-style backtracking, because upon failure that returns to decision points in uncompleted derivations only. For input obj unary1 unary2 keyword1:obj2, with a single token lookahead, keyword1 could be consumed by unary_message_chain. The failure may not occur before keyword_message, and next to be tried would be the second alternative of message_chain, i.e. binary_message_chain, thus missing the correct parse.
However as this grammar is LL(2), it should be possible to extend lookahead to avoid consuming keyword1 from within unary_message_chain. Have you tried explicitly setting k=2, without backtracking?