Ignore MSGTokenError in JAVACC - error-handling

I use JAVACC to parse some string defined by a bnf grammar with initial non-terminal G.
I would like to catch errors thrown by TokenMgrError.
In particular, I want to handle the following two cases:
If some prefix of the input satisfies G, but not all of the symbols are read from the input, consider this case as normal and return AST for found prefix by a call to G().
If the input has no prefix satisfying G, return null from G().
Currently I'm getting TokenMgrError 's in each of this case instead.
I started to modify the generated files (i.e, to change Error to Exception and add appropriate try/catch/throws statements), but I found it to be tedious. In addition, automatic generation of the modified files produced by JAVACC does not work. Is there a smarter way to accomplish this?

You can always eliminate all TokenMgrErrors by including
<*> TOKEN : { <UNEXPECTED: ~[] > }
as the final rule. This pushes all you issues to the grammar level where you can generally deal with them more easily.

Related

ANTLR4 - replace op boundaries error|How to use TokenStreamRewriter to transform text from two listener events on overlapping tokens in original AST?

Hello ANTLR creators/users,
Some context - I am using PlSql ANTLR4 parser to do some lightweight transpiling of some queries from oracle sql to, let's say, spark sql. I have my listener class setup which extends the base listener.
Example of an issue -
Let's say the input is something like -
SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101') from xyz;
Now, I'd like to replace || with CONCAT and to_char with CAST as STRING, so that the final query looks like -
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
In my listener class, I am overriding two functions from base listener to do this - concatenation and string_function. In those, I am using a tokenStreamRewriter's replace to make the necessary transformation. Since tokenStreamRewriter is evaluated lazily, I am running to issue ->
java.lang.IllegalArgumentException: replace op boundaries of
<ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..[#53,276:276=')',
<2214>,3:63]:"CAST (to_number(substr(ATTRIBUTE_VALUE,1,4))-3 as STRING)">
overlap with previous <ReplaceOp#[#38,228:234='to_char',<2193>,3:15]..
[#56,279:284=''0101'',<2209>,3:66]:"CONCAT
(to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3),'0101')">
Clearly, the issue is my two listener functions attempting to replace/transform text on overlapping boundaries.
Is there any work around for territory overlap kind of issues for ANTLR4? I'm sure folks run into such stuff all the time probably.
I'd appreciate any workarounds, even dirty ones at this point of time :)
I did realize that ANTLR4 does not allow us to modify original AST, otherwise this would have been a little bit easier to solve.
Thanks!
A look at how tokenstreamrewriter works leads to the following understanding:
first, a list of all modification operations are built
then, you invoke getText()
here, there is a reduction of modification operations. The idea for example is to merge multiple insert together in one reduction. Its role is also to avoid multiple replace on same data (but i will expand on this point later).
every token is then read, in the case there is a modification listed for the said token index, TokenStreamRewriter do the operation, otherwise it just pop the read token.
Let's have a look on how modification operations are implemented:
for insert, tokenstream rewriter basically just adds the string to be added at the current token index, and then do an index+1, effectively going to next token
for replace, tokenstream rewriter replace a range of tokens with the new string, and set the new index to the end of this range.
So, for tokenstreamrewriter, overlapping replaces are not possible, as when you replace you jump to the end of the range of tokens to be replaced. Especially, in the case you remove the checks of overlapping, then only the first replace will be operated, as afterwards, the token index is past the other replaces.
Basically, this has been done because there is no way to tell easily what tokens should be replaced while using overlapping replaces. You would need for that symbol recognition and matching.
So, what you are trying to do is the following (for each step, the part between '*' is what is modified):
*SELECT to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)||'0101')* from xyz;
|
V
CONCAT (*to_char(to_number(substr(ATTRIBUTE_VALUE,1,4))-3)*,'0101') from xyz;
|
V
SELECT CONCAT(CAST(to_number(substr(ATTRIBUTE_VALUE,1,4))-3) as STRING),'0101') from xyz;
to achieve your transformation, you could do so a replace of :
'to_char' -> 'CONCAT(CAST'
'||' -> ' as STRING),'
And, by using a bit of intelligence while parsing your tokens, like is there a '||' in my tokens to know if it's string, you would know what to replace.
regards
The way I solve it in multiple projects based on ANTLR is this: I translated ANTLR parse-tree to an AST written using Kolasu, an open-source library we developed at Strumenta.
Kolasu has all sort of utilities to process and mutate ASTs. For all non-trivial projects I end up doing transformations on the AST.
Kolasu

antlr2 to antlr4 class specifier, options, TOKENS and more

I need to rewrite a grammar file from antlr2 syntax to antlr4 syntax and have the following questions.
1) Bart Kiers states there is a strict order: grammar, options, tokens, #header, #members in this SO post. This antlr2.org post disagrees stating header is before options. Is there a resource that states the correct order (if one exists) for antlr4?
2) The same antlr2.org post states: "The options section for a grammar, if specified, must immediately follow the ';' of the class specifier:
class MyParser extends Parser;
options { k=2; }
However, when running with antlr4, any class specifier creates this error:
syntax error: missing COLON at 'MyParser' while matching a rule
3) What happened to options in antlr4? says there are no rule-level options at the time.
warning(83): MyGrammar.g4:4:4: unsupported option k
warning(83): MyGrammar.g4:5:4: unsupported option exportVocab
warning(83): MyGrammar.g4:6:4: unsupported option codeGenMakeSwitchThreshold
warning(83): MyGrammar.g4:7:4: unsupported option codeGenBitsetTestThreshold
warning(83): MyGrammar.g4:8:4: unsupported option defaultErrorHandler
warning(83): MyGrammar.g4:9:4: unsupported option buildAST
i.) does antlr4's adaptive LL(*) parsing algorithm no longer require k token lookhead?
ii.) is there an equivalent in antlr4 for exportVocab?
iii.) are there equivalents in antlr4 for optimizations codeGenMakeSwitchThreshold and codeGenBitsetTestThreshold or have they become obsolete?
iv.) is there an equivalent for defaultErrorHandler ?
v.) I know antlr4 no longer builds AST. I'm still trying to get a grasp of how this will affect what uses the currently generated *Parser.java and *Lexer.java.
4) My current grammar file specifies a TOKENS section
tokens {
ROOT; FOO; BAR; TRUE="true"; FALSE="false"; NULL="null";
}
I changed the double quotes to single quotes and the semi-colons to commas and the equal sign to a colon to try and get rid of each syntax error but have this error:
mismatched input ':' expecting RBRACE
along with others. Rewritten looks like:
tokens {
ROOT; FOO; BAR; TRUE:'true'; FALSE:'false' ...
}
so I removed :'true' and :'false' and TRUE and FALSE will appear in the generated MyGrammar.tokens but I'm not sure if it will function the same as before.
Thanks!
Just look at the ultimate source for the syntax: the ANTLR4 grammar. As you can see the order plays no role in the prequel section (which includes named actions, options and the like, you can even have more than one option section). The only condition is that the prequel section must appear before any rule.
The error is about a wrong option. Remove that and the error will go away.
Many (actually most of the old) options are no longer needed and supported in ANTLR4.
i.) ANTLR4 uses unlimited lookahead (hence the * in ALL(*)). You cannot specify any other lookahead.
ii.) The exportVocab has long gone (not even ANTLR3 supports it). It only specifies a name for the .tokens file. Use the default instead.
iii.) Nothing like that is needed nor supported anymore. The prediction algorithm has completely changed in ANTLR4.
iv.) You use an error listener instead. There are many examples how to do that (also here at SO).
v.) Is that a question or just thinking loudly? Hint: ANTLR4 based parsers generate a parse tree.
I'm not 100% sure about this one, but I believe you can no longer specify the value a token should match in the tokens section. Instead this is only for virtual tokens and everything else must be specified as normal lexer tokens.
To sum up: most of the special options and tricks required for older ANTLR grammars are not needed anymore and must be removed. The new parsing algorithm can deal with all the ambiquities automatically, which former versions had trouble with and needed guidance from the user for.

Semantic predicates fail but don't go to the next one

I tried to use ANTLR4 to identify a range notation like <1..100>, and here is my attempt:
#parser::members {
def evalRange(self, minnum, maxnum, num):
if minnum <= num <= maxnum:
return True
return False
}
range_1_100 : INT { self.evalRange(1, 100, $INT.int) }? ;
But it does not work for more than one range like:
some_rule : range_1_100 | range_200_300 ;
When I input a number (200), it just stops at the first rule:
200
line 3:0 rule range_1_100 failed predicate: { self.evalRange(1, 100, $INT.int) }?
(top (range_1_100 200))
It is not as I expected. How can I make the token match the next rule (range_200_300)?
Here's an excerpt from the docs (emphasis mine):
Predicates can appear anywhere within a parser rule just like actions can, but only those appearing on the left edge of alternatives can affect prediction (choosing between alternatives).
[...]
ANTLR's general decision-making strategy is to find all viable alternatives and then ignore the alternatives guarded with predicates that currently evaluate to false. (A viable alternative is one that matches the current input.) If more than one viable alternative remains, the parser chooses the alternative specified first in the decision.
Which basically means your predicate must be the first item in the alternation to be taken into account during the prediction phase.
Of course, you won't be able to use $INT as it wasn't matched yet at this point, but you can replace it with something like _input.LA(1) instead (lookahead of one token) - the exact syntax depends on your language target.
As a side note, I'd advise you to not validate the input through the grammar, it's easier and better to perform a separate validation pass after the parse. Let the grammar handle the syntax, not the semantics.

Preferentially match shorter token in ANTLR4

I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.

Stripping actions from ANTLR grammar changes its parsing algorithm

I have a grammar Foo.xtext (too complex to include it here). Xtext generates InternalFoo.g from it. After some tweaking it also generates DebugInternalFoo.g which claims to be the same thing without actions. Now, I strip off actions with ANTLR directly
java -cp antlr-3.4.jar org.antlr.tool.Strip Internal.g > Stripped.g
I'd expect the three grammars to behave the same way when I check them. But here is what I experienced
InternalFoo.g - error, rule assignment has non-LL(*) decision
DebugInternalFoo.g - no problem, parses fine
Stripped.g - warnings at rule assignment, decision can match using multiple alternatives. It fails to parse properly.
Is it possible that a grammar parses a text differently with or without actions? Or is it a bug in any of the action-remover tools? (The rule in question has syntactic predicates, and without them, it would really have a non-LL(*) decision.)
UPDATE:
I partly found what caused the problem. The rule in question was like this
trickyRule:
({ some complex action})
(expression '=')=>...
Stripping with Antlr removed the action, but left an empty group there:
// Stripped.g
trickyRule:
() (expression '=')=>...
The generation of the debug grammar removes both the action, and the now empty group around it:
// DebugInternalFoo.g
trickyRule:
(expression '=')=>...
So the lesson learned is: an empty group before a syntactic predicate is not the same as nothing at all.
Is it possible that a grammar parses a text differently with or without actions?
Yes, that is possible. org.antlr.tool.Strip leaves syntactic predicates1, but removes validating2- and gated3 semantic predicates (and member sections that these semantic predicates might use).
For example, the following rules would only match an A_TOKEN:
parser_rule1
: (parser_rule2)=> parser_rule2
;
parser_rule2
: {input.LT(1).getType() == A_TOKEN}? .
;
but if you use the Strip tool on it, it leaves the following:
parser_rule1
: (parser_rule2)=> parser_rule2
;
parser_rule2
: /*{input.LT(1).getType() == A_TOKEN}?*/ .
;
making it match any token.
In other words, Strip could change the behavior of the generated lexer or parser.
1 syntactic predicate: ( ... )=>
2 validating semantic predicate { ... }?
3 gated semantic predicate { ... }?=>