I just started with antlr, And I am using 4.2. Easy guessing says it would be like antlr3 in basics. so I followed the accepted answer of this question. (But instead of Exp, I replaced Java, which means I want to parse Java) Everything is fine, Until I want to compile the ANTLRDemo.java example.
When I compile that, I get 4 errors:
ANTLRStringStream in = new ANTLRStringStream("some random text");
JavaLexer lexer = new JavaLexer(in);
first error: constructor JavaLexer in class JavaLexer cannot be applied to
given types; JavaLexer lexer = new JavaLexer(in); required:
CharStream found: ANTLRStringStream reason: actual argument
ANTLRStringStream cannot be converted to CharStream by method
invocation conversion (I know what this error is ;-)
CommonTokenStream tokens = new CommonTokenStream( lexer);
JavaParser parser = new JavaParser(tokens);
System.out.println(parser.eval());
to make it short, let's say every line has its own similar error. For example, "parser" does not have an "eval()" method.
What am I missing? I guess antlr4 does not run like 3. Any Ideas? Please consider my beginner status.
In ANTLR 4, use ANTLRInputStream instead of the old ANTLRStringStream from ANTLR 3.
The eval() method exists when you have a parser rule in the grammar named eval. One such method is created for each rule in the grammar. If you do not intend to start parsing at rule eval, then you should replace that call with the name of the start rule for your particular grammar.
Related
I'm using the Java syntax defined at https://github.com/antlr/grammars-v4/tree/master/java/java
My users are free to input any thing, for example
assert image != null;
,
public Color[][] smooth(Color[][] image, int neighberhoodSize)
{
...
}
,
package myapplication.mylibrary;
, and
import static java.lang.System.out; //'out' is a static field in java.lang.System
import static screen.ColorName.*;
My program should tell which syntax the input matches.
What I have up to now is
var stream = CharStreams.fromString(input);
ITokenSource lexer = new JavaLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
Parser parser = new JavaParser(tokens);
parser.ErrorHandler = new BailErrorStrategy();
try
{
var tree = parser.statement();
Console.WriteLine("The input is a statement");
}
catch (Exception e)
{
Console.WriteLine("The input is not a statement");
}
Are there better way to check the input match any of the 100 rules?
No, there's no other way than trial-and-error. Note that your generated parser has the property:
public static final String[] ruleNames
which you can use in combination with reflection to call all parser rules automatically instead of trying them manually.
Also, trying parser.statement() might not be enough: the input String s = "mu"; FUBAR could be properly parsed by parser.statement() and leave the trailing Identifier (FUBAR) in the token stream. After all, the statement rule probably does not end with an EOF token forcing the parser to consume all tokens. You'll probably have to manually check if all tokens are consumed before determining the input was successfully parsed by a certain parser rule. Also see this Q&A: How to test ANTLR translation without adding EOF to every rule
Unless you really mean that your users can enter anything (and I would suspect that, with some thought, that’s not really the case)
You could add a parser rule that includes alternatives for each construct your users could enter. You might have to take a little care on the order.
Since parser rules are evaluated recursive descent, if your new rule isn’t referenced by any other rules, it would have no impact on the rest of the grammar.
Could be worth a shot.
I'am writing a code indentor using ANTLR4 and Java. I have successfully generated the lexer and the parser. And the approach i am using is to walk through the generated parse tree.
ParseTreeWalker mywalker = new ParseTreeWalker();
mywalker.walk(myListener, myTree);
The auto-generated *BaseListener has methods like below...
#Override public void enterEveryRule(ParserRuleContext ctx) { }
I'm very new to ANTLR. But, As I understand, I need to extend *BaseListener and override the relevant methods and write code to indent, So my question is what are the methods that I should be overriding for indenting the input code file? Or if there is an alternate approach I should take, please let me know.
Thanks!
None. You don't need a parser for this task and you are limiting yourself to valid code, when you require a parser (hence you cannot reformat code with a syntax error). Instead take the lexer and iterate over all tokens. Keep a state to know where you are (a block, a function, whatever) and indent according to that.
I'm creating my first grammar with ANTLR and ANTLRWorks 2. I have mostly finished the grammar itself (it recognizes the code written in the described language and builds correct parse trees), but I haven't started anything beyond that.
What worries me is that every first occurrence of a token in a parser rule is underlined with a yellow squiggle saying "Implicit token definition in parser rule".
For example, in this rule, the 'var' has that squiggle:
variableDeclaration: 'var' IDENTIFIER ('=' expression)?;
How it looks exactly:
The odd thing is that ANTLR itself doesn't seem to mind these rules (when doing test rig test, I can't see any of these warning in the parser generator output, just something about incorrect Java version being installed on my machine), so it's just ANTLRWorks complaining.
Is it something to worry about or should I ignore these warnings? Should I declare all the tokens explicitly in lexer rules? Most exaples in the official bible The Defintive ANTLR Reference seem to be done exactly the way I write the code.
I highly recommend correcting all instances of this warning in code of any importance.
This warning was created (by me actually) to alert you to situations like the following:
shiftExpr : ID (('<<' | '>>') ID)?;
Since ANTLR 4 encourages action code be written in separate files in the target language instead of embedding them directly in the grammar, it's important to be able to distinguish between << and >>. If tokens were not explicitly created for these operators, they will be assigned arbitrary types and no named constants will be available for referencing them.
This warning also helps avoid the following problematic situations:
A parser rule contains a misspelled token reference. Without the warning, this could lead to silent creation of an additional token that may never be matched.
A parser rule contains an unintentional token reference, such as the following:
number : zero | INTEGER;
zero : '0'; // <-- this implicit definition causes 0 to get its own token
If you're writing lexer grammar which wouldn't be used across multiple parser grammmar(s) then you can ignore this warning shown by ANTLRWorks2.
Let's suppose I have two grammars (and that there is a Lexer defined somewhere), ParserA and ParserB.
In ParserA I have the following code:
parser grammar ParserA;
classDeclaration
scope {
ST mList;
}
...
ParserB is something like:
parser grammar ParserB;
import ParserA;
methodDeclaration : something something { $classDeclaration::mList.add(...) };
The code in the action will fail to compile (by javac) since classDeclaration is in a different class (and file). Any tips on how to fix it?
Any tips on how to fix it?
No, there's (AFAIK) no ANTLR shortcut here: there's no communication possible between imported grammars (either by using scopes or by providing parameters to imported grammar rules).
I have an antlr generated Java parser that uses the C target and it works quite well. The problem is I also want it to parse erroneous code and produce a meaningful AST. If I feed it a minimal Java class with one import after which a semicolon is missing it produces two "Tree Error Node" objects where the "import" token and the tokens for the imported class should be.
But since it parses the following code correctly and produces the correct nodes for this code it must recover from the error by adding the semicolon or by resyncing. Is there a way to make antlr reflect this fixed input it produces internally in the AST? Or can I at least get the tokens/text that produced the "Tree Node Errors" somehow?
In the C targets
antlr3commontreeadaptor.c around line 200 the following fragment indicates that the C target only creates dummy error nodes so far:
static pANTLR3_BASE_TREE
errorNode (pANTLR3_BASE_TREE_ADAPTOR adaptor, pANTLR3_TOKEN_STREAM ctnstream, pANTLR3_COMMON_TOKEN startToken, pANTLR3_COMMON_TOKEN stopToken, pANTLR3_EXCEPTION e)
{
// Use the supplied common tree node stream to get another tree from the factory
// TODO: Look at creating the erronode as in Java, but this is complicated by the
// need to track and free the memory allocated to it, so for now, we just
// want something in the tree that isn't a NULL pointer.
//
return adaptor->createTypeText(adaptor, ANTLR3_TOKEN_INVALID, (pANTLR3_UINT8)"Tree Error Node");
}
Am I out of luck here and only the error nodes the Java target produces would allow me to retrieve the text of the erroneous nodes?
I haven't used antlr much, but typically the way you handle this type of error is to add rules for matching wrong syntax, make them produce error nodes, and try to fix up after errors so that you can keep parsing. Fixing up afterwards is the problem because you don't want one error to trigger more and more errors for each new token until the end.
I solved the problem by adding new alternate rules to the grammer for all possible erroneous statements.
Each Java import statement gets translated to an AST subtree with the artificial symbol IMPORT as the root for example. To make sure that I can differentiate between ASTs from correct and erroneous code the rules for the erroneous statements rewrite them to an AST with a root symbol with the prefix ERR_, so in the example of the import statement the artifical root symbol would be ERR_IMPORT.
More different root symbols could be used to encode more detailed information about the parse error.
My parser is now as error tolerant as I need it to be and it's very easy to add rules for new kinds of erroneous input whenever I need to do so. You have to watch out to not introduce any ambiguities into your grammar, though.