I'm very new to ANTLR4 and am trying to build my own language. So my grammar starts at
program: <EOF> | statement | functionDef | statement program | functionDef program;
and my statement is
statement: selectionStatement | compoundStatement | ...;
and
selectionStatement
: If LeftParen expression RightParen compoundStatement (Else compoundStatement)?
| Switch LeftParen expression RightParen compoundStatement
;
compoundStatement
: LeftBrace statement* RightBrace;
Now the problem is, that when I test a piece of code against selectionStatement or statement it passes the test, but when I test it against program it fails to recognize. Can anyone help me on this? Thank you very much
edit: the code I use to test is the following:
if (x == 2) {}
It passes the test against selectionStatement and statement but fails at program. It appears that program only accepts if...else
if (x == 2) {} else {}
Edit 2:
The error message I received was
<unknown>: Incorrect error: no viable alternative at input 'if(x==2){}'
Cannot answer your question given the incomplete information provided: the statement rule is partial and the compoundStatement rule is missing.
Nonetheless, there are two techniques you should be using to answer this kind of question yourself (in addition to unit tests).
First, ensure that the lexer is working as expected. This answer shows how to dump the token stream directly.
Second, use a custom ErrorListener to provide a meaningful/detailed description of its parse path to every encountered error. An example:
public class JavaErrorListener extends BaseErrorListener {
public int lastError = -1;
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine,
String msg, RecognitionException e) {
Parser parser = (Parser) recognizer;
String name = parser.getSourceName();
TokenStream tokens = parser.getInputStream();
Token offSymbol = (Token) offendingSymbol;
int thisError = offSymbol.getTokenIndex();
if (offSymbol.getType() == -1 && thisError == tokens.size() - 1) {
Log.debug(this, name + ": Incorrect error: " + msg);
return;
}
String offSymName = JavaLexer.VOCABULARY.getSymbolicName(offSymbol.getType());
List<String> stack = parser.getRuleInvocationStack();
// Collections.reverse(stack);
Log.error(this, name);
Log.error(this, "Rule stack: " + stack);
Log.error(this, "At line " + line + ":" + charPositionInLine + " at " + offSymName + ": " + msg);
if (thisError > lastError + 10) {
lastError = thisError - 10;
}
for (int idx = lastError + 1; idx <= thisError; idx++) {
Token token = tokens.get(idx);
if (token.getChannel() != Token.HIDDEN_CHANNEL) Log.error(this, token.toString());
}
lastError = thisError;
}
}
Note: adjust the Log statements to whatever logging package you are using.
Finally, Antlr doesn't do 'weird' things - just things that you don't understand.
Related
I'm building a language primarily used for calculation purposes. It is a small language with C like syntax but extremely limited functionality. For the past few days, I've been trying to generate code that is encapsulated in curly braces however whenever I enter expressions in curly braces, the code generated is always for the last expression entered. It is supposed to work on a while loop.
For example:
while( true )
{
// some expressions (not using any variables for simplicity)
5 + 9;
8 - 10;
4 * 6;
}
However the code generated only takes into account the last expression (4 * 6) in this case.
The link to the code:
https://codeshare.io/GL0xRk
And also, the code snippet for handling curly braces and some other relative code:
calcul returns [String code]
#init
{
$code = new String();
}
#after
{
System.out.print($code);
for( int i = 0; i < getvarg_count(); ++i )
{
System.out.println("POP");
}
System.out.println("HALT");
}
: (decl
{
// declaration
$code += $decl.code;
})*
NEWLINE*
{
$code += "";
}
(instruction
{
// instruction, eg. x = 5; 7 * 4;
$code += $instruction.code;
System.err.println("instruction found");
})*
;
whileStat returns [String code]
: WHILE '(' condition ')' NEWLINE* block
{
int cur_label = nextLabel();
$code = "LABEL " + cur_label + "\n";
$code += $condition.code;
$code += "JUMPF " + (cur_label + 1) + "\n";
$code += $block.code;
$code += "JUMP " + cur_label + "\n";
$code += "LABEL " + (cur_label + 1) + "\n";
}
;
block returns [String code]
#init
{
$code = new String();
}
: '{' instruction* '}' NEWLINE*
{
System.err.println("block found");
$code += $instruction.code;
System.err.println("curly braces for while found");
}
;
And the compiler code generated:
while(true)
{
5+9;
8-10;
4*6;
}
block found
curly braces for while found
instruction found
LABEL 0
PUSHI 1
JUMPF 1
PUSHI 4
PUSHI 6
MUL
POP
JUMP 0
LABEL 1
HALT
I have a feeling that the $code is always reinitialized. Or maybe it's because I have instruction* in two different rules. I'm not sure how else to handle this problem. All help is much appreciated.
Thank you
Anyway, it looks like your problem is that $instruction in block's action only refers to the last instruction because the block is outside of the *, so the action only gets run once.
You can either move the action inside the * like you did in the calcul rule or you can put all the instructions in a list with instructions+=instruction* and then use $instructions in the action (or better: a listener or visitor).
PS: I strongly recommend to use a listener or visitor instead of having actions all over your grammar. They make the grammar very hard to read.
I asked a question a couple of weeks ago about my ANTLR grammar (My simple ANTLR grammar is not working as expected). Since asking that question, I've done more digging and debugging and gotten most of the kinks out. I am left with one issue, though.
My generated parser code is not picking up invalid tokens in one particular part of the text that is processed. The lexer is properly breaking things into tokens, but the parser does not kick out invalid tokens in some cases. In particular, when the invalid token is at the end of a phrase like "A and "B", the parser ignores it - it's like the token isn't even there.
Some specific examples:
"A and B" - perfectly valid
"A# and B" - parser properly picks up the invalid # token
"A and #B" - parser properly picks up the invalid # token
"A and B#" - here's the mystery - the lexer finds the # token and the parser IGNORES it (!)
"(A and B#) or C" - further mystery - the lexer finds the # token and the parser IGNORES it (!)
Here is my grammar:
grammar QvidianPlaybooks;
options{ language=CSharp3; output=AST; ASTLabelType = CommonTree; }
public parse
: expression
;
LPAREN : '(' ;
RPAREN : ')' ;
ANDOR : 'AND'|'and'|'OR'|'or';
NAME : ('A'..'Z');
WS : ' ' { $channel = Hidden; };
THEREST : .;
// ***************** parser rules:
expression : anexpression EOF!;
anexpression : atom (ANDOR^ atom)*;
atom : NAME | LPAREN! anexpression RPAREN!;
The code that then processes the resulting tree looks like this:
... from the main program
QvidianPlaybooksLexer lexer = new QvidianPlaybooksLexer(new ANTLRStringStream(src));
QvidianPlaybooksParser parser = new QvidianPlaybooksParser(new CommonTokenStream(lexer));
parser.TreeAdaptor = new CommonTreeAdaptor();
CommonTree tree = (CommonTree)parser.parse().Tree;
ValidateTree(tree, 0, iValidIdentifierCount);
// recursive code that walks the tree
public static RuleLogicValidationResult ValidateTree(ITree Tree, int depth, int conditionCount)
{
RuleLogicValidationResult rlvr = null;
if (Tree != null)
{
CommonErrorNode commonErrorNode = Tree as CommonErrorNode;
if (null != commonErrorNode)
{
rlvr = new RuleLogicValidationResult();
rlvr.IsValid = false;
rlvr.ErrorType = LogicValidationErrorType.Other;
Console.WriteLine(rlvr.ToString());
}
else
{
string strTree = Tree.ToString();
strTree = strTree.Trim();
strTree = strTree.ToUpper();
if ((Tree.ChildCount != 0) && (Tree.ChildCount != 2))
{
rlvr = new RuleLogicValidationResult();
rlvr.IsValid = false;
rlvr.ErrorType = LogicValidationErrorType.Other;
rlvr.InvalidIdentifier = strTree;
rlvr.ErrorPosition = 0;
Console.WriteLine(String.Format("CHILD COUNT of {0} = {1}", strTree, tree.ChildCount));
}
// if the current node is valid, then validate the two child nodes
if (null == rlvr || rlvr.IsValid)
{
// output the tree node
for (int i = 0; i < depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(Tree);
rlvr = ValidateTree(Tree.GetChild(0), depth + 1, conditionCount);
if (rlvr.IsValid)
{
rlvr = ValidateTree(Tree.GetChild(1), depth + 1, conditionCount);
}
}
else
{
Console.WriteLine(rlvr.ToString());
}
}
}
else
{
// this tree is null, return a "it's valid" result
rlvr = new RuleLogicValidationResult();
rlvr.ErrorType = LogicValidationErrorType.None;
rlvr.IsValid = true;
}
return rlvr;
}
Add EOF to the end of your start rule. :)
I'm trying to implement something like a Code Contracts feature for JavaScript as an assignment for one of my courses.
The problem I'm having is that I can't seem to find a way to output the source file directly to the console without modifying the entire grammar.
Does anybody knows a way to achieve this?
Thanks in advance.
Here's an example of what I'm trying to do:
function DoClear(num, arr, text){
Contract.Requires<RangeError>(num > 0);
Contract.Requires(num < 1000);
Contract.Requires<TypeError>(arr instanceOf Array);
Contract.Requires<RangeError>(arr.length > 0 && arr.length <= 9);
Contract.Requires<ReferenceError>(text != null);
Contract.Ensures<RangeError>(text.length === 0);
// method body
[...]
return text;
}
function DoClear(num, arr, text){
if (!(num > 0))
throw RangeError;
if (!(num < 1000))
throw Error;
if (!(arr instanceOf Array))
throw TypeError;
if (!(arr.length > 0 && arr.length <= 9))
throw RangeError;
if (!(text != null))
throw ReferenceError
// method body
[...]
if (!(text.length === 0))
throw RangeError
else
return text;
}
There are a few (minor) things you'll want to consider:
ignore string literals that might contain your special contract-syntax;
ignore multi- and single line comments that might contain your special Contract syntax;
ignore code like this: var Requires = "Contract.Requires<RangeError>"; (i.e. regular JavaScript code that "looks like" your contract-syntax);
It's pretty straight forward to take the points above into account and also simply create single tokens for an entire contract-line. You'll be making your life hard when tokenizing the following into 4 different tokens Contract.Requires<RangeError>(num > 0):
Contract
Requires
<RangeError>
(num > 0)
So it's easiest to create a single token from it, and at the parsing phase, split the token on ".", "<" or ">" with a maximum of 4 tokens (leaving expressions containing ".", "<" or ">" as they are).
A quick demo of what I described above might look like this:
grammar CCJS;
parse
: atom+ EOF
;
atom
: code_contract
| (Comment | String | Any) {System.out.print($text);}
;
code_contract
: Contract
{
String[] tokens = $text.split("[.<>]", 4);
System.out.print("if (!" + tokens[3] + ") throw " + tokens[2]);
}
;
Contract
#init{
boolean hasType = false;
}
#after{
if(!hasType) {
// inject a generic Error if this contract has no type
setText(getText().replaceFirst("\\(", "<Error>("));
}
}
: 'Contract.' ('Requires' | 'Ensures') ('<' ('a'..'z' | 'A'..'Z')+ '>' {hasType=true;})? '(' ~';'+
;
Comment
: '//' ~('\r' | '\n')*
| '/*' .* '*/'
;
String
: '"' (~('\\' | '"' | '\r' | '\n') | '\\' . )* '"'
;
Any
: .
;
which you can test with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String src =
"/* \n" +
" Contract.Requires to be ignored \n" +
"*/ \n" +
"function DoClear(num, arr, text){ \n" +
" Contract.Requires<RangeError>(num > 0); \n" +
" Contract.Requires(num < 1000); \n" +
" Contract.Requires<TypeError>(arr instanceOf Array); \n" +
" Contract.Requires<RangeError>(arr.length > 0 && arr.length <= 9); \n" +
" Contract.Requires<ReferenceError>(text != null); \n" +
" Contract.Ensures<RangeError>(text.length === 0); \n" +
" \n" +
" // method body \n" +
" // and ignore single line comments, Contract.Ensures \n" +
" var s = \"Contract.Requires\"; // also ignore strings \n" +
" \n" +
" return text; \n" +
"} \n";
CCJSLexer lexer = new CCJSLexer(new ANTLRStringStream(src));
CCJSParser parser = new CCJSParser(new CommonTokenStream(lexer));
parser.parse();
}
}
If you run the Main class above, the following will be printed to the console:
/*
Contract.Requires to be ignored
*/
function DoClear(num, arr, text){
if (!(num > 0)) throw RangeError;
if (!(num < 1000)) throw Error;
if (!(arr instanceOf Array)) throw TypeError;
if (!(arr.length > 0 && arr.length <= 9)) throw RangeError;
if (!(text != null)) throw ReferenceError;
if (!(text.length === 0)) throw RangeError;
// method body
// and ignore single line comments, Contract.Ensures
var s = "Contract.Requires"; // also ignore strings
return text;
}
BUT ...
... I realize that it isn't what you're exactly looking for: the RangeError is not placed at the end of your function. And that's going to be tough one: a function might have multiple returns, and is likely to have multiple code blocks { ... } making it difficult to know where the } is that ends the function. So you don't know where exactly to inject this RangeError-check. At least, not with a naive approach as I demonstrated.
The only reliable way to implement such a thing is to get a decent JavaScript grammar, add your own contract-rules to it, rewrite the AST the parser produces, and finally emit the new AST in a friendly-formatted way: not a trivial task, to say the least!
There are various ECMA/JS grammars on the ANTLR Wiki, but tread with care: they are user-committed grammars and may contain errors (probably will in this case[1]!).
If you choose to place the RangeError there where it should be rewritte, like so:
function DoClear(num, arr, text){
Contract.Requires<RangeError>(num > 0);
...
// method body
...
Contract.Ensures<RangeError>(text.length === 0);
return text;
}
which would result in:
function DoClear(num, arr, text){
if (!(num > 0)) throw RangeError;
...
// method body
...
if (!(text.length === 0))
throw RangeError
return text;
}
then you need not parse the entire method body, and you might get away with a hack as I proposed.
Best of luck!
[1] the last time I checked these ECMA/JS script grammars, none of them handled regex literals, /pattern/, properly, making them in my opinion suspect.
The java code generated from ANTLR is one rule, one method in most times. But for the following rule:
switchBlockLabels[ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts]
: ^(SWITCH_BLOCK_LABEL_LIST switchCaseLabel[_entity, _method, _preStmts]* switchDefaultLabel? switchCaseLabel*)
;
it generates a submethod named synpred125_TreeParserStage3_fragment(), in which mehod switchCaseLabel(_entity, _method, _preStmts) is called:
synpred125_TreeParserStage3_fragment(){
......
switchCaseLabel(_entity, _method, _preStmts);//variable not found error
......
}
switchBlockLabels(ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts){
......
synpred125_TreeParserStage3_fragment();
......
}
The problem is switchCaseLabel has parameters and the parameters come from the parameters of switchBlockLabels() method, so "variable not found error" occurs.
How can I solve this problem?
My guess is that you've enabled global backtracking in your grammar like this:
options {
backtrack=true;
}
in which case you can't pass parameters to ambiguous rules. In order to communicate between ambiguous rules when you have enabled global backtracking, you must use rule scopes. The "predicate-methods" do have access to rule scopes variables.
A demo
Let's say we have this ambiguous grammar:
grammar Scope;
options {
backtrack=true;
}
parse
: atom+ EOF
;
atom
: numberOrName+
;
numberOrName
: Number
| Name
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
(for the record, the atom+ and numberOrName+ make it ambiguous)
If you now want to pass information between the parse and numberOrName rule, say an integer n, something like this will fail (which is the way you tried it):
grammar Scope;
options {
backtrack=true;
}
parse
#init{int n = 0;}
: (atom[++n])+ EOF
;
atom[int n]
: (numberOrName[n])+
;
numberOrName[int n]
: Number {System.out.println(n + " = " + $Number.text);}
| Name {System.out.println(n + " = " + $Name.text);}
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
In order to do this using rule scopes, you could do it like this:
grammar Scope;
options {
backtrack=true;
}
parse
scope{int n; /* define the scoped variable */ }
#init{$parse::n = 0; /* important: initialize the variable! */ }
: atom+ EOF
;
atom
: numberOrName+
;
numberOrName /* increment and print the scoped variable from the parse rule */
: Number {System.out.println(++$parse::n + " = " + $Number.text);}
| Name {System.out.println(++$parse::n + " = " + $Name.text);}
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
Test
If you now run the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "foo 42 Bar 666";
ScopeLexer lexer = new ScopeLexer(new ANTLRStringStream(src));
ScopeParser parser = new ScopeParser(new CommonTokenStream(lexer));
parser.parse();
}
}
you will see the following being printed to the console:
1 = foo
2 = 42
3 = Bar
4 = 666
P.S.
I don't know what language you're parsing, but enabling global backtracking is usually overkill and can have quite an impact on the performance of your parser. Computer languages often are ambiguous in just a few cases. Instead of enabling global backtracking, you really should look into adding syntactic predicates, or enabling backtracking on those rules that are ambiguous. See The Definitive ANTLR Reference for more info.
I am writing an ANTRL grammar for translating one language to another but the documentation on using the HIDDEN channel is very scarce. I cannot find an example anywhere. The only thing I have found is the FAQ on www.antlr.org which tells you how to access the hidden channel but not how best to use this functionality. The target language is Java.
In my grammar file, I pass whitespace and comments through like so:
// Send runs of space and tab characters to the hidden channel.
WHITESPACE
: (SPACE | TAB)+ { $channel = HIDDEN; }
;
// Single-line comments begin with --
SINGLE_COMMENT
: ('--' COMMENT_CHARS NEWLINE) {
$channel=HIDDEN;
}
;
fragment COMMENT_CHARS
: ~('\r' | '\n')*
;
// Treat runs of newline characters as a single NEWLINE token.
NEWLINE
: ('\r'? '\n')+ { $channel = HIDDEN; }
;
In my members section I have defined a method for writing hidden channel tokens to my output StringStream...
#members {
private int savedIndex = 0;
void ProcessHiddenChannel(TokenStream input) {
List<Token> tokens = ((CommonTokenStream)input).getTokens(savedIndex, input.index());
for(Token token: tokens) {
if(token.getChannel() == token.HIDDEN_CHANNEL) {
output.append(token.getText());
}
}
savedIndex = input.index();
}
}
Now to use this, I have to call the method after every single token in my grammar.
myParserRule
: MYTOKEN1 { ProcessHiddenChannel(input); }
MYTOKEN2 { ProcessHiddenChannel(input); }
;
Surely there must be a better way?
EDIT: This is an example of the input language:
-- -----------------------------------------------------------------
--
--
-- Name Description
-- ==================================
-- IFM1/183 Freq Spectrum Inversion
--
-- -----------------------------------------------------------------
PROCEDURE IFM1/183
TITLE "Freq Spectrum Inversion";
HELP
Freq Spectrum Inversion
ENDHELP;
PRIVILEGE CTRL;
WINDOW MANDATORY;
INPUT
$Input : #NO_YES
DEFAULT select %YES when /IFMS1/183.VALUE = %NO;
%NO otherwise
endselect
PROMPT "Spec Inv";
$Forced_Cmd : BOOLEAN
Default FALSE
Prompt "Forced Commanding";
DEFINE
&RetCode : #PSTATUS := %OK;
&msg : STRING;
&Input : BOOLEAN;
REQUIRE AVAILABLE(/IFMS1)
MSG "IFMS1 not available";
REQUIRE /IFMS1/001.VALUE = %MON_AND_CTRL
MSG "IFMS1 not in control mode";
BEGIN -- Procedure Body --
&msg := "IFMS1/183 -> " + toString($Input) + " : ";
-- pre-check
IF /IFMS1/183.VALUE = $Input
AND $Forced_Cmd = FALSE THEN
EXIT (%OK, MSG &msg + "already set");
ENDIF;
-- command
IF $Input = %YES THEN &Input:= TRUE;
ELSE &Input:= FALSE;
ENDIF;
SET &RetCode := SEND IFMS1.FREQPLAN
( $FreqSpecInv := &Input);
IF &RetCode <> %OK THEN
EXIT (&RetCode, MSG &msg + "command failed");
ENDIF;
-- verify
SET &RetCode := VERIFY /IFMS1/183.VALUE = $Input TIMEOUT '10';
IF &RetCode <> %OK THEN
EXIT (&RetCode, MSG &msg + "verification failed");
ELSE
EXIT (&RetCode, MSG &msg + "verified");
ENDIF;
END
Look into inheriting CommonTokenStream and feeding an instance of your subclass into ANTLR. From the code example that you give, I suspect that you might be interested in taking a look at the filter and the rewrite options available in version 3.
Also, take a look at this other related stack overflow question.
I have just been going through some of my old questions and thought it was worth responding with the final solution that worked the best. In the end, the best way to translate a language was to use StringTemplate. This takes care of re-indenting the output for you. There is a very good example called 'cminus' in the ANTLR example pack that shows how to use it.