ANTLR Source to Output - antlr

I'm trying to implement something like a Code Contracts feature for JavaScript as an assignment for one of my courses.
The problem I'm having is that I can't seem to find a way to output the source file directly to the console without modifying the entire grammar.
Does anybody knows a way to achieve this?
Thanks in advance.
Here's an example of what I'm trying to do:
function DoClear(num, arr, text){
Contract.Requires<RangeError>(num > 0);
Contract.Requires(num < 1000);
Contract.Requires<TypeError>(arr instanceOf Array);
Contract.Requires<RangeError>(arr.length > 0 && arr.length <= 9);
Contract.Requires<ReferenceError>(text != null);
Contract.Ensures<RangeError>(text.length === 0);
// method body
[...]
return text;
}
function DoClear(num, arr, text){
if (!(num > 0))
throw RangeError;
if (!(num < 1000))
throw Error;
if (!(arr instanceOf Array))
throw TypeError;
if (!(arr.length > 0 && arr.length <= 9))
throw RangeError;
if (!(text != null))
throw ReferenceError
// method body
[...]
if (!(text.length === 0))
throw RangeError
else
return text;
}

There are a few (minor) things you'll want to consider:
ignore string literals that might contain your special contract-syntax;
ignore multi- and single line comments that might contain your special Contract syntax;
ignore code like this: var Requires = "Contract.Requires<RangeError>"; (i.e. regular JavaScript code that "looks like" your contract-syntax);
It's pretty straight forward to take the points above into account and also simply create single tokens for an entire contract-line. You'll be making your life hard when tokenizing the following into 4 different tokens Contract.Requires<RangeError>(num > 0):
Contract
Requires
<RangeError>
(num > 0)
So it's easiest to create a single token from it, and at the parsing phase, split the token on ".", "<" or ">" with a maximum of 4 tokens (leaving expressions containing ".", "<" or ">" as they are).
A quick demo of what I described above might look like this:
grammar CCJS;
parse
: atom+ EOF
;
atom
: code_contract
| (Comment | String | Any) {System.out.print($text);}
;
code_contract
: Contract
{
String[] tokens = $text.split("[.<>]", 4);
System.out.print("if (!" + tokens[3] + ") throw " + tokens[2]);
}
;
Contract
#init{
boolean hasType = false;
}
#after{
if(!hasType) {
// inject a generic Error if this contract has no type
setText(getText().replaceFirst("\\(", "<Error>("));
}
}
: 'Contract.' ('Requires' | 'Ensures') ('<' ('a'..'z' | 'A'..'Z')+ '>' {hasType=true;})? '(' ~';'+
;
Comment
: '//' ~('\r' | '\n')*
| '/*' .* '*/'
;
String
: '"' (~('\\' | '"' | '\r' | '\n') | '\\' . )* '"'
;
Any
: .
;
which you can test with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String src =
"/* \n" +
" Contract.Requires to be ignored \n" +
"*/ \n" +
"function DoClear(num, arr, text){ \n" +
" Contract.Requires<RangeError>(num > 0); \n" +
" Contract.Requires(num < 1000); \n" +
" Contract.Requires<TypeError>(arr instanceOf Array); \n" +
" Contract.Requires<RangeError>(arr.length > 0 && arr.length <= 9); \n" +
" Contract.Requires<ReferenceError>(text != null); \n" +
" Contract.Ensures<RangeError>(text.length === 0); \n" +
" \n" +
" // method body \n" +
" // and ignore single line comments, Contract.Ensures \n" +
" var s = \"Contract.Requires\"; // also ignore strings \n" +
" \n" +
" return text; \n" +
"} \n";
CCJSLexer lexer = new CCJSLexer(new ANTLRStringStream(src));
CCJSParser parser = new CCJSParser(new CommonTokenStream(lexer));
parser.parse();
}
}
If you run the Main class above, the following will be printed to the console:
/*
Contract.Requires to be ignored
*/
function DoClear(num, arr, text){
if (!(num > 0)) throw RangeError;
if (!(num < 1000)) throw Error;
if (!(arr instanceOf Array)) throw TypeError;
if (!(arr.length > 0 && arr.length <= 9)) throw RangeError;
if (!(text != null)) throw ReferenceError;
if (!(text.length === 0)) throw RangeError;
// method body
// and ignore single line comments, Contract.Ensures
var s = "Contract.Requires"; // also ignore strings
return text;
}
BUT ...
... I realize that it isn't what you're exactly looking for: the RangeError is not placed at the end of your function. And that's going to be tough one: a function might have multiple returns, and is likely to have multiple code blocks { ... } making it difficult to know where the } is that ends the function. So you don't know where exactly to inject this RangeError-check. At least, not with a naive approach as I demonstrated.
The only reliable way to implement such a thing is to get a decent JavaScript grammar, add your own contract-rules to it, rewrite the AST the parser produces, and finally emit the new AST in a friendly-formatted way: not a trivial task, to say the least!
There are various ECMA/JS grammars on the ANTLR Wiki, but tread with care: they are user-committed grammars and may contain errors (probably will in this case[1]!).
If you choose to place the RangeError there where it should be rewritte, like so:
function DoClear(num, arr, text){
Contract.Requires<RangeError>(num > 0);
...
// method body
...
Contract.Ensures<RangeError>(text.length === 0);
return text;
}
which would result in:
function DoClear(num, arr, text){
if (!(num > 0)) throw RangeError;
...
// method body
...
if (!(text.length === 0))
throw RangeError
return text;
}
then you need not parse the entire method body, and you might get away with a hack as I proposed.
Best of luck!
[1] the last time I checked these ECMA/JS script grammars, none of them handled regex literals, /pattern/, properly, making them in my opinion suspect.

Related

ANTLR4 generating code for the last expression entered in curly braces

I'm building a language primarily used for calculation purposes. It is a small language with C like syntax but extremely limited functionality. For the past few days, I've been trying to generate code that is encapsulated in curly braces however whenever I enter expressions in curly braces, the code generated is always for the last expression entered. It is supposed to work on a while loop.
For example:
while( true )
{
// some expressions (not using any variables for simplicity)
5 + 9;
8 - 10;
4 * 6;
}
However the code generated only takes into account the last expression (4 * 6) in this case.
The link to the code:
https://codeshare.io/GL0xRk
And also, the code snippet for handling curly braces and some other relative code:
calcul returns [String code]
#init
{
$code = new String();
}
#after
{
System.out.print($code);
for( int i = 0; i < getvarg_count(); ++i )
{
System.out.println("POP");
}
System.out.println("HALT");
}
: (decl
{
// declaration
$code += $decl.code;
})*
NEWLINE*
{
$code += "";
}
(instruction
{
// instruction, eg. x = 5; 7 * 4;
$code += $instruction.code;
System.err.println("instruction found");
})*
;
whileStat returns [String code]
: WHILE '(' condition ')' NEWLINE* block
{
int cur_label = nextLabel();
$code = "LABEL " + cur_label + "\n";
$code += $condition.code;
$code += "JUMPF " + (cur_label + 1) + "\n";
$code += $block.code;
$code += "JUMP " + cur_label + "\n";
$code += "LABEL " + (cur_label + 1) + "\n";
}
;
block returns [String code]
#init
{
$code = new String();
}
: '{' instruction* '}' NEWLINE*
{
System.err.println("block found");
$code += $instruction.code;
System.err.println("curly braces for while found");
}
;
And the compiler code generated:
while(true)
{
5+9;
8-10;
4*6;
}
block found
curly braces for while found
instruction found
LABEL 0
PUSHI 1
JUMPF 1
PUSHI 4
PUSHI 6
MUL
POP
JUMP 0
LABEL 1
HALT
I have a feeling that the $code is always reinitialized. Or maybe it's because I have instruction* in two different rules. I'm not sure how else to handle this problem. All help is much appreciated.
Thank you
Anyway, it looks like your problem is that $instruction in block's action only refers to the last instruction because the block is outside of the *, so the action only gets run once.
You can either move the action inside the * like you did in the calcul rule or you can put all the instructions in a list with instructions+=instruction* and then use $instructions in the action (or better: a listener or visitor).
PS: I strongly recommend to use a listener or visitor instead of having actions all over your grammar. They make the grammar very hard to read.

ANTLR4: Unexpected behavior that I can't understand

I'm very new to ANTLR4 and am trying to build my own language. So my grammar starts at
program: <EOF> | statement | functionDef | statement program | functionDef program;
and my statement is
statement: selectionStatement | compoundStatement | ...;
and
selectionStatement
: If LeftParen expression RightParen compoundStatement (Else compoundStatement)?
| Switch LeftParen expression RightParen compoundStatement
;
compoundStatement
: LeftBrace statement* RightBrace;
Now the problem is, that when I test a piece of code against selectionStatement or statement it passes the test, but when I test it against program it fails to recognize. Can anyone help me on this? Thank you very much
edit: the code I use to test is the following:
if (x == 2) {}
It passes the test against selectionStatement and statement but fails at program. It appears that program only accepts if...else
if (x == 2) {} else {}
Edit 2:
The error message I received was
<unknown>: Incorrect error: no viable alternative at input 'if(x==2){}'
Cannot answer your question given the incomplete information provided: the statement rule is partial and the compoundStatement rule is missing.
Nonetheless, there are two techniques you should be using to answer this kind of question yourself (in addition to unit tests).
First, ensure that the lexer is working as expected. This answer shows how to dump the token stream directly.
Second, use a custom ErrorListener to provide a meaningful/detailed description of its parse path to every encountered error. An example:
public class JavaErrorListener extends BaseErrorListener {
public int lastError = -1;
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine,
String msg, RecognitionException e) {
Parser parser = (Parser) recognizer;
String name = parser.getSourceName();
TokenStream tokens = parser.getInputStream();
Token offSymbol = (Token) offendingSymbol;
int thisError = offSymbol.getTokenIndex();
if (offSymbol.getType() == -1 && thisError == tokens.size() - 1) {
Log.debug(this, name + ": Incorrect error: " + msg);
return;
}
String offSymName = JavaLexer.VOCABULARY.getSymbolicName(offSymbol.getType());
List<String> stack = parser.getRuleInvocationStack();
// Collections.reverse(stack);
Log.error(this, name);
Log.error(this, "Rule stack: " + stack);
Log.error(this, "At line " + line + ":" + charPositionInLine + " at " + offSymName + ": " + msg);
if (thisError > lastError + 10) {
lastError = thisError - 10;
}
for (int idx = lastError + 1; idx <= thisError; idx++) {
Token token = tokens.get(idx);
if (token.getChannel() != Token.HIDDEN_CHANNEL) Log.error(this, token.toString());
}
lastError = thisError;
}
}
Note: adjust the Log statements to whatever logging package you are using.
Finally, Antlr doesn't do 'weird' things - just things that you don't understand.

How to rewrite token stream more than once using ANTLR4

I implement simple preprocessor using the great ANTLR4 library. The program itself runs in several iterations - in each iteration the future output is modified slightly.
Currently I use TokenStreamRewriter and its methods delete, insertAfter, replace and getText.
Unfortunately I can't manage to rewrite tokens that was rewritten before (got IllegalArgumentException). This is not a bug but according to the source code multiple replacement can't be achieved in any way.
I suppose that a proper solution exists as this appears to be a common problem. Could anyone please hint me? I'd rather use some existing and tested solution than reimplement the rewriter itself.
Maybe the rewriter isn't the right tool to use.
Thanks for help
Good evening
Now a dynamic code for the same problem. First you must have made visible in your listener class the Token stream and the rewriter
Here is the code of the constructor of my VB6Mylistener class
class VB6MYListener : public VB6ParserListener {
public: string FicName;
int numero ; // numero de la regle
wstring BaseFile;
CommonTokenStream* TOK ;
TokenStreamRewriter* Rewriter ;
// Fonctions pour la traduction avec le listener void functions
created by ANTLR4 ( contextes )
VB6MYListener( CommonTokenStream* tok , wstring baseFile, TokenStreamRewriter* rewriter , string Name)
{
TOK = tok; // Flux de tokens
BaseFile = baseFile; // Fichier entree en VB6
Rewriter = rewriter;
FicName = Name; // Nom du fichier courant pour suivi
}
Here in a context i cross with the listener. The Tokenstream is TOK visible by all the functions void
std::string retourchaine;
std::vector<std::string> TAB{};
for (int i = ctx->start->getTokenIndex(); i <= ctx->stop>getTokenIndex(); i++)
{
TAB.push_back(TOK->get(i)->getText()); // HERE TOK
}
for (auto element : TAB)
{
if (element == "=") { element = ":="; }
if (element != "As" && element != "Private" && element != "Public")
{
std::cout << element << std::endl;
retourchaine += element ; // retour de la chaine au contexte
}
}
retourchaine = retourchaine + " ;";
Rewriter->replace(ctx->start, ctx->stop, retourchaine );
`
A workaround I am using because I need to make a replacement in the token and the Tokenrewriter does not make the job correctly when you have multiple replacements in one context.
In each context I can make the stream of tokens visible and I use an array to copy all the tokens in the context and create a string with the replacement and after that, I use Rewriter->replace( ctx->start , ctx->stop , tokentext ) ;
Some code here for a context:
string TAB[265];
string tokentext = "";
for (int i = ctx->start->getTokenIndex(); i <= ctx->stop->getTokenIndex(); i++)
{
TAB[i] = TOK->get(i)->getText();
// if (TOK->get(i)->getText() != "As" && TOK->get(i)->getText() != "Private" && TOK->get(i)->getText() != "Public")
//if (TOK->get(i)->getText() == "=")
//{
if (TAB[i] == "=") { TAB[i] = ":="; }
// if (TAB[i] == "=") { TAB[i] = "="; } // autres changements
if (TAB[i] != "As" && TAB[i] != "Private" && TAB[i] != "Public") { tokentext += TAB[i]; }
cout << "nombre de tokens du contexte" << endl;
cout << i << endl;
}
tokentext = tokentext + " ;";
cout << tokentext << endl;
Rewriter->replace(ctx->start, ctx->stop, tokentext);
It's a a basic code I use to make the job robust. Hope this will be useful.
i think that rewriting the token stream is not a good idea, because you can't
treat the general case of a tree. The TokenStreamRewriter tool of ANTLR is usefulness. If you use a listener , you can't change the AST tree and the contexts created by ANTLR. you must use a Bufferedwriter to do the job for rewriting the context you change locally in the final file of your translation.
Thanks to Ewa Hechsman and her program on github on a transpiler from Pascal to Python.
I think it'a real solution for a professional project.
So i agree with Ira Baxter. we need a rewriting tree

My simple ANTLR grammar ignores certain invalid tokens when parsing

I asked a question a couple of weeks ago about my ANTLR grammar (My simple ANTLR grammar is not working as expected). Since asking that question, I've done more digging and debugging and gotten most of the kinks out. I am left with one issue, though.
My generated parser code is not picking up invalid tokens in one particular part of the text that is processed. The lexer is properly breaking things into tokens, but the parser does not kick out invalid tokens in some cases. In particular, when the invalid token is at the end of a phrase like "A and "B", the parser ignores it - it's like the token isn't even there.
Some specific examples:
"A and B" - perfectly valid
"A# and B" - parser properly picks up the invalid # token
"A and #B" - parser properly picks up the invalid # token
"A and B#" - here's the mystery - the lexer finds the # token and the parser IGNORES it (!)
"(A and B#) or C" - further mystery - the lexer finds the # token and the parser IGNORES it (!)
Here is my grammar:
grammar QvidianPlaybooks;
options{ language=CSharp3; output=AST; ASTLabelType = CommonTree; }
public parse
: expression
;
LPAREN : '(' ;
RPAREN : ')' ;
ANDOR : 'AND'|'and'|'OR'|'or';
NAME : ('A'..'Z');
WS : ' ' { $channel = Hidden; };
THEREST : .;
// ***************** parser rules:
expression : anexpression EOF!;
anexpression : atom (ANDOR^ atom)*;
atom : NAME | LPAREN! anexpression RPAREN!;
The code that then processes the resulting tree looks like this:
... from the main program
QvidianPlaybooksLexer lexer = new QvidianPlaybooksLexer(new ANTLRStringStream(src));
QvidianPlaybooksParser parser = new QvidianPlaybooksParser(new CommonTokenStream(lexer));
parser.TreeAdaptor = new CommonTreeAdaptor();
CommonTree tree = (CommonTree)parser.parse().Tree;
ValidateTree(tree, 0, iValidIdentifierCount);
// recursive code that walks the tree
public static RuleLogicValidationResult ValidateTree(ITree Tree, int depth, int conditionCount)
{
RuleLogicValidationResult rlvr = null;
if (Tree != null)
{
CommonErrorNode commonErrorNode = Tree as CommonErrorNode;
if (null != commonErrorNode)
{
rlvr = new RuleLogicValidationResult();
rlvr.IsValid = false;
rlvr.ErrorType = LogicValidationErrorType.Other;
Console.WriteLine(rlvr.ToString());
}
else
{
string strTree = Tree.ToString();
strTree = strTree.Trim();
strTree = strTree.ToUpper();
if ((Tree.ChildCount != 0) && (Tree.ChildCount != 2))
{
rlvr = new RuleLogicValidationResult();
rlvr.IsValid = false;
rlvr.ErrorType = LogicValidationErrorType.Other;
rlvr.InvalidIdentifier = strTree;
rlvr.ErrorPosition = 0;
Console.WriteLine(String.Format("CHILD COUNT of {0} = {1}", strTree, tree.ChildCount));
}
// if the current node is valid, then validate the two child nodes
if (null == rlvr || rlvr.IsValid)
{
// output the tree node
for (int i = 0; i < depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(Tree);
rlvr = ValidateTree(Tree.GetChild(0), depth + 1, conditionCount);
if (rlvr.IsValid)
{
rlvr = ValidateTree(Tree.GetChild(1), depth + 1, conditionCount);
}
}
else
{
Console.WriteLine(rlvr.ToString());
}
}
}
else
{
// this tree is null, return a "it's valid" result
rlvr = new RuleLogicValidationResult();
rlvr.ErrorType = LogicValidationErrorType.None;
rlvr.IsValid = true;
}
return rlvr;
}
Add EOF to the end of your start rule. :)

Handle hidden channel in antlr 3

I am writing an ANTRL grammar for translating one language to another but the documentation on using the HIDDEN channel is very scarce. I cannot find an example anywhere. The only thing I have found is the FAQ on www.antlr.org which tells you how to access the hidden channel but not how best to use this functionality. The target language is Java.
In my grammar file, I pass whitespace and comments through like so:
// Send runs of space and tab characters to the hidden channel.
WHITESPACE
: (SPACE | TAB)+ { $channel = HIDDEN; }
;
// Single-line comments begin with --
SINGLE_COMMENT
: ('--' COMMENT_CHARS NEWLINE) {
$channel=HIDDEN;
}
;
fragment COMMENT_CHARS
: ~('\r' | '\n')*
;
// Treat runs of newline characters as a single NEWLINE token.
NEWLINE
: ('\r'? '\n')+ { $channel = HIDDEN; }
;
In my members section I have defined a method for writing hidden channel tokens to my output StringStream...
#members {
private int savedIndex = 0;
void ProcessHiddenChannel(TokenStream input) {
List<Token> tokens = ((CommonTokenStream)input).getTokens(savedIndex, input.index());
for(Token token: tokens) {
if(token.getChannel() == token.HIDDEN_CHANNEL) {
output.append(token.getText());
}
}
savedIndex = input.index();
}
}
Now to use this, I have to call the method after every single token in my grammar.
myParserRule
: MYTOKEN1 { ProcessHiddenChannel(input); }
MYTOKEN2 { ProcessHiddenChannel(input); }
;
Surely there must be a better way?
EDIT: This is an example of the input language:
-- -----------------------------------------------------------------
--
--
-- Name Description
-- ==================================
-- IFM1/183 Freq Spectrum Inversion
--
-- -----------------------------------------------------------------
PROCEDURE IFM1/183
TITLE "Freq Spectrum Inversion";
HELP
Freq Spectrum Inversion
ENDHELP;
PRIVILEGE CTRL;
WINDOW MANDATORY;
INPUT
$Input : #NO_YES
DEFAULT select %YES when /IFMS1/183.VALUE = %NO;
%NO otherwise
endselect
PROMPT "Spec Inv";
$Forced_Cmd : BOOLEAN
Default FALSE
Prompt "Forced Commanding";
DEFINE
&RetCode : #PSTATUS := %OK;
&msg : STRING;
&Input : BOOLEAN;
REQUIRE AVAILABLE(/IFMS1)
MSG "IFMS1 not available";
REQUIRE /IFMS1/001.VALUE = %MON_AND_CTRL
MSG "IFMS1 not in control mode";
BEGIN -- Procedure Body --
&msg := "IFMS1/183 -> " + toString($Input) + " : ";
-- pre-check
IF /IFMS1/183.VALUE = $Input
AND $Forced_Cmd = FALSE THEN
EXIT (%OK, MSG &msg + "already set");
ENDIF;
-- command
IF $Input = %YES THEN &Input:= TRUE;
ELSE &Input:= FALSE;
ENDIF;
SET &RetCode := SEND IFMS1.FREQPLAN
( $FreqSpecInv := &Input);
IF &RetCode <> %OK THEN
EXIT (&RetCode, MSG &msg + "command failed");
ENDIF;
-- verify
SET &RetCode := VERIFY /IFMS1/183.VALUE = $Input TIMEOUT '10';
IF &RetCode <> %OK THEN
EXIT (&RetCode, MSG &msg + "verification failed");
ELSE
EXIT (&RetCode, MSG &msg + "verified");
ENDIF;
END
Look into inheriting CommonTokenStream and feeding an instance of your subclass into ANTLR. From the code example that you give, I suspect that you might be interested in taking a look at the filter and the rewrite options available in version 3.
Also, take a look at this other related stack overflow question.
I have just been going through some of my old questions and thought it was worth responding with the final solution that worked the best. In the end, the best way to translate a language was to use StringTemplate. This takes care of re-indenting the output for you. There is a very good example called 'cminus' in the ANTLR example pack that shows how to use it.