I implement simple preprocessor using the great ANTLR4 library. The program itself runs in several iterations - in each iteration the future output is modified slightly.
Currently I use TokenStreamRewriter and its methods delete, insertAfter, replace and getText.
Unfortunately I can't manage to rewrite tokens that was rewritten before (got IllegalArgumentException). This is not a bug but according to the source code multiple replacement can't be achieved in any way.
I suppose that a proper solution exists as this appears to be a common problem. Could anyone please hint me? I'd rather use some existing and tested solution than reimplement the rewriter itself.
Maybe the rewriter isn't the right tool to use.
Thanks for help
Good evening
Now a dynamic code for the same problem. First you must have made visible in your listener class the Token stream and the rewriter
Here is the code of the constructor of my VB6Mylistener class
class VB6MYListener : public VB6ParserListener {
public: string FicName;
int numero ; // numero de la regle
wstring BaseFile;
CommonTokenStream* TOK ;
TokenStreamRewriter* Rewriter ;
// Fonctions pour la traduction avec le listener void functions
created by ANTLR4 ( contextes )
VB6MYListener( CommonTokenStream* tok , wstring baseFile, TokenStreamRewriter* rewriter , string Name)
{
TOK = tok; // Flux de tokens
BaseFile = baseFile; // Fichier entree en VB6
Rewriter = rewriter;
FicName = Name; // Nom du fichier courant pour suivi
}
Here in a context i cross with the listener. The Tokenstream is TOK visible by all the functions void
std::string retourchaine;
std::vector<std::string> TAB{};
for (int i = ctx->start->getTokenIndex(); i <= ctx->stop>getTokenIndex(); i++)
{
TAB.push_back(TOK->get(i)->getText()); // HERE TOK
}
for (auto element : TAB)
{
if (element == "=") { element = ":="; }
if (element != "As" && element != "Private" && element != "Public")
{
std::cout << element << std::endl;
retourchaine += element ; // retour de la chaine au contexte
}
}
retourchaine = retourchaine + " ;";
Rewriter->replace(ctx->start, ctx->stop, retourchaine );
`
A workaround I am using because I need to make a replacement in the token and the Tokenrewriter does not make the job correctly when you have multiple replacements in one context.
In each context I can make the stream of tokens visible and I use an array to copy all the tokens in the context and create a string with the replacement and after that, I use Rewriter->replace( ctx->start , ctx->stop , tokentext ) ;
Some code here for a context:
string TAB[265];
string tokentext = "";
for (int i = ctx->start->getTokenIndex(); i <= ctx->stop->getTokenIndex(); i++)
{
TAB[i] = TOK->get(i)->getText();
// if (TOK->get(i)->getText() != "As" && TOK->get(i)->getText() != "Private" && TOK->get(i)->getText() != "Public")
//if (TOK->get(i)->getText() == "=")
//{
if (TAB[i] == "=") { TAB[i] = ":="; }
// if (TAB[i] == "=") { TAB[i] = "="; } // autres changements
if (TAB[i] != "As" && TAB[i] != "Private" && TAB[i] != "Public") { tokentext += TAB[i]; }
cout << "nombre de tokens du contexte" << endl;
cout << i << endl;
}
tokentext = tokentext + " ;";
cout << tokentext << endl;
Rewriter->replace(ctx->start, ctx->stop, tokentext);
It's a a basic code I use to make the job robust. Hope this will be useful.
i think that rewriting the token stream is not a good idea, because you can't
treat the general case of a tree. The TokenStreamRewriter tool of ANTLR is usefulness. If you use a listener , you can't change the AST tree and the contexts created by ANTLR. you must use a Bufferedwriter to do the job for rewriting the context you change locally in the final file of your translation.
Thanks to Ewa Hechsman and her program on github on a transpiler from Pascal to Python.
I think it'a real solution for a professional project.
So i agree with Ira Baxter. we need a rewriting tree
Related
I'm currently in a traineeship and I currently have to softwares I'm working on. The most important was requested yesterday and I'm stucked on the failure of its main feature: saving passwords.
The application is developped in C++\CLR using Visual Studio 2013 (Couldn't install MFC libraries somehow, installation kept failing and crashing even after multiple reboots.) and aims to generate a password from a seed provided by the user. The generated password will be save onto a .txt file. If the seed has already been used then the previously generated password will show up.
Unfortunately I can't save the password and seed to the file, though I can write the seed if I don't get to the end of the document. I went for the "if line is empty then write this to the document" but it doesn't work and I can't find out why. However I can read the passwords without any problem.
Here's the interresting part of the source:
int seed;
char genRandom() {
static const char letters[] =
"0123456789"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz";
int stringLength = sizeof(letters) - 1;
return letters[rand() % stringLength];
}
System::Void OK_Click(System::Object^ sender, System::EventArgs^ e) {
fstream passwords;
if (!(passwords.is_open())) {
passwords.open("passwords.txt", ios::in | ios::out);
}
string gen = msclr::interop::marshal_as<std::string>(GENERATOR->Text), line, genf = gen;
bool empty_line_found = false;
while (empty_line_found == false) {
getline(passwords, line);
if (gen == line) {
getline(passwords, line);
PASSWORD->Text = msclr::interop::marshal_as<System::String^>(line);
break;
}
if (line.empty()) {
for (unsigned int i = 0; i < gen.length(); i++) {
seed += gen[i];
}
srand(seed);
string pass;
for (int i = 0; i < 10; ++i) {
pass += genRandom();
}
passwords << pass << endl << gen << "";
PASSWORD->Text = msclr::interop::marshal_as<System::String^>(pass);
empty_line_found = true;
}
}
}
I've also tried replacing ios::in by ios::app and it doesn't work. And yes I have included fstream, iostream, etc.
Thanks in advance!
[EDIT]
Just solved this problem. Thanks Rook for putting me on the right way. It feels like a silly way to do it, but I've closed the file and re-openned it using ios::app to write at the end of it. I also solved a stupid mistake resulting in writing the password before the seed and not inserting a final line so the main loop can still work. Here's the code in case someone ends up with the same problem:
int seed;
char genRandom() {
static const char letters[] =
"0123456789"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz";
int stringLength = sizeof(letters) - 1;
return letters[rand() % stringLength];
}
System::Void OK_Click(System::Object^ sender, System::EventArgs^ e) {
fstream passwords;
if (!(passwords.is_open())) {
passwords.open("passwords.txt", ios::in | ios::out);
}
string gen = msclr::interop::marshal_as<std::string>(GENERATOR->Text), line, genf = gen;
bool empty_line_found = false;
while (empty_line_found == false) {
getline(passwords, line);
if (gen == line) {
getline(passwords, line);
PASSWORD->Text = msclr::interop::marshal_as<System::String^>(line);
break;
}
if (line.empty()) {
passwords.close();
passwords.open("passwords.txt", ios::app);
for (unsigned int i = 0; i < gen.length(); i++) {
seed += gen[i];
}
srand(seed);
string pass;
for (int i = 0; i < 10; ++i) {
pass += genRandom();
}
passwords << gen << endl << pass << endl << "";
PASSWORD->Text = msclr::interop::marshal_as<System::String^>(pass);
empty_line_found = true;
}
}
passwords.close();
}
So, here's an interesting thing:
passwords << pass << endl << gen << "";
You're not ending that with a newline. This means the very end of your file could be missing a newline too. This has an interesting effect when you do this on the final line:
getline(passwords, line);
getline will read until it sees a line ending, or an EOF. If there's no newline, it'll hit that EOF and then set the EOF bit on the stream. That means the next time you try to do this:
passwords << pass << endl << gen << "";
the stream will refuse to write anything, because it is in an eof state. There are various things you can do here, but the simplest would be to do passwords.clear() to remove any error flags like eof. I'd be very cautious about accidentally clearing genuine error flags though; read the docs for fstream carefully.
I also reiterate my comment about C++/CLR being a glue language, and not a great language for general purpose development, which would be best done using C++ or a .net language, such as C#. If you're absolutely wedded to C++/CLR for some reason, you may as well make use of the extensive .net library so you don't have to pointlessly martial managed types back and forth. See System::IO::FileStream for example.
I'm very new to ANTLR4 and am trying to build my own language. So my grammar starts at
program: <EOF> | statement | functionDef | statement program | functionDef program;
and my statement is
statement: selectionStatement | compoundStatement | ...;
and
selectionStatement
: If LeftParen expression RightParen compoundStatement (Else compoundStatement)?
| Switch LeftParen expression RightParen compoundStatement
;
compoundStatement
: LeftBrace statement* RightBrace;
Now the problem is, that when I test a piece of code against selectionStatement or statement it passes the test, but when I test it against program it fails to recognize. Can anyone help me on this? Thank you very much
edit: the code I use to test is the following:
if (x == 2) {}
It passes the test against selectionStatement and statement but fails at program. It appears that program only accepts if...else
if (x == 2) {} else {}
Edit 2:
The error message I received was
<unknown>: Incorrect error: no viable alternative at input 'if(x==2){}'
Cannot answer your question given the incomplete information provided: the statement rule is partial and the compoundStatement rule is missing.
Nonetheless, there are two techniques you should be using to answer this kind of question yourself (in addition to unit tests).
First, ensure that the lexer is working as expected. This answer shows how to dump the token stream directly.
Second, use a custom ErrorListener to provide a meaningful/detailed description of its parse path to every encountered error. An example:
public class JavaErrorListener extends BaseErrorListener {
public int lastError = -1;
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine,
String msg, RecognitionException e) {
Parser parser = (Parser) recognizer;
String name = parser.getSourceName();
TokenStream tokens = parser.getInputStream();
Token offSymbol = (Token) offendingSymbol;
int thisError = offSymbol.getTokenIndex();
if (offSymbol.getType() == -1 && thisError == tokens.size() - 1) {
Log.debug(this, name + ": Incorrect error: " + msg);
return;
}
String offSymName = JavaLexer.VOCABULARY.getSymbolicName(offSymbol.getType());
List<String> stack = parser.getRuleInvocationStack();
// Collections.reverse(stack);
Log.error(this, name);
Log.error(this, "Rule stack: " + stack);
Log.error(this, "At line " + line + ":" + charPositionInLine + " at " + offSymName + ": " + msg);
if (thisError > lastError + 10) {
lastError = thisError - 10;
}
for (int idx = lastError + 1; idx <= thisError; idx++) {
Token token = tokens.get(idx);
if (token.getChannel() != Token.HIDDEN_CHANNEL) Log.error(this, token.toString());
}
lastError = thisError;
}
}
Note: adjust the Log statements to whatever logging package you are using.
Finally, Antlr doesn't do 'weird' things - just things that you don't understand.
I am working on SAPI 5.4 Here is my one of grammar rule
<RULE ID="FIRST_TRANSMISSION" TOPLEVEL="ACTIVE">
<P><RULEREF REFID="BATTERY"/></P>
<P><RULEREF REFID="FO"/></P>
<P><RULEREF REFID="MISSION"/></P>
</RULE>
I used c++ code to get recognized words here is the peace of my code. My rule ID=256
case 256:
{
if (SUCCEEDED (hr))
{
hr = pISpRecoResult->GetText(SP_GETWHOLEPHRASE, SP_GETWHOLEPHRASE, TRUE, &pwszText, NULL);
}
char ch[260];
char DefChar = ' ';
WideCharToMultiByte(CP_ACP,0,pwszText,-1, ch,260,&DefChar, NULL);
string ss(ch);
str.append(ss);
break;
}
Now I want to get recognized words according to sub rules.
ex:- I want to get the recognized word according to <P><RULEREF REFID="FO"/></P> this phase in grammar file. How can I do it
You need to use ISpRecoResult::GetPhrase to retrieve the SPPHRASE associated with the recognition. Then you can use the Rule field of the SPPHRASE to traverse the rules associated with the recognition until you find the one with the id 'FO'. Once you've found the appropriate SPPHRASERULE, the rule has the word indexes associated with it, and you can call ISpRecoResult::GetText as before.
The code would look something like this: (Note - I haven't actually compiled this, so there are likely going to be errors.)
SPPHRASE* pPhrase;
hr = pRecoResult->GetPhrase(&pPhrase);
if (SUCCEEDED(hr))
{
ULONG ulFirstElement = 0;
ULONG ulCountOfElements = 0;
hr = FindRuleMatching(&pPhrase->Rule, L"FO", &ulFirstElement, &ulCountOfElements);
if (SUCCEEDED(hr))
{
LPWSTR pwszText;
hr = pRecoResult->GetText(ulFirstElement, ulCountOfElements, TRUE, &pwszText, NULL);
if (SUCCEEDED(hr))
{
// do stuff
::CoTaskMemFree(pwszText);
}
}
::CoTaskMemFree(pPhrase);
}
HRESULT
FindRuleMatching(const SPPHRASERULE* pRule, LPCWSTR szRuleName, ULONG* pulFirst, ULONG* pulCount)
{
if (pRule == NULL)
{
return E_FAIL;
}
// depth-first search.
if (wcscmp(pRule->pszName, szRuleName) == 0)
{
*pulFirst = pRule->ulFirstElement;
*pulCount = pRule->ulCountOfElements;
return S_OK;
}
else if (SUCCEEDED(FindRuleMatching(pRule->pFirstChild, szRuleName, pulFirst, pulCount)))
{
return S_OK;
}
else return FindRuleMatching(pRule->pNextSibling, szRuleName, pulFirst, pulCount);
}
I asked a question a couple of weeks ago about my ANTLR grammar (My simple ANTLR grammar is not working as expected). Since asking that question, I've done more digging and debugging and gotten most of the kinks out. I am left with one issue, though.
My generated parser code is not picking up invalid tokens in one particular part of the text that is processed. The lexer is properly breaking things into tokens, but the parser does not kick out invalid tokens in some cases. In particular, when the invalid token is at the end of a phrase like "A and "B", the parser ignores it - it's like the token isn't even there.
Some specific examples:
"A and B" - perfectly valid
"A# and B" - parser properly picks up the invalid # token
"A and #B" - parser properly picks up the invalid # token
"A and B#" - here's the mystery - the lexer finds the # token and the parser IGNORES it (!)
"(A and B#) or C" - further mystery - the lexer finds the # token and the parser IGNORES it (!)
Here is my grammar:
grammar QvidianPlaybooks;
options{ language=CSharp3; output=AST; ASTLabelType = CommonTree; }
public parse
: expression
;
LPAREN : '(' ;
RPAREN : ')' ;
ANDOR : 'AND'|'and'|'OR'|'or';
NAME : ('A'..'Z');
WS : ' ' { $channel = Hidden; };
THEREST : .;
// ***************** parser rules:
expression : anexpression EOF!;
anexpression : atom (ANDOR^ atom)*;
atom : NAME | LPAREN! anexpression RPAREN!;
The code that then processes the resulting tree looks like this:
... from the main program
QvidianPlaybooksLexer lexer = new QvidianPlaybooksLexer(new ANTLRStringStream(src));
QvidianPlaybooksParser parser = new QvidianPlaybooksParser(new CommonTokenStream(lexer));
parser.TreeAdaptor = new CommonTreeAdaptor();
CommonTree tree = (CommonTree)parser.parse().Tree;
ValidateTree(tree, 0, iValidIdentifierCount);
// recursive code that walks the tree
public static RuleLogicValidationResult ValidateTree(ITree Tree, int depth, int conditionCount)
{
RuleLogicValidationResult rlvr = null;
if (Tree != null)
{
CommonErrorNode commonErrorNode = Tree as CommonErrorNode;
if (null != commonErrorNode)
{
rlvr = new RuleLogicValidationResult();
rlvr.IsValid = false;
rlvr.ErrorType = LogicValidationErrorType.Other;
Console.WriteLine(rlvr.ToString());
}
else
{
string strTree = Tree.ToString();
strTree = strTree.Trim();
strTree = strTree.ToUpper();
if ((Tree.ChildCount != 0) && (Tree.ChildCount != 2))
{
rlvr = new RuleLogicValidationResult();
rlvr.IsValid = false;
rlvr.ErrorType = LogicValidationErrorType.Other;
rlvr.InvalidIdentifier = strTree;
rlvr.ErrorPosition = 0;
Console.WriteLine(String.Format("CHILD COUNT of {0} = {1}", strTree, tree.ChildCount));
}
// if the current node is valid, then validate the two child nodes
if (null == rlvr || rlvr.IsValid)
{
// output the tree node
for (int i = 0; i < depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(Tree);
rlvr = ValidateTree(Tree.GetChild(0), depth + 1, conditionCount);
if (rlvr.IsValid)
{
rlvr = ValidateTree(Tree.GetChild(1), depth + 1, conditionCount);
}
}
else
{
Console.WriteLine(rlvr.ToString());
}
}
}
else
{
// this tree is null, return a "it's valid" result
rlvr = new RuleLogicValidationResult();
rlvr.ErrorType = LogicValidationErrorType.None;
rlvr.IsValid = true;
}
return rlvr;
}
Add EOF to the end of your start rule. :)
I'm trying to implement something like a Code Contracts feature for JavaScript as an assignment for one of my courses.
The problem I'm having is that I can't seem to find a way to output the source file directly to the console without modifying the entire grammar.
Does anybody knows a way to achieve this?
Thanks in advance.
Here's an example of what I'm trying to do:
function DoClear(num, arr, text){
Contract.Requires<RangeError>(num > 0);
Contract.Requires(num < 1000);
Contract.Requires<TypeError>(arr instanceOf Array);
Contract.Requires<RangeError>(arr.length > 0 && arr.length <= 9);
Contract.Requires<ReferenceError>(text != null);
Contract.Ensures<RangeError>(text.length === 0);
// method body
[...]
return text;
}
function DoClear(num, arr, text){
if (!(num > 0))
throw RangeError;
if (!(num < 1000))
throw Error;
if (!(arr instanceOf Array))
throw TypeError;
if (!(arr.length > 0 && arr.length <= 9))
throw RangeError;
if (!(text != null))
throw ReferenceError
// method body
[...]
if (!(text.length === 0))
throw RangeError
else
return text;
}
There are a few (minor) things you'll want to consider:
ignore string literals that might contain your special contract-syntax;
ignore multi- and single line comments that might contain your special Contract syntax;
ignore code like this: var Requires = "Contract.Requires<RangeError>"; (i.e. regular JavaScript code that "looks like" your contract-syntax);
It's pretty straight forward to take the points above into account and also simply create single tokens for an entire contract-line. You'll be making your life hard when tokenizing the following into 4 different tokens Contract.Requires<RangeError>(num > 0):
Contract
Requires
<RangeError>
(num > 0)
So it's easiest to create a single token from it, and at the parsing phase, split the token on ".", "<" or ">" with a maximum of 4 tokens (leaving expressions containing ".", "<" or ">" as they are).
A quick demo of what I described above might look like this:
grammar CCJS;
parse
: atom+ EOF
;
atom
: code_contract
| (Comment | String | Any) {System.out.print($text);}
;
code_contract
: Contract
{
String[] tokens = $text.split("[.<>]", 4);
System.out.print("if (!" + tokens[3] + ") throw " + tokens[2]);
}
;
Contract
#init{
boolean hasType = false;
}
#after{
if(!hasType) {
// inject a generic Error if this contract has no type
setText(getText().replaceFirst("\\(", "<Error>("));
}
}
: 'Contract.' ('Requires' | 'Ensures') ('<' ('a'..'z' | 'A'..'Z')+ '>' {hasType=true;})? '(' ~';'+
;
Comment
: '//' ~('\r' | '\n')*
| '/*' .* '*/'
;
String
: '"' (~('\\' | '"' | '\r' | '\n') | '\\' . )* '"'
;
Any
: .
;
which you can test with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String src =
"/* \n" +
" Contract.Requires to be ignored \n" +
"*/ \n" +
"function DoClear(num, arr, text){ \n" +
" Contract.Requires<RangeError>(num > 0); \n" +
" Contract.Requires(num < 1000); \n" +
" Contract.Requires<TypeError>(arr instanceOf Array); \n" +
" Contract.Requires<RangeError>(arr.length > 0 && arr.length <= 9); \n" +
" Contract.Requires<ReferenceError>(text != null); \n" +
" Contract.Ensures<RangeError>(text.length === 0); \n" +
" \n" +
" // method body \n" +
" // and ignore single line comments, Contract.Ensures \n" +
" var s = \"Contract.Requires\"; // also ignore strings \n" +
" \n" +
" return text; \n" +
"} \n";
CCJSLexer lexer = new CCJSLexer(new ANTLRStringStream(src));
CCJSParser parser = new CCJSParser(new CommonTokenStream(lexer));
parser.parse();
}
}
If you run the Main class above, the following will be printed to the console:
/*
Contract.Requires to be ignored
*/
function DoClear(num, arr, text){
if (!(num > 0)) throw RangeError;
if (!(num < 1000)) throw Error;
if (!(arr instanceOf Array)) throw TypeError;
if (!(arr.length > 0 && arr.length <= 9)) throw RangeError;
if (!(text != null)) throw ReferenceError;
if (!(text.length === 0)) throw RangeError;
// method body
// and ignore single line comments, Contract.Ensures
var s = "Contract.Requires"; // also ignore strings
return text;
}
BUT ...
... I realize that it isn't what you're exactly looking for: the RangeError is not placed at the end of your function. And that's going to be tough one: a function might have multiple returns, and is likely to have multiple code blocks { ... } making it difficult to know where the } is that ends the function. So you don't know where exactly to inject this RangeError-check. At least, not with a naive approach as I demonstrated.
The only reliable way to implement such a thing is to get a decent JavaScript grammar, add your own contract-rules to it, rewrite the AST the parser produces, and finally emit the new AST in a friendly-formatted way: not a trivial task, to say the least!
There are various ECMA/JS grammars on the ANTLR Wiki, but tread with care: they are user-committed grammars and may contain errors (probably will in this case[1]!).
If you choose to place the RangeError there where it should be rewritte, like so:
function DoClear(num, arr, text){
Contract.Requires<RangeError>(num > 0);
...
// method body
...
Contract.Ensures<RangeError>(text.length === 0);
return text;
}
which would result in:
function DoClear(num, arr, text){
if (!(num > 0)) throw RangeError;
...
// method body
...
if (!(text.length === 0))
throw RangeError
return text;
}
then you need not parse the entire method body, and you might get away with a hack as I proposed.
Best of luck!
[1] the last time I checked these ECMA/JS script grammars, none of them handled regex literals, /pattern/, properly, making them in my opinion suspect.