My simple ANTLR grammar ignores certain invalid tokens when parsing

My simple ANTLR grammar ignores certain invalid tokens when parsing - antlr

I asked a question a couple of weeks ago about my ANTLR grammar (My simple ANTLR grammar is not working as expected). Since asking that question, I've done more digging and debugging and gotten most of the kinks out. I am left with one issue, though.
My generated parser code is not picking up invalid tokens in one particular part of the text that is processed. The lexer is properly breaking things into tokens, but the parser does not kick out invalid tokens in some cases. In particular, when the invalid token is at the end of a phrase like "A and "B", the parser ignores it - it's like the token isn't even there.
Some specific examples:
"A and B" - perfectly valid
"A# and B" - parser properly picks up the invalid # token
"A and #B" - parser properly picks up the invalid # token
"A and B#" - here's the mystery - the lexer finds the # token and the parser IGNORES it (!)
"(A and B#) or C" - further mystery - the lexer finds the # token and the parser IGNORES it (!)
Here is my grammar:
grammar QvidianPlaybooks;
options{ language=CSharp3; output=AST; ASTLabelType = CommonTree; }
public parse
: expression
;
LPAREN : '(' ;
RPAREN : ')' ;
ANDOR : 'AND'|'and'|'OR'|'or';
NAME : ('A'..'Z');
WS : ' ' { $channel = Hidden; };
THEREST : .;
// ***************** parser rules:
expression : anexpression EOF!;
anexpression : atom (ANDOR^ atom)*;
atom : NAME | LPAREN! anexpression RPAREN!;
The code that then processes the resulting tree looks like this:
... from the main program
QvidianPlaybooksLexer lexer = new QvidianPlaybooksLexer(new ANTLRStringStream(src));
QvidianPlaybooksParser parser = new QvidianPlaybooksParser(new CommonTokenStream(lexer));
parser.TreeAdaptor = new CommonTreeAdaptor();
CommonTree tree = (CommonTree)parser.parse().Tree;
ValidateTree(tree, 0, iValidIdentifierCount);
// recursive code that walks the tree
public static RuleLogicValidationResult ValidateTree(ITree Tree, int depth, int conditionCount)
{
RuleLogicValidationResult rlvr = null;
if (Tree != null)
{
CommonErrorNode commonErrorNode = Tree as CommonErrorNode;
if (null != commonErrorNode)
{
rlvr = new RuleLogicValidationResult();
rlvr.IsValid = false;
rlvr.ErrorType = LogicValidationErrorType.Other;
Console.WriteLine(rlvr.ToString());
}
else
{
string strTree = Tree.ToString();
strTree = strTree.Trim();
strTree = strTree.ToUpper();
if ((Tree.ChildCount != 0) && (Tree.ChildCount != 2))
{
rlvr = new RuleLogicValidationResult();
rlvr.IsValid = false;
rlvr.ErrorType = LogicValidationErrorType.Other;
rlvr.InvalidIdentifier = strTree;
rlvr.ErrorPosition = 0;
Console.WriteLine(String.Format("CHILD COUNT of {0} = {1}", strTree, tree.ChildCount));
}
// if the current node is valid, then validate the two child nodes
if (null == rlvr || rlvr.IsValid)
{
// output the tree node
for (int i = 0; i < depth; i++)
{
Console.Write(" ");
}
Console.WriteLine(Tree);
rlvr = ValidateTree(Tree.GetChild(0), depth + 1, conditionCount);
if (rlvr.IsValid)
{
rlvr = ValidateTree(Tree.GetChild(1), depth + 1, conditionCount);
}
}
else
{
Console.WriteLine(rlvr.ToString());
}
}
}
else
{
// this tree is null, return a "it's valid" result
rlvr = new RuleLogicValidationResult();
rlvr.ErrorType = LogicValidationErrorType.None;
rlvr.IsValid = true;
}
return rlvr;
}

Add EOF to the end of your start rule. :)

Related

ANTLR4: Unexpected behavior that I can't understand

I'm very new to ANTLR4 and am trying to build my own language. So my grammar starts at
program: <EOF> | statement | functionDef | statement program | functionDef program;
and my statement is
statement: selectionStatement | compoundStatement | ...;
and
selectionStatement
: If LeftParen expression RightParen compoundStatement (Else compoundStatement)?
| Switch LeftParen expression RightParen compoundStatement
;
compoundStatement
: LeftBrace statement* RightBrace;
Now the problem is, that when I test a piece of code against selectionStatement or statement it passes the test, but when I test it against program it fails to recognize. Can anyone help me on this? Thank you very much
edit: the code I use to test is the following:
if (x == 2) {}
It passes the test against selectionStatement and statement but fails at program. It appears that program only accepts if...else
if (x == 2) {} else {}
Edit 2:
The error message I received was
<unknown>: Incorrect error: no viable alternative at input 'if(x==2){}'

Cannot answer your question given the incomplete information provided: the statement rule is partial and the compoundStatement rule is missing.
Nonetheless, there are two techniques you should be using to answer this kind of question yourself (in addition to unit tests).
First, ensure that the lexer is working as expected. This answer shows how to dump the token stream directly.
Second, use a custom ErrorListener to provide a meaningful/detailed description of its parse path to every encountered error. An example:
public class JavaErrorListener extends BaseErrorListener {
public int lastError = -1;
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine,
String msg, RecognitionException e) {
Parser parser = (Parser) recognizer;
String name = parser.getSourceName();
TokenStream tokens = parser.getInputStream();
Token offSymbol = (Token) offendingSymbol;
int thisError = offSymbol.getTokenIndex();
if (offSymbol.getType() == -1 && thisError == tokens.size() - 1) {
Log.debug(this, name + ": Incorrect error: " + msg);
return;
}
String offSymName = JavaLexer.VOCABULARY.getSymbolicName(offSymbol.getType());
List<String> stack = parser.getRuleInvocationStack();
// Collections.reverse(stack);
Log.error(this, name);
Log.error(this, "Rule stack: " + stack);
Log.error(this, "At line " + line + ":" + charPositionInLine + " at " + offSymName + ": " + msg);
if (thisError > lastError + 10) {
lastError = thisError - 10;
}
for (int idx = lastError + 1; idx <= thisError; idx++) {
Token token = tokens.get(idx);
if (token.getChannel() != Token.HIDDEN_CHANNEL) Log.error(this, token.toString());
}
lastError = thisError;
}
}
Note: adjust the Log statements to whatever logging package you are using.
Finally, Antlr doesn't do 'weird' things - just things that you don't understand.

PEGJS predicate grammar

I need to create a grammar with the help of predicate. The below grammar fails for the given case.
startRule = a:namespace DOT b:id OPEN_BRACE CLOSE_BRACE {return {"namespace": a, "name": b}}
namespace = id (DOT id)*
DOT = '.';
OPEN_BRACE = '(';
CLOSE_BRACE = ')';
id = [a-zA-Z]+;
It fails for the given input as
com.mytest.create();
which should have given "create" as value of "name" key in the result part.
Any help would be great.

There are several things here.
The most important, is that you must be aware that PEG is greedy. That means that your (DOT id)* rule matches ALL the DOT id sequences, including the one that you have in startRule as DOT b:id.
That can be solved using lookahead.
The other thing is that you must remember to use join, since by default it will return each character as the member of an array.
I also added a rule for semicolons.
Try this:
start =
namespace:namespace DOT name:string OPEN_BRACE CLOSE_BRACE SM nl?
{
return { namespace : namespace, name : name };
}
/* Here I'm using the lookahead: (member !OPEN_BRACE)* */
namespace =
first:string rest:(member !OPEN_BRACE)*
{
rest = rest.map(function (x) { return x[0]; });
rest.unshift(first);
return rest;
}
member =
DOT str:string
{ return str; }
DOT =
'.'
OPEN_BRACE =
'('
CLOSE_BRACE =
')'
SM =
';'
nl =
"\n"
string =
str:[a-zA-Z]+
{ return str.join(''); }
And as far I can tell, I'm parsing that line correctly.

ANTLR3 lexer returns one token when expecting to return 5 tokens

Hello i'm trying to build a simple lexer to tokenize lines starting with an ';' character.
This is my lexer grammar:
lexer grammar TestLex;
options {
language = Java;
filter = true;
}
#header {
package com.ualberta.slmyers.cmput415.assign1;
}
IR : LINE+
;
LINE : SEMICOLON (~NEWLINE)* NEWLINE
;
SEMICOLON : ';'
;
NEWLINE : '\n'
;
WS : (' ' | '\t')+
{$channel = HIDDEN;}
;
And here is my java class to run my lexer:
package com.ualberta.slmyers.cmput415.assign1;
import java.io.IOException;
import org.antlr.runtime.*;
public class Test {
public static void main(String[] args) throws RecognitionException,
IOException {
// create an instance of the lexer
TestLex lexer = new TestLex(
new ANTLRFileStream(
"/home/linux/workspace/Cmput415Assign1/src/com/ualberta/slmyers/cmput415/assign1/test3.s"));
// wrap a token-stream around the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// when using ANTLR v3.3 or v3.4, un-comment the next line:
tokens.fill();
// traverse the tokens and print them to see if the correct tokens are
// created
int n = 1;
for (Object o : tokens.getTokens()) {
CommonToken token = (CommonToken) o;
System.out.println("token(" + n + ") = "
+ token.getText().replace("\n", "\\n"));
n++;
}
}
}
credits to: http://bkiers.blogspot.ca/2011/03/2-introduction-to-antlr.html
for the adapted code above.
This is my test file:
; token 1
; token 2
; token 3
; token 4
Note there is a newline character after the last '4'.
This is my output:
token(1) = ; token 1\n; token 2\n; token 3\n; token 4\n
token(2) = <EOF>
I'm expecting this as my output:
token(1) = ; token 1\n
token(2) = ; token 2\n
token(3) = ; token 3\n
token(4) = ; token 4\n
token(5) = <EOF>

OK I figured it out the problem was this line:
IR : LINE+
;
which returned a one token comprised of many lines.

Is there way to detect if an optional (? operator) tree grammar rule executed in an action?

path[Scope sc] returns [Path p]
#init{
List<String> parts = new ArrayList<String>();
}
: ^(PATH (id=IDENT{parts.add($id.text);})+ pathIndex? )
{// ACTION CODE
// need to check if pathIndex has executed before running this code.
if ($pathIndex.index >=0 ){
p = new Path($sc, parts, $pathIndex.index);
}else if($pathIndex.pathKey != ""){
p = new Path($sc, parts, $pathIndex.pathKey);
}
;
Is there a way to detect if pathIndex was executed? In my action code, I tried testing $pathIndex == null, but ANTLR doesn't let you do that. ANTLRWorks gives a syntax error which saying "Missing attribute access on rule scope: pathIndex."
The reason why I need to do this is because in my action code I do:
$pathIndex.index
which returns 0 if the variable $pathIndex is translated to is null. When you are accessing an attribute, ANTLR generates pathIndex7!=null?pathIndex7.index:0 This causes a problem with an object because it changes a value I have preset to -1 as an error flag to 0.

There are a couple of options:
1
Put your code inside the optional pathIndex:
rule
: ^(PATH (id=IDENT{parts.add($id.text);})+ (pathIndex {/*pathIndex cannot be null here!*/} )? )
;
2
Use a boolean flag to denote the presence (or absence) of pathIndex:
rule
#init{boolean flag = false;}
: ^(PATH (id=IDENT{parts.add($id.text);})+ (pathIndex {flag = true;} )? )
{
if(flag) {
// ...
}
}
;
EDIT
You could also make pathIndex match nothing so that you don't need to make it optional inside path:
path[Scope sc] returns [Path p]
: ^(PATH (id=IDENT{parts.add($id.text);})+ pathIndex)
{
// code
}
;
pathIndex returns [int index, String pathKey]
#init {
$index = -1;
$pathKey = "";
}
: ( /* some rules here */ )?
;
PS. Realize that the expression $pathIndex.pathKey != "" will most likely evaluate to false. To compare the contents of strings in Java, use their equals(...) method instead:
!$pathIndex.pathKey.equals("")
or if $pathIndex.pathKey can be null, you can circumvent a NPE by doing:
!"".equals($pathIndex.pathKey)

More information would have been helpful. However, if I understand correctly, when a value for the index is not present in the input you want to test for $pathIndex.index == null. This code does that using the pathIndex rule to return the Integer $index to the path rule:
path
: ^(PATH IDENT+ pathIndex?)
{ if ($pathIndex.index == null)
System.out.println("path index is null");
else
System.out.println("path index = " + $pathIndex.index); }
;
pathIndex returns [Integer index]
: DIGIT
{ $index = Integer.parseInt($DIGIT.getText()); }
;
For testing, I created these simple parser and lexer rules:
path : 'path' IDENT+ pathIndex? -> ^(PATH IDENT+ pathIndex?)
;
pathIndex : DIGIT
;
/** lexer rules **/
DIGIT : '0'..'9' ;
IDENT : LETTER+ ;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
When the index is present in the input, as in path a b c 5, the output is:
Tree = (PATH a b c 5)
path index = 5
When the index is not present in the input, as in path a b c, the output is:
Tree = (PATH a b c)
path index is null

ANTLR Source to Output

I'm trying to implement something like a Code Contracts feature for JavaScript as an assignment for one of my courses.
The problem I'm having is that I can't seem to find a way to output the source file directly to the console without modifying the entire grammar.
Does anybody knows a way to achieve this?
Thanks in advance.
Here's an example of what I'm trying to do:
function DoClear(num, arr, text){
Contract.Requires<RangeError>(num > 0);
Contract.Requires(num < 1000);
Contract.Requires<TypeError>(arr instanceOf Array);
Contract.Requires<RangeError>(arr.length > 0 && arr.length <= 9);
Contract.Requires<ReferenceError>(text != null);
Contract.Ensures<RangeError>(text.length === 0);
// method body
[...]
return text;
}
function DoClear(num, arr, text){
if (!(num > 0))
throw RangeError;
if (!(num < 1000))
throw Error;
if (!(arr instanceOf Array))
throw TypeError;
if (!(arr.length > 0 && arr.length <= 9))
throw RangeError;
if (!(text != null))
throw ReferenceError
// method body
[...]
if (!(text.length === 0))
throw RangeError
else
return text;
}

There are a few (minor) things you'll want to consider:
ignore string literals that might contain your special contract-syntax;
ignore multi- and single line comments that might contain your special Contract syntax;
ignore code like this: var Requires = "Contract.Requires<RangeError>"; (i.e. regular JavaScript code that "looks like" your contract-syntax);
It's pretty straight forward to take the points above into account and also simply create single tokens for an entire contract-line. You'll be making your life hard when tokenizing the following into 4 different tokens Contract.Requires<RangeError>(num > 0):
Contract
Requires
<RangeError>
(num > 0)
So it's easiest to create a single token from it, and at the parsing phase, split the token on ".", "<" or ">" with a maximum of 4 tokens (leaving expressions containing ".", "<" or ">" as they are).
A quick demo of what I described above might look like this:
grammar CCJS;
parse
: atom+ EOF
;
atom
: code_contract
| (Comment | String | Any) {System.out.print($text);}
;
code_contract
: Contract
{
String[] tokens = $text.split("[.<>]", 4);
System.out.print("if (!" + tokens[3] + ") throw " + tokens[2]);
}
;
Contract
#init{
boolean hasType = false;
}
#after{
if(!hasType) {
// inject a generic Error if this contract has no type
setText(getText().replaceFirst("\\(", "<Error>("));
}
}
: 'Contract.' ('Requires' | 'Ensures') ('<' ('a'..'z' | 'A'..'Z')+ '>' {hasType=true;})? '(' ~';'+
;
Comment
: '//' ~('\r' | '\n')*
| '/*' .* '*/'
;
String
: '"' (~('\\' | '"' | '\r' | '\n') | '\\' . )* '"'
;
Any
: .
;
which you can test with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String src =
"/* \n" +
" Contract.Requires to be ignored \n" +
"*/ \n" +
"function DoClear(num, arr, text){ \n" +
" Contract.Requires<RangeError>(num > 0); \n" +
" Contract.Requires(num < 1000); \n" +
" Contract.Requires<TypeError>(arr instanceOf Array); \n" +
" Contract.Requires<RangeError>(arr.length > 0 && arr.length <= 9); \n" +
" Contract.Requires<ReferenceError>(text != null); \n" +
" Contract.Ensures<RangeError>(text.length === 0); \n" +
" \n" +
" // method body \n" +
" // and ignore single line comments, Contract.Ensures \n" +
" var s = \"Contract.Requires\"; // also ignore strings \n" +
" \n" +
" return text; \n" +
"} \n";
CCJSLexer lexer = new CCJSLexer(new ANTLRStringStream(src));
CCJSParser parser = new CCJSParser(new CommonTokenStream(lexer));
parser.parse();
}
}
If you run the Main class above, the following will be printed to the console:
/*
Contract.Requires to be ignored
*/
function DoClear(num, arr, text){
if (!(num > 0)) throw RangeError;
if (!(num < 1000)) throw Error;
if (!(arr instanceOf Array)) throw TypeError;
if (!(arr.length > 0 && arr.length <= 9)) throw RangeError;
if (!(text != null)) throw ReferenceError;
if (!(text.length === 0)) throw RangeError;
// method body
// and ignore single line comments, Contract.Ensures
var s = "Contract.Requires"; // also ignore strings
return text;
}
BUT ...
... I realize that it isn't what you're exactly looking for: the RangeError is not placed at the end of your function. And that's going to be tough one: a function might have multiple returns, and is likely to have multiple code blocks { ... } making it difficult to know where the } is that ends the function. So you don't know where exactly to inject this RangeError-check. At least, not with a naive approach as I demonstrated.
The only reliable way to implement such a thing is to get a decent JavaScript grammar, add your own contract-rules to it, rewrite the AST the parser produces, and finally emit the new AST in a friendly-formatted way: not a trivial task, to say the least!
There are various ECMA/JS grammars on the ANTLR Wiki, but tread with care: they are user-committed grammars and may contain errors (probably will in this case[1]!).
If you choose to place the RangeError there where it should be rewritte, like so:
function DoClear(num, arr, text){
Contract.Requires<RangeError>(num > 0);
...
// method body
...
Contract.Ensures<RangeError>(text.length === 0);
return text;
}
which would result in:
function DoClear(num, arr, text){
if (!(num > 0)) throw RangeError;
...
// method body
...
if (!(text.length === 0))
throw RangeError
return text;
}
then you need not parse the entire method body, and you might get away with a hack as I proposed.
Best of luck!
[1] the last time I checked these ECMA/JS script grammars, none of them handled regex literals, /pattern/, properly, making them in my opinion suspect.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

My simple ANTLR grammar ignores certain invalid tokens when parsing - antlr

Add EOF to the end of your start rule. :)

Related

ANTLR4: Unexpected behavior that I can't understand

PEGJS predicate grammar

ANTLR3 lexer returns one token when expecting to return 5 tokens

Is there way to detect if an optional (? operator) tree grammar rule executed in an action?

ANTLR Source to Output

Categories

Resources