error recovery of ANTRL4 leaves loop - antlr

I try to write a parser for subset of ABAP language. But I encounter some problems, when input contains misspelled/unknown statements.
In this example, one of the PERFORM statments is spelled PERFOR. So I expected parser to gobble tokens until re-sync, and proceed with follwing PERFORM statements.
FUNCTION-POOL test.
FUNCTION z_angebot_01.
PERFORM x.
LOOP AT mytable.
PERFORM test.
PERFOR test.
PERFORM test.
PERFORM test.
ENDLOOP.
ENDFUNCTION.
Instead of this, the parser seems to try it with token insertion and leaves LOOP. Later it complains about extraneous ENDLOOP.
Output messages:
line 11:4 extraneous input 'PERFOR' expecting {ENDLOOP, LOOP_AT, PERFORM}
line 15:2 extraneous input 'ENDLOOP' expecting {ENDFUNCTION, LOOP_AT, PERFORM}
While debugging the generated code, I noticed, there is no error at all with PERFOR statement. The parser stays inside of loop as long as LOOP_AT or PERFORM is found. Anything else exits loop.
But how can i treat misspelled/unknown statements as syntax errors, which are to be ignored until next EOC token?
I use separated lexer/parser, so this is my current approach:
AbapLexer.g4:
lexer grammar AbapLexer;
#lexer::header {
package generated;
}
WS : [ \t\r\n] -> skip;
EOC : '.' ;
ENDFUNCTION : [Ee][Nn][Dd][Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn];
ENDLOOP : [Ee][Nn][Dd][Ll][Oo][Oo][Pp];
FUNCTION : [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn];
FUNCTION_POOL : [Ff][Uu][Nn][Cc][Tt][Ii][Oo][Nn] '-' [Pp][Oo][Oo][Ll];
LOOP_AT: [Ll][Oo][Oo][Pp] WHITESPACE [Aa][Tt];
PERFORM : [Pp][Ee][Rr][Ff][Oo][Rr][Mm];
IDENTIFIER: [_a-zA-Z] [_0-9a-zA-Z]* ;
fragment WHITESPACE: [ \t\r\n]+;
AbapParser.g4:
parser grammar AbapParser;
options { tokenVocab=AbapLexer; }
#parser::header {
package generated;
}
report: (FUNCTION_POOL) IDENTIFIER EOC
(functionStatement)*
;
block:
(
loopStatement EOC
| performStatement EOC
)+
;
loopStatement:
loopStatementStart EOC
block?
ENDLOOP
;
loopStatementStart:
LOOP_AT IDENTIFIER
;
performStatement:
PERFORM IDENTIFIER
;
functionStatement
:
FUNCTION functionname = IDENTIFIER EOC
block?
ENDFUNCTION EOC
;
Any hints appreciated!
Thank you
Peter

Related

Yacc/bison: what's wrong with my syntax equations?

I'm writing a "compiler" of sorts: it reads a description of a game (with rooms, characters, things, etc.) Think of it as a visual version of an Adventure-style game, but with much simpler problems.
When I run my "compiler" I'm getting a syntax error on my input, and I can't figure out why. Here's the relevant section of my yacc input:
character
: char-head general-text character-insides { PopChoices(); }
;
character-insides
: LEFTBRACKET options RIGHTBRACKET
;
char-head
: char-namesWT opt-imgsWT char-desc opt-cond
;
char-desc
: general-text { SetText($1); }
;
char-namesWT
: DOTC ID WORD { AddCharacter($3, $2); expect(EXP_TEXT); }
;
opt-cond
: %empty
| condition
;
condition
: condition-reason condition-main general-text
{ AddCondition($1, $2, $3); }
;
condition-reason
: DOTU { $$ = 'u'; }
| DOTV { $$ = 'v'; }
;
condition-main
: money-conditionWT
| have-conditionWT
| moves-conditionWT
| flag-conditionWT
;
have-conditionWT
: PERCENT_SLASH opt-bang ID
{ $$ = MkCondID($1, $2, $3) ; expect(EXP_TEXT); }
;
opt-bang
: %empty { $$ = TRUE; }
| BANG { $$ = FALSE; }
;
ID: WORD
;
Things in all caps are terminal symbols, things in lower or mixed case are non-terminals. If a non-terminal ends in WT, then it "wants text". That is, it expects that what comes after it may be arbitrary text.
Background: I have written my own token recognizer in C++ because(*) I want the syntax to be able to change the way the lexer's behavior. Two types of tokens should be matched only when the syntax expects them: FILENAME (with slashes and other non-alphameric characters) and TEXT, which means "all the text from here to the end of the line" (but not starting with certain keywords).
The function "expect" tells the lexer when to look for these two symbols. The expectation is reset to EXP_NORMAL after each token is returned.
I have added code to yylex that prints out the tokens as it recognizes them, and it looks to me like the tokenizer is working properly -- returning the tokens I expect.
(*) Also because I want to be able to ask the tokenizer for the column where the error occurred, and get the contents of the line being scanned at the time so I can print out a more useful error message.
Here is the relevant part of the input:
.c Wendy wendy
OK, now you caught me, what do you want to do with me?
.u %/lasso You won't catch me like that.
[
Here is the last part of the debugging output from yylex:
token: 262: DOTC/
token: 289: WORD/Wendy
token: 289: WORD/wendy
token: 292: TEXT/OK, now you caught me, what do you want to do with me?
token: 286: DOTU/
token: 274: PERCENT_SLASH/%/
token: 289: WORD/lasso
token: 292: TEXT/You won't catch me like that.
token: 269: LEFTBRACKET/
here's my error message:
: line 124, columns 3-4: syntax error, unexpected LEFTBRACKET, expecting TEXT
[
To help you understand the equations above, here is the relevant part of the description of the input syntax that I wrote the yacc code from.
// Character:
// .c id charactername,[imagename,[animationname]]
// description-text
// .u condition on the character being usable [optional]
// .v condition on the character being visible [optional]
// [
// (options)
// ]
// Conditions:
// %$[-]n Must [not] have at least n dollars
// %/[-]name Must [not] have named thing
// %t-nnn At/before specified number of moves
// %t+nnn At/after specified number of moves
// %#[-]name named flag must [not] be set
// Condition-char: $, /, t, or #, as described above
//
// Condition:
// % condition-char (identifier/int) ['/' text-if-fail ]
// description-text: Can be either on-line text or multi-line text
// On-line text is the rest of the line
brackets mark optional non-terminals, but a bracket standing alone (represented by LEFTBRACKET and RIGHTBRACKET in the yacc) is an actual token, e.g.
// [
// (options)
// ]
above.
What am I doing wrong?
To debug parsing problems in your grammar, you need to understand the shift/reduce machine that yacc/bison produces (described in the .output file produced with the -v option), and you need to look at the trail of states that the parser goes through to reach the problem you see.
To enable debugging code in the parser (which can print the states and the shift and reduce actions as they occur), you need to compile with -DYYDEBUG or put #define YYDEBUG 1 in the top of your grammar file. The debugging code is controlled by the global variable yydebug -- set to non-zero to turn on the trace and zero to turn it off. I often use the following in main:
#ifdef YYDEBUG
extern int yydebug;
if (char *p = getenv("YYDEBUG"))
yydebug = atoi(p);
#endif
Then you can include -DYYDEBUG in your compiler flags for debug builds and turn on the debugging code by something like setenv YYDEBUG 1 to set the envvar prior to running your program.
I suppose your syntax error message was generated by bison. What is striking is that it claims to have found a LEFTBRACKET when it expects a [. Naively, you might expect it to be satisfied with the LEFTBRACKET it found, but of course bison knows nothing about LEFTBRACKET except its numeric value, which will be some integer larger than 256.
The only reason bison might expect [ is if your grammar includes the terminal '['. But since your scanner seems to return LEFTBRACKET when it sees a [, the parser will never see '['.

Capturing content which can start with Parser keywords in Xtext

The following is the simplified version of my actual grammar :-
grammar org.hello.World
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate world "http://www.hello.org/World"
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER)*
;
Greeting:
'<hello>' name=ID '</hello>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
So using above grammar if my input is like :-
<hi><hello>world</hello>
Then I am getting an syntax error saying that mismatched character 'i' expecting 'e' at Column 2 .
My requirement is that AnyContent should match "<hi>" , can anyone guide me about how to achieve that?
If you want to make it with Xtext. I advice you to split your problem. You first problem is syntaxic, you need to parser your file. The second problem is semantic, you want to give a "sense" to your objets and tell who is the container. Define the container and the containment for XML can't be done inside your grammar.
Make a custom Ecore and make an easy grammar, with start and end tag. You don't really care about the name of your tag.
Example :
Model returns XmlFile: (StartTag|EndTag|Text)+;
Text returns Text: text=STRING;
StartTag returns StartTag: '<' name=ID '>';
EndTag returns EndTag: '</' name=ID '>';
Change the TokenSource. The token source will give the token to your Parser. You can override the nature of your token, merge or split them.
The idea here is to merge all token outside the between of ">" and "</".
This token represent a Text, so you can create a single token for all elements containing between this elements. Example :
class CustomTokenSource extends XtextTokenStream{
new(TokenSource tokenSource, ITokenDefProvider tokenDefProvider) {
super(tokenSource,tokenDefProvider)
}
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
In this example you need to add your custom code on the method "tokenOverride".
Add your custom token source on your parser :
class XDSLParser extends DSLParser{
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new CustomTokenSource(tokenSource, getTokenDefProvider());
}
}
Compute the containement : the containment of your elements can be compute after the parsing. After it, you can get your model and change it as you will. To make it, you need to override the method "doParse" of your Parser "XDSLParser" as follow :
override protected IParseResult doParse(String ruleName, CharStream in, NodeModelBuilder nodeModelBuilder, int initialLookAhead) {
var IParseResult result = super.doParse( ruleName, in, nodeModelBuilder, initialLookAhead)
//Give you model
result.rootASTElement;
return result
}
Note : The model that you obtain after the parsing will be flat. The xmlFile Object will contain all the elements in the good order. You need to write an algorithm to build the containement on your AST model.
This will require a lot of tweaking in the grammar due to the nature of the antlr lexer that is used by Xtext. The lexer will not roll back for the keyword <hello>: As soon as it sees a < followed by an h it'll try consume the hello-token. Something along these lines could work though:
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER | '<' (ID | ANY_OTHER | '/' | '>') | '/' | '>' | 'hello')*
;
Greeting:
'<' 'hello '>' name=ID '<' '/' 'hello' '>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
The approach won't scale for real world grammars but maybe it helps to get on the some working track.

skipping parts of a matched lexical element or token

I would like to match a "{NUM}" and then have the lexer rule return "NUM". so, I tried
NUM : ('{' { skip(); }) 'NUM' ('}' { skip(); });
But, that seems to skip everything and return empty on a match. would it be possible to skip parts of a lexer match ?
antlr 3.4
Invoking skip() anywhere in your rule will remove the entire token from the lexer, not just certain characters.
What you could do is this:
NUM
: '{NUM}' {setText("NUM");}
;
Or, if NUM is variable, do:
NUM
: '{' 'A'..'Z'+ '}' {setText($text.substring(1, $text.length() - 1));}
;
which removes the first and last char from the token.
EDIT
smartnut007 wrote:
Is there an equivalent way to do this for Tokens ?
If you mean how to change the text of tokens inside parser rules, try this:
parser_rule
: LEXER_RULE {$LEXER_RULE.setText("new-text");}
;
LEXER_RULE
: 'old-text'
;

ANTLR: removing clutter

i'm learning ANTLR right now. Let's say, I have a VHDL code and would like to do some processing on the PROCESS blocks. The rest should be completely ignored. I don't want to describe the whole VHDL language, since I'm interested only in the process blocks. So I could write a rule that matches process blocks. But how do I tell ANTLR to match only the process block rule and ignore anything else?
I know next to no VHDL, so let's say you want to replace all single line comments in a (Java) source file with multi-line comments:
//foo
should become:
/* foo */
You need to let the lexer match single line comments, of course. But you should also make sure it recognizes multi-line comments because you don't want //bar to be recognized as a single line comment in:
/*
//bar
*/
The same goes for string literals:
String s = "no // comment";
Finally, you should create some sort of catch-all rule in the lexer that will match any character.
A quick demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Str
: '"' ('\\' . | ~('\\' | '"'))* '"'
;
MLComment
: '/*' .* '*/'
;
SLComment
: '//' ~('\r' | '\n')*
{
setText("/* " + getText().substring(2) + " */");
}
;
Any
: . // fall through rule, matches any character
;
If you now parse input like this:
//comment 1
class Foo {
//comment 2
/*
* not // a comment
*/
String s = "not // a // comment"; //comment 3
}
the following will be printed to your console:
/* comment 1 */
class Foo {
/* comment 2 */
/*
* not // a comment
*/
String s = "not // a // comment"; /* comment 3 */
}
Note that this is just a quick demo: a string literal in Java could contain Unicode escapes, which my demo doesn't support, and my demo also does not handle char-literals (the char literal char c = '"'; would break it). All of these things are quite easy to fix, of course.
In the upcoming ANTLR v4, you can do fuzzy parsing. take a look at
http://www.antlr.org/wiki/display/ANTLR4/Wildcard+Operator+and+Nongreedy+Subrules
You can get the beta software here:
http://antlr.org/download/antlr-4.0b3-complete.jar
Terence

Processing an n-ary ANTLR AST one child at a time

I currently have a compiler that uses an AST where all children of a code block are on the same level (ie, block.children == {stm1, stm2, stm3, etc...}). I am trying to do liveness analysis on this tree, which means that I need to take the value returned from the processing of stm1 and then pass it to stm2, then take the value returned by stm2 and pass it to stm3, and so on. I do not see a way of executing the child rules in this fashion when the AST is structured this way.
Is there a way to allow me to chain the execution of the child grammar items with my given AST, or am I going to have to go through the painful process of refactoring the parser to generate a nested structure and updating the rest of the compiler to work with the new AST?
Example ANTLR grammar fragment:
block
: ^(BLOCK statement*)
;
statement
: // stuff
;
What I hope I don't have to go to:
block
: ^(BLOCK statementList)
;
statementList
: ^(StmLst statement statement+)
| ^(StmLst statement)
;
statement
: // stuff
;
Parser (or lexer) rules can take parameter values and can return a value. So, in your case, you can do something like:
block
#init {Object o = null; /* initialize the value being passed through */ }
: ^(BLOCK (s=statement[o] {$o = $s.returnValue; /*re-assign 'o' */ } )*)
;
statement [Object parameter] returns [Object returnValue]
: // do something with 'parameter' and 'returnValue'
;
Here's a very simple example that you can use to play around with:
grammar Test;
#members{
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("1;2;3;4;");
TestLexer lexer = new TestLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.parse();
}
}
parse
: block EOF
;
block
#init{int temp = 0;}
: (i=statement[temp] {temp = $i.ret;} ';')+
;
statement [int param] returns [int ret]
: Number {$ret = $param + Integer.parseInt($Number.text);}
{System.out.printf("param=\%d, Number=\%s, ret=\%d\n", $param, $Number.text, $ret);}
;
Number
: '0'..'9'+
;
When you've generated a parser and lexer from it and compiled these classes, execute the TestParser class and you'll see the following printed to your console:
param=0, Number=1, ret=1
param=1, Number=2, ret=3
param=3, Number=3, ret=6
param=6, Number=4, ret=10