flex&bison shift/reduce conflict - grammar

Here are part of my grammar:
expr_address
: expr_address_category expr_opt { $$ = new ExprAddress($1,*$2);}
| axis axis_data { $$ = new ExprAddress($1,*$2);}
;
axis_data
: expr_opt { $$ = $1;}
| sign { if($1 == MINUS)
$$ = new IntergerExpr(-1000000000);
else if($1 == PLUS)
$$ = new IntergerExpr(+1000000000);}
;
expr_opt
: { $$ = new IntergerExpr(0);}
| expr { $$ = $1;}
;
expr_address_category
: I { $$ = NCAddress_I;}
| J { $$ = NCAddress_J;}
| K { $$ = NCAddress_K;}
;
axis
: X { $$ = NCAddress_X;}
| Y { $$ = NCAddress_Y;}
| Z { $$ = NCAddress_Z;}
| U { $$ = NCAddress_U;}
| V { $$ = NCAddress_V;}
| W { $$ = NCAddress_W;}
;
expr
: '[' expr ']' {$$ = $2;}
| COS parenthesized_expr {$$ = new BuiltinMethodCallExpr(COS,*$2);}
| SIN parenthesized_expr {$$ = new BuiltinMethodCallExpr(SIN,*$2);}
| ATAN parenthesized_expr {$$ = new BuiltinMethodCallExpr(ATAN,*$2);}
| SQRT parenthesized_expr {$$ = new BuiltinMethodCallExpr(SQRT,*$2);}
| ROUND parenthesized_expr {$$ = new BuiltinMethodCallExpr(ROUND,*$2);}
| variable {$$ = $1;}
| literal
| expr '+' expr {$$ = new BinaryOperatorExpr(*$1,PLUS,*$3);}
| expr '-' expr {$$ = new BinaryOperatorExpr(*$1,MINUS,*$3);}
| expr '*' expr {$$ = new BinaryOperatorExpr(*$1,MUL,*$3);}
| expr '/' expr {$$ = new BinaryOperatorExpr(*$1,DIV,*$3);}
| sign expr %prec UMINUS {$$ = new UnaryOperatorExpr($1,*$2);}
| expr EQ expr {$$ = new BinaryOperatorExpr(*$1,EQ,*$3);}
| expr NE expr {$$ = new BinaryOperatorExpr(*$1,NE,*$3);}
| expr GT expr {$$ = new BinaryOperatorExpr(*$1,GT,*$3);}
| expr GE expr {$$ = new BinaryOperatorExpr(*$1,GE,*$3);}
| expr LT expr {$$ = new BinaryOperatorExpr(*$1,LT,*$3);}
| expr LE expr {$$ = new BinaryOperatorExpr(*$1,LE,*$3);}
;
variable
: d_h_address {$$ = new AddressExpr(*$1);}
;
d_h_address
: D INTEGER_LITERAL { $$ = new IntAddress(NCAddress_D,$2);}
| H INTEGER_LITERAL { $$ = new IntAddress(NCAddress_H,$2);}
;
I hope my grammar support that like:
H100=20;
X;
X+0;
X+;
X+H100; //means H100 variable ref
The top two are same with X0; By the way,sign -> +/-;
But bison report conflicts,the key part of bison.output:
State 108
11 expr: sign . expr
64 axis_data: sign .
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 64 (axis_data)]
H [reduce using rule 64 (axis_data)]
$default reduce using rule 64 (axis_data)
State 69
62 expr_address: axis . axis_data
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 65 (expr_opt)]
H [reduce using rule 65 (expr_opt)]
$default reduce using rule 65 (expr_opt)
State 68
61 expr_address: expr_address_category . expr_opt
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 65 (expr_opt)]
H [reduce using rule 65 (expr_opt)]
$default reduce using rule 65 (expr_opt)
I don't know how to deal with this,thanks advance.
EDIT:
I make a minimal grammar:
%{
#include <stdio.h>
extern "C" int yylex();
void yyerror(const char *s) { printf("ERROR: %s/n", s); }
%}
%token PLUS '+' MINUS '-'
%token D H I J K X Y Z INT
/*%type sign expr var expr_address_category expr_opt
%type axis */
%start word_list
%%
/*Above grammar lost this rule,it makes ambiguous*/
word_list
: word
| word_list word
;
sign
: PLUS
| MINUS
;
expr
: var
| sign expr
| '[' expr ']'
;
var
: D INT
| H INT
;
word
: expr_address
| var '=' expr
;
expr_address
: expr_address_category expr_opt
/*| '(' axis sign ')'*/
| axis sign
;
expr_opt
: /* empty */
| expr
;
expr_address_category
: I
| J
| K
| axis
;
axis
: X
| Y
| Z
;
%%
and I hope it can support:
X;
X0;
X+0; //the top three are same with X0
X+;
X+H100; //this means X's data is ref +H100;
X+H100=10; //two word on a block,X+ and H100=10;
XH100=10; //two word on a block,X and H100=10;
EDIT2:
The above EDIT lost this rule.
block
: word_list ';'
| ';'
;
Because I have to allow such grammar:
H000 = 100 H001 = 200 H002 = 300;

This is essentially the classic LR(2) grammar, except that in your case it is LR(3) because your variables consist of two tokens [Note 1]:
var : D INT | H INT
The basic problem is the concatenation of words without separators:
word_list : word | word_list word
combined with the fact that one of the options for word ends with an optional var:
word: expr_address
expr_address: expr_address_category expr_opt
while the other one starts with a var:
word: var '=' expr
The = makes this unambiguous, since nothing in an expr can contain that symbol. But at the point where a decision needs to be made, the = is not visible, because the lookahead is the first token of a var -- either an H or a D -- and the equals sign is still two tokens away.
This LR(2) grammar is very similar to the grammar used by yacc/bison itself, a fact which I always find to be ironic, because the grammar for yacc does not require ; between productions:
production: SYMBOL ':' | production SYMBOL /* Lots of detail omitted */
As with your grammar, this makes it impossible to know whether a SYMBOL should be shifted or trigger a reduce because the disambiguating : is still not visible.
Since the grammar is (I assume) unambiguous, and bison can now generate GLR parsers, that will be the simplest solution: just add
%glr-parser
to your bison prologue (but read the section of the bison manual on GLR parsers to understand the trade-off).
Note that the shift-reduce conflicts will still be reported as warnings; since it is impossible to reliably decide whether a grammar is ambiguous, bison doesn't attempt to do so and ambiguities will be reported at run-time if they exist.
You should also fix the issue mentioned in #ChrisDodd's answer regarding the refactoring of expr_address (although with a GLR parser it is not strictly necessary).
If, for whatever reason, you feel that a GLR parser will not meet your needs, you could use the solution in most implementations of yacc (including bison), which is a hack in the lexical scanner. The basic idea is to mark whether a symbol is followed by a colon or not in the lexer, so that the above production could be rewritten as:
production: SYMBOL_COLON | production SYMBOL
This solution would work for you if you were willing to combine the letter and the number into a single token:
word: expr_address expr_opt
| VARIABLE_EQUALS expr
// ...
expr: VARIABLE
My preference is to do this transformation in a wrapper around the lexer, which keeps a (one-element) queue of pending tokens:
/* The use of static variables makes this yylex wrapper unreliable
* if it is reused after a syntax error.
*/
int yylex_wrapper() {
static int saved_token = -1;
static YYSTYPE saved_yylval = {0};
int token = saved_token;
saved_token = -1;
yylval = saved_yylval;
// Read a new token only if we don't have one in the queue.
if (token < 0) token = yylex();
// If the current token is IDENTIFIER, check the next token
if (token == IDENTIFIER) {
// Read the next token into the queue (saved_token / saved_yylval)
YYSTYPE temp_val = yylval;
saved_token = yylex();
saved_yylval = yylval;
yylval = temp_val;
// If the second token is '=', then modify the current token
// and delete the '=' from the queue
if (saved_token == '=') {
saved_token = -1;
token = IDENTIFIER_EQUALS;
}
}
return token;
}
Notes
Personally, I would start by making a var a single token (do you really want to allow people to write:
H /* Some comment in the middle of the variable name */ 100
but that's not going to solve any problems; it merely reduces the grammar's lookahead requirement from LR(3) to LR(2).

The main problem is that it can't figure out where one word in a word_list ends and the next one begins, because there is no separator token between words. This is in contrast to your examples, which all have ; terminators. So that suggests one obvious fix -- put in the ; separators:
word: expr_address ';'
| var '=' expr ';'
That fixes most of the problems, but leaves a lookahead conflict where it can't decide whether an axis is an expr_address_category or not when the lookahead is a sign, because it depends on whether there's an expr after the sign or not. You can fix that by refactoring to defer deciding:
expr_address
: expr_address_category expr_opt
| axis expr_opt
| axis sign
..and remove axis from expr_address_category

Related

SableCC expecting EOF

I seem to be having issues with SableCC generating lexing.grammar
This is what i run on sableCC
Package lexing ; // A Java package is produced for the
// generated scanner
Helpers
num = ['0'..'9']+; // A num is 1 or more decimal digits
letter = ['a'..'z'] | ['A'..'Z'] ;
// A letter is a single upper or
// lowercase character.
Tokens
number = num; // A number token is a whole number
ident = letter (letter | num)* ;
// An ident token is a letter followed by
// 0 or more letters and numbers.
arith_op = [ ['+' + '-' ] + ['*' + '/' ] ] ;
// Arithmetic operators
rel_op = ['<' + '>'] | '==' | '<=' | '>=' | '!=' ;
// Relational operators
paren = ['(' + ')']; // Parentheses
blank = (' ' | '\t' | 10 | '\n')+ ; // White space
unknown = [0..0xffff] ;
// Any single character which is not part
// of one of the above tokens.
This is the result
org.sablecc.sablecc.parser.ParserException: [21,1] expecting: EOF
at org.sablecc.sablecc.parser.Parser.parse(Parser.java:1792)
at org.sablecc.sablecc.SableCC.processGrammar(SableCC.java:203)
at org.sablecc.sablecc.SableCC.processGrammar(SableCC.java:171)
at org.sablecc.sablecc.SableCC.main(SableCC.java:137)
You can only have a short_comment if you put one empty line after it. If you use long_comments instead (/* ... */), there's no need for it.
The reason is that, according to the grammar that defines the SableCC 2.x input language, a short comment is defined as a consuming pattern of eol:
cr = 13;
lf = 10;
eol = cr lf | cr | lf; // This takes care of different platforms
short_comment = '//' not_cr_lf* eol;
Since the last line has:
// of one of the above tokens.
It consumes the last (invisible) EOF token expected at the end of any
.sable file, explaining the error.

Warning: 2 shift/reduce conflicts [-Wconflicts-sr] err

%{
#include<stdio.h>
#include<stdlib.h>
int regs[30];
%}
%token NUMBER LETTER
%left PLUS MINUS
%left MULT DIV
%%
prog: prog st | ; //when I remove this line the error goes
st : E {printf("ans %d", $1);}| LETTER '=' E {regs[$1] = $3; printf("variable contains %d",regs[$1]);};
E : E PLUS E{$$ = $1 + $3;} //addition
| E MINUS E{$$ = $1 - $3 ;} //subtraction
| MINUS E{$$ = -$2;}
| E MULT E{$$ = $1 * $3 ;}
| E DIV E { if($3)$$= $1 / $3; else yyerror("Divide by 0");}
/*|LBRACE E RBRACE{$$= $2;}
| RBRACE E LBRACE{yyerror("Wrong expression");} */
| NUMBER {$$ = $1;}
| LETTER {$$ = regs[$1];}
;
%%
int main(void)
{
printf("Enter Expression: ");
yyparse();
return 0;
}
int yyerror(char *msg)
{
printf("%s", msg);// printing error
exit(0);
}
I am not able to resolve the conflicts. Also I am getting a segmentation fault when I run it with some edits. I am using yacc and lex for the same.
The two shift-reduce conflicts are the result of the fact that you don't require any explicit separator between statements. Because of that, a = b - 3 could be interpreted as one statement or as two (a = b; - 3). The second interpretation may not seem very natural to you but it is easily derived by the grammar.
In addition, your use of unary minus leads to an incorrect parse of -2/3 as -(2/3) instead of (-2)/3. (You may or may not find this serious, since it has few semantic consequences with these particular operators.) This particular issue and a correct resolution is discussed in the bison manual, and in many many other internet resources.
Both of these explanations are made a bit more visible if you use the -v command line option to bison to produce a description of the parser. See Understanding your parser (again, in the bison manual).

lex and yacc warning's and not working as expected

Lexer.l
%{
#include "y.tab.h"
%}
%%
"define" return(TK_KEY_DEFINE);
"as" return(TK_KEY_AS);
"is" return(TK_KEY_IS);
"if" return(TK_KEY_IF);
"then" return(TK_KEY_THEN);
"else" return(TK_KEY_ELSE);
"endif" return(TK_KEY_ENDIF);
"with" return(TK_KEY_WITH);
"DEFINE" return(TK_KEY_DEFINE_UC);
"AS" return(TK_KEY_AS_UC);
"IS" return(TK_KEY_IS_UC);
"IF" return(TK_KEY_IF_UC);
"THEN" return(TK_KEY_THEN_UC);
"ELSE" return(TK_KEY_ELSE_UC);
"ENDIF" return(TK_KEY_ENDIF_UC);
"WITH" return(TK_KEY_WITH_UC);
"+" return(TK_PLUS);
"-" return(TK_MINUS);
"*" return(TK_MUL);
"/" return(TK_DIV);
"~" return(TK_NOT);
"&" return(TK_AND);
"|" return(TK_OR);
"<=" return(TK_LEQ);
"<" return(TK_LESS);
">=" return(TK_GEQ);
">" return(TK_GT);
"==" return(TK_EQ);
"=" return(TK_ASSIGN);
"(" return(TK_OPEN);
")" return(TK_CLOSE);
";" return(TK_SEMI);
"," return(TK_COMMA);
[[:alpha:]_][[:alnum:]_]* return(IDENTIFIER);
[+-]?[0-9]+ return(INTEGER);
[+-]?([0-9]+([.][0-9]*)?|[.][0-9]+) return(REAL);
[[:space:]]+ ;
%%
int yywrap(void)
{
return 1;
}
Parser.y
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct node
{
struct node *left;
struct node *right;
char *token;
} node;
node *mknode(node *left, node *right, char *token);
void printtree(node *tree);
#define YYSTYPE struct node *
%}
%start Program
%token TK_KEY_DEFINE TK_KEY_DEFINE_UC
%token TK_KEY_AS TK_KEY_AS_UC
%token TK_KEY_IS TK_KEY_IS_UC
%token TK_KEY_IF TK_KEY_IF_UC
%token TK_KEY_THEN TK_KEY_THEN_UC
%token TK_KEY_ELSE TK_KEY_ELSE_UC
%token TK_KEY_ENDIF TK_KEY_ENDIF_UC
%token TK_KEY_WITH TK_KEY_WITH_UC
%token TK_PLUS TK_MINUS
%token TK_MUL TK_DIV
%token TK_NOT
%token TK_AND
%token TK_OR
%token TK_LEQ TK_LESS TK_GEQ TK_GT
%token TK_EQ
%token TK_ASSIGN
%token TK_OPEN TK_CLOSE
%token TK_SEMI
%token TK_COMMA
%token IDENTIFIER
%token INTEGER
%token REAL
%left TK_PLUS TK_MINUS
%left TK_MUL TK_DIV
%left TK_LEG TK_LESS TK_GEQ TK_GT
%left TK_AND TK_OR
%left TK_EQ
%right TK_NOT TK_ASSIGN
%%
Program : Macros Statements;
Macros : /* empty */
| Macro Macros
;
Macro : TK_KEY_DEFINE MacroTemplate TK_KEY_AS Expression;
MacroTemplate : IDENTIFIER MT;
MT : /*empty*/
| TK_OPEN IdentifierList TK_CLOSE
;
IdentifierList : IDENTIFIER I;
I : /*empty*/
| TK_COMMA IdentifierList
;
Statements : /*empty*/
| Statement Statements
;
IfStmt : TK_KEY_IF Condition TK_KEY_THEN Statements TK_KEY_ELSE Statements TK_KEY_ENDIF;
Statement : AssignStmt
| IfStmt
;
AssignStmt : IDENTIFIER TK_KEY_IS Expression;
Condition : C1 C11;
C11 : /*empty*/
| TK_OR C1 C11
;
C1 : C2 C22;
C22 : /*empty*/
| TK_AND C2 C22
;
C2 : C3 C33;
C33 : TK_EQ C3 C33;
C3 : C4 C44;
C44 : /*empty*/
| TK_LESS C4 C44
| TK_LEQ C4 C44
| TK_GT C4 C44
| TK_GEQ C4 C44
;
C4 : TK_NOT C5 | C5;
C5 : INTEGER | REAL | TK_OPEN Condition TK_CLOSE;
Expression : Term EE;
EE : /*empty*/
| TK_PLUS Term EE
| TK_MINUS Term EE
;
Term : Factor TT;
TT : /*empty*/
| TK_MUL Factor TT
| TK_DIV Factor TT
;
Factor : IDENTIFIER | REAL | INTEGER | TK_OPEN Expression TK_CLOSE;
%%
int main (void) {return yyparse ( );}
node *mknode(node *left, node *right, char *token)
{
/* malloc the node */
node *newnode = (node *)malloc(sizeof(node));
char *newstr = (char *)malloc(strlen(token)+1);
strcpy(newstr, token);
newnode->left = left;
newnode->right = right;
newnode->token = newstr;
return(newnode);
}
void printtree(node *tree)
{
int i;
if (tree->left || tree->right)
printf("(");
printf(" %s ", tree->token);
if (tree->left)
printtree(tree->left);
if (tree->right)
printtree(tree->right);
if (tree->left || tree->right)
printf(")");
}
int yyerror (char *s)
{
fprintf (stderr, "%s\n", s);
}
I wish the output to a parse tree if no errors and indicate error if any .
But I get a lot of warnings such as
warning: rule useless in grammar
warning: nonterminal useless in grammar
I understood the reason of this by reading other similar questions but could not correct it myself. Please help me solve this . Thanks !
Hi rici ,
Thank you so much , so I need not worry about left recursion , left factored grammar etc and directly go ahead and use something like below in yacc ?
%%
Program : Macros Statements;
Macros : /*empty*/
|Macro Macros
;
Macro : TK_KEY_DEFINE MacroTemplate TK_KEY_AS Expression;
MacroTemplate : VarTemplate
| FunTemplate
;
VarTemplate : IDENTIFIER;
FunTemplate : IDENTIFIER TK_OPEN IdentifierList TK_CLOSE;
IdentifierList : IDENTIFIER TK_COMMA IdentifierList
| IDENTIFIER
;
Statements : /*empty*/
| Statement Statements
;
IfStmt : TK_KEY_IF Condition TK_KEY_THEN Statements TK_KEY_ELSE Statements TK_KEY_ENDIF;
Statement : AssignStmt
| IfStmt
;
AssignStmt : IDENTIFIER TK_KEY_IS Expression;
Condition : Condition TK_AND Condition
| Condition TK_OR Condition
| Condition TK_LESS Condition
| Condition TK_LEQ Condition
| Condition TK_GT Condition
| Condition TK_GEQ Condition
| Condition TK_EQ Condition
| TK_NOT Condition
| TK_OPEN Condition TK_CLOSE
| INTEGER
| REAL
;
Expression : Expression TK_PLUS Expression
| Expression TK_MINUS Expression
| Expression TK_MUL Expression
| Expression TK_DIV Expression
| TK_OPEN Expression TK_CLOSE
| IDENTIFIER
| INTEGER
| REAL
;
%%
Also yes , I noted your last point :)
Unlike C11, C22, C44 and other "tail" rules, which can produce %empty, C33 has only one production:
C33 : TK_EQ C3 C33;
Since it has no non-recursive production, it cannot possibly produce a sentence (consisting only of non-terminals). And since it is part of the only production for C2 which is part of the only production for C1 which is part of the only production for Condition which is part of the only production for IfStmt, none of those can produce any sentence either. A rule which cannot produce any sentence is technically described as "useless" and a non-terminal all of whose rules are useless (or whose only rule is useless) is a "useless non-terminal".
There is another category of useless non-terminals: those which cannot be produced by any useful rule. That will be the case with C4 (which can only be produced by C3, which has been discovered to be useless) and thus with C44 and C5.
It should be evident how to fix that, but I'd like to note that you are tying yourself into knots by trying to avoid left-recursion, which is both unnecessary and counter-productive when using a bottom-up parser generator such as bison/yacc. (See the last paragraph of this answer for a longer grumble about this.) The artificial productions (C33 and friends) serve only to complicate the parse tree.
Also, since your grammar is not ambiguous -- in effect, the production rules clearly define operator binding strengths -- the various precedence declarations are pointless. (Unlike "useless", that is not a technical term :-) ). Precedence declarations are only applied to resolve grammatical ambiguity, which is not present here.
Finally, I think you should re-examine your grammar for Conditions. What, for example, is the meaning of ~3 < ~4? And why is x * 2 < y not valid?

How to find the length of a token in antlr?

I am trying to create a grammar which accepts any character or number or just about anything, provided its length is equal to 1.
Is there a function to check the length?
EDIT
Let me make my question more clear with an example.
I wrote the following code:
grammar first;
tokens {
SET = 'set';
VAL = 'val';
UND = 'und';
CON = 'con';
ON = 'on';
OFF = 'off';
}
#parser::members {
private boolean inbounds(Token t, int min, int max) {
int n = Integer.parseInt(t.getText());
return n >= min && n <= max;
}
}
parse : SET expr;
expr : VAL('u'('e')?)? String |
UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF) |
CON('n'('e'('c'('t')?)?)?)? oneChar
;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
dot : .;
oneChar : dot { $dot.text.length() == 1;} ;
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
I want my grammar to do the following things:
Accept commands like: 'set value abc' , 'set underli on' , 'set conn #'. The grammar should be intelligent enough to accept incomplete words like 'underl' instead of 'underline. etc etc.
The third syntax: 'set connect oneChar' should accept any character, but just one character. It can be a numeric digit or alphabet or any special character. I am getting a compiler error in the generated parser file because of this.
The first syntax: 'set value' should accept all the possible strings, even on and off. But when I give something like: 'set value offer', the grammar is failing. I think this is happening because I already have a token 'OFF'.
In my grammar all the three requirements I have listed above are not working fine. Don't know why.
There are some mistakes and/or bad practices in your grammar:
#1
The following is not a validating predicate:
{$dot.text.length() == 1;}
A proper validating predicate in ANTLR has a question mark at the end, and the inner code has no semi colon at the end. So it should be:
{$dot.text.length() == 1}?
instead.
#2
You should not be handling these alternative commands:
expr
: VAL('u'('e')?)? String
| UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF)
| CON('n'('e'('c'('t')?)?)?)? oneChar
;
in a parser rule. You should let the lexer handle this instead. Something like this will do it:
expr
: VAL String
| UND (ON | OFF)
| CON oneChar
;
// ...
VAL : 'val' ('u' ('e')?)?;
UND : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
(also see #5!)
#3
Your lexer rules:
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
are making things complicated for you. The lexer can produce three different kind of tokens because of this: CHAR, DIGIT or String. Ideally, you should only create String tokens since a String can already be a single CHAR or DIGIT. You can do that by adding the fragment keyword before these rules:
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
There will now be no CHAR and DIGIT tokens in your token stream, only String tokens. In short: fragment rules are only used inside lexer rules, by other lexer rules. They will never be tokens of their own (and can therefor never appear in any parser rule!).
#4
The rule:
dot : .;
does not do what you think it does. It matches "any token", not "any character". Inside a lexer rule, the . matches any character but in parser rules, it matches any token. Realize that parser rules can only make use of the tokens created by the lexer.
The input source is first tokenized based on the lexer-rules. After that has been done, the parser (though its parser rules) can then operate on these tokens (not characters!!!). Make sure you understand this! (if not, ask for clarification or grab a book about ANTLR)
- an example -
Take the following grammar:
p : . ;
A : 'a' | 'A';
B : 'b' | 'B';
The parser rule p will now match any token that the lexer produces: which is only a A- or B-token. So, p can only match one of the characters 'a', 'A', 'b' or 'B', nothing else.
And in the following grammar:
prs : . ;
FOO : 'a';
BAR : . ;
the lexer rule BAR matches any single character in the range \u0000 .. \uFFFF, but it can never match the character 'a' since the lexer rule FOO is defined before the BAR rule and captures this 'a' already. And the parser rule prs again matches any token, which is either FOO or BAR.
#5
Putting single characters like 'u' inside your parser rules, will cause the lexer to tokenize an u as a separate token: you don't want that. Also, by putting them in parser rules, it is unclear which token has precedence over other tokens. You should keep all such literals outside your parser rules and make them explicit lexer rules instead. Only use lexer rules in your parser rules.
So, don't do:
pRule : 'u' ':' String
String : ...
but do:
pRule : U ':' String
U : 'u';
String : ...
You could make ':' a lexer rule, but that is of less importance. The 'u' however can also be a String so it must appear as a lexer rule before the String rule.
Okay, those were the most obvious things that come to mind. Based on them, here's a proposed grammar:
grammar first;
parse
: (SET expr {System.out.println("expr = " + $expr.text);} )+ EOF
;
expr
: VAL String {System.out.print("A :: ");}
| UL (ON | OFF) {System.out.print("B :: ");}
| CON oneChar {System.out.print("C :: ");}
;
oneChar
: String {$String.text.length() == 1}?
;
SET : 'set';
VAL : 'val' ('u' ('e')?)?;
UL : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
ON : 'on';
OFF : 'off';
String : (CHAR | DIGIT)+;
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
that can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"set value abc \n" +
"set underli on \n" +
"set conn x \n" +
"set conn xy ";
ANTLRStringStream in = new ANTLRStringStream(source);
firstLexer lexer = new firstLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
firstParser parser = new firstParser(tokens);
System.out.println("parsing:\n======\n" + source + "\n======");
parser.parse();
}
}
which, after generating the lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool first.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
prints the following output:
parsing:
======
set value abc
set underli on
set conn x
set conn xy
======
A :: expr = value abc
B :: expr = underli on
C :: expr = conn x
line 0:-1 rule oneChar failed predicate: {$String.text.length() == 1}?
C :: expr = conn xy
As you can see, the last command, C :: expr = conn xy, produces an error, as expected.

eliminate extra spaces in a given ANTLR grammar

In any grammar I create in ANTLR, is it possible to parse the grammar and the result of the parsing can eleminate any extra spaces in the grammar. f.e
simple example ;
int x=5;
if I write
int x = 5 ;
I would like that the text changes to the int x=5 without the extra spaces. Can the parser return the original text without extra spaces?
Can the parser return the original text without extra spaces?
Yes, you need to define a lexer rule that captures these spaces and then skip() them:
Space
: (' ' | '\t') {skip();}
;
which will cause spaces and tabs to be ignored.
PS. I'm assuming you're using Java as the target language. The skip() can be different in other targets (Skip() for C#, for example). You may also want to include \r and \n chars in this rule.
EDIT
Let's say your language only consists of a couple of variable declarations. Assuming you know the basics of ANTLR, the following grammar should be easy to understand:
grammar T;
parse
: stat* EOF
;
stat
: Type Identifier '=' Int ';'
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
And you're parsing the source:
int x = 5 ; double y =5;boolean z = 0 ;
which you'd like to change into:
int x=5;
double y=5;
boolean z=0;
Here's a way to embed code in your grammar and let the parser rules return custom objects (Strings, in this case):
grammar T;
parse returns [String str]
#init{StringBuilder buffer = new StringBuilder();}
#after{$str = buffer.toString();}
: (stat {buffer.append($stat.str).append('\n');})* EOF
;
stat returns [String str]
: Type Identifier '=' Int ';'
{$str = $Type.text + " " + $Identifier.text + "=" + $Int.text + ";";}
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
Test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "int x = 5 ; double y =5;boolean z = 0 ;";
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
System.out.println("Result:\n"+parser.parse());
}
}
which produces:
Result:
int x=5;
double y=5;
boolean z=0;