I've written a simple grammar for a language meant to be used in graph-based dialogue systems (primarily for video games).
Here are the grammars:
parser grammar DialogueScriptParser;
options {
tokenVocab = DialogueScriptLexer;
}
// Entry Point
script: scheduled_block* EOF;
// Scheduled Blocks
scheduled_block:
scheduled_block_open block scheduled_block_close;
scheduled_block_open: LT flag_list? LT;
scheduled_block_close: GT flag_list? GT;
block: statement*;
// Statements
statement:
if_statement
| switch_statement
| compound_statement
| expression_statement
| declaration_statement;
// Compound Statement
compound_statement: LBRACE statement_list? RBRACE;
statement_list: statement+;
// Expression Statement
expression_statement: expression SEMI;
// If Statement
if_statement:
IF LPAREN expression RPAREN statement (ELSE statement)?;
// Switch Statement
switch_statement: SWITCH LPAREN expression LPAREN switch_block;
switch_block: LBRACE switch_label* RBRACE;
switch_label: CASE expression COLON | DEFAULT COLON;
// Declaration Statement
declaration_statement: type declarator_init SEMI;
declarator_init: declarator (ASSIGN expression)?;
declarator: IDENTIFIER;
// Expression
expression_list: expression (COMMA expression);
expression:
name
| literal
| LPAREN expression RPAREN
| expression (INC | DEC)
| expression LBRACK expression RBRACK
| expression LPAREN expression_list? RPAREN
| expression LBRACE expression_list? RBRACE
| (SUB | ADD | INC | DEC | NOT | BIT_NOT) expression
| expression TURNARY expression COLON expression
| expression mul_div_mod_operator expression
| expression add_sub_operator expression
| <assoc = right> expression concat_operator expression
| expression relational_operator expression
| expression and_operator expression
| expression or_operator expression
| expression bitwise_operator expression;
// Operators
concat_operator: CONCAT;
and_operator: AND;
or_operator: OR;
add_sub_operator: ADD | SUB;
mul_div_mod_operator: MUL | DIV | MOD;
relational_operator: GT | LT | LE | GE;
equality_operator: EQUAL | NOTEQUAL;
bitwise_operator:
| '<' '<'
| '>' '>'
| BIT_AND
| BIT_OR
| BIT_XOR;
assignment_operator:
ASSIGN
| ADD_ASSIGN
| SUB_ASSIGN
| MUL_ASSIGN
| DIV_ASSIGN
| AND_ASSIGN
| OR_ASSIGN
| XOR_ASSIGN
| MOD_ASSIGN
| LSHIFT_ASSIGN
| RSHIFT_ASSIGN;
// Types
type:
primitive_type
| type LBRACK RBRACK
| name ('<' type (COMMA type)* '>')?;
primitive_type:
TYPE_BOOLEAN
| TYPE_CHAR
| TYPE_FLOAT_DEFAULT
| TYPE_FLOAT32
| TYPE_FLOAT64
| TYPE_INT_DEFAULT
| TYPE_INT8
| TYPE_INT16
| TYPE_INT32
| TYPE_INT64
| TYPE_UINT_DEFAULT
| TYPE_UINT8
| TYPE_UINT16
| TYPE_UINT32
| TYPE_UINT64
| TYPE_STRING;
// Name
name: namespace? IDENTIFIER (DOT IDENTIFIER)*;
// Namespace
namespace: IDENTIFIER COLONCOLON (namespace)*;
// Flags
flag_list: IDENTIFIER (COMMA IDENTIFIER)*;
// Literals
literal:
INTEGER_LITERAL
| FLOATING_POINT_LITERAL
| BOOLEAN_LITERAL
| CHARACTER_LITERAL
| STRING_LITERAL
| NULL_LITERAL;
and:
lexer grammar DialogueScriptLexer;
// Keywords
TYPE_BOOLEAN: 'bool';
TYPE_CHAR: 'char'; // 16 bits
TYPE_FLOAT_DEFAULT: 'float'; // 32 bits
TYPE_FLOAT32: 'float32';
TYPE_FLOAT64: 'float64';
TYPE_INT_DEFAULT: 'int'; // 32 bits
TYPE_INT8: 'int8';
TYPE_INT16: 'int16';
TYPE_INT32: 'int32';
TYPE_INT64: 'int64';
TYPE_UINT_DEFAULT: 'uint'; // 32 bits
TYPE_UINT8: 'uint8';
TYPE_UINT16: 'uint16';
TYPE_UINT32: 'uint32';
TYPE_UINT64: 'uint64';
TYPE_STRING: 'string'; // 16 bits per character
BREAK: 'break'; // used for switch
CASE: 'case'; // used for switch
DEFAULT: 'default'; // used for switch
IF: 'if';
ELSE: 'else';
SWITCH: 'switch';
// Integer Literals
INTEGER_LITERAL:
DecIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral;
fragment DecIntegerLiteral: '0' | NonZeroDigit Digit*;
fragment HexIntegerLiteral: '0' [xX] HexDigit+;
fragment OctalIntegerLiteral: '0' Digit+;
fragment BinaryIntegerLiteral: '0' [bB] BinaryDigit+;
// Floating-Point Literals
FLOATING_POINT_LITERAL: DecFloatingPointLiteral;
fragment DecFloatingPointLiteral:
DecIntegerLiteral? ('.' Digits) FloatTypeSuffix?
| DecIntegerLiteral FloatTypeSuffix?;
// Boolean Literals
BOOLEAN_LITERAL: 'true' | 'false';
// Character Literals
CHARACTER_LITERAL:
'\'' SingleCharacter '\''
| '\'' EscapeSequence '\'';
fragment SingleCharacter: ~['\\\r\n];
fragment EscapeSequence:
'\\\''
| '\\"'
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v';
// String Literals
STRING_LITERAL: '"' StringCharacters? '"';
fragment StringCharacters: StringCharacter+;
fragment StringCharacter: ~["\\\r\n] | EscapeSequence;
// Null Literal
NULL_LITERAL: 'null';
// Separators
LPAREN: '(';
RPAREN: ')';
LBRACE: '{';
RBRACE: '}';
LBRACK: '[';
RBRACK: ']';
SEMI: ';';
COMMA: ',';
DOT: '.';
COLON: ':';
COLONCOLON: '::';
// Operators
ASSIGN: '=';
ADD_ASSIGN: '+=';
SUB_ASSIGN: '-=';
MUL_ASSIGN: '*=';
DIV_ASSIGN: '/=';
AND_ASSIGN: '&=';
OR_ASSIGN: '|=';
XOR_ASSIGN: '^=';
MOD_ASSIGN: '%=';
LSHIFT_ASSIGN: '<<=';
RSHIFT_ASSIGN: '>>=';
GT: '>';
LT: '<';
EQUAL: '==';
LE: '<=';
GE: '>=';
NOTEQUAL: '!=';
NOT: '!';
BIT_NOT: '~';
BIT_AND: '&';
BIT_OR: '|';
BIT_XOR: '^';
/* Defining these here make recognizing scheduled blocks difficult BIT_SHIFT_L: '<<'; BIT_SHIFT_R:
'>>';
*/
AND: '&&';
OR: '||';
INC: '++';
DEC: '--';
ADD: '+';
SUB: '-';
MUL: '*';
DIV: '/';
MOD: '%';
CONCAT: '..';
TURNARY: '?';
// Identifiers
/* Order affects precedence IDENTFIER must come last. */
IDENTIFIER: Letter LetterOrDigit*;
fragment LetterOrDigit: Letter | Digit;
fragment Digits: Digit+;
fragment Digit: '0' | NonZeroDigit;
fragment NonZeroDigit: [1-9];
fragment HexDigit: [0-9a-fA-F];
fragment BinaryDigit: [01];
fragment Letter: [a-zA-Z_];
fragment FloatTypeSuffix: [fFdD];
// Whitespace and Comments
WHITESPACE: [ \t\r\n\u000C]+ -> skip;
COMMENT_BLOCK: '/*' .*? '*/' -> channel(HIDDEN);
COMMENT_LINE: '//' ~[\r\n]* -> channel(HIDDEN);
Project: https://github.com/Sahasrara/DialogueScript
The grammar is working for the most part, but it's struggling to parse certain types of expressions.
Example:
<<
if (intVar == 10 && globalFunc() || "string lit" .. "concat string" == stringVar)
{
anotherFunc();
}
>>
Here's the output tree:
I know there's a precedence issue here, but I'm not entirely sure how to resolve it. Would someone mind pointing me in the right direction?
You are not using the equality_operator rule that contains the == operator. Place it somewhere in your expression rule:
expression
: ...
| expression add_sub_operator expression
| expression equality_operator expression
| <assoc = right> expression concat_operator expression
| ...
;
When placed there, it will have a lower precedence than + and -, and a higher precedence than ..:
Also note that the assignment_operator is not used currently.
I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like:
i,j : bool;
setvar : set<bool>;
i > 5;
j < 10;
But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token.
Here is the grammar:
//// Parser Rules
grammar MLTL1;
start: block*;
block: var_list ';'
| expr ';'
;
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ;
type: BASE_TYPE
| KW_SET REL_LT BASE_TYPE REL_GT
;
expr: expr REL_OP expr
| '(' expr ')'
| IDENTIFIER
| INT
;
//// Lexical Spec
// Types
BASE_TYPE: 'bool'
| 'int'
| 'float'
;
// Keywords
KW_SET: 'set' ;
// Op groups for precedence
REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT
| REL_GTE | REL_LTE ;
// Relational ops
REL_EQ: '==' ;
REL_NEQ: '!=' ;
REL_GT: '>' ;
REL_LT: '<' ;
REL_GTE: '>=' ;
REL_LTE: '<=' ;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
INT
: SIGN? NONZERODIGIT DIGIT*
| '0'
;
fragment
SIGN
: [+-]
;
fragment
DIGIT
: [0-9]
;
fragment
NONZERODIGIT
: [1-9]
;
fragment
LETTER
: [a-zA-Z_]
;
COMMENT : '#' ~[\r\n]* -> skip;
WS : [ \t\r\n]+ -> channel(HIDDEN);
I tested the grammar to see what tokens it is generating for the test input above using this python:
from antlr4 import InputStream, CommonTokenStream
import MLTL1Lexer
import MLTL1Parser
input="""
i,j : bool;
setvar: set<bool>;
i > 5;
j < 10;
"""
lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input))
stream = CommonTokenStream(lexer)
stream.fill()
tokens = stream.getTokens(0,100)
for t in tokens:
print(str(t.type) + " " + t.text)
parser = MLTL1Parser.MLTL1Parser(stream)
parse_tree = parser.start()
print(parse_tree.toStringTree(recog=parser))
And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?
(There may be more than just these two instances, but...)
Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase.
As you've used them, you're turning many of your intended Lexer rules, effectively into fragments.
I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type.
(If you used grun to dump the tokens you would have seen them identified as REL_OP tokens.
With the changes below, your sample input works just fine.
grammar MLTL1
;
start: block*;
block: var_list ';' | expr ';';
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type;
type: baseType | KW_SET REL_LT baseType REL_GT;
expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT;
//// Lexical Spec
// Types
baseType: 'bool' | 'int' | 'float';
// Keywords
KW_SET: 'set';
// Op groups for precedence
rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE;
// Relational ops
REL_EQ: '==';
REL_NEQ: '!=';
REL_GT: '>';
REL_LT: '<';
REL_GTE: '>=';
REL_LTE: '<=';
IDENTIFIER: LETTER (LETTER | DIGIT)*;
INT: SIGN? NONZERODIGIT DIGIT* | '0';
fragment SIGN: [+-];
fragment DIGIT: [0-9];
fragment NONZERODIGIT: [1-9];
fragment LETTER: [a-zA-Z_];
COMMENT: '#' ~[\r\n]* -> skip;
WS: [ \t\r\n]+ -> channel(HIDDEN);
So, I'm trying to assign a method value to a var in a test program, I'm using a Decaf grammar.
The grammar:
// Define decaf grammar
grammar Decaf;
// Reglas LEXER
// Definiciones base para letras y digitos
fragment LETTER: ('a'..'z'|'A'..'Z'|'_');
fragment DIGIT: '0'..'9';
// Las otras reglas de lexer de Decaf
ID: LETTER (LETTER|DIGIT)*;
NUM: DIGIT(DIGIT)*;
CHAR: '\'' ( ~['\r\n\\] | '\\' ['\\] ) '\'';
WS : [ \t\r\n\f]+ -> channel(HIDDEN);
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
// -----------------------------------------------------------------------------------------------------------------------------------------
// Reglas PARSER
program:'class' 'Program' '{' (declaration)* '}';
declaration
: structDeclaration
| varDeclaration
| methodDeclaration
;
varDeclaration
: varType ID ';'
| varType ID '[' NUM ']' ';'
;
structDeclaration:'struct' ID '{' (varDeclaration)* '}' (';')?;
varType
: 'int'
| 'char'
| 'boolean'
| 'struct' ID
| structDeclaration
| 'void'
;
methodDeclaration: methodType ID '(' (parameter (',' parameter)*)* ')' block;
methodType
: 'int'
| 'char'
| 'boolean'
| 'void'
;
parameter
: parameterType ID
| parameterType ID '[' ']'
| 'void'
;
parameterType
: 'int'
| 'char'
| 'boolean'
;
block: '{' (varDeclaration)* (statement)* '}';
statement
: 'if' '(' expression ')' block ( 'else' block )? #stat_if
| 'while' '('expression')' block #stat_else
| 'return' expressionOom ';' #stat_return
| methodCall ';' #stat_mcall
| block #stat_block
| location '=' expression #stat_assignment
| (expression)? ';' #stat_line
;
expressionOom: expression |;
location: (ID|ID '[' expression ']') ('.' location)?;
expression
: location #expr_loc
| methodCall #expr_mcall
| literal #expr_literal
| '-' expression #expr_minus // Unary Minus Operation
| '!' expression #expr_not // Unary NOT Operation
| '('expression')' #expr_parenthesis
| expression arith_op_fifth expression #expr_arith5 // * / % << >>
| expression arith_op_fourth expression #expr_arith4 // + -
| expression arith_op_third expression #expr_arith3 // == != < <= > >=
| expression arith_op_second expression #expr_arith2 // &&
| expression arith_op_first expression #expr_arith1 // ||
;
methodCall: ID '(' arg1 ')';
// Puede ir algo que coincida con arg2 o nada, en caso de una llamada a metodo sin parametro
arg1: arg2 |;
// Expression y luego se utiliza * para permitir 0 o más parametros adicionales
arg2: (arg)(',' arg)*;
arg: expression;
// Operaciones
// Divididas por nivel de precedencia
// Especificación de precedencia: https://anoopsarkar.github.io/compilers-class/decafspec.html
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
arith_op_fifth: '*' | '/' | '%' | '<<' | '>>';
arith_op_fourth: '+' | '-';
arith_op_third: rel_op | eq_op;
arith_op_second: '&&';
arith_op_first: '||';
literal : int_literal | char_literal | bool_literal ;
int_literal : NUM ;
char_literal : '\'' CHAR '\'' ;
bool_literal : 'true' | 'false' ;
And the test program is as follows:
class Program
{
int factorial(int b)
{
int n;
n = 1;
return n+2;
}
void main(void)
{
int a;
int b;
b=0;
a=factorial(b);
factorial(b);
return;
}
}
The parse tree for this program looks as following, at least for the part I'm interested which is a=factorial(b):
This tree is wrong, since it should look like location = expression -> methodCall
The following tree is how it looks on a friend's implementation, and it should sort of look like this if the grammar was correctly implemented:
This is correctly implemented, or the result I need, since I want the tree to look like location = expression -> methodCall and not location = expression -> location. If I remove the parameter from a=factorial(b) and leave it as a=factorial(), it will be read correctly as a methodCall, so I'm not sure what I'm missing.
So my question is, I'm not sure where I'm messing up in the grammar, I guess it's either on location or expression, but I'm not sure how to adjust it to behave the way I want it to. I sort of just got the rules literally from the specification we were provided.
In an ANTLR rule, alternatives are matches from top to bottom. So in your expression rule:
expression
: location #expr_loc
| methodCall #expr_mcall
...
;
the generated parser will try to match a location before it tries to match a methodCall. Try swapping those two around:
expression
: methodCall #expr_mcall
| location #expr_loc
...
;
Im trying to skip/ignore the text outside a custom tag:
This text is a unique token to skip < ?compo \5+5\ ?> also this < ?compo \1+1\ ?>
I tried with the follow lexer:
TAG_OPEN : '<?compo ' -> pushMode(COMPOSER);
mode COMPOSER;
TAG_CLOSE : ' ?>' -> popMode;
NUMBER_DIGIT : '1'..'9';
ZERO : '0';
LOGICOP
: OR
| AND
;
COMPAREOP
: EQ
| NE
| GT
| GE
| LT
| LE
;
WS : ' ';
NEWLINE : ('\r\n'|'\n'|'\r');
TAB : ('\t');
...
and parser:
instructions
: (TAG_OPEN statement TAG_CLOSE)+?;
statement
: if_statement
| else
| else_if
| if_end
| operation_statement
| mnemonic
| comment
| transparent;
But it doesn't work (I test it by using the intelliJ tester on the rule "instructions")...
I have also add some skip rules outside the "COMPOSER" mode:
TEXT_SKIP : TAG_CLOSE .*? (TAG_OPEN | EOF) -> skip;
But i don't have any results...
Someone can help me?
EDIT:
I change "instructions" and now the parser tree is correctly builded for every instruction of every tag:
instructions : (.*? TAG_OPEN statement TAG_CLOSE .*?)+;
But i have a not recognized character error outside the the tags...
Below is a quick demo that worked for me.
Lexer grammar:
lexer grammar CompModeLexer;
TAG_OPEN
: '<?compo' -> pushMode(COMPOSER)
;
OTHER
: . -> skip
;
mode COMPOSER;
TAG_CLOSE
: '?>' -> popMode
;
OPAR
: '('
;
CPAR
: ')'
;
INT
: '0'
| [1-9] [0-9]*
;
LOGICOP
: 'AND'
| 'OR'
;
COMPAREOP
: [<>!] '='
| [<>=]
;
MULTOP
: [*/%]
;
ADDOP
: [+-]
;
SPACE
: [ \t\r\n\f] -> skip
;
Parser grammar:
parser grammar CompModeParser;
options {
tokenVocab=CompModeLexer;
}
parse
: tag* EOF
;
tag
: TAG_OPEN statement TAG_CLOSE
;
statement
: expr
;
expr
: '(' expr ')'
| expr MULTOP expr
| expr ADDOP expr
| expr COMPAREOP expr
| expr LOGICOP expr
| INT
;
A test with the input This text is a unique token to skip <?compo 5+5 ?> also this <?compo 1+1 ?> resulted in the following tree:
I found another solution (not elegant as the previous):
Create a generic TEXT token in the general context (so outside the tag's mode)
TEXT : ( ~[<] | '<' ~[?])+ -> skip;
Create a parser rule for handle a generic text
code
: TEXT
| (TEXT? instruction TEXT?)+;
Create a parser rule for handle an instruction
instruction
: TAG_OPEN statement TAG_CLOSE;
Here are part of my grammar:
expr_address
: expr_address_category expr_opt { $$ = new ExprAddress($1,*$2);}
| axis axis_data { $$ = new ExprAddress($1,*$2);}
;
axis_data
: expr_opt { $$ = $1;}
| sign { if($1 == MINUS)
$$ = new IntergerExpr(-1000000000);
else if($1 == PLUS)
$$ = new IntergerExpr(+1000000000);}
;
expr_opt
: { $$ = new IntergerExpr(0);}
| expr { $$ = $1;}
;
expr_address_category
: I { $$ = NCAddress_I;}
| J { $$ = NCAddress_J;}
| K { $$ = NCAddress_K;}
;
axis
: X { $$ = NCAddress_X;}
| Y { $$ = NCAddress_Y;}
| Z { $$ = NCAddress_Z;}
| U { $$ = NCAddress_U;}
| V { $$ = NCAddress_V;}
| W { $$ = NCAddress_W;}
;
expr
: '[' expr ']' {$$ = $2;}
| COS parenthesized_expr {$$ = new BuiltinMethodCallExpr(COS,*$2);}
| SIN parenthesized_expr {$$ = new BuiltinMethodCallExpr(SIN,*$2);}
| ATAN parenthesized_expr {$$ = new BuiltinMethodCallExpr(ATAN,*$2);}
| SQRT parenthesized_expr {$$ = new BuiltinMethodCallExpr(SQRT,*$2);}
| ROUND parenthesized_expr {$$ = new BuiltinMethodCallExpr(ROUND,*$2);}
| variable {$$ = $1;}
| literal
| expr '+' expr {$$ = new BinaryOperatorExpr(*$1,PLUS,*$3);}
| expr '-' expr {$$ = new BinaryOperatorExpr(*$1,MINUS,*$3);}
| expr '*' expr {$$ = new BinaryOperatorExpr(*$1,MUL,*$3);}
| expr '/' expr {$$ = new BinaryOperatorExpr(*$1,DIV,*$3);}
| sign expr %prec UMINUS {$$ = new UnaryOperatorExpr($1,*$2);}
| expr EQ expr {$$ = new BinaryOperatorExpr(*$1,EQ,*$3);}
| expr NE expr {$$ = new BinaryOperatorExpr(*$1,NE,*$3);}
| expr GT expr {$$ = new BinaryOperatorExpr(*$1,GT,*$3);}
| expr GE expr {$$ = new BinaryOperatorExpr(*$1,GE,*$3);}
| expr LT expr {$$ = new BinaryOperatorExpr(*$1,LT,*$3);}
| expr LE expr {$$ = new BinaryOperatorExpr(*$1,LE,*$3);}
;
variable
: d_h_address {$$ = new AddressExpr(*$1);}
;
d_h_address
: D INTEGER_LITERAL { $$ = new IntAddress(NCAddress_D,$2);}
| H INTEGER_LITERAL { $$ = new IntAddress(NCAddress_H,$2);}
;
I hope my grammar support that like:
H100=20;
X;
X+0;
X+;
X+H100; //means H100 variable ref
The top two are same with X0; By the way,sign -> +/-;
But bison report conflicts,the key part of bison.output:
State 108
11 expr: sign . expr
64 axis_data: sign .
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 64 (axis_data)]
H [reduce using rule 64 (axis_data)]
$default reduce using rule 64 (axis_data)
State 69
62 expr_address: axis . axis_data
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 65 (expr_opt)]
H [reduce using rule 65 (expr_opt)]
$default reduce using rule 65 (expr_opt)
State 68
61 expr_address: expr_address_category . expr_opt
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 65 (expr_opt)]
H [reduce using rule 65 (expr_opt)]
$default reduce using rule 65 (expr_opt)
I don't know how to deal with this,thanks advance.
EDIT:
I make a minimal grammar:
%{
#include <stdio.h>
extern "C" int yylex();
void yyerror(const char *s) { printf("ERROR: %s/n", s); }
%}
%token PLUS '+' MINUS '-'
%token D H I J K X Y Z INT
/*%type sign expr var expr_address_category expr_opt
%type axis */
%start word_list
%%
/*Above grammar lost this rule,it makes ambiguous*/
word_list
: word
| word_list word
;
sign
: PLUS
| MINUS
;
expr
: var
| sign expr
| '[' expr ']'
;
var
: D INT
| H INT
;
word
: expr_address
| var '=' expr
;
expr_address
: expr_address_category expr_opt
/*| '(' axis sign ')'*/
| axis sign
;
expr_opt
: /* empty */
| expr
;
expr_address_category
: I
| J
| K
| axis
;
axis
: X
| Y
| Z
;
%%
and I hope it can support:
X;
X0;
X+0; //the top three are same with X0
X+;
X+H100; //this means X's data is ref +H100;
X+H100=10; //two word on a block,X+ and H100=10;
XH100=10; //two word on a block,X and H100=10;
EDIT2:
The above EDIT lost this rule.
block
: word_list ';'
| ';'
;
Because I have to allow such grammar:
H000 = 100 H001 = 200 H002 = 300;
This is essentially the classic LR(2) grammar, except that in your case it is LR(3) because your variables consist of two tokens [Note 1]:
var : D INT | H INT
The basic problem is the concatenation of words without separators:
word_list : word | word_list word
combined with the fact that one of the options for word ends with an optional var:
word: expr_address
expr_address: expr_address_category expr_opt
while the other one starts with a var:
word: var '=' expr
The = makes this unambiguous, since nothing in an expr can contain that symbol. But at the point where a decision needs to be made, the = is not visible, because the lookahead is the first token of a var -- either an H or a D -- and the equals sign is still two tokens away.
This LR(2) grammar is very similar to the grammar used by yacc/bison itself, a fact which I always find to be ironic, because the grammar for yacc does not require ; between productions:
production: SYMBOL ':' | production SYMBOL /* Lots of detail omitted */
As with your grammar, this makes it impossible to know whether a SYMBOL should be shifted or trigger a reduce because the disambiguating : is still not visible.
Since the grammar is (I assume) unambiguous, and bison can now generate GLR parsers, that will be the simplest solution: just add
%glr-parser
to your bison prologue (but read the section of the bison manual on GLR parsers to understand the trade-off).
Note that the shift-reduce conflicts will still be reported as warnings; since it is impossible to reliably decide whether a grammar is ambiguous, bison doesn't attempt to do so and ambiguities will be reported at run-time if they exist.
You should also fix the issue mentioned in #ChrisDodd's answer regarding the refactoring of expr_address (although with a GLR parser it is not strictly necessary).
If, for whatever reason, you feel that a GLR parser will not meet your needs, you could use the solution in most implementations of yacc (including bison), which is a hack in the lexical scanner. The basic idea is to mark whether a symbol is followed by a colon or not in the lexer, so that the above production could be rewritten as:
production: SYMBOL_COLON | production SYMBOL
This solution would work for you if you were willing to combine the letter and the number into a single token:
word: expr_address expr_opt
| VARIABLE_EQUALS expr
// ...
expr: VARIABLE
My preference is to do this transformation in a wrapper around the lexer, which keeps a (one-element) queue of pending tokens:
/* The use of static variables makes this yylex wrapper unreliable
* if it is reused after a syntax error.
*/
int yylex_wrapper() {
static int saved_token = -1;
static YYSTYPE saved_yylval = {0};
int token = saved_token;
saved_token = -1;
yylval = saved_yylval;
// Read a new token only if we don't have one in the queue.
if (token < 0) token = yylex();
// If the current token is IDENTIFIER, check the next token
if (token == IDENTIFIER) {
// Read the next token into the queue (saved_token / saved_yylval)
YYSTYPE temp_val = yylval;
saved_token = yylex();
saved_yylval = yylval;
yylval = temp_val;
// If the second token is '=', then modify the current token
// and delete the '=' from the queue
if (saved_token == '=') {
saved_token = -1;
token = IDENTIFIER_EQUALS;
}
}
return token;
}
Notes
Personally, I would start by making a var a single token (do you really want to allow people to write:
H /* Some comment in the middle of the variable name */ 100
but that's not going to solve any problems; it merely reduces the grammar's lookahead requirement from LR(3) to LR(2).
The main problem is that it can't figure out where one word in a word_list ends and the next one begins, because there is no separator token between words. This is in contrast to your examples, which all have ; terminators. So that suggests one obvious fix -- put in the ; separators:
word: expr_address ';'
| var '=' expr ';'
That fixes most of the problems, but leaves a lookahead conflict where it can't decide whether an axis is an expr_address_category or not when the lookahead is a sign, because it depends on whether there's an expr after the sign or not. You can fix that by refactoring to defer deciding:
expr_address
: expr_address_category expr_opt
| axis expr_opt
| axis sign
..and remove axis from expr_address_category