ANTLR 4 parser grammar has trouble with if expression - antlr

I've written a simple grammar for a language meant to be used in graph-based dialogue systems (primarily for video games).
Here are the grammars:
parser grammar DialogueScriptParser;
options {
tokenVocab = DialogueScriptLexer;
}
// Entry Point
script: scheduled_block* EOF;
// Scheduled Blocks
scheduled_block:
scheduled_block_open block scheduled_block_close;
scheduled_block_open: LT flag_list? LT;
scheduled_block_close: GT flag_list? GT;
block: statement*;
// Statements
statement:
if_statement
| switch_statement
| compound_statement
| expression_statement
| declaration_statement;
// Compound Statement
compound_statement: LBRACE statement_list? RBRACE;
statement_list: statement+;
// Expression Statement
expression_statement: expression SEMI;
// If Statement
if_statement:
IF LPAREN expression RPAREN statement (ELSE statement)?;
// Switch Statement
switch_statement: SWITCH LPAREN expression LPAREN switch_block;
switch_block: LBRACE switch_label* RBRACE;
switch_label: CASE expression COLON | DEFAULT COLON;
// Declaration Statement
declaration_statement: type declarator_init SEMI;
declarator_init: declarator (ASSIGN expression)?;
declarator: IDENTIFIER;
// Expression
expression_list: expression (COMMA expression);
expression:
name
| literal
| LPAREN expression RPAREN
| expression (INC | DEC)
| expression LBRACK expression RBRACK
| expression LPAREN expression_list? RPAREN
| expression LBRACE expression_list? RBRACE
| (SUB | ADD | INC | DEC | NOT | BIT_NOT) expression
| expression TURNARY expression COLON expression
| expression mul_div_mod_operator expression
| expression add_sub_operator expression
| <assoc = right> expression concat_operator expression
| expression relational_operator expression
| expression and_operator expression
| expression or_operator expression
| expression bitwise_operator expression;
// Operators
concat_operator: CONCAT;
and_operator: AND;
or_operator: OR;
add_sub_operator: ADD | SUB;
mul_div_mod_operator: MUL | DIV | MOD;
relational_operator: GT | LT | LE | GE;
equality_operator: EQUAL | NOTEQUAL;
bitwise_operator:
| '<' '<'
| '>' '>'
| BIT_AND
| BIT_OR
| BIT_XOR;
assignment_operator:
ASSIGN
| ADD_ASSIGN
| SUB_ASSIGN
| MUL_ASSIGN
| DIV_ASSIGN
| AND_ASSIGN
| OR_ASSIGN
| XOR_ASSIGN
| MOD_ASSIGN
| LSHIFT_ASSIGN
| RSHIFT_ASSIGN;
// Types
type:
primitive_type
| type LBRACK RBRACK
| name ('<' type (COMMA type)* '>')?;
primitive_type:
TYPE_BOOLEAN
| TYPE_CHAR
| TYPE_FLOAT_DEFAULT
| TYPE_FLOAT32
| TYPE_FLOAT64
| TYPE_INT_DEFAULT
| TYPE_INT8
| TYPE_INT16
| TYPE_INT32
| TYPE_INT64
| TYPE_UINT_DEFAULT
| TYPE_UINT8
| TYPE_UINT16
| TYPE_UINT32
| TYPE_UINT64
| TYPE_STRING;
// Name
name: namespace? IDENTIFIER (DOT IDENTIFIER)*;
// Namespace
namespace: IDENTIFIER COLONCOLON (namespace)*;
// Flags
flag_list: IDENTIFIER (COMMA IDENTIFIER)*;
// Literals
literal:
INTEGER_LITERAL
| FLOATING_POINT_LITERAL
| BOOLEAN_LITERAL
| CHARACTER_LITERAL
| STRING_LITERAL
| NULL_LITERAL;
and:
lexer grammar DialogueScriptLexer;
// Keywords
TYPE_BOOLEAN: 'bool';
TYPE_CHAR: 'char'; // 16 bits
TYPE_FLOAT_DEFAULT: 'float'; // 32 bits
TYPE_FLOAT32: 'float32';
TYPE_FLOAT64: 'float64';
TYPE_INT_DEFAULT: 'int'; // 32 bits
TYPE_INT8: 'int8';
TYPE_INT16: 'int16';
TYPE_INT32: 'int32';
TYPE_INT64: 'int64';
TYPE_UINT_DEFAULT: 'uint'; // 32 bits
TYPE_UINT8: 'uint8';
TYPE_UINT16: 'uint16';
TYPE_UINT32: 'uint32';
TYPE_UINT64: 'uint64';
TYPE_STRING: 'string'; // 16 bits per character
BREAK: 'break'; // used for switch
CASE: 'case'; // used for switch
DEFAULT: 'default'; // used for switch
IF: 'if';
ELSE: 'else';
SWITCH: 'switch';
// Integer Literals
INTEGER_LITERAL:
DecIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral;
fragment DecIntegerLiteral: '0' | NonZeroDigit Digit*;
fragment HexIntegerLiteral: '0' [xX] HexDigit+;
fragment OctalIntegerLiteral: '0' Digit+;
fragment BinaryIntegerLiteral: '0' [bB] BinaryDigit+;
// Floating-Point Literals
FLOATING_POINT_LITERAL: DecFloatingPointLiteral;
fragment DecFloatingPointLiteral:
DecIntegerLiteral? ('.' Digits) FloatTypeSuffix?
| DecIntegerLiteral FloatTypeSuffix?;
// Boolean Literals
BOOLEAN_LITERAL: 'true' | 'false';
// Character Literals
CHARACTER_LITERAL:
'\'' SingleCharacter '\''
| '\'' EscapeSequence '\'';
fragment SingleCharacter: ~['\\\r\n];
fragment EscapeSequence:
'\\\''
| '\\"'
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v';
// String Literals
STRING_LITERAL: '"' StringCharacters? '"';
fragment StringCharacters: StringCharacter+;
fragment StringCharacter: ~["\\\r\n] | EscapeSequence;
// Null Literal
NULL_LITERAL: 'null';
// Separators
LPAREN: '(';
RPAREN: ')';
LBRACE: '{';
RBRACE: '}';
LBRACK: '[';
RBRACK: ']';
SEMI: ';';
COMMA: ',';
DOT: '.';
COLON: ':';
COLONCOLON: '::';
// Operators
ASSIGN: '=';
ADD_ASSIGN: '+=';
SUB_ASSIGN: '-=';
MUL_ASSIGN: '*=';
DIV_ASSIGN: '/=';
AND_ASSIGN: '&=';
OR_ASSIGN: '|=';
XOR_ASSIGN: '^=';
MOD_ASSIGN: '%=';
LSHIFT_ASSIGN: '<<=';
RSHIFT_ASSIGN: '>>=';
GT: '>';
LT: '<';
EQUAL: '==';
LE: '<=';
GE: '>=';
NOTEQUAL: '!=';
NOT: '!';
BIT_NOT: '~';
BIT_AND: '&';
BIT_OR: '|';
BIT_XOR: '^';
/* Defining these here make recognizing scheduled blocks difficult BIT_SHIFT_L: '<<'; BIT_SHIFT_R:
'>>';
*/
AND: '&&';
OR: '||';
INC: '++';
DEC: '--';
ADD: '+';
SUB: '-';
MUL: '*';
DIV: '/';
MOD: '%';
CONCAT: '..';
TURNARY: '?';
// Identifiers
/* Order affects precedence IDENTFIER must come last. */
IDENTIFIER: Letter LetterOrDigit*;
fragment LetterOrDigit: Letter | Digit;
fragment Digits: Digit+;
fragment Digit: '0' | NonZeroDigit;
fragment NonZeroDigit: [1-9];
fragment HexDigit: [0-9a-fA-F];
fragment BinaryDigit: [01];
fragment Letter: [a-zA-Z_];
fragment FloatTypeSuffix: [fFdD];
// Whitespace and Comments
WHITESPACE: [ \t\r\n\u000C]+ -> skip;
COMMENT_BLOCK: '/*' .*? '*/' -> channel(HIDDEN);
COMMENT_LINE: '//' ~[\r\n]* -> channel(HIDDEN);
Project: https://github.com/Sahasrara/DialogueScript
The grammar is working for the most part, but it's struggling to parse certain types of expressions.
Example:
<<
if (intVar == 10 && globalFunc() || "string lit" .. "concat string" == stringVar)
{
anotherFunc();
}
>>
Here's the output tree:
I know there's a precedence issue here, but I'm not entirely sure how to resolve it. Would someone mind pointing me in the right direction?

You are not using the equality_operator rule that contains the == operator. Place it somewhere in your expression rule:
expression
: ...
| expression add_sub_operator expression
| expression equality_operator expression
| <assoc = right> expression concat_operator expression
| ...
;
When placed there, it will have a lower precedence than + and -, and a higher precedence than ..:
Also note that the assignment_operator is not used currently.

Related

ANTLR grammar not picking the right option

So, I'm trying to assign a method value to a var in a test program, I'm using a Decaf grammar.
The grammar:
// Define decaf grammar
grammar Decaf;
// Reglas LEXER
// Definiciones base para letras y digitos
fragment LETTER: ('a'..'z'|'A'..'Z'|'_');
fragment DIGIT: '0'..'9';
// Las otras reglas de lexer de Decaf
ID: LETTER (LETTER|DIGIT)*;
NUM: DIGIT(DIGIT)*;
CHAR: '\'' ( ~['\r\n\\] | '\\' ['\\] ) '\'';
WS : [ \t\r\n\f]+ -> channel(HIDDEN);
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
// -----------------------------------------------------------------------------------------------------------------------------------------
// Reglas PARSER
program:'class' 'Program' '{' (declaration)* '}';
declaration
: structDeclaration
| varDeclaration
| methodDeclaration
;
varDeclaration
: varType ID ';'
| varType ID '[' NUM ']' ';'
;
structDeclaration:'struct' ID '{' (varDeclaration)* '}' (';')?;
varType
: 'int'
| 'char'
| 'boolean'
| 'struct' ID
| structDeclaration
| 'void'
;
methodDeclaration: methodType ID '(' (parameter (',' parameter)*)* ')' block;
methodType
: 'int'
| 'char'
| 'boolean'
| 'void'
;
parameter
: parameterType ID
| parameterType ID '[' ']'
| 'void'
;
parameterType
: 'int'
| 'char'
| 'boolean'
;
block: '{' (varDeclaration)* (statement)* '}';
statement
: 'if' '(' expression ')' block ( 'else' block )? #stat_if
| 'while' '('expression')' block #stat_else
| 'return' expressionOom ';' #stat_return
| methodCall ';' #stat_mcall
| block #stat_block
| location '=' expression #stat_assignment
| (expression)? ';' #stat_line
;
expressionOom: expression |;
location: (ID|ID '[' expression ']') ('.' location)?;
expression
: location #expr_loc
| methodCall #expr_mcall
| literal #expr_literal
| '-' expression #expr_minus // Unary Minus Operation
| '!' expression #expr_not // Unary NOT Operation
| '('expression')' #expr_parenthesis
| expression arith_op_fifth expression #expr_arith5 // * / % << >>
| expression arith_op_fourth expression #expr_arith4 // + -
| expression arith_op_third expression #expr_arith3 // == != < <= > >=
| expression arith_op_second expression #expr_arith2 // &&
| expression arith_op_first expression #expr_arith1 // ||
;
methodCall: ID '(' arg1 ')';
// Puede ir algo que coincida con arg2 o nada, en caso de una llamada a metodo sin parametro
arg1: arg2 |;
// Expression y luego se utiliza * para permitir 0 o más parametros adicionales
arg2: (arg)(',' arg)*;
arg: expression;
// Operaciones
// Divididas por nivel de precedencia
// Especificación de precedencia: https://anoopsarkar.github.io/compilers-class/decafspec.html
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
arith_op_fifth: '*' | '/' | '%' | '<<' | '>>';
arith_op_fourth: '+' | '-';
arith_op_third: rel_op | eq_op;
arith_op_second: '&&';
arith_op_first: '||';
literal : int_literal | char_literal | bool_literal ;
int_literal : NUM ;
char_literal : '\'' CHAR '\'' ;
bool_literal : 'true' | 'false' ;
And the test program is as follows:
class Program
{
int factorial(int b)
{
int n;
n = 1;
return n+2;
}
void main(void)
{
int a;
int b;
b=0;
a=factorial(b);
factorial(b);
return;
}
}
The parse tree for this program looks as following, at least for the part I'm interested which is a=factorial(b):
This tree is wrong, since it should look like location = expression -> methodCall
The following tree is how it looks on a friend's implementation, and it should sort of look like this if the grammar was correctly implemented:
This is correctly implemented, or the result I need, since I want the tree to look like location = expression -> methodCall and not location = expression -> location. If I remove the parameter from a=factorial(b) and leave it as a=factorial(), it will be read correctly as a methodCall, so I'm not sure what I'm missing.
So my question is, I'm not sure where I'm messing up in the grammar, I guess it's either on location or expression, but I'm not sure how to adjust it to behave the way I want it to. I sort of just got the rules literally from the specification we were provided.
In an ANTLR rule, alternatives are matches from top to bottom. So in your expression rule:
expression
: location #expr_loc
| methodCall #expr_mcall
...
;
the generated parser will try to match a location before it tries to match a methodCall. Try swapping those two around:
expression
: methodCall #expr_mcall
| location #expr_loc
...
;

Any method to remove unwanted space in grammar rule?

I am trying to add a special type of function to my antlr grammar called Window function. My Grammar looks something like this :
stat: expression;
equation: expression relop expression;
expression:
multiplyingExpression ((PLUS | MINUS) multiplyingExpression)*;
multiplyingExpression:
powExpression ((TIMES | DIV) powExpression)*;
powExpression: signedAtom (POW signedAtom)?;
signedAtom:
PLUS signedAtom
| MINUS signedAtom
| winfunc
| func
| iffunc
| atom;
atom:
scientific
| string_literal
| id
| constant
| LPAREN expression RPAREN;
string_literal: STRING;
scientific: SCIENTIFIC_NUMBER;
constant: PI | EULER | I;
variable: VARIABLE;
func: funcname LPAREN expression (COMMA expression)* RPAREN;
iffunc:
'if' LPAREN equation COMMA expression COMMA expression RPAREN;
funcname: variable;
relop: EQ | GT | LT;
LPAREN: '(';
RPAREN: ')';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
GT: '>';
LT: '<';
EQ: '==';
COMMA: ',';
POINT: '.';
POW: '^';
id: '[' idx ']' {console.log($idx.text);};
idx:
{(this.antlrHelper.isMetric(this.getCurrentToken().text))}? metricid
| {(this.antlrHelper.isDimension(this.getCurrentToken().text))}? entityid
| unknownid;
// metricid | entityid | unknownid;
metricid: VARIABLE;
entityid: VARIABLE;
unknownid: VARIABLE;
VARIABLE: VALID_ID_START VALID_ID_CHAR*;
fragment VALID_ID_START: ('a' .. 'z') | ('A' .. 'Z') | '_';
fragment VALID_ID_CHAR: VALID_ID_START | ('0' .. '9') | '.';
SCIENTIFIC_NUMBER: NUMBER ((E1 | E2) SIGN? NUMBER)?;
fragment NUMBER: ('0' .. '9')+ ('.' ('0' .. '9')+)?;
fragment E1: 'E';
fragment E2: 'e';
fragment SIGN: ('+' | '-');
STRING: '"' StringCharacters? '"';
fragment StringCharacters: StringCharacter+;
fragment StringCharacter: ~["\\] | EscapeSequence;
// §3.10.6 Escape Sequences for Character and String Literals
fragment EscapeSequence: '\\' [btnfr"'\\] | OctalEscape;
fragment OctalEscape:
'\\' OctalDigit
| '\\' OctalDigit OctalDigit
| '\\' ZeroToThree OctalDigit OctalDigit;
fragment ZeroToThree: [0-3];
fragment OctalDigit: [0-7];
WS: [ \r\n\t]+ -> skip;
It is taking normal functions. For windows function, I've added following rule :
winfunc:
WINDOW winfuncname LPAREN winMetricId COMMA scientific COMMA scientific RPAREN;
WINDOW: 'Window_';
winfuncname:
variable;
winMetricId: '[' winMetricIdx ']';
winMetricIdx:
{(this.antlrHelper.isMetric(this.getCurrentToken().text))}? metricid
| otherid;
otherid: VARIABLE;
On parsing
Window_ADD
it is parsing it into func rule but I want my grammar to parse it to winfunc rule.
Window_ ADD
it is parsing it into winfunc but I don't want that extra space to be there. How can I make Window_ADD parse to winfunc rule instead of Window_ ADD?
You have two options:
1. If you know exactly which function names (terminals) will be used, you may simply change your rule:
WINDOW: 'Window_';
to
WINDOW: 'Window_ADD';
If you want to add more functions, say, Window_DEL, just add one more terminal to this rule:
WINDOW: 'Window_' ('ADD' | 'DEL');
or
WINDOW: 'Window_';
WINDOW_ADD: WINDOW 'ADD';
WINDOW_DEL: WINDOW 'DEL';
2. In case function names are unknown, you may want to use wildcards to determine a terminal:
WINDOW: 'Window_' VALID_ID_CHAR+;
In this case the type of a function is determined during the phase of semantic analysis.

ANTLR4 - Access token group in a sequence using context

I have a grammar that includes this rule:
expr:
unaryExpr '(' (stat | expr | constant) ')' #labelUnaryExpr
| binaryExpr '(' (stat | expr | constant) ',' (stat | expr | constant) ')' #labelBinaryExpr
| multipleExpr '(' (stat | expr | constant) (',' (stat | expr | constant))+ ')' #labelMultipleExpr
;
For expr, I can access the value of unaryExpr by calling ctx.unaryStat(). How can I access (stat | expr | constant) similarly? Is there a solution that doesn't require modifying my grammar by adding another rule for the group?
Since you've labelled you alternatives, you can access the (stat | expr | constant) in its respective listener/visitor method:
#Override
public void enterLabelUnaryExpr(#NotNull ExprParser.LabelUnaryExprContext ctx) {
// one of these will return something other than null
System.out.println(ctx.stat());
System.out.println(ctx.expr());
System.out.println(ctx.constant());
}

Precedence in Antlr using parentheses

We are developing a DSL, and we're facing some problems:
Problem 1:
In our DSL, it's allowed to do this:
A + B + C
but not this:
A + B - C
If the user needs to use two or more different operators, he'll need to insert parentheses:
A + (B - C) or (A + B) - C.
Problem 2:
In our DSL, the most precedent operator must be surrounded by parentheses.
For example, instead of using this way:
A + B * C
The user needs to use this:
A + (B * C)
To solve the Problem 1 I've got a snippet of ANTLR that worked, but I'm not sure if it's the best way to solve it:
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr;
To solve the Problem 2, I have no idea where to start.
I appreciate your help to find out a better solution to the first problem and a direction to solve the seconde one. (Sorry for my bad english)
Below is the grammar that we have developed:
grammar TclGrammar;
options {
output=AST;
ASTLabelType=CommonTree;
}
#members {
public boolean isSum(String type) {
System.out.println("Tipo: " + type);
return "+".equals(type);
}
public boolean isSub(String type) {
System.out.println("Tipo: " + type);
return "-".equals(type);
}
}
prog
: exprMain ';' {System.out.println(
$exprMain.tree == null ? "null" : $exprMain.tree.toStringTree());}
;
exprMain
: exprQuando? (exprDeveSatis | exprDeveFalharCaso)
;
exprDeveSatis
: 'DEVE SATISFAZER' '{'! expr '}'!
;
exprDeveFalharCaso
: 'DEVE FALHAR CASO' '{'! expr '}'!
;
exprQuando
: 'QUANDO' '{'! expr '}'!
;
expr
: logicExpr
;
logicExpr
: boolExpr (('E'|'OU')^ boolExpr)*
;
boolExpr
: comparatorExpr
| emExpr
| 'VERDADE'
| 'FALSO'
;
emExpr
: FIELD 'EM' '[' (variable_lista | field_lista) comCruzamentoExpr? ']'
-> ^('EM' FIELD (variable_lista+)? (field_lista+)? comCruzamentoExpr?)
;
comCruzamentoExpr
: 'COM CRUZAMENTO' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COM CRUZAMENTO' FIELD+)
;
comparatorExpr
: sumExpr (('<'^|'<='^|'>'^|'>='^|'='^|'<>'^) sumExpr)?
| naoPreenchidoExpr
| preenchidoExpr
;
naoPreenchidoExpr
: FIELD 'NAO PREENCHIDO' -> ^('NAO PREENCHIDO' FIELD)
;
preenchidoExpr
: FIELD 'PREENCHIDO' -> ^('PREENCHIDO' FIELD)
;
sumExpr
#init {boolean isSum=false;boolean isSub=false;}
: {isSum(input.LT(2).getText()) && !isSub}? multExpr('+'^{isSum=true;} sumExpr)+
| {isSub(input.LT(2).getText()) && !isSum}? multExpr('-'^{isSub=true;} sumExpr)+
| multExpr
;
multExpr
: funcExpr(('*'^|'/'^) funcExpr)?
;
funcExpr
: 'QUANTIDADE'^ '('! FIELD ')'!
| 'EXTRAI_TEXTO'^ '('! FIELD ';' INTEGER ';' INTEGER ')'!
| cruzaExpr
| 'COMBINACAO_UNICA' '(' FIELD ';' FIELD (';' FIELD)* ')' -> ^('COMBINACAO_UNICA' FIELD+)
| 'EXISTE'^ '('! FIELD ')'!
| 'UNICO'^ '('! FIELD ')'!
| atom
;
cruzaExpr
: operadorCruzaExpr ('CRUZA COM'^|'CRUZA AMBOS'^) operadorCruzaExpr ondeExpr?
;
operadorCruzaExpr
: FIELD('('!field_lista')'!)?
;
ondeExpr
: ('ONDE'^ '('!expr')'!)
;
atom
: FIELD
| VARIABLE
| '('! expr ')'!
;
field_lista
: FIELD(';' field_lista)?
;
variable_lista
: VARIABLE(';' variable_lista)?
;
FIELD
: NONCONTROL_CHAR+
;
NUMBER
: INTEGER | FLOAT
;
VARIABLE
: '\'' NONCONTROL_CHAR+ '\''
;
fragment SIGN: '+' | '-';
fragment NONCONTROL_CHAR: LETTER | DIGIT | SYMBOL;
fragment LETTER: LOWER | UPPER;
fragment LOWER: 'a'..'z';
fragment UPPER: 'A'..'Z';
fragment DIGIT: '0'..'9';
fragment SYMBOL: '_' | '.' | ',';
fragment FLOAT: INTEGER '.' '0'..'9'*;
fragment INTEGER: '0' | SIGN? '1'..'9' '0'..'9'*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {skip();}
;
This is similar to not having operator precedence at all.
expr
: funcExpr
( ('+' funcExpr)*
| ('-' funcExpr)*
| ('*' funcExpr)*
| ('/' funcExpr)*
)
;
I think the following should work. I'm assuming some lexer tokens with obvious names.
expr: sumExpr;
sumExpr: onlySum | subExpr;
onlySum: atom ( PLUS onlySum )?;
subExpr: onlySub | multExpr;
onlySub: atom ( MINUS onlySub )? ;
multExpr: atom ( STAR atomic )? ;
parenExpr: OPEN_PAREN expr CLOSE_PAREN;
atom: FIELD | VARIABLE | parenExpr
The only* rules match an expression if it only has one type of operator outside of parentheses. The *Expr rules match either a line with the appropriate type of operators or go to the next operator.
If you have multiple types of operators, then they are forced to be inside parentheses because the match will go through atom.

How do I distinguish a very keyword-like token from a keyword using ANTLR?

I am having trouble distinguishing a keyword from a non-keyword when a grammar allows the non-keyword to have a similar "look" to the keyword.
Here's the grammar:
grammar Query;
options {
output = AST;
backtrack = true;
}
tokens {
DefaultBooleanNode;
}
// Parser
startExpression : expression EOF ;
expression : withinExpression ;
withinExpression
: defaultBooleanExpression
(WSLASH^ NUMBER defaultBooleanExpression)*
defaultBooleanExpression
: (queryFragment -> queryFragment)
(e=queryFragment -> ^(DefaultBooleanNode $defaultBooleanExpression $e))*
;
queryFragment : unquotedQuery ;
unquotedQuery : UNQUOTED | NUMBER ;
// Lexer
WSLASH : ('W'|'w') '/';
NUMBER : Digit+ ('.' Digit+)? ;
UNQUOTED : UnquotedStartChar UnquotedChar* ;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~( 'u' )
)
;
fragment Digit : ('0'..'9') ;
fragment HexDigit : ('0'..'9' | 'a'..'f' | 'A'..'F') ;
WHITESPACE : ( ' ' | '\r' | '\t' | '\u000C' | '\n' ) { skip(); };
I have simplified it enough to get rid of the distractions but I think removing any more would remove the problem.
A slash is permitted in the middle of an unquoted query fragment.
Boolean queries in particular have no required keyword.
A new syntax (e.g. W/3) is being introduced but I'm trying not to affect existing queries which happen to look similar (e.g. X/Y)
Due to '/' being valid as part of a word, ANTLR appears to be giving me "W/3" as a single token of type UNQUOTED instead of it being a WSLASH followed by a NUMBER.
Due to the above, I end up with a tree like: DefaultBooleanNode(DefaultBooleanNode(~first clause~, "W/3"), ~second clause~), whereas what I really wanted was WSLASH(~first clause~, "3", ~second clause~).
What I would like to do is somehow write the UNQUOTED rule as "what I have now, but not matching ~~~~", but I'm at a loss for how to do that.
I realise that I could spell it out in full, e.g.:
Any character from UnquotedStartChar except 'w', followed by the rest of the rule
'w' followed by any character from UnquotedChar except '/', followed by the rest of the rule
'w/' followed by any character from UnquotedChar except digits
...
However, that would look awful. :)
When a lexer generated by ANTLR "sees" that certain input can be matched by more than 1 rule, it chooses the longest match. If you want a shorter match to take precedence, you'll need to merge all the similar rules into one and then check with a gated sematic predicate if the shorter match is ahead or not. If the shorter match is ahead, you change the type of the token.
A demo:
Query.g
grammar Query;
tokens {
WSlash;
}
#lexer::members {
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
parse
: (t=. {System.out.printf("\%-10s \%s\n", tokenNames[$t.type], $t.text);} )* EOF
;
NUMBER
: Digit+ ('.' Digit+)?
;
UNQUOTED
: {ahead("W/")}?=> 'W/' { $type=WSlash; /* change the type of the token */ }
| {ahead("w/")}?=> 'w/' { $type=WSlash; /* change the type of the token */ }
| UnquotedStartChar UnquotedChar*
;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~'u'
)
;
fragment Digit : '0'..'9';
fragment HexDigit : '0'..'9' | 'a'..'f' | 'A'..'F';
WHITESPACE : (' ' | '\r' | '\t' | '\u000C' | '\n') { skip(); };
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
QueryLexer lexer = new QueryLexer(new ANTLRStringStream("P/3 W/3"));
QueryParser parser = new QueryParser(new CommonTokenStream(lexer));
parser.parse();
}
}
To run the demo on *nix/MacOS:
java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
or on Windows:
java -cp antlr-3.3.jar org.antlr.Tool Query.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
which will print the following:
UNQUOTED P/3
WSlash W/
NUMBER 3
EDIT
To eliminate the warning when using the WSlash token in a parser rule, simply add an empty fragment rule to your grammar:
fragment WSlash : /* empty */ ;
It's a bit of a hack, but that's how it's done. No more warnings.