Antlr4 instruction keywords and longest-statement matching - antlr

I am attempting to write a grammar, but I've found a problem occurring that I'm not quite sure how to solve 'elegantly'.
The issue is that I have 'bro' as a reserved instruction keyword, and it can be followed(or not) by a predication statement. IE: 'bro_t' or 'bro'.
Now, the issue is that currently 'bro_t' matches the definition for ID, while 'bro' is a token by itself, and clearly 'bro_t' is longer than 'bro', so the parser matches that statement to an ID and the parse fails. The solutions that I have come up with are to make 'bro_t' and 'bro_f' reserved as well, but that would be relatively time consuming for the entire instruction set. The other solution that I was looking at was wildcard operators, but I don't really understand if they are applicable here and if so how to apply them.
Grammar:
predicate
: '_t' '<' register '>' | '_f' '<' register '>' | ;
operation
: 'bro' predicate ;
ID: ('a' .. 'z' | 'A' .. 'Z' | '_') ( 'a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '_' | '$' | '.')* ;

Why not do:
operation
: BRO '<' register '>'
;
BRO : 'bro' ( '_' [a-z]+ )?
ID : [a-zA-Z_] [a-zA-Z0-9_$.]*;
?

Related

Lexer negation in ANTLR - why doesn't double-negation work?

I have this token I would like to match any string of non-special characters except for the literal string "TODAY".
ANTLR makes this a pain in the butt:
UNQUOTED :
( ~('T'|'t'|~UnquotedStartChar) UnquotedChar*
( ~('O'|'o'|~UnquotedChar) UnquotedChar*
| ('O'|'o')
( ~('D'|'d'|~UnquotedChar) UnquotedChar*
| ('D'|'d')
( ~('A'|'a'|~UnquotedChar) UnquotedChar*
| ('A'|'a')
( ~('Y'|'y'|~UnquotedChar) UnquotedChar*
)?
)?
)?
)?
)
;
I figured I was being clever by double-negating in the ~('T'|'t'|~UnquotedStartChar) - it should match everything that isn't 'T', 't', or non-permitted in the start char normally... which intuitively seems like it should work, and when I think about it a bit, seems like it should work, but when I actually try to compile it, I get this error message:
error(100): antlr/QueryParser.g:0:1: syntax error: buildnfa: MismatchedTreeNodeException(16!=32)
UnquotedStartChar is, in itself, quite a mess... but I'm not entirely sure it's relevant yet. Here's the first couple of levels anyway.
fragment
UnquotedStartChar
: EscapeSequence
| ~( ProhibitedAtStartOfUnquoted )
;
fragment
ProhibitedAtStartOfUnquoted
: ProhibitedInUnquoted
| Slash | PLUS | MINUS | DOLLAR;
fragment
EscapeSequence
: Backslash
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~( 'u' )
)
;
In lexer grammar, rule (token) that is defined earlier takes precedence in case of a conflict, so you could just define a token for 'TODAY' above UNQUOTED and it should match the string instead of the general UNQUOTED token.

Antlr - mismatched input '1' expecting number

I'm new to Antlr and I have the following simplified language:
grammar Hello;
sentence : targetAttributeName EQUALS expression+ (IF relationedExpression (logicalRelation relationedExpression)*)?;
expression :
'(' expression ')' |
expression ('*'|'/') expression |
expression ('+'|'-') expression |
function |
targetAttributeName |
NUMBER;
filterExpression :
'(' filterExpression ')' |
filterExpression ('*'|'/') filterExpression |
filterExpression ('+'|'-') filterExpression |
function |
filterAttributeName |
NUMBER |
DATE;
relationedExpression :
filterExpression ('<'|'<='|'>'|'>='|'=') filterExpression |
filterAttributeName '=' STRING |
STRING '=' filterAttributeName
;
logicalRelation :
'AND' |
'OR'
;
targetAttributeName :
'x'|
'y'
;
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
function:
simpleFunction |
complexFunction ;
simpleFunction :
'simpleFunction' '(' expression ')' |
'simpleFunction2' '(' expression ')'
;
complexFunction :
'complexFunction' '(' expression ')' |
'complexFunction2' '(' expression ')'
;
EQUALS : '=';
IF : 'IF';
STRING : '"' [a-zA-z0-9]* '"';
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
DATE: NUMBER NUMBER NUMBER NUMBER '.' NUMBER NUMBER? '.' NUMBER NUMBER? '.';
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It works with x = y * 2, but it doesn't work with x =y * 1.
The error message is the following:
Hello::sentence:1:7: mismatched input '1' expecting {'simpleFunction', 'complexFunction', 'x', 'y', 'complexFunction2', '(', 'simpleFunction2', NUMBER}
It is very strange for me, because 1 is a NUMBER...
If I change the filterAttribute from 'a' '1' to 'a1', then it works with x=y*1, but I don't understand the difference between the two cases. Could somebody explain it for me?
Thanks.
By doing this:
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
ANTLR creates lexer rules from these inline tokens. So you really have a lexer grammar that looks like this:
T_1 : '1': // the rule name will probably be different though
T_a : 'a';
...
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
In other words, the input 1 will be tokenized as T_1, not as a NUMBER.
EDIT
Whenever certain input can match two or more lexer rules, ANTLR chooses the one defined first. The lexer does not "listen" to the parser to see what it needs at a particular time. The lexing and parsing are 2 distinct phases. This is simply how ANTLR works, and many other other parser generators. If this is not acceptable for you, you should google for "scanner-less parsing", or "packrat parsers".

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.

ANTLR Grammar is not backtracking while parsing similar rules

Suppose I have a grammar which takes care of the global variables and some method declarations of some variation of C
program: (declaration)* (procedure)*;
declaration: typespec identifier ';';
procedure: typespec identifier '(' ')' ';';
typespec: 'char' | 'int';
identifier: ('a' .. 'z' | 'A' .. 'Z') ('A' - 'Z' | 'a' .. 'z' | '0' .. '9' | '_')*;
If I feed it something like:
int MAX;
char proc();
the grammar reads int MAX; correctly but then it wants to apply the declaration rule also to the 2nd row, and it fails when it reaches (, and at this point I expect it to backtrack and apply the next rule which is the one for procedure. Could somebody please tell me why this isn't happening?
Did you post all of your grammar? I couldn't get it to compile as you posted...but I played around with what you posted to make it match your example:
program: (declaration)* (procedure)*;
statement: TYPE_SPEC IDENT ;
declaration: statement ';';
procedure: statement '(' ')' ';';
TYPE_SPEC
: 'char' | 'int';
IDENT
: ('a' .. 'z' | 'A' .. 'Z') ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')*;
WHITESPACE
: ('\r' | '\n' | '\r\n' | ' ' | '\t' ) {$channel=HIDDEN;}
;
I'd recommend that your make lexer rules (The ones in capitals) for your token matching rather than making them part of your parser rules - I've done some of them already for you as you can see.

ANTLR 3.x - How to format rewrite rules

I'm finding myself challenged on how to properly format rewrite rules when certain conditions occur in the original rule.
What is the appropriate way to rewrite this:
unaryExpression: op=('!' | '-') t=term
-> ^(UNARY_EXPR $op $t)
Antlr doesn't seem to like me branding anything in parenthesis with a label and "op=" fails. Also, I've tried:
unaryExpression: ('!' | '-') t=term
-> ^(UNARY_EXPR ('!' | '-') $t)
Antlr doesn't like the or '|' and throws a grammar error.
Replacing the character class with a token name does solve this problem, however it creates a quagmire of other issues with my grammar.
--- edit ----
A second problem has been added. Please help me format this rule with tree grammar:
multExpression
: unaryExpression (MULT_OP unaryExpression)*
;
Pretty simple: My expectation is to enclose every matched token in a parent (imaginary) token MULT so that I end up with something like:
MULT
o
|
o---o---o---o---o
| | | | |
'3' '*' '6' '%' 2
unaryExpression
: (op='!' | op='-') term
-> ^(UNARY_EXPR[$op] $op term)
;
I used the UNARY_EXPR[$op] so the root node gets some useful line/column information instead of defaulting to -1.