ANTLR Grammar is not backtracking while parsing similar rules

ANTLR Grammar is not backtracking while parsing similar rules - antlr

Suppose I have a grammar which takes care of the global variables and some method declarations of some variation of C
program: (declaration)* (procedure)*;
declaration: typespec identifier ';';
procedure: typespec identifier '(' ')' ';';
typespec: 'char' | 'int';
identifier: ('a' .. 'z' | 'A' .. 'Z') ('A' - 'Z' | 'a' .. 'z' | '0' .. '9' | '_')*;
If I feed it something like:
int MAX;
char proc();
the grammar reads int MAX; correctly but then it wants to apply the declaration rule also to the 2nd row, and it fails when it reaches (, and at this point I expect it to backtrack and apply the next rule which is the one for procedure. Could somebody please tell me why this isn't happening?

Did you post all of your grammar? I couldn't get it to compile as you posted...but I played around with what you posted to make it match your example:
program: (declaration)* (procedure)*;
statement: TYPE_SPEC IDENT ;
declaration: statement ';';
procedure: statement '(' ')' ';';
TYPE_SPEC
: 'char' | 'int';
IDENT
: ('a' .. 'z' | 'A' .. 'Z') ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')*;
WHITESPACE
: ('\r' | '\n' | '\r\n' | ' ' | '\t' ) {$channel=HIDDEN;}
;
I'd recommend that your make lexer rules (The ones in capitals) for your token matching rather than making them part of your parser rules - I've done some of them already for you as you can see.

Related

Im just starting with ANTLR and I cant decipher where Im messing up with mismatched input error

I've just started using antlr so Id really appreciate the help! Im just trying to make a variable declaration declaration rule but its not working! Ive put the files Im working with below, please lmk if you need anything else!
INPUT CODE:
var test;
GRAMMAR G4 FILE:
grammar treetwo;
program : (declaration | statement)+ EOF;
declaration :
variable_declaration
| variable_assignment
;
statement:
expression
| ifstmnt
;
variable_declaration:
VAR NAME SEMICOLON
;
variable_assignment:
NAME '=' NUM SEMICOLON
| NAME '=' STRING SEMICOLON
| NAME '=' BOOLEAN SEMICOLON
;
expression:
operand operation operand SEMICOLON
| expression operation expression SEMICOLON
| operand operation expression SEMICOLON
| expression operation operand SEMICOLON
;
ifstmnt:
IF LPAREN term RPAREN LCURLY
(declaration | statement)+
RCURLY
;
term:
| NUM EQUALITY NUM
| NAME EQUALITY NUM
| NUM EQUALITY NAME
| NAME EQUALITY NAME
;
/*Tokens*/
NUM : '0' | '-'?[1-9][0-9]*;
STRING: [a-zA-Z]+;
BOOLEAN: 'true' | 'false';
VAR : 'var';
NAME : [a-zA-Z]+;
SEMICOLON : ';';
LPAREN: '(';
RPAREN: ')';
LCURLY: '{';
RCURLY: '}';
EQUALITY: '==' | '<' | '>' | '<=' | '>=' | '!=' ;
operation: '+' | '-' | '*' | '/';
operand: NUM;
IF: 'if';
WS : [ \t\r\n]+ -> skip;
Error I'm getting:
(line 1,char 0): mismatched input 'var' expecting {NUM, 'var', NAME, 'if'}

Your STRING rule is the same as your NAME rule.
With the ANTLR lexer, if two lexer rules match the same input, the first one declared will be used. As a result, you’ll never see a NAME token.
Most tutorials will show you have to dump out the token stream. It’s usually a good idea to view the token stream and verify your Lexer rules before getting too far into your parser rules.

Antlr4 instruction keywords and longest-statement matching

I am attempting to write a grammar, but I've found a problem occurring that I'm not quite sure how to solve 'elegantly'.
The issue is that I have 'bro' as a reserved instruction keyword, and it can be followed(or not) by a predication statement. IE: 'bro_t' or 'bro'.
Now, the issue is that currently 'bro_t' matches the definition for ID, while 'bro' is a token by itself, and clearly 'bro_t' is longer than 'bro', so the parser matches that statement to an ID and the parse fails. The solutions that I have come up with are to make 'bro_t' and 'bro_f' reserved as well, but that would be relatively time consuming for the entire instruction set. The other solution that I was looking at was wildcard operators, but I don't really understand if they are applicable here and if so how to apply them.
Grammar:
predicate
: '_t' '<' register '>' | '_f' '<' register '>' | ;
operation
: 'bro' predicate ;
ID: ('a' .. 'z' | 'A' .. 'Z' | '_') ( 'a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '_' | '$' | '.')* ;

Why not do:
operation
: BRO '<' register '>'
;
BRO : 'bro' ( '_' [a-z]+ )?
ID : [a-zA-Z_] [a-zA-Z0-9_$.]*;
?

Can't create a variable with just one letter

I wish the variables could be declared with only one letter in the name.
When I write Integer aa; all work, but
when I type Integer a; then grun says: mismatched input 'a' expecting ID.
I've seen the inverse problem but it didn't help. I think my code is right but I can't see where I'm wrong. This is my lexer:
lexer grammar Symbols;
...
LineComment: '//' ~[\u000A\u000D]* -> channel(HIDDEN) ;
DelimetedComment: '/*' .*? '*/' -> channel(HIDDEN) ;
String: '"' .*? '"' ;
Character: '\'' (EscapeSeq | .) '\'' ;
IntegerLiteral: '0' | (ADD?| SUB) DecDigitNoZero DecDigit+ ;
FloatLiteral: ((ADD? | SUB) (DecDigitNoZero DecDigit*)? DOT DecDigit+ | IntegerLiteral) [F] ;
DoubleLiteral: ((ADD? | SUB) (DecDigitNoZero DecDigit*)? DOT DecDigit+ | IntegerLiteral) [D] ;
LongLiteral: IntegerLiteral [L] ;
HexLiteral: '0' [xX] HexDigit (HexDigit | UNDERSCORE)* ;
BinLiteral: '0' [bB] BinDigit (BinDigit | UNDERSCORE)* ;
OctLiteral: '0' [cC] OctDigit (OctDigit | UNDERSCORE)* ;
Booleans: TRUE | FALSE ;
Number: IntegerLiteral | FloatLiteral | DoubleLiteral | BinLiteral | HexLiteral | OctLiteral | LongLiteral ;
EscapeSeq: UniCharacterLiteral | EscapedIdentifier;
UniCharacterLiteral: '\\' 'u' HexDigit HexDigit HexDigit HexDigit ;
EscapedIdentifier: '\\' ('t' | 'b' | 'r' | 'n' | '\'' | '"' | '\\' | '$') ;
HexDigit: [0-9a-fA-F] ;
BinDigit: [01] ;
OctDigit: [0-7];
DecDigit: [0-9];
DecDigitNoZero: [1-9];
ID: [a-z] ([a-zA-Z_] | [0-9])*;
TYPE: [A-Z] ([a-zA-Z] | UNDERSCORE | [0-9])* ;
DATATYPE: Number | String | Character | Booleans ;

When you get an error like "Unexpected input 'foo', expected BAR" and you think "But 'foo' is a BAR", the first thing you should do is to print the token stream for your input (you can do this by running grun Symbols tokens -tokens inputfile). If you do this, you'll see that the a in your input is recognized as a HexDigit, not as an ID.
Why does this happen? Because both HexDigit and ID match the input a and ANTLR (like most lexer generators) resolves ambiguities according to the maximal munch rule: When multiple rules can match the current input, it chooses the one that produces the longest match (which is why variables with more than one letter work) and then resolves ties by picking the one that is defined first, which is HexDigit in this case.
Note that the lexer does not care which lexer rules are used by the parser and when. The lexer decides which tokens to produce solely based on the contents of the lexer grammar, so the lexer does not know or care that the parser wants an ID right now. It looks at all rules that match and then picks one according to the maximal munch rule and that's it.
In your case you never actually use HexDigit in your parser grammar, so there is no reason why you'd ever want a HexDigit token to be created. Therefore HexDigit should not be a lexer rule - it should be a fragment:
fragment HexDigit : [0-9a-fA-F];
This also applies to your other rules that aren't used in the parser, including all the ...Digit rules.
PS: Your Number rule will never match because of these same rules. It should probably be a parser rule instead (or the other number rules should be fragments if you don't care which kind of number literal you have).

Antlr - mismatched input '1' expecting number

I'm new to Antlr and I have the following simplified language:
grammar Hello;
sentence : targetAttributeName EQUALS expression+ (IF relationedExpression (logicalRelation relationedExpression)*)?;
expression :
'(' expression ')' |
expression ('*'|'/') expression |
expression ('+'|'-') expression |
function |
targetAttributeName |
NUMBER;
filterExpression :
'(' filterExpression ')' |
filterExpression ('*'|'/') filterExpression |
filterExpression ('+'|'-') filterExpression |
function |
filterAttributeName |
NUMBER |
DATE;
relationedExpression :
filterExpression ('<'|'<='|'>'|'>='|'=') filterExpression |
filterAttributeName '=' STRING |
STRING '=' filterAttributeName
;
logicalRelation :
'AND' |
'OR'
;
targetAttributeName :
'x'|
'y'
;
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
function:
simpleFunction |
complexFunction ;
simpleFunction :
'simpleFunction' '(' expression ')' |
'simpleFunction2' '(' expression ')'
;
complexFunction :
'complexFunction' '(' expression ')' |
'complexFunction2' '(' expression ')'
;
EQUALS : '=';
IF : 'IF';
STRING : '"' [a-zA-z0-9]* '"';
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
DATE: NUMBER NUMBER NUMBER NUMBER '.' NUMBER NUMBER? '.' NUMBER NUMBER? '.';
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It works with x = y * 2, but it doesn't work with x =y * 1.
The error message is the following:
Hello::sentence:1:7: mismatched input '1' expecting {'simpleFunction', 'complexFunction', 'x', 'y', 'complexFunction2', '(', 'simpleFunction2', NUMBER}
It is very strange for me, because 1 is a NUMBER...
If I change the filterAttribute from 'a' '1' to 'a1', then it works with x=y*1, but I don't understand the difference between the two cases. Could somebody explain it for me?
Thanks.

By doing this:
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
ANTLR creates lexer rules from these inline tokens. So you really have a lexer grammar that looks like this:
T_1 : '1': // the rule name will probably be different though
T_a : 'a';
...
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
In other words, the input 1 will be tokenized as T_1, not as a NUMBER.
EDIT
Whenever certain input can match two or more lexer rules, ANTLR chooses the one defined first. The lexer does not "listen" to the parser to see what it needs at a particular time. The lexing and parsing are 2 distinct phases. This is simply how ANTLR works, and many other other parser generators. If this is not acceptable for you, you should google for "scanner-less parsing", or "packrat parsers".

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below

That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas