What does this ANLTR4 notation mean? - grammar

I have a question regarding the notation of a UCB Logo grammar that I found was generated for ANTLR4. There are some notations can't make out and thought about asking. If anyone is willing to clarify, I will be grateful.
Here are the notations I don't quite understand:
WORD
: {listDepth > 0}? ~[ \t\r\n\[\];] ( ~[ \t\r\n\];~] | LINE_CONTINUATION | '\\' ( [ \t\[\]();~] | LINE_BREAK ) )*
| {arrayDepth > 0}? ~[ \t\r\n{};] ( ~[ \t\r\n};~] | LINE_CONTINUATION | '\\' ( [ \t{}();~] | LINE_BREAK ) )*;
array
: '{' ( ~( '{' | '}' ) | array )* '}';
NAME
: ~[-+*/=<> \t\r\n\[\]()":{}] ( ~[-+*/=<> \t\r\n\[\](){}] | LINE_CONTINUATION | '\\' [-+*/=<> \t\r\n\[\]();~{}] )*;
I guess the array means that it can start with { and have an arbitrary number of levels, but has to end with }.
I take it that the others are some form of regular expressions?
Too my knowledge, regex is different for different programming languages.
Did I get that right?

Antlr does not do regular expressions. It does implement some of the same operators, but that is where the similarity largely ends.
The first sub-terms ( {listDepth > 0}?) in the WORD rule are predicates - no relation to anything in the regular expression world. They are defined in the Antlr documentation and explained in detail in the TDAR.
Your understanding of the array rule is essentially correct.

Related

Exclude tokens from Identifier lexical rule

I have Identifier lexical rule:
Identifier
: ( 'a'..'z' | 'A'..'Z' | '_' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' )*
;
LogicalOr and LogicalAnd rules:
LogicalOr : '| ' | '||' | OR;
LogicalAnd : '&' | '&&' | AND;
fragment Or : '[Oo][Rr]';
fragment And : '[Aa][Nn][Dd]';
strings "and" and "or" are recognized as identifiers, instead of logicalAnd and logicalOr. Could someone help me to solve this problem please?
There are two potential issues at play. First and foremost, ANTLR 3 does not support the character class syntax introduced by ANTLR 4. Your Or fragment literally matches the input [Oo][Rr]; it does not match OR, or, or oR. The same applies to your And fragment. You need to write the rule like this instead:
fragment
Or
: ('O' | 'o') ('R' | 'r')
;
If this does not resolve your issue, then you need to make sure your LogicalOr and LogicalAnd rules are positioned before the Identifier rule in the grammar. The rule which appears first will determine what token type is assigned for this input sequence.

What is wrong with this ANTLR Grammar? Conditional statement nested parenthesis

I've been tasked with writing a prototype of my team's DSL in Java, so I thought I would try it out using ANTLR. However I'm having problems with the 'expression' and 'condition' rules.
The DSL is already well defined so I would like to keep as close to the current spec as possible.
grammar MyDSL;
// Obviously this is just a snippet of the whole language, but it should give a
// decent view of the issue.
entry
: condition EOF
;
condition
: LPAREN condition RPAREN
| atomic_condition
| NOT condition
| condition AND condition
| condition OR condition
;
atomic_condition
: expression compare_operator expression
| expression (IS NULL | IS NOT NULL)
| identifier
| BOOLEAN
;
compare_operator
: EQUALS
| NEQUALS
| GT | LT
| GTEQUALS | LTEQUALS
;
expression
: LPAREN expression RPAREN
| atomic_expression
| PREFIX expression
| expression (MULTIPLY | DIVIDE) expression
| expression (ADD | SUBTRACT) expression
| expression CONCATENATE expression
;
atomic_expression
: SUBSTR LPAREN expression COMMA expression (COMMA expression)? RPAREN
| identifier
| INTEGER
;
identifier
: WORD
;
// Function Names
SUBSTR: 'SUBSTR';
// Control Chars
LPAREN : '(';
RPAREN : ')';
COMMA : ',';
// Literals and Identifiers
fragment DIGIT : [0-9] ;
INTEGER: DIGIT+;
fragment LETTER : [A-Za-z#$#];
fragment CHARACTER : DIGIT | LETTER | '_';
WORD: LETTER CHARACTER*;
BOOLEAN: 'TRUE' | 'FALSE';
// Arithmetic Operators
MULTIPLY : '*';
DIVIDE : '/';
ADD : '+';
SUBTRACT : '-';
PREFIX: ADD| SUBTRACT ;
// String Operators
CONCATENATE : '||';
// Comparison Operators
EQUALS : '==';
NEQUALS : '<>';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
// Logical Operators
NOT : 'NOT';
AND : 'AND';
OR : 'OR';
// Keywords
IS : 'IS';
NULL: 'NULL';
// Whitespace
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
The phrase I'm testing with is
(FOO == 115 AND (SUBSTR(BAR,2,1) == 1 OR SUBSTR(BAR,4,1) == 1))
However it is breaking on the nested parenthesis, matching the first ( with the first ) instead of the outermost (see below). In ANTLR3 I solved this with semantic predicates but it seems that ANTLR4 is supposed to have fixed the need for those.
I'd really like to keep the condition and the expression rules separate if at all possible. I have been able to get it to work when merged together in a single expression rule (based on examples here and elsewhere) but the current DSL spec has them as different and I'm trying to reduce any possible differences in behaviour.
Can anyone point out how I can get this all working while maintaining a separate rule for conditions' andexpressions`? Many thanks!
The grammar seems fine to me.
There's one thing going wrong in the lexer: the WORD token is defined before various keywords/operators causing it to get precedence over them. Place your WORD rule at the very end of your lexer rules (or at least after the last keywords which WORD could also match).

How to allow an identifer which can start with a digit without causing MismatchedTokenException

I want to match the following input:
statement span=1m 0_dur=12
with the following grammar:
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
statement :'statement' 'span' '=' INTEGER 'm' ident '=' INTEGER;
INTEGER
: DIGIT+
;
ident : IDENT | 'AVG' | 'COUNT';
IDENT
: (LETTER | DIGIT | '_')+ ;
WHITESPACE
: ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
LETTER : ('a'..'z' | 'A'..'Z') ;
fragment
DIGIT : '0'..'9';
but it cause an error:
MismatchedTokenException : line 1:15 mismatched input '1m' expecting '\u0004'
Does anyone has any idea how to solve this?
THanks
Charles
I think your grammar is context sensitive, even at the lexical analyser(Tokenizer) level. The string "1m" is recognized as IDENT, not INTEGER followed by 'm'. You either redefine your syntax, or use predicated parsing, or embed Java code in your grammar to detect the context (e.g. If the number is presented after "span" followed by "=", then parse it as INTEGER).

Why is this grammar giving me a "non LL(*) decision" error?

I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.

How do I make a TreeParser in ANTLR3?

I'm attemping to learn language parsing for fun...
I've created a ANTLR grammar which I believe will match a simple language I am hoping to implement. It will have the following syntax:
<FunctionName> ( <OptionalArguments>+) {
<OptionalChildFunctions>+
}
Actual Example:
ForEach(in:[1,2,3,4,5] as:"nextNumber") {
Print(message:{nextNumber})
}
I believe I have the grammar working correctly to match this construct, and now I am attemping to build an Abstract Syntax Tree for the language.
Firstly, I must admit I'm not exactly sure HOW this tree should look. Secondly, I'm at a complete loss how to do this in my Antlr grammar...I've been trying without much success for hours.
This is the current idea I'm going with for the tree:
FunctionName
/ \
Attributes \
/ \ / \
ID /\ ChildFunctions
/ \ ID etc
/ \
Attribute AttributeValue
Type
This is my current Antlr grammar file:
grammar Test;
options {output=AST;ASTLabelType=CommonTree;}
program : function ;
function : ID (OPEN_BRACKET (attribute (COMMA? attribute)*)? CLOSE_BRACKET)? (OPEN_BRACE function* CLOSE_BRACE)?;
attribute : ID COLON datatype;
datatype : NUMBER | STRING | BOOLEAN | array | lookup ;
array : OPEN_BOX (datatype (COMMA datatype)* )? CLOSE_BOX ;
lookup : OPEN_BRACE (ID (PERIOD ID)*) CLOSE_BRACE;
NUMBER
: ('+' | '-')? (INTEGER | FLOAT)
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
BOOLEAN
: 'true' | 'TRUE' | 'false' | 'FALSE'
;
ID : (LETTER|'_') (LETTER | INTEGER |'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WHITESPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;} ;
COLON : ':' ;
COMMA : ',' ;
PERIOD : '.' ;
OPEN_BRACKET : '(' ;
CLOSE_BRACKET : ')' ;
OPEN_BRACE : '{' ;
CLOSE_BRACE : '}' ;
OPEN_BOX : '[' ;
CLOSE_BOX : ']' ;
fragment
LETTER
: 'a'..'z' | 'A'..'Z'
;
fragment
INTEGER
: '0'..'9'+
;
fragment
FLOAT
: INTEGER+ '.' INTEGER*
;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
;
ANY help / advice would be great. I've tried reading dozens of tutorials and nothing about the AST generation seems to stick :(
Step 1 is to make the tree look like the little graph that you posted. Right now, you don't have any tree construction operators, so you're going to end up with a flat list.
See tree construction on the antlr.org website.
You can use ANTLRWorks to see what your getting for a parse tree and AST. Start adding tree construction operators and watch how things change.
EDIT / Additional Info:
Here's a process you can follow to give you a rough idea of how to do it:
Download ANTLRWorks and use it's graphing facilities. You will definitely want to see the parse tree and the AST before and after you make changes. Once you understand how everything works, then you can use any IDE or editor you want.
There are two basic operators for tree construction - The exclamation mark ! which tells the compiler to not place the node within the AST, and the carot ^, which tells ANTLR to make something the root node. Start by going through each non-terminal rule and deciding which elements don't need to be in the AST. For example, you don't need commas or parenthesis. Once you have all the information you can populate the a structure (or create your own AST structure) that provides all the information. Commas don't help any more, so add a ! to them. For example:
function: ID (OPEN_BRACKET! (attribute (COMMA!? attribute)*)? CLOSE_BRACKET!)? (OPEN_BRACE! function* CLOSE_BRACE!)?;
Take a look at the AST in ANTLRWorks before and after. Compare.
Now decide which element should be the root node. It looks like you want ID to be the root node, so add a ^ after ID and compare in ANTLRWorks.
Here's a few changes that bring it closer to what I think you want:
program : function ;
function : ID^ (OPEN_BRACKET! attributeList? CLOSE_BRACKET!)? (OPEN_BRACE! function* CLOSE_BRACE!)?;
attributeList: (attribute (COMMA!? attribute)*);
attribute : ID COLON! datatype;
datatype : NUMBER | STRING | BOOLEAN | array | lookup ;
array : OPEN_BOX! (datatype^ (COMMA! datatype)* )? CLOSE_BOX!;
lookup : OPEN_BRACE! (ID (PERIOD! ID)*) CLOSE_BRACE!;
With that under your belt, now go look at some of the tutorials.