Issue defining an Antlr XYZ File grammar that can consume a '\n' terminated string non greedly - antlr

I just started using Antlr4. As initial project I tasked myself with writing a Grammar for XYZ files since they are relatively simple.
At the moment it works great if there are no comments in the file.
So far this has been my progress:
grammar XYZFile;
options {
accessLevel = '';
}
molecule : nAtomsLine commentLine atom ;
nAtomsLine : nAtom NEWLINE ;
nAtom : N_ATOMS ;
atom : ( atom3d | atom2d ) NEWLINE? (atom | EOF )? ;
atom3d : symbol xCoord yCoord zCoord ;
atom2d : symbol xCoord yCoord ;
xCoord : FLOAT ;
yCoord : FLOAT ;
zCoord : FLOAT ;
symbol : SYMBOLSTR ;
commentLine : comment NEWLINE ;
comment : COMMENT? ;
NEWLINE : '\r'? '\n' ;
SYMBOLSTR : 'A' ( 'c' | 'g' | 'l' | 'm' | 'r' | 's' | 't' | 'u' )
| 'B' ( 'a' | 'e' | 'h' | 'i' | 'k' | 'r' )?
| 'C' ( 'a' | 'd' | 'e' | 'f' | 'l' | 'm' | 'n' | 'o' | 'r' | 's' | 'u' )?
| 'D' ( 'b' | 's' | 'y' )
| 'E' ( 'r' | 's' | 'u' )
| 'F' ( 'e' | 'l' | 'm' | 'r' )?
| 'G' ( 'a' | 'd' | 'e' )
| 'H' ( 'e' | 'f' | 'g' | 'o' | 's' )?
| 'I' ( 'n' | 'r' )?
| 'K' 'r'?
| 'L' ( 'a' | 'i' | 'r' | 'u' | 'v' )
| 'M' ( 'c' | 'g' | 'n' | 'o' | 't' )
| 'N' ( 'a' | 'b' | 'd' | 'e' | 'h' | 'i' | 'o' | 'p' )?
| 'O' ( 'g' | 's' )?
| 'P' ( 'a' | 'b' | 'd' | 'm' | 'o' | 'r' | 't' | 'u' )?
| 'R' ( 'a' | 'b' | 'e' | 'f' | 'g' | 'h' | 'n' | 'u' )
| 'S' ( 'b' | 'c' | 'e' | 'g' | 'i' | 'm' | 'n' | 'r' )?
| 'T' ( 'a' | 'b' | 'c' | 'e' | 'h' | 'i' | 'l' | 'm' | 's' )
| 'U' | 'V' | 'W' | 'Xe' | 'Y' 'b'?
| 'Z' ( 'n' | 'r' )
;
N_ATOMS : INT ;
INT : DIGIT+ ;
FLOAT : '-'? DIGIT+ '.' DIGIT*
| '-'? '.' DIGIT+
;
WS : [ \t] -> skip ;
COMMENT : ~[\n\r].*? ;
fragment
DIGIT : [0-9] ;
I think my issue is in the lexer's COMMENT part. It is supposed to consume everything until the endo of the line. Currently its consumming only one character on purpose because all modifications I made to it only made it worse:
.*? NEWLINE
~[\n\r]*?
~[\n\r]*? NEWLINE
~[\n\r] .*? NEWLINE
Pretty sure that I tried many other things in frustration but these should be enough to illustrate where I feel stuck.
I understand that this predicate can match other cases but I'm not seeing how to avoid it.
Thank you for your time.

The second line can contain pretty much any kind of characters (also digits), making it hard in the lexer to make a distinction between a digit/number being part of a comment or part of a coordinate (as already explained by Mike).
It'd be a bit overkill to create a grammar for this file format: processing it line by line would be a better choice. But given this is more of an exercise to get familiar with ANTLR, I'll suggest a way how you could do it.
A solution would be to make the lexer a bit context sensitive so that it "knows" when it is in one of 3 modes:
first line mode: an integer number can be created
second line mode: any characters making a comment
last mode: remaining lines containing an atom + coordinates
ANTLR's lexer has something called lexical modes where you can guide the lexer in one of these modes I described above. To be able to use lexical modes, you must separate the lexer and parser grammar in their own file however.
Here's how that might look like:
file: XYZLexer.g4
lexer grammar XYZLexer;
INTEGER
: [0-9]+
;
END_LINE_1
: [\r\n]+ -> skip, mode(COMMENT_MODE)
;
mode COMMENT_MODE;
COMMENT
: ~[\r\n]+
;
END_LINE_2
: [\r\n]+ -> skip, mode(ATOM_MODE)
;
mode ATOM_MODE;
ATOM
: [a-zA-Z]
;
NUMBER
: '-'? [0-9]+ '.' [0-9]+
;
SPACES
: [ \t]+ -> skip
;
LINE_BREAK
: [\r\n]+
;
file: XYZParser.g4
parser grammar XYZParser;
options {
tokenVocab=XYZLexer;
}
xyz_file
: INTEGER COMMENT atom_lines EOF
;
atom_lines
: atom ( LINE_BREAK+ atom )* LINE_BREAK*
;
atom
: ATOM coordinate
;
coordinate
: NUMBER+
;
With a parser generated from the above grammar(s), input like:
2
comment example
C 0.00000 1.40272 0.00000
H 0.00000 2.49029 0.00000
would be parsed as follows:

The first step in ANTLR parsing your input, is to convert your input stream of characters into a stream of tokens. This process uses you Lexer rules (the rules that begin with a capital letter). At this time, the parser rules are irrelevant, the parser rules act on the stream of tokens that the Lexer produces.
When the Lexer (aka tokenizer), tokenizes your input characters, it will evaluate you input against all of your Lexer rules. When more than 1 rule can match your input, then there are two "tie-breaker" strategies:
The Lexer rule that matches the longest stream of input characters with take top priority.
If there is more than one rule that matches the same (longest) sequence of characters, then the rule that appears first "wins"
In your grammar, the COMMENT rule (~[\n\r].*?) is going to match the complete contents of any line. As a result, none of your other Lexer rules really stand a chance (excepting the NEWLINE rule of course). Having your other Lexer rules before the COMMENT rule won't matter, because they match a shorter stream of input characters than the COMMENT rule.
Looking at what little "specs" there are at the link you provided, this is going to be rather difficult. (Note: This is what most languages have some sort of "start a comment" token; often //)
If you've followed the ANTLR set up in the intro, and have defined the grun alias, it's always a good starting point to run your input through grun with the -tokens flag to see how the Lexer interprets you input stream as a stream of tokens.
You might have some success with a semantic predicate on your COMMENT rule that checks for a line beginning with an Atomic symbol or a number, and returns false to prevent the COMMENT rule from matching, but the file format seems to be pretty "relaxed", so this might not be very manageable.
The short answer is the your COMMENT rule will have to reject input that's not a comment in the XYZ format, and that seems rather ambiguous.

Related

What's wrong with my grammar in Antlr 3?

grammar even_numbers;
NUMBER : '0'..'9';
EVEN_NUMBER : '2' | '4' | '6' | '8';
signedEvenNumber : ('+' | '-' | ) NUMBER? EVEN_NUMBER;
The error is:
error(208): :4:1: The following token definitions can never be matched because prior tokens match the same input: EVEN_NUMBER
Please check the picture
The error is quite clear, if you read it carefully: the EVEN_NUMBER cannot be matched since NUMBER will match what EVEN_NUMBER also matches. And NUMBER is getting precedence because it is defined before EVEN_NUMBER.
What you can do is this:
signedEvenNumber : ('+' | '-' | ) number? EVEN_NUMBER;
number : ZERO | ODD_NUMBER | EVEN_NUMBER;
ZERO : '0';
ODD_NUMBER : '1' | '3' | '5' | '7' | '9';
EVEN_NUMBER : '2' | '4' | '6' | '8';

ANTLR chaining 1 to 1 grammar rules together to solve conditionals

If you look at the ObjectiveC antlr v3 grammars (http://www.antlr3.org/grammar/1212699960054/ObjectiveC2ansi.g), and many of the other popular grammars out there they do a similar structure to this for solving conditionals
conditional_expression : logical_or_expression
('?' logical_or_expression ':' logical_or_expression)? ;
constant_expression : conditional_expression ;
logical_or_expression : logical_and_expression
('||' logical_and_expression)* ;
logical_and_expression : inclusive_or_expression
('&&' inclusive_or_expression)* ;
inclusive_or_expression : exclusive_or_expression
('|' exclusive_or_expression)* ;
exclusive_or_expression : and_expression ('^' and_expression)* ;
and_expression : equality_expression ('&' equality_expression)* ;
equality_expression : relational_expression
(('!=' | '==') relational_expression)* ;
relational_expression : shift_expression
(('<' | '>' | '<=' | '>=') shift_expression)* ;
shift_expression : additive_expression (('<<' | '>>') additive_expression)* ;
additive_expression : multiplicative_expression
(('+' | '-') multiplicative_expression)* ;
multiplicative_expression : cast_expression
(('*' | '/' | '%') cast_expression)* ;
cast_expression : '(' type_name ')' cast_expression | unary_expression ;
unary_expression
: postfix_expression
| '++' unary_expression
| '--' unary_expression
| unary_operator cast_expression
| 'sizeof' ('(' type_name ')' | unary_expression) ;
unary_operator : '&' | '*' | '-' | '~' | '!' ;
If you read it you'll notice they do this very long 1 to 1 chain of conditionals from conditional_expression to logical_or_expression to logical_and_expression to inclusive_or_expression to exclusive_or_expression.
Now, I am quite naive when it comes to ANTLR but this strikes me as an odd way to parse conditionals. It seems very complicated for the definition of a logical_or_expression to twist through every single other conditional expression type. Afterall, what does the definition of a logical OR have to do with a left bitwise shift?
Is there perhaps a better way, or is there a specific reason this method is required?
As already mentioned, the "chain" is needed to properly handle operator precedence. Without it, input like 1+2*3 would be parsed as:
*
/ \
+ 3
/ \
1 2
instead of:
+
/ \
1 *
/ \
2 3
Since ANTLR 4 supports direct left recursive rules:
foo
: foo '?' foo
| TOKEN
;
so not indirect left recursive rules:
foo
: bar
| TOKEN
;
bar
: foo '?' foo
;
You can rewrite these rules as follows:
expression
: '-' expression
| '(' type_name ')' expression
| expression ('*' | '/' | '%') expression
| expression ('+' | '-') expression
| expression ('<<' | '>>') expression
| expression ('<' | '>' | '<=' | '>=') expression
| expression ('!=' | '==') expression
| expression '&' expression
| expression '^' expression
| expression '|' expression
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
| IDENTIFIER
| NUMBER
;
If the parser now stumbles upon an expression, it will first look for ('*' | '/' | '%'), and if that's not there, it will look for ('+' | '-'), etc. In other words, the alternatives placed first in the rule will get precedence over alternatives placed lower in the rule.
Now I know from your earlier question, Once grammar is complete, what's the best way to walk an ANTLR v4 tree?, that you're using a listener to "walk" the tree. If you create an expression rule as I just showed, you'd need to do a lot of manual inspections in your enterExpression(...) and exitExpression(...) methods to find out which of the alternatives matched an expression. This is where "labels" come in handy. You simply label each alternative in your expression rule:
expression
: '-' expression #unaryExpr
| '(' type_name ')' expression #castExpr
| expression ('*' | '/' | '%') expression #multExpr
| expression ('+' | '-') expression #addExpr
| expression ('<<' | '>>') expression #...
| expression ('<' | '>' | '<=' | '>=') expression
| expression ('!=' | '==') expression
| expression '&' expression
| expression '^' expression
| expression '|' expression
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
| IDENTIFIER
| NUMBER
;
(note that when you label one, you must label them all!)
And then the base listener class will have enter- and exit method for all alternatives:
public void enterUnaryExpr(...)
public void exitUnaryExpr(...)
public void enterCastExpr(...)
public void exitCastExpr(...)
public void enterMultExpr(...)
public void exitMultExpr(...)
...
There is a very good reason for doing it this way: operator precedence. Taking your example of the logical OR and left bitwise shift, think about something like
if (a << b || c)
Objective-C precedence rules say that the '<<' has precedence, so the correct way to evaluate this is
(a << b) || c
The parser rules manage this by using the chain you mention, because the rule for '||' is higher up in the chain, the parse correctly gives a << b as a subexpression for the || operator.
There is no better way in Antl3, however in Antlr4, there is, as Antlr4 allows directly left recursive rules. I highly recommend the "Definitive Antlr4 reference" as it has a very good explanation of this issue.

ANTLR: resolving non-LL(*) problems

assume following rules in EBNF:
<datum> --> <simple datum> | <compound datum>
<simple datum> --> <boolean> | <number>
| <character> | <string> | <symbol>
<symbol> --> <identifier>
<compound datum> --> <list> | <vector>
<list> --> (<datum>*) | (<datum>+ . <datum>)
| <abbreviation>
<abbreviation> --> <abbrev prefix> <datum>
<abbrev prefix> --> ' | ` | , | ,#
<vector> --> #(<datum>*)
For the list rule, the ANTLR grammar would look like:
list : '(' datum+ '.' datum ')'
| '(' datum* ')'
| ABBREV_PREFIX datum
;
which prdocues an non-LL(*) decision error for alts 1,2.
I tried to refactor this statement but can't get up with something working.
for example:
list : '(' datum* (datum'.' datum)? ')'
| ABBREV_PREFIX datum
;
produces the same error. The main problem for me is that one rule has a + while the other uses a * so the left-factoring isn't as simple as it usually is.
Your list rule:
// A B
// | |
list // | |
: '(' datum* (datum '.' datum)? ')'
| ABBREV_PREFIX datum
;
does not know when a datum should be matched by "sub"-production rule A or B. You'll need to do it like this:
list
: '(' (datum+ ('.' datum)?)? ')' // also matches: '(' datum* ')'
| ABBREV_PREFIX datum
;
How about:
list : '(' ')'
| '(' datum+ ('.' datum )? ')'
| ABBREV_PREVIX_DATUM
;

ANTLR AST building problem

I am unable to get AST of
" risk & factors | concise" | item 503
using following grammar
grammar BoolExpRep;
options {
output=AST;
}
tokens {
MultiWordNode;
}
start
: exp EOF!
;
exp
: atom (expRest^)?
| '('! exp ')'! (expRest^)?
;
expRest
: OPER exp -> ^(OPER exp)
| '~' WORD exp -> ^('~' WORD exp)
;
OPER
: '&' | '|' | '!' | '&!' | '|!'
;
atom
: WORD WORD+ -> ^(MultiWordNode WORD+)
| WORD
| '"' WORD (OPER? WORD)+ '"' -> ^(MultiWordNode '"' (OPER? WORD)+ '"')
| '"' WORD '"'
;
WORD
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '.' | '*' | '{' | '}' | '_' | '?' | ':'
| ',' | '/' | '\\')+ ('0'..'9')*
;
I want to make an AST which should show words and operators in order for
" risk & factors | concise "
i.e. my requirement is something like:
" risk & factors | concise "
instead what i get is :
" & risk & factors & concise "
actually, i am referring to AST under MultiwordNode (or generated at 'atom' level)
it should work something like
M u l t i W o r d N o d e
/ / / | \ \ \
/ / / | \ \ \
" risk & factors | concise "
(sorry for my bad drawing :) )
the problem is that if operator occurs without quotes, it should be the kind of head of its siblings (as it refers in AST). but when it occurs in some text with quotes, it should be captures just like other words...
Using this as atom should do the trick:
atom
: WORD WORD+ -> ^(MultiWordNode WORD+)
| WORD
| '"' WORD (OPER WORD)+ '"' -> ^(MultiWordNode '"' WORD (OPER WORD)+ '"')
| '"' WORD '"'
;

mismatchedtoken with antlr syntactic predicates

I have the following lexer rules in my grammar file:
LINE : 'F' | 'G';
RULE : (('->' ('F' | 'G')) => 'F' | 'G' )
| LINE LINE + | LINE * (ROTATE + LINE+)+ ;
fragment ROTATE : ('/' | '\\');
I'm basically trying to match productions that look like F -> F/F\F\F/F. It successfully matches stuff like the above, but I'm guessing there's a problem with my syntactic predicate, since G -> G produces a MismatchedTokenException. The predicate serves to disambiguate between single letters on the lhs of '->', which I want to be recognized as the LINE token, and those on the rhs, which should be RULEs.
Any idea what I'm doing wrong?
Note that the rule:
RULE
: (('->' ('F' | 'G')) => 'F' | 'G')
| LINE LINE +
| LINE * (ROTATE + LINE+)+
;
matches a single G without the predicate. The rule above could be rewritten as:
RULE
: ( ('->' ('F' | 'G')) => 'F'
| 'G'
)
| LINE LINE +
| LINE * (ROTATE + LINE+)+
;
which in its turn equals:
RULE
: ('->' ('F' | 'G')) => 'F'
| 'G'
| LINE LINE +
| LINE * (ROTATE + LINE+)+
;
Perhaps you meant to do something like this:
RULE
: ('->' ('F' | 'G')) => ('F' | 'G')
| LINE LINE +
| LINE * (ROTATE + LINE+)+
;