How can I construct a clean, Python like grammar in ANTLR? - antlr

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?

I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;

How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;

I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.

Related

Ambiguous Lexer rules in Antlr

I have an antlr grammar with multiple lexer rules that match the same word. It can't be resolved during lexing, but with the grammar, it becomes unambiguous.
Example:
conversion: NUMBER UNIT CONVERT UNIT;
NUMBER: [0-9]+;
UNIT: 'in' | 'meters' | ......;
CONVERT: 'in';
Input: 1 in in meters
The word "in" matches the lexer rules UNIT and CONVERT.
How can this be solved while keeping the grammar file readable?
When an input matches two lexer rules, ANTLR chooses either the longest or the first, see disambiguate. With your grammar, in will be interpreted as UNIT, never CONVERT, and the rule
conversion: NUMBER UNIT CONVERT UNIT;
can't work because there are three UNIT tokens :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<UNIT>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<UNIT>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<UNIT>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0159
line 1:5 missing 'in' at 'in'
line 1:8 mismatched input 'meters' expecting <EOF>
What you can do is to have only ID or TEXT tokens and distinguish them with a label, like this :
grammar Question;
question
#init {System.out.println("Question last update 0132");}
: conversion NL EOF
;
conversion
: NUMBER unit1=ID convert=ID unit2=ID
{System.out.println("Quantity " + $NUMBER.text + " " + $unit1.text +
" to convert " + $convert.text + " " + $unit2.text);}
;
ID : LETTER ( LETTER | DIGIT | '_' )* ; // or TEXT : LETTER+ ;
NUMBER : DIGIT+ ;
NL : [\r\n] ;
WS : [ \t] -> channel(HIDDEN) ; // -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
Execution :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<ID>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<ID>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<ID>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0132
Quantity 1 in to convert in meters
Labels are available from the rule's context in the visitor, so it is easy to distinguish tokens of the same type.
Based on the info in your question, it's hard to say what the best solution would be - I don't know what your lexer rules are, for example - nor can I tell why you have lexer rules that are ambiguous at all.
In my experience with antlr, lexer rules don't generally carry any semantic meaning; they are just text that matches some kind of regular expression. So, instead of having VARIABLE, METHOD_NAME, etc, I'd just have IDENTIFIER, and then figure it out at a higher level.
In other words, it seems (from the little I can glean from your question) that you might benefit either from replacing UNIT and CONVERT with grammar rules, or just having a single rule:
conversion: NUMBER TEXT TEXT TEXT
and validating the text values in your ANTLR listener/tree-walker/etc.
EDIT
Thanks for updating your question with lexer rules. It's clear now why it's failing - as BernardK points out, antlr will always choose the first matching lexer rule. This means it's impossible for the second of two ambiguous lexer rules to match, and makes your proposed design infeasible.
My opinion is that lexer rules are not the correct layer to do things like unit validation; they excel at structure, not content. Evaluating the parse tree will be much more practical than trying to contort an antlr grammar.
Finally, you might also do something with embedded actions on parse rules, like validating the value of an ID token against a known set of units. It could work, but would destroy the reusability of your grammar.

ANTLR - how to skip missing tokens in a 'for' loop

I'm developing a 'toy' language to learn antlr.
My construct for a for loop look like this.
for(4,10){
//program expressions
};
I have a grammar that I think works, but it's a little ugly. Specifically I'm not sure that I've handled the semantically unimportant tokens very well.
For example, the comma in the middle there appears as a token, but it's unimportant to the parser, it just needs the 2 and the 3 for the loop bounds. This means when I see the child() elements for the parts of the loop token, I have to skip the unimportant ones.
You can probably see this best if you examine the ANTLR viewer and look at the parse tree for this. The red arrows point to the tokens I think are redundant.
Feel like I should be making more use of the skip() feature than I am, but I can't see how to insert into the grammar for the tokens at this level.
loop: 'for(' foridxitem ',' foridxitem '){' (programexpression)+ '}';
foridxitem: NUM #ForIndexNumÌ
|
var #ForIndexVar;
The short answer is Antlr produces a parse-tree, so there will always be cruft to step over or otherwise ignore when walking the tree.
The longer answer is that there is a tension between skipping cruft in the lexer and producing tokens of limited syntactic value that are nonetheless necessary for writing unambiguous rules.
For example, you identify for( as a candidate for skipping, yet is probably syntactically required. Conversely, the parameters comma could be truly without syntactic meaning. So, you might clean it up in the lexer (and parser) this way:
FOR: 'for(' -> pushMode(params) ;
ENDLOOP: '}' ;
WS: .... -> skip() ;
mode params;
NUM: .... ;
VAR: .... ;
COMMA: ',' -> skip() ;
ENDPARAMS: '){' -> skip(), popMode() ;
P_WS: .... -> skip() ;
Your parer rule then becomes
loop: FOR foridxitem* programexpression+ ENDLOOP ;
foridxitem: NUM | VAR ;
programexpression: .... ;
That should clean up the tree a fair bit.

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

Antlr Lexer exclude a certain pattern

In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.
You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;
How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.