How to read alternates in EBNF grammars - grammar

I have an EBNF grammar that has a few rules with this pattern:
sequence ::=
item
| item extra* sequence
Is the above equivalent to the following?
sequence ::=
item (extra* sequence)*
Edit
Due to some of you observing bugs or ambiguities in both sequences, I'll give a specific example. The SVG specification provides a grammar for path data. This grammar has several producers with this pattern:
lineto-argument-sequence:
coordinate-pair
| coordinate-pair comma-wsp? lineto-argument-sequence
Could the above be rewritten as the following?
lineto-argument-sequence:
coordinate-pair (comma-wsp? lineto-argument-sequence)*

Not really, they seem to have different bugs. The first sequence is ambiguous around "item" seeing that "extra" is optional. You could rewrite it as the following to remove ambiguity:
sequence3 ::=
item extra* sequence3
The second one is ambigous around "extra", seeing as it is basically two nested loops both starting with "extra". You could rewrite it as the following to remove ambiguity:
sequence4 ::=
item ((extra|item))*
Your first version will likely choke on an input sequence consisting of a single "item" (it depends on the parser implementation) because it won't disambiguate.
My rewrites assume you want to match a sequence starting with "item" and optionally followed by a series of (0 or more) "item" or "extra" in any order.
e.g.
item
item extra
item extra item
item extra extra item
item item item item
item item item item extra
etc.
Without additional information I would be personally inclined towards the option I labled "sequence4" as all the other options are merely using recursion as an expensive loop construct. If you are willing to give me more information I may be able to give a better answer.
EDIT: based on Jorn's excellent observation (with a small mod).
If you rewrite "sequence3" to remove recursion you get the following:
sequence5 ::=
(item extra*)+
It think this will be my prefered version, not "sequence4".
I have to point out that all three versions above are functionally equivalent (as recognizers or generators). The parse trees for 3 would be different to 4 and 5, but I cannot think that that would affect anything other than perhaps performance.
EDIT:
Concerning the following:
lineto-argument-sequence:
coordinate-pair
| coordinate-pair comma-wsp? lineto-argument-sequence
What this production says is that a lineto-argument-sequence is composed of at least one coordinate-pair followed by zero or more coordinate-pairs seperated by optional white/comma. Any of the following would constitute a lineto-argument-sequence (read -> as 'becomes'):
1,2 -> (1, 2)
1.5.6 -> (1.5, 0.6)
1.5.06 -> (1.5, 0.06)
2 3 3 4 -> (2,3) (3,4)
2,3-3-4 -> (2,3) (-3,-4)
2 3 3 -> ERROR
So a coordinate-pair is really any 2 consecutive numbers.
I have mocked up a grammar in ANTLR that seems to work. Note the pattern used for lineto_argument_sequence is similar to the one Jorn and I recommended previously.
grammar SVG;
lineto_argument_sequence
: coordinate_pair (COMMA_WSP? coordinate_pair)*
;
coordinate_pair
: coordinate COMMA_WSP? coordinate
;
coordinate
: NUMBER
;
COMMA_WSP
: ( WS+|WS*','WS*) //{ $channel=HIDDEN; }
;
NUMBER
: '-'? (INT | FLOAT) ;
fragment
INT
: '0'..'9'+ ;
fragment
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
fragment
WS : ' ' | '\t' | '\r' | '\n' ;
fragment
EXPONENT
: ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
Given the following input:
2, 3 -3 -4 5.5.65.5.6
it produces this parse tree.
alt text http://www.freeimagehosting.net/uploads/85fc77bc3c.png

This rule would also be equivalent to sequence ::= (item extra*)*, thus removing the recursion on sequence.

Yes, those two grammars describe the same language.
But is that really EBNF? Wikipedia article on EBNF does not include the Kleene star operator.

Related

Grammars: How to add a level of precedence

So lets say I have the following Context Free Grammar for a simple calculator language:
S->TS'
S'->OP1 TE'|e
T->FT'
T'->OP2 FT'|e
F->id|(S)
OP1->+|-
OP2->*|/
As one can see the * and / have higher precedence over + and -.
However, how can I add another level of precedence? Example would be for exponents, ^, (ex:3^2=9) or something else? Please explain your procedure and reasoning on how you got there so I can do it for other operators.
Here's a more readable grammar:
expr: sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
factor: ID
| '(' expr ')'
add_op: '+' | '-'
mul_op: '*' | '/'
This can be easily extended using the same pattern:
expr: bool
bool: bool or_op conj
| conj
conj: conj and_op comp
| comp
/* This one doesn't allow associativity. No a < b < c in this language */
comp: sum comp_op sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
/* Here we'll add an even higher precedence operators */
/* Unlike the other operators, though, this one is right associative */
factor: atom exp_op factor
| atom
atom: ID
| '(' expr ')'
/* I left out the operator definitions. I hope they are obvious. If not,
* let me know and I'll put them back in
*/
I hope the pattern is more or less obvious there.
Those grammars won't work in a recursive descent parser, because recursive descent parsers choke on left recursion. The grammar you have has been run through a left-recursion elimination algorithm, and you could do that to the grammar above as well. But note that eliminating left recursion more or less erases the difference between left- and right-recursion, so after you identify the parse with a recursive descent grammar, you need to fix it according to your knowledge about the associativity of the operator, because associativity is no longer inherent in the grammar.
For these simple productions, eliminating left-recursion is really simple, in two steps. We start with some non-terminal:
foo: foo foo_op bar
| bar
and we flip it around so that it is right associative:
foo: bar foo_op foo
| bar
(If the operator was originally right associative, as with exponentiation above, then this step isn't needed.)
Then we need to left-factor, because LL parsing requires that every alternative for a non-terminal has a unique prefix:
foo : bar foo'
foo': foo_op foo
| ε
Doing that to every recursive production above (that is, all of them except for expr, comp and atom) will yield a grammar which looks like the one you started with, only with more operators.
In passing, I emphasize that there is no mysterious magical force at work here. When the grammar says, for example:
term: term mul_op factor
| factor
what it's saying is that a term (or product, if you prefer) cannot be the right-hand argument of a multiplication, but it can be the left-hand argument. It's also saying that if you're at a point in which a product would be valid, you don't actually need something with a multiplication operator; you can use a factor instead. But obviously you cannot use a sum, since factor doesn't parse expressions with a sum operator. (It does parse anything inside parentheses. But those are things inside parentheses.)
That's the sense in which both associativity and precedence are implicit in the grammar.

Ambiguous Lexer rules in Antlr

I have an antlr grammar with multiple lexer rules that match the same word. It can't be resolved during lexing, but with the grammar, it becomes unambiguous.
Example:
conversion: NUMBER UNIT CONVERT UNIT;
NUMBER: [0-9]+;
UNIT: 'in' | 'meters' | ......;
CONVERT: 'in';
Input: 1 in in meters
The word "in" matches the lexer rules UNIT and CONVERT.
How can this be solved while keeping the grammar file readable?
When an input matches two lexer rules, ANTLR chooses either the longest or the first, see disambiguate. With your grammar, in will be interpreted as UNIT, never CONVERT, and the rule
conversion: NUMBER UNIT CONVERT UNIT;
can't work because there are three UNIT tokens :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<UNIT>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<UNIT>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<UNIT>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0159
line 1:5 missing 'in' at 'in'
line 1:8 mismatched input 'meters' expecting <EOF>
What you can do is to have only ID or TEXT tokens and distinguish them with a label, like this :
grammar Question;
question
#init {System.out.println("Question last update 0132");}
: conversion NL EOF
;
conversion
: NUMBER unit1=ID convert=ID unit2=ID
{System.out.println("Quantity " + $NUMBER.text + " " + $unit1.text +
" to convert " + $convert.text + " " + $unit2.text);}
;
ID : LETTER ( LETTER | DIGIT | '_' )* ; // or TEXT : LETTER+ ;
NUMBER : DIGIT+ ;
NL : [\r\n] ;
WS : [ \t] -> channel(HIDDEN) ; // -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
Execution :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<ID>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<ID>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<ID>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0132
Quantity 1 in to convert in meters
Labels are available from the rule's context in the visitor, so it is easy to distinguish tokens of the same type.
Based on the info in your question, it's hard to say what the best solution would be - I don't know what your lexer rules are, for example - nor can I tell why you have lexer rules that are ambiguous at all.
In my experience with antlr, lexer rules don't generally carry any semantic meaning; they are just text that matches some kind of regular expression. So, instead of having VARIABLE, METHOD_NAME, etc, I'd just have IDENTIFIER, and then figure it out at a higher level.
In other words, it seems (from the little I can glean from your question) that you might benefit either from replacing UNIT and CONVERT with grammar rules, or just having a single rule:
conversion: NUMBER TEXT TEXT TEXT
and validating the text values in your ANTLR listener/tree-walker/etc.
EDIT
Thanks for updating your question with lexer rules. It's clear now why it's failing - as BernardK points out, antlr will always choose the first matching lexer rule. This means it's impossible for the second of two ambiguous lexer rules to match, and makes your proposed design infeasible.
My opinion is that lexer rules are not the correct layer to do things like unit validation; they excel at structure, not content. Evaluating the parse tree will be much more practical than trying to contort an antlr grammar.
Finally, you might also do something with embedded actions on parse rules, like validating the value of an ID token against a known set of units. It could work, but would destroy the reusability of your grammar.

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

Capturing formatted variable declarations in ANTLR

I have a simple lexer/grammar I've been working on and I'm having trouble understanding the standard operating procedure for matching formatted variables. I am trying to match the following:
Variable name can be 1 character minimum. If it is one char, it must be an uppercase or lowercase letter.
If it is greater than 1 character, it must begin with a letter of any case, and then be followed by any number of characters, including numbers, underscore and the dollar sign.
I've rewritten this several times, in many flavors, and I always get the following error:
Decision can match input such as "SINGLELETTER" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input"
Would really appreciate some insight. I understand there is some ambiguity in my grammar, but I am a bit confused why multiple alternatives can be matched, once we enter the original matching loop. Thank you!
variablename
: (SINGLELETTER)
| (SINGLELETTER|UNDERSCORE)( SINGLELETTER|UNDERSCORE | DOLLAR | NUMBER)*;
SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Why not make VariableName, a lexer rule which produces a single token for the entire name?
Variablename
: SINGLELETTER
| (SINGLELETTER|UNDERSCORE) (SINGLELETTER | UNDERSCORE | DOLLAR | NUMBER)*;
fragment SINGLELETTER : ( 'a'..'z' | 'A'..'Z');
fragment LOWERCASE : 'a'..'z';
fragment UNDERSCORE : '_';
fragment DOLLAR : '$';
fragment NUMBER : '0'..'9';
Also, the way you wrote variableName does not follow point #2 you wrote (the grammar allows the variable to start with _, but you didn't allow that in your explanation).

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.