Grammars: How to add a level of precedence - grammar

So lets say I have the following Context Free Grammar for a simple calculator language:
S->TS'
S'->OP1 TE'|e
T->FT'
T'->OP2 FT'|e
F->id|(S)
OP1->+|-
OP2->*|/
As one can see the * and / have higher precedence over + and -.
However, how can I add another level of precedence? Example would be for exponents, ^, (ex:3^2=9) or something else? Please explain your procedure and reasoning on how you got there so I can do it for other operators.

Here's a more readable grammar:
expr: sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
factor: ID
| '(' expr ')'
add_op: '+' | '-'
mul_op: '*' | '/'
This can be easily extended using the same pattern:
expr: bool
bool: bool or_op conj
| conj
conj: conj and_op comp
| comp
/* This one doesn't allow associativity. No a < b < c in this language */
comp: sum comp_op sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
/* Here we'll add an even higher precedence operators */
/* Unlike the other operators, though, this one is right associative */
factor: atom exp_op factor
| atom
atom: ID
| '(' expr ')'
/* I left out the operator definitions. I hope they are obvious. If not,
* let me know and I'll put them back in
*/
I hope the pattern is more or less obvious there.
Those grammars won't work in a recursive descent parser, because recursive descent parsers choke on left recursion. The grammar you have has been run through a left-recursion elimination algorithm, and you could do that to the grammar above as well. But note that eliminating left recursion more or less erases the difference between left- and right-recursion, so after you identify the parse with a recursive descent grammar, you need to fix it according to your knowledge about the associativity of the operator, because associativity is no longer inherent in the grammar.
For these simple productions, eliminating left-recursion is really simple, in two steps. We start with some non-terminal:
foo: foo foo_op bar
| bar
and we flip it around so that it is right associative:
foo: bar foo_op foo
| bar
(If the operator was originally right associative, as with exponentiation above, then this step isn't needed.)
Then we need to left-factor, because LL parsing requires that every alternative for a non-terminal has a unique prefix:
foo : bar foo'
foo': foo_op foo
| ε
Doing that to every recursive production above (that is, all of them except for expr, comp and atom) will yield a grammar which looks like the one you started with, only with more operators.
In passing, I emphasize that there is no mysterious magical force at work here. When the grammar says, for example:
term: term mul_op factor
| factor
what it's saying is that a term (or product, if you prefer) cannot be the right-hand argument of a multiplication, but it can be the left-hand argument. It's also saying that if you're at a point in which a product would be valid, you don't actually need something with a multiplication operator; you can use a factor instead. But obviously you cannot use a sum, since factor doesn't parse expressions with a sum operator. (It does parse anything inside parentheses. But those are things inside parentheses.)
That's the sense in which both associativity and precedence are implicit in the grammar.

Related

ANTLR4 Best practice on token ambiguities: Lexer predicate, or Parser tree walker

I have a question about a certain ambiguity I am encountering in a grammar I am currently working on. Here is the problem, in brief. Consider these two inputs:
1010
0101
In isolation, in my grammar the first input is interpreted as a decimal number, the second as an octal due to the leading zero.
However, if the preceding character to each of these sequences is a % then both would be interpreted as a binary number. This wouldn't be a problem if we stopped there.
Now, let's say before the % we encountered a 5, what would happen? Does my grammar consider each of these as valid input:
5%1010
5%0101
The answer is "Yes!" The rightmost sequences of 1s and 0s simply revert back to decimal and octal, respectively, and the % is a modulo operator.
This wouldn't be a problem if expressions in my grammar only consisted of digits, but that unfortunately is not the case, as any number of non-digit tokens could substitute for the 5 in the example above, like variables, braces, and even other math operators like parentheses and minus signs.
The solution I have come to in ANTLR is simply to have an expression rule where one of the alternatives concatenates an expression and a binary number, so you have:
expr
: expr Binary
| expr '%' expr
| Integer
| Octal
| Binary
;
Integer
: '0'
| [1-9] [0-9]*
;
Octal
: '0' [0-7]+
;
Binary
: '%' [01]+
;
I then leave it up to my visitor to actually "pull apart" the right hand side of the expression type above (the expr Binary one), and properly calculate the modulo, which means I have to "re-tokenize" essentially the % and following digits.
I guess my question is: Is this the best solution given my case? I fully accept it if so, but I am curious if others have had to resort to things like these.
I cooked up a lexer predicate to do some crazy lookaheads (and lookbehinds) in the input, but my instinct was this felt wrong, as I was essentially hand-parsing, rather than leveraging the tool itself to give me enough what I needed to work with.

How to make a simple calculator syntax highlighting for IntelliJ?

I'm making a custom language support plugin according to this tutorial and I'm stuck with a few .bnf concepts. Let's say I want to parse a simple calculator language that supports +,-,*,/,unary -, and parentheses. Here's what I currently have:
Flex:
package com.intellij.circom;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.intellij.circom.psi.CircomTypes;
import com.intellij.psi.TokenType;
%%
%class CircomLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
WHITESPACE = [ \n\r\t]+
NUMBER = [0-9]+
%%
{WHITESPACE} { return TokenType.WHITE_SPACE; }
{NUMBER} { return CircomTypes.NUMBER; }
Bnf:
{
parserClass="com.intellij.circom.parser.CircomParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Circom"
psiImplClassSuffix="Impl"
psiPackage="com.intellij.circom.psi"
psiImplPackage="com.intellij.circom.psi.impl"
elementTypeHolderClass="com.intellij.circom.psi.CircomTypes"
elementTypeClass="com.intellij.circom.psi.CircomElementType"
tokenTypeClass="com.intellij.circom.psi.CircomTokenType"
}
expr ::=
expr ('+' | '-') expr
| expr ('*' | '/') expr
| '-' expr
| '(' expr ')'
| literal;
literal ::= NUMBER;
First it complains that expr is recursive. How do I rewrite it to not be recursive? Second, when I try to compile and run it, it freezes idea test instance when trying to parse this syntax, looks like an endless loop.
Calling the grammar files "BNF" is a bit misleading, since they are actually modified PEG (parsing expression grammar) format, which allows certain extended operators, including grouping, repetition and optionality, and ordered choice (which is semantically different from the regular definition of |).
Since the underlying technology is PEG, you cannot use left-recursive rules. Left-recursion will cause an infinite loop in the parser, unless the code generator refuses to generate left-recursive code. Fortunately, repetition operators are available so you only need recursion for syntax involving parentheses, and that's not left-recursion so it presents no problem.
As far as I can see from the documentation I found, grammar kit does not provide for operator precedence declarations. If you really need to produce a correct parse taking operator-precedence into account, you'll need to use multiple precedence levels. However, if your only use case is syntax highlighting, you probably do not require a precisely accurate parse, and it would be sufficient to do something like the following:
expr ::= unary (('+' | '-' | '*' | '/') unary)*
unary ::= '-'* ( '(' expr ')' | literal )
(For precise parsing, you'd need to split expr above into two precedence levels, one for additive operators and another for multiplicative. But I suggest not doing that unless you intend to use the parse for evaluation or code-generation.)
Also, you almost certainly require some lexical rule to recognise the various operator characters and return appropriate single character tokens.

shift/reduce error in yacc

I know this part of my grammar cause error but I don't know how to fix it I even use %left and right but it didn't help. Can anybody please help me to find out what is the problem with this grammar.
Thanks in advance for your help.
%token VARIABLE NUM
%right '='
%left '+' '-'
%left '*' '/'
%left '^'
%start S_PROOP
EQUATION_SEQUENCE
: FORMULA '=' EQUATION
;
EQUATION
: FORMULA
| FORMULA '=' EQUATION
;
FORMULA
: SUM EXPRESSION
| PRODUCT EXPRESSION
| EXPRESSION '+' EXPRESSION
| EXPRESSION '*' EXPRESSION
| EXPRESSION '/' EXPRESSION
| EXPRESSION '^' EXPRESSION
| EXPRESSION '-' EXPRESSION
| EXPRESSION
;
EXPRESSION
: EXPRESSION EXPRESSION
| '(' EXPRESSION ')'
| NUM
| VARIABLE
;
Normal style is to use lower case for non-terminals and upper case for terminals; using upper case indiscriminately makes your grammar harder to read (at least for those of us used to normal yacc/bison style). So I've written this answer without so much recourse to the caps lock key.
The basic issue is the production
expression: expression expression
which is obviously ambiguous, since it does not provide any indication of associativity. In that, it is not different from
expression: expression '+' expression
but that conflict can be resolved using a precedence declaration:
%left '+'
The difference is that the first production does not have any terminal symbol, which makes it impossible to use precedence rules to disambiguate: in yacc/bison, precedence is always a comparison between a potential reduction and a potential shift. The potential reduction is some production which could be reduced; the potential shift is a terminal symbol which might be able to extend some production. Since the potential shift must be a terminal symbol, that is what is used in the precedence declaration; by default, the precedence of the potential reduction is defined by the last terminal symbol in the right-hand side but it is possible to specify a different terminal using a %prec marker. In any case, the precedence relation involves a terminal symbol, and if the grammar allows juxtaposition of two terminals, there is no relevant terminal symbol.
That's easy to work around, since you are under no obligation to use precedence to resolve conflicts. You could just avoid the conflict:
/* Left associative rule */
expr_sequence: expr | expr_sequence expr
/* Alternative: right associative rule */
expr_sequence: expr | expr expr_sequence
Since there is no indication what you intend by the juxtaposition, I'm unable to recommend one or the other of the above alternatives, but normally I would incline towards the first one.
That's not terribly different from your grammar for equation_sequence, although equation_sequence actually uses a terminal symbol so it could have been handled with a precedence declaration. It's worth noting that equation_sequence, as written, is right-associative. That's usually considered correct for assignment operators, (a = b = c + 3, in a language like C, is parsed as a = (b = c + 3) and not as (a = b) = c + 3, making assignment one of the few right-associative operators.) But if you are using = as an equality operator, it might not actually be what you intended.

How to read alternates in EBNF grammars

I have an EBNF grammar that has a few rules with this pattern:
sequence ::=
item
| item extra* sequence
Is the above equivalent to the following?
sequence ::=
item (extra* sequence)*
Edit
Due to some of you observing bugs or ambiguities in both sequences, I'll give a specific example. The SVG specification provides a grammar for path data. This grammar has several producers with this pattern:
lineto-argument-sequence:
coordinate-pair
| coordinate-pair comma-wsp? lineto-argument-sequence
Could the above be rewritten as the following?
lineto-argument-sequence:
coordinate-pair (comma-wsp? lineto-argument-sequence)*
Not really, they seem to have different bugs. The first sequence is ambiguous around "item" seeing that "extra" is optional. You could rewrite it as the following to remove ambiguity:
sequence3 ::=
item extra* sequence3
The second one is ambigous around "extra", seeing as it is basically two nested loops both starting with "extra". You could rewrite it as the following to remove ambiguity:
sequence4 ::=
item ((extra|item))*
Your first version will likely choke on an input sequence consisting of a single "item" (it depends on the parser implementation) because it won't disambiguate.
My rewrites assume you want to match a sequence starting with "item" and optionally followed by a series of (0 or more) "item" or "extra" in any order.
e.g.
item
item extra
item extra item
item extra extra item
item item item item
item item item item extra
etc.
Without additional information I would be personally inclined towards the option I labled "sequence4" as all the other options are merely using recursion as an expensive loop construct. If you are willing to give me more information I may be able to give a better answer.
EDIT: based on Jorn's excellent observation (with a small mod).
If you rewrite "sequence3" to remove recursion you get the following:
sequence5 ::=
(item extra*)+
It think this will be my prefered version, not "sequence4".
I have to point out that all three versions above are functionally equivalent (as recognizers or generators). The parse trees for 3 would be different to 4 and 5, but I cannot think that that would affect anything other than perhaps performance.
EDIT:
Concerning the following:
lineto-argument-sequence:
coordinate-pair
| coordinate-pair comma-wsp? lineto-argument-sequence
What this production says is that a lineto-argument-sequence is composed of at least one coordinate-pair followed by zero or more coordinate-pairs seperated by optional white/comma. Any of the following would constitute a lineto-argument-sequence (read -> as 'becomes'):
1,2 -> (1, 2)
1.5.6 -> (1.5, 0.6)
1.5.06 -> (1.5, 0.06)
2 3 3 4 -> (2,3) (3,4)
2,3-3-4 -> (2,3) (-3,-4)
2 3 3 -> ERROR
So a coordinate-pair is really any 2 consecutive numbers.
I have mocked up a grammar in ANTLR that seems to work. Note the pattern used for lineto_argument_sequence is similar to the one Jorn and I recommended previously.
grammar SVG;
lineto_argument_sequence
: coordinate_pair (COMMA_WSP? coordinate_pair)*
;
coordinate_pair
: coordinate COMMA_WSP? coordinate
;
coordinate
: NUMBER
;
COMMA_WSP
: ( WS+|WS*','WS*) //{ $channel=HIDDEN; }
;
NUMBER
: '-'? (INT | FLOAT) ;
fragment
INT
: '0'..'9'+ ;
fragment
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
fragment
WS : ' ' | '\t' | '\r' | '\n' ;
fragment
EXPONENT
: ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
Given the following input:
2, 3 -3 -4 5.5.65.5.6
it produces this parse tree.
alt text http://www.freeimagehosting.net/uploads/85fc77bc3c.png
This rule would also be equivalent to sequence ::= (item extra*)*, thus removing the recursion on sequence.
Yes, those two grammars describe the same language.
But is that really EBNF? Wikipedia article on EBNF does not include the Kleene star operator.

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.