How to exclude the comma operator from an expression in Grammar-Kit? - intellij-plugin

I'm writing an IntelliJ language plugin for a C-derived language which includes the comma operator. I'm using Grammar-Kit to generate the parser. Where the formal grammar has a lot of nested expression productions, I've rewritten them using Grammar-Kit's priority-based expression parsing, so my expression production looks like this:
expression ::= comma_expression
| assignment_expression
| conditional_expression
| eor_expression
| xor_expression
| and_expression
| equality_expression
| relation_expression
| add_expression
| mul_expression
| prefix_expression
| postfix_group
| primary_expression
comma_expression ::= expression ',' expression {pin=2}
// etc.
This works fine in itself, but there are places in the grammar where I need to parse an expression that can't be a comma expression. Function calls are one example of this:
function_call_expression ::= identifier '(' ('void'|<<comma_list expression>>)? ')'
private meta comma_list ::= <<p>> (',' <<p>>)*
A function argument can't be a comma expression, because that would be ambiguous with the comma separating the next argument. (In the grammar as I have it now, it always parses as a single comma expression.) The formal grammar deals with this by specifying that each function argument must be an assignment expression, because their assignment expression includes all the expressions with tighter precedence. That doesn't work for the Grammar-Kit priority-based grammar, because an assignment expression really does have to include an assignment.
The same applies to initializers, where allowing a comma expression would lead to an ambiguous parse in cases like int x=1, y;.
How should I deal with this situation? I'd like to keep using the priority-based parse to keep a shallow PSI tree, but also avoid manually rewriting the PSI tree for function calls to turn aCommaExpression into an argument list.

Related

How to deal with nested expressions in grammar kit?

I'm trying to write BNF file for my custom language intellij plugin. I'm getting confused with the rules for nested expressions. My custom language contains both binary operator expressions and array reference expressions. So I wrote the BNF file like this:
{
extends(".*_expr")=expr
tokens=[
id="regexp:[a-zA-Z_][a-zA-Z0-9_]*"
number="regexp:[0-9]+"
]
}
expr ::= binary_expr| array_ref_expr | const_expr
const_expr ::= number
binary_expr ::= expr '+' expr
array_ref_expr ::= id '[' expr ']'
But when I tried to evaluate expressions like 'a[1+1]' , I got an error:
']' expected, got '+'
Debugging the generated parser code, I found that when analyzing an expression like
a[expr]
, the expression in the brackets must have lower priority than array_ref_expr, thus binary_expr will not be included. If I swapped the priorities of the two expressions, the parser will not analyze expressions like
a[1]+1
. I also tried to make them same priority, or to make one expression right associative, each doesn't work for some specific expressions.
What would I need to do?
Many thanks
Getting rid of the left-recursion in the plus should fix this.
file ::= expr*
expr ::= plus_expr | expr_
expr_ ::= array_ref_expr | const_expr
plus_expr ::= expr_ '+' expr
const_expr ::= number
array_ref_expr ::= id '[' expr ']'
The expr ::= plus_expr | expr_ could also be expr_ ('+' expr)? but - to me - it makes more sense to use plus_expr so it gets its own node in the Psi tree.

Grammars: How to add a level of precedence

So lets say I have the following Context Free Grammar for a simple calculator language:
S->TS'
S'->OP1 TE'|e
T->FT'
T'->OP2 FT'|e
F->id|(S)
OP1->+|-
OP2->*|/
As one can see the * and / have higher precedence over + and -.
However, how can I add another level of precedence? Example would be for exponents, ^, (ex:3^2=9) or something else? Please explain your procedure and reasoning on how you got there so I can do it for other operators.
Here's a more readable grammar:
expr: sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
factor: ID
| '(' expr ')'
add_op: '+' | '-'
mul_op: '*' | '/'
This can be easily extended using the same pattern:
expr: bool
bool: bool or_op conj
| conj
conj: conj and_op comp
| comp
/* This one doesn't allow associativity. No a < b < c in this language */
comp: sum comp_op sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
/* Here we'll add an even higher precedence operators */
/* Unlike the other operators, though, this one is right associative */
factor: atom exp_op factor
| atom
atom: ID
| '(' expr ')'
/* I left out the operator definitions. I hope they are obvious. If not,
* let me know and I'll put them back in
*/
I hope the pattern is more or less obvious there.
Those grammars won't work in a recursive descent parser, because recursive descent parsers choke on left recursion. The grammar you have has been run through a left-recursion elimination algorithm, and you could do that to the grammar above as well. But note that eliminating left recursion more or less erases the difference between left- and right-recursion, so after you identify the parse with a recursive descent grammar, you need to fix it according to your knowledge about the associativity of the operator, because associativity is no longer inherent in the grammar.
For these simple productions, eliminating left-recursion is really simple, in two steps. We start with some non-terminal:
foo: foo foo_op bar
| bar
and we flip it around so that it is right associative:
foo: bar foo_op foo
| bar
(If the operator was originally right associative, as with exponentiation above, then this step isn't needed.)
Then we need to left-factor, because LL parsing requires that every alternative for a non-terminal has a unique prefix:
foo : bar foo'
foo': foo_op foo
| ε
Doing that to every recursive production above (that is, all of them except for expr, comp and atom) will yield a grammar which looks like the one you started with, only with more operators.
In passing, I emphasize that there is no mysterious magical force at work here. When the grammar says, for example:
term: term mul_op factor
| factor
what it's saying is that a term (or product, if you prefer) cannot be the right-hand argument of a multiplication, but it can be the left-hand argument. It's also saying that if you're at a point in which a product would be valid, you don't actually need something with a multiplication operator; you can use a factor instead. But obviously you cannot use a sum, since factor doesn't parse expressions with a sum operator. (It does parse anything inside parentheses. But those are things inside parentheses.)
That's the sense in which both associativity and precedence are implicit in the grammar.

How to make a simple calculator syntax highlighting for IntelliJ?

I'm making a custom language support plugin according to this tutorial and I'm stuck with a few .bnf concepts. Let's say I want to parse a simple calculator language that supports +,-,*,/,unary -, and parentheses. Here's what I currently have:
Flex:
package com.intellij.circom;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.intellij.circom.psi.CircomTypes;
import com.intellij.psi.TokenType;
%%
%class CircomLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
WHITESPACE = [ \n\r\t]+
NUMBER = [0-9]+
%%
{WHITESPACE} { return TokenType.WHITE_SPACE; }
{NUMBER} { return CircomTypes.NUMBER; }
Bnf:
{
parserClass="com.intellij.circom.parser.CircomParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Circom"
psiImplClassSuffix="Impl"
psiPackage="com.intellij.circom.psi"
psiImplPackage="com.intellij.circom.psi.impl"
elementTypeHolderClass="com.intellij.circom.psi.CircomTypes"
elementTypeClass="com.intellij.circom.psi.CircomElementType"
tokenTypeClass="com.intellij.circom.psi.CircomTokenType"
}
expr ::=
expr ('+' | '-') expr
| expr ('*' | '/') expr
| '-' expr
| '(' expr ')'
| literal;
literal ::= NUMBER;
First it complains that expr is recursive. How do I rewrite it to not be recursive? Second, when I try to compile and run it, it freezes idea test instance when trying to parse this syntax, looks like an endless loop.
Calling the grammar files "BNF" is a bit misleading, since they are actually modified PEG (parsing expression grammar) format, which allows certain extended operators, including grouping, repetition and optionality, and ordered choice (which is semantically different from the regular definition of |).
Since the underlying technology is PEG, you cannot use left-recursive rules. Left-recursion will cause an infinite loop in the parser, unless the code generator refuses to generate left-recursive code. Fortunately, repetition operators are available so you only need recursion for syntax involving parentheses, and that's not left-recursion so it presents no problem.
As far as I can see from the documentation I found, grammar kit does not provide for operator precedence declarations. If you really need to produce a correct parse taking operator-precedence into account, you'll need to use multiple precedence levels. However, if your only use case is syntax highlighting, you probably do not require a precisely accurate parse, and it would be sufficient to do something like the following:
expr ::= unary (('+' | '-' | '*' | '/') unary)*
unary ::= '-'* ( '(' expr ')' | literal )
(For precise parsing, you'd need to split expr above into two precedence levels, one for additive operators and another for multiplicative. But I suggest not doing that unless you intend to use the parse for evaluation or code-generation.)
Also, you almost certainly require some lexical rule to recognise the various operator characters and return appropriate single character tokens.

shift/reduce error in yacc

I know this part of my grammar cause error but I don't know how to fix it I even use %left and right but it didn't help. Can anybody please help me to find out what is the problem with this grammar.
Thanks in advance for your help.
%token VARIABLE NUM
%right '='
%left '+' '-'
%left '*' '/'
%left '^'
%start S_PROOP
EQUATION_SEQUENCE
: FORMULA '=' EQUATION
;
EQUATION
: FORMULA
| FORMULA '=' EQUATION
;
FORMULA
: SUM EXPRESSION
| PRODUCT EXPRESSION
| EXPRESSION '+' EXPRESSION
| EXPRESSION '*' EXPRESSION
| EXPRESSION '/' EXPRESSION
| EXPRESSION '^' EXPRESSION
| EXPRESSION '-' EXPRESSION
| EXPRESSION
;
EXPRESSION
: EXPRESSION EXPRESSION
| '(' EXPRESSION ')'
| NUM
| VARIABLE
;
Normal style is to use lower case for non-terminals and upper case for terminals; using upper case indiscriminately makes your grammar harder to read (at least for those of us used to normal yacc/bison style). So I've written this answer without so much recourse to the caps lock key.
The basic issue is the production
expression: expression expression
which is obviously ambiguous, since it does not provide any indication of associativity. In that, it is not different from
expression: expression '+' expression
but that conflict can be resolved using a precedence declaration:
%left '+'
The difference is that the first production does not have any terminal symbol, which makes it impossible to use precedence rules to disambiguate: in yacc/bison, precedence is always a comparison between a potential reduction and a potential shift. The potential reduction is some production which could be reduced; the potential shift is a terminal symbol which might be able to extend some production. Since the potential shift must be a terminal symbol, that is what is used in the precedence declaration; by default, the precedence of the potential reduction is defined by the last terminal symbol in the right-hand side but it is possible to specify a different terminal using a %prec marker. In any case, the precedence relation involves a terminal symbol, and if the grammar allows juxtaposition of two terminals, there is no relevant terminal symbol.
That's easy to work around, since you are under no obligation to use precedence to resolve conflicts. You could just avoid the conflict:
/* Left associative rule */
expr_sequence: expr | expr_sequence expr
/* Alternative: right associative rule */
expr_sequence: expr | expr expr_sequence
Since there is no indication what you intend by the juxtaposition, I'm unable to recommend one or the other of the above alternatives, but normally I would incline towards the first one.
That's not terribly different from your grammar for equation_sequence, although equation_sequence actually uses a terminal symbol so it could have been handled with a precedence declaration. It's worth noting that equation_sequence, as written, is right-associative. That's usually considered correct for assignment operators, (a = b = c + 3, in a language like C, is parsed as a = (b = c + 3) and not as (a = b) = c + 3, making assignment one of the few right-associative operators.) But if you are using = as an equality operator, it might not actually be what you intended.

How do I add parenthesis to this rule?

I have a left-recursive rule like the following:
EXPRESSION : EXPRESSION BINARYOP EXPRESSION | UNARYOP EXPRESSION | NUMBER;
I need to add parenthesis to it but I'm not sure how to make a left parenthesis depend on a matching right parenthesis yet still optional. Can someone show me how? (Or am I trying to do entirely too much in lexing, and should I leave some or all of this to the parsing?)
You could add a recursive rule:
EXPRESSION : EXPRESSION BINARYOP EXPRESSION
| UNARYOP EXPRESSION
| NUMBER
| OPENPARENS EXPRESSION CLOSEPARENS
;
Yes, you're trying to do too much in the lexer. Here's how to get around the left-recursive rules:
http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator (see how the parser rule expr trickles down to the rule atom and then get called recursively from atom again)
HTH