How to deal with nested expressions in grammar kit? - intellij-idea

I'm trying to write BNF file for my custom language intellij plugin. I'm getting confused with the rules for nested expressions. My custom language contains both binary operator expressions and array reference expressions. So I wrote the BNF file like this:
{
extends(".*_expr")=expr
tokens=[
id="regexp:[a-zA-Z_][a-zA-Z0-9_]*"
number="regexp:[0-9]+"
]
}
expr ::= binary_expr| array_ref_expr | const_expr
const_expr ::= number
binary_expr ::= expr '+' expr
array_ref_expr ::= id '[' expr ']'
But when I tried to evaluate expressions like 'a[1+1]' , I got an error:
']' expected, got '+'
Debugging the generated parser code, I found that when analyzing an expression like
a[expr]
, the expression in the brackets must have lower priority than array_ref_expr, thus binary_expr will not be included. If I swapped the priorities of the two expressions, the parser will not analyze expressions like
a[1]+1
. I also tried to make them same priority, or to make one expression right associative, each doesn't work for some specific expressions.
What would I need to do?
Many thanks

Getting rid of the left-recursion in the plus should fix this.
file ::= expr*
expr ::= plus_expr | expr_
expr_ ::= array_ref_expr | const_expr
plus_expr ::= expr_ '+' expr
const_expr ::= number
array_ref_expr ::= id '[' expr ']'
The expr ::= plus_expr | expr_ could also be expr_ ('+' expr)? but - to me - it makes more sense to use plus_expr so it gets its own node in the Psi tree.

Related

How to make a simple calculator syntax highlighting for IntelliJ?

I'm making a custom language support plugin according to this tutorial and I'm stuck with a few .bnf concepts. Let's say I want to parse a simple calculator language that supports +,-,*,/,unary -, and parentheses. Here's what I currently have:
Flex:
package com.intellij.circom;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.intellij.circom.psi.CircomTypes;
import com.intellij.psi.TokenType;
%%
%class CircomLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
WHITESPACE = [ \n\r\t]+
NUMBER = [0-9]+
%%
{WHITESPACE} { return TokenType.WHITE_SPACE; }
{NUMBER} { return CircomTypes.NUMBER; }
Bnf:
{
parserClass="com.intellij.circom.parser.CircomParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Circom"
psiImplClassSuffix="Impl"
psiPackage="com.intellij.circom.psi"
psiImplPackage="com.intellij.circom.psi.impl"
elementTypeHolderClass="com.intellij.circom.psi.CircomTypes"
elementTypeClass="com.intellij.circom.psi.CircomElementType"
tokenTypeClass="com.intellij.circom.psi.CircomTokenType"
}
expr ::=
expr ('+' | '-') expr
| expr ('*' | '/') expr
| '-' expr
| '(' expr ')'
| literal;
literal ::= NUMBER;
First it complains that expr is recursive. How do I rewrite it to not be recursive? Second, when I try to compile and run it, it freezes idea test instance when trying to parse this syntax, looks like an endless loop.
Calling the grammar files "BNF" is a bit misleading, since they are actually modified PEG (parsing expression grammar) format, which allows certain extended operators, including grouping, repetition and optionality, and ordered choice (which is semantically different from the regular definition of |).
Since the underlying technology is PEG, you cannot use left-recursive rules. Left-recursion will cause an infinite loop in the parser, unless the code generator refuses to generate left-recursive code. Fortunately, repetition operators are available so you only need recursion for syntax involving parentheses, and that's not left-recursion so it presents no problem.
As far as I can see from the documentation I found, grammar kit does not provide for operator precedence declarations. If you really need to produce a correct parse taking operator-precedence into account, you'll need to use multiple precedence levels. However, if your only use case is syntax highlighting, you probably do not require a precisely accurate parse, and it would be sufficient to do something like the following:
expr ::= unary (('+' | '-' | '*' | '/') unary)*
unary ::= '-'* ( '(' expr ')' | literal )
(For precise parsing, you'd need to split expr above into two precedence levels, one for additive operators and another for multiplicative. But I suggest not doing that unless you intend to use the parse for evaluation or code-generation.)
Also, you almost certainly require some lexical rule to recognise the various operator characters and return appropriate single character tokens.

How to exclude the comma operator from an expression in Grammar-Kit?

I'm writing an IntelliJ language plugin for a C-derived language which includes the comma operator. I'm using Grammar-Kit to generate the parser. Where the formal grammar has a lot of nested expression productions, I've rewritten them using Grammar-Kit's priority-based expression parsing, so my expression production looks like this:
expression ::= comma_expression
| assignment_expression
| conditional_expression
| eor_expression
| xor_expression
| and_expression
| equality_expression
| relation_expression
| add_expression
| mul_expression
| prefix_expression
| postfix_group
| primary_expression
comma_expression ::= expression ',' expression {pin=2}
// etc.
This works fine in itself, but there are places in the grammar where I need to parse an expression that can't be a comma expression. Function calls are one example of this:
function_call_expression ::= identifier '(' ('void'|<<comma_list expression>>)? ')'
private meta comma_list ::= <<p>> (',' <<p>>)*
A function argument can't be a comma expression, because that would be ambiguous with the comma separating the next argument. (In the grammar as I have it now, it always parses as a single comma expression.) The formal grammar deals with this by specifying that each function argument must be an assignment expression, because their assignment expression includes all the expressions with tighter precedence. That doesn't work for the Grammar-Kit priority-based grammar, because an assignment expression really does have to include an assignment.
The same applies to initializers, where allowing a comma expression would lead to an ambiguous parse in cases like int x=1, y;.
How should I deal with this situation? I'd like to keep using the priority-based parse to keep a shallow PSI tree, but also avoid manually rewriting the PSI tree for function calls to turn aCommaExpression into an argument list.

no viable alternative error

I'm using ANTLR to generate recognizer for a java-like language and the following rules are used to recognize generic types:
referenceType
: singleType ('.' singleType)*
;
singleType
: Identifier typeArguments?
;
typeArguments
: '<' typeArgument (',' typeArgument)* '>'
;
typeArgument
: referenceType
;
Now, for the following input statement, ANTLR produces the 'no viable alternative' error.
Iterator<Entry<K,V>> i = entrySet().iterator();
However, if I put a space between the two consecutive '>' characters, no error is produced. It seams that ANTLR cannot distinguish between the above rule and the rule used to recognize shift expressions, but I don't know how to modify the grammar to resolve this ambiguity. Any help would be appreciated.
You probably have a rule like the following in the lexer:
RightShift : '>>';
For ANTLR to recognize >> as either two > characters or one >> operator, depending on context, you'll need to instead place your shift operator in the parser:
rightShift : '>' '>';
If your language includes the >>> or >>= operators, those would need to be moved to the parser as well.
To validate that x > > y isn't allowed, you'll want to make a pass over the resulting parse tree (ANTLR 4) or AST (ANTLR 3) to verify that the two > characters parsed by the rightShift parser rule appear in sequence.
280Z28 is probably right in his diagnosis that you have a rule like
RightShift : '>>';
An alternative solution is to explicitly include the possibility of a trailing >> in your parser. (I have seen this in other grammars, but only in LALR.)
typeArguments
: ('<' typeArgument (',' typeArgument)* '>') |
('<' typeArgument ',' referenceType '<' typeArgument RightShift );
;
In Antlr3, that will need to be left factored.
Whether this is clearer or having a second pass that validates your right shift operator depends on how often you need to use this.

Reduce/Reduce conflict when introducing pointers in my grammar

I'm working on a small compiler in order to get a greater appreciation of the difficulties of creating one's own language. Right now I'm at the stage of adding pointer functionality to my grammar but I got a reduce/reduce conflict by doing it.
Here is a simplified version of my grammar that is compilable by bnfc. I use the happy parser generator and that's the program telling me there is a reduce/reduce conflict.
entrypoints Stmt ;
-- Statements
-------------
SDecl. Stmt ::= Type Ident; -- ex: "int my_var;"
SExpr. Stmt ::= Expr; -- ex: "printInt(123); "
-- Types
-------------
TInt. Type ::= "int" ;
TPointer. Type ::= Type "*" ;
TAlias. Type ::= Ident ; -- This is how I implement typedefs
-- Expressions
--------------
EMult. Expr1 ::= Expr1 "*" Expr2 ;
ELitInt. Expr2 ::= Integer ;
EVariable. Expr2 ::= Ident ;
-- and the standard corecions
_. Expr ::= Expr1 ;
_. Expr1 ::= Expr2 ;
I'm in a learning stage of how grammars work. But I think I know what happens. Consider these two programs
main(){
int a;
int b;
a * b;
}
and
typedef int my_type;
main(){
my_type * my_type_pointer_variable;
}
(The typedef and main(){} part isn't relevant and in my grammar. But they give some context)
In the first program I wish it would parse a "*" b as Stmt ==(SExpr)==> Expr ==(EMult)==> Expr * Expr ==(..)==> Ident "*" Ident, that is to essentially start stepping using the SExpr rule.
At the same time I would like my_type * my_type_pointer_variable to be expanded using the rules. Stmt ==(SDecl)==> Type Ident ==(TPointer)==> Type "*" Ident ==(TAlias)==> Ident "*" Ident.
But the grammar stage have no idea if an identifier originally is a type alias or a variable.
(1) How can I get rid of the reduce/reduce conflict and (2) am I the only one having this issue? Is there an obvious solution and how does the c grammar resolve this issue?
So far I have successfully just been able to change the syntax of my language by using "&" or some other symbol instead of "*", but that's very undesirable. Also I cannot make sense from various public c grammars and tried to see why they don't have this issue but I have had no luck in this.
And last, how do I resolve issues like these on my own? All I understood from happys more verbose output is how the conflict happens, is cleverness the only way to work around these conflicts? I'm afraid I'll stumble on even more issues for example when introducing EIndir. Expr = '*' Expr;
The usual way this problem is dealt with in C parsers is something generally called "the lexer feedback hack". Its a 'hack' in the sense that it doesn't deal with it in the grammar at all; instead, when the lexer recognizes an identifier, it classifies that identifier as either a typename or a non-typename, and returns a different token for each case (usually designated 'TypeIdent' for an identifier that is a typename and simply 'Ident' for any other). The lexer makes this selection by looking at the current state of the symbol table, so it sees all the typedefs that have occurred prior to the current point in the parse, but not typedefs that are after the current point. This is why C requires that you declare typedefs before their first use in each compilation unit.

Using precedence in Bison for unary minus doesn't solve shift/reduce conflict

I'm devising a very simple grammar, where I use the unary minus operand. However, I get a shift/reduce conflict. In the Bison manual, and everywhere else I look, it says that I should define a new token and give it higher precedence than the binary minus operand, and then use "%prec TOKEN" in the rule.
I've done that, but I still get the warning. Why?
I'm using bison (GNU Bison) 2.4.1. The grammar is shown below:
%{
#include <string>
extern "C" int yylex(void);
%}
%union {
std::string token;
}
%token <token> T_IDENTIFIER T_NUMBER
%token T_EQUAL T_LPAREN T_RPAREN
%right T_EQUAL
%left T_PLUS T_MINUS
%left T_MUL T_DIV
%left UNARY
%start program
%%
program : statements expr
;
statements : '\n'
| statements line
;
line : assignment
| expr
;
assignment : T_IDENTIFIER T_EQUAL expr
;
expr : T_NUMBER
| T_IDENTIFIER
| expr T_PLUS expr
| expr T_MINUS expr
| expr T_MUL expr
| expr T_DIV expr
| T_MINUS expr %prec UNARY
| T_LPAREN expr T_RPAREN
;
%prec doesn't do as much as you might hope here. It tells Bison that in a situation where you have - a * b you want to parse this as (- a) * b instead of - (a * b). In other words, here it will prefer the UNARY rule over the T_MUL rule. In either case, you can be certain that the UNARY rule will get applied eventually, and it is only a question of the order in which the input gets reduced to the unary argument.
In your grammar, things are very much different. Any sequence of line non-terminals will make up a sequence, and there is nothing to say that a line non-terminal must end at an end-of-line. In fact, any expression can be a line. So here are basically two ways to parse a - b: either as a single line with a binary minus, or as two “lines”, the second starting with a unary minus. There is nothing to decide which of these rules will apply, so the rule-based precedence won't work here yet.
Your solution is correcting your line splitting, by requiring every line to actually end with or be followed by an end-of-line symbol.
If you really want the behaviour your grammar indicates with respect to line endings, you'd need two separate non-terminals for expressions which can and which cannot start with a T_MINUS. You'd have to propagate this up the tree: the first line may start with a unary minus, but subsequent ones must not. Inside a parenthesis, starting with a minus would be all right again.
The expr rule is ok (without the %prec UNARY). Your shift/reduce conflict comes from the rule:
statements : '\n'
| statements line
;
The rule does not what you think. For example you can write:
a + b c + d
I think that is not supposed to be valid input.
But also the program rule is not very sane:
program : statements expr
;
The rules should be something like:
program: lines;
lines: line | lines line;
line: statement "\n" | "\n";
statement: assignment | expr;