BNF grammar for sequence of statements - language-design

If I am making a grammar for a c-like language that has a sequence of statements, what is the most standard way for defining the grammar?
My thought is to do something like this:
<program> ::= <statement>
<statement> ::= <statement-head><statement-tail>
<statement-head> ::= <if-statement> | <var-declaration> | <assignment> | <whatever>
<statement-tail> ::= ; | ;<statement>
but that feels a little clunky to me. I've also considered making
<program> ::= <statement>*
or
<statement> ::= <statement-head> ; | <sequence>
<sequence> ::= <statement> <statement>
type productions.
Is there a standard or accepted way to do this. I want my AST to be as clean as possible.

A very common way would be:
<block-statement> ::= '{' <statement-list> '}' ;
<statement-list> ::= /* empty */ | <statement-list> <statement> ;
<statement> ::= <whatever> ';' ;
And then you define actual statements instead of typing <whatever>. It seems much cleaner to include trailing semicolons as part of individual statements rather than putting them in the definition for the list non-terminal.

You can find the BNF for C here and I think it was taken from K&R, which you could check out. You could also check out the SQL BNF here which may provide more information on formulating good sequences.
This will provide some convention information.
In terms of AST generation, it really doesn't matter how 'clunky' your definition is providing it parses the source correctly for all permutations. Then just add the actions to build your AST.
Just make sure you are constructing your grammer for the right parser generator such as an LL or LR parser as you may run into problems with reduction, which will mean some rules need rewriting in a new way. See this on eliminating left recursion.
You may also want to check out Bison/Yacc examples such as these or these. Also check out the Dragon Book and a book called "Modern Compiler Implementation in C"

Related

How do I make an item optional or repeatable in K syntax rule?

How can I convert this EBNF rules below with K Framework ?
An element can be used to mean zero or more of the previous:
items ::= {"," item}*
For now, I am using a List from the Domain module. But inline List declarations are not allowed, like this one:
syntax Foo ::= Stmt List{Id, ""}
For now, I have to create a new syntax rule for the listed item to counter the problem:
syntax Ids ::= List{Id, ""}
syntax Foo ::= Stmt Ids
Is there another way to counter this creation of a new rule?
An element can appear zero or one time. In other words it can be optional:
array-decl ::= <variable> "[" {Int}? "]"
Where we want to accept: a[4] and a[]. For now, to bypass this one I create 2 rules, where one branch has the item and the other not. But this solution duplicate rules in an unnecessary way in my opinion.
An element can appear one or more of the previous:
e ::= {a-z}+
Where we don't accept any non-zero length sequence of lower case letters. Right now, I didn't find a way to simulate this.
Thank you in advance!
Inline zero-or-more productions have been restricted in the K-framework because the backend doesn't support terms with a variable number of arguments.
Therefore we ask that each list is declared as a separate production which will produce a cons list. Typical functional style matching can be used to process the AST.
Your typical EBNF extensions would look something like this:
{Id ","}* - syntax Ids ::= List{Id, ","}
{Id ","}+ - syntax Ids ::= NeList{Id, ","}
Id? - syntax OptionalId ::= "" [klabel(none)] | Id [klabel(some)]
The optional (?) production has the same problem. So we ask the user to specify labels that can be referenced by rules. Note that the empty production is not allowed in the semantics module because it may interfere with parsing the concrete syntax in rules. So you will need to create a COMMON module with most of the syntax, and a *-SYNTAX module with the productions that can interfere with rule parsing (empty productions and tokens that can conflict with K variables).
No, there is currently no mechanism to do this without the extra production.
I typically do this as follows:
syntax MaybeFoo ::= ".MaybeFoo" | Foo
syntax ArrayDecl ::= Variable "[" MaybeFoo "]"
Non-empty lists may be declared similar to lists:
syntax Bars ::= NeList{Bar, ","}

How to make a simple calculator syntax highlighting for IntelliJ?

I'm making a custom language support plugin according to this tutorial and I'm stuck with a few .bnf concepts. Let's say I want to parse a simple calculator language that supports +,-,*,/,unary -, and parentheses. Here's what I currently have:
Flex:
package com.intellij.circom;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.intellij.circom.psi.CircomTypes;
import com.intellij.psi.TokenType;
%%
%class CircomLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
WHITESPACE = [ \n\r\t]+
NUMBER = [0-9]+
%%
{WHITESPACE} { return TokenType.WHITE_SPACE; }
{NUMBER} { return CircomTypes.NUMBER; }
Bnf:
{
parserClass="com.intellij.circom.parser.CircomParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Circom"
psiImplClassSuffix="Impl"
psiPackage="com.intellij.circom.psi"
psiImplPackage="com.intellij.circom.psi.impl"
elementTypeHolderClass="com.intellij.circom.psi.CircomTypes"
elementTypeClass="com.intellij.circom.psi.CircomElementType"
tokenTypeClass="com.intellij.circom.psi.CircomTokenType"
}
expr ::=
expr ('+' | '-') expr
| expr ('*' | '/') expr
| '-' expr
| '(' expr ')'
| literal;
literal ::= NUMBER;
First it complains that expr is recursive. How do I rewrite it to not be recursive? Second, when I try to compile and run it, it freezes idea test instance when trying to parse this syntax, looks like an endless loop.
Calling the grammar files "BNF" is a bit misleading, since they are actually modified PEG (parsing expression grammar) format, which allows certain extended operators, including grouping, repetition and optionality, and ordered choice (which is semantically different from the regular definition of |).
Since the underlying technology is PEG, you cannot use left-recursive rules. Left-recursion will cause an infinite loop in the parser, unless the code generator refuses to generate left-recursive code. Fortunately, repetition operators are available so you only need recursion for syntax involving parentheses, and that's not left-recursion so it presents no problem.
As far as I can see from the documentation I found, grammar kit does not provide for operator precedence declarations. If you really need to produce a correct parse taking operator-precedence into account, you'll need to use multiple precedence levels. However, if your only use case is syntax highlighting, you probably do not require a precisely accurate parse, and it would be sufficient to do something like the following:
expr ::= unary (('+' | '-' | '*' | '/') unary)*
unary ::= '-'* ( '(' expr ')' | literal )
(For precise parsing, you'd need to split expr above into two precedence levels, one for additive operators and another for multiplicative. But I suggest not doing that unless you intend to use the parse for evaluation or code-generation.)
Also, you almost certainly require some lexical rule to recognise the various operator characters and return appropriate single character tokens.

Is there an automatic way to simplify EBNF grammars?

I want to handle the Scala grammar in order to create a data structure and I want to create a graph to model it. The problem that I have is that I want to get rid of some syntax sugar of the EBNF grammar in order to parse it in an efficient way.
For example if I have the following rule:
Expr ::= (Bindings | id | '_') '=>' Expr | Expr1
I want the corresponding rules for it:
Expr ::= LeftExpr '=>' Expr | Expr1
LeftExpr ::= Bindings | id | '_'
What I would really appreciate would be a tool which can automatically convert those rules in order to have left-associativity rules with only OR operators in order to avoid doing it manually. I think it's possible a thing like that, maybe without some semantic sense of the new added rules, e.g. it's the same if LeftExpr is Foo or whatever other name.
I looked at web sources but the argument is so confusing that I cannot find the right answer. Any help is appreciated, also just a link (really) related to the problem. Thank you

Solve ambiguity in my grammar with LALR parser

I'm using whittle to parse a grammar, but I'm running into the classical LALR ambiguity problem. My grammar looks like this (simplified):
<comment> ::= '{' <string> '}' # string enclosed in braces
<tag> ::= '[' <name> <quoted-string> ']' # [tagname "tag value"]
<name> ::= /[A-Za-z_]+/ # subset of all printable chars
<quoted-string> ::= '"' <string> '"' # string enclosed in quotes
<string> ::= /[:print:]/ # regex for all printable chars
The problem, of course, is <string>. It contains all printable characters and is therefore very greedy. Since it's an LALR parser, it tries to parse a <name> as a <string> and everything breaks. The grammar complicates things because it uses different string delimiters for different things, which is why I tried to make the <string> rule in the first place.
Is there a canonical way to normalize this grammar to make it LALR compliant, if it's even possible?
This is not "the classical LALR ambiguity problem", whatever that might be. It is simply an error in the lexical specification of the language.
I took a quick glance at the Whittle readme, but it didn't bear any resemblance to the grammar in the OP. So I'm assuming that the text in the OP is conceptual rather than literal, and the fact that it includes the obviously incorrect
<string> ::= /[:print:]/ # regex for all printable chars
is just a typo.
Better would have been /[:print:]*/, assuming that Ruby lets you get away with [:print:] rather than the Posix-standard [[:print:]].
But that wouldn't be correct either because lexing (usually) matches the longest possible string, and consequently that will gobble up the closing quote and any following text.
So the correct solution for quoted-string is to write it out correctly:
<quoted-string> ::= /"[^"]*"/
or even
<quoted-string> :: /"([^\\"]|\\.)*"/
# any number of characters other than quote or escape, or escaped pairs
You might have other ideas about how to escape internal double quotes; those are just examples. In both cases, you need to postprocess the token in order to (at least) strip the double-quotes and possible interpret escape sequences. That's just the way it goes.
Your comment sequences present a more difficult issue, assuming that your intention was that a comment might include nested braces (eg. {This comment {with this} ends here}) because the nested brace syntax is not regular and thus cannot be matched with a regular expression. Of course, very few "regular expression" libraries are really regular these days, and I don't know if Ruby contains some sort of brace-counting extension, like for example Lua's pattern syntax. The nested brace syntax is certainly context-free but to actually parse it you need to lexically analyze the contents of the outer {...} in a different way than the rest of the program.
It is this latter observation, and not any weakness in the LALR algorithm, that is causing you pain, and I'd say that this is a weakness with the (mostly undocumented afaics) lexical analysis section of whittle. In a flex-generated lexer, for example, it would be normal to use start conditions to separate the lexical environments (program / quoted string / braced comment), and the parser would then have no ambiguity.
Hope that helps.

Reduce/Reduce conflict when introducing pointers in my grammar

I'm working on a small compiler in order to get a greater appreciation of the difficulties of creating one's own language. Right now I'm at the stage of adding pointer functionality to my grammar but I got a reduce/reduce conflict by doing it.
Here is a simplified version of my grammar that is compilable by bnfc. I use the happy parser generator and that's the program telling me there is a reduce/reduce conflict.
entrypoints Stmt ;
-- Statements
-------------
SDecl. Stmt ::= Type Ident; -- ex: "int my_var;"
SExpr. Stmt ::= Expr; -- ex: "printInt(123); "
-- Types
-------------
TInt. Type ::= "int" ;
TPointer. Type ::= Type "*" ;
TAlias. Type ::= Ident ; -- This is how I implement typedefs
-- Expressions
--------------
EMult. Expr1 ::= Expr1 "*" Expr2 ;
ELitInt. Expr2 ::= Integer ;
EVariable. Expr2 ::= Ident ;
-- and the standard corecions
_. Expr ::= Expr1 ;
_. Expr1 ::= Expr2 ;
I'm in a learning stage of how grammars work. But I think I know what happens. Consider these two programs
main(){
int a;
int b;
a * b;
}
and
typedef int my_type;
main(){
my_type * my_type_pointer_variable;
}
(The typedef and main(){} part isn't relevant and in my grammar. But they give some context)
In the first program I wish it would parse a "*" b as Stmt ==(SExpr)==> Expr ==(EMult)==> Expr * Expr ==(..)==> Ident "*" Ident, that is to essentially start stepping using the SExpr rule.
At the same time I would like my_type * my_type_pointer_variable to be expanded using the rules. Stmt ==(SDecl)==> Type Ident ==(TPointer)==> Type "*" Ident ==(TAlias)==> Ident "*" Ident.
But the grammar stage have no idea if an identifier originally is a type alias or a variable.
(1) How can I get rid of the reduce/reduce conflict and (2) am I the only one having this issue? Is there an obvious solution and how does the c grammar resolve this issue?
So far I have successfully just been able to change the syntax of my language by using "&" or some other symbol instead of "*", but that's very undesirable. Also I cannot make sense from various public c grammars and tried to see why they don't have this issue but I have had no luck in this.
And last, how do I resolve issues like these on my own? All I understood from happys more verbose output is how the conflict happens, is cleverness the only way to work around these conflicts? I'm afraid I'll stumble on even more issues for example when introducing EIndir. Expr = '*' Expr;
The usual way this problem is dealt with in C parsers is something generally called "the lexer feedback hack". Its a 'hack' in the sense that it doesn't deal with it in the grammar at all; instead, when the lexer recognizes an identifier, it classifies that identifier as either a typename or a non-typename, and returns a different token for each case (usually designated 'TypeIdent' for an identifier that is a typename and simply 'Ident' for any other). The lexer makes this selection by looking at the current state of the symbol table, so it sees all the typedefs that have occurred prior to the current point in the parse, but not typedefs that are after the current point. This is why C requires that you declare typedefs before their first use in each compilation unit.