I have a left-recursive rule like the following:
EXPRESSION : EXPRESSION BINARYOP EXPRESSION | UNARYOP EXPRESSION | NUMBER;
I need to add parenthesis to it but I'm not sure how to make a left parenthesis depend on a matching right parenthesis yet still optional. Can someone show me how? (Or am I trying to do entirely too much in lexing, and should I leave some or all of this to the parsing?)
You could add a recursive rule:
EXPRESSION : EXPRESSION BINARYOP EXPRESSION
| UNARYOP EXPRESSION
| NUMBER
| OPENPARENS EXPRESSION CLOSEPARENS
;
Yes, you're trying to do too much in the lexer. Here's how to get around the left-recursive rules:
http://www.antlr.org/wiki/display/ANTLR3/Expression+evaluator (see how the parser rule expr trickles down to the rule atom and then get called recursively from atom again)
HTH
Related
I want to handle the Scala grammar in order to create a data structure and I want to create a graph to model it. The problem that I have is that I want to get rid of some syntax sugar of the EBNF grammar in order to parse it in an efficient way.
For example if I have the following rule:
Expr ::= (Bindings | id | '_') '=>' Expr | Expr1
I want the corresponding rules for it:
Expr ::= LeftExpr '=>' Expr | Expr1
LeftExpr ::= Bindings | id | '_'
What I would really appreciate would be a tool which can automatically convert those rules in order to have left-associativity rules with only OR operators in order to avoid doing it manually. I think it's possible a thing like that, maybe without some semantic sense of the new added rules, e.g. it's the same if LeftExpr is Foo or whatever other name.
I looked at web sources but the argument is so confusing that I cannot find the right answer. Any help is appreciated, also just a link (really) related to the problem. Thank you
I want to parse this
VALID_EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(\.[a-z]+)*\.[a-z]+\z/i
and other variations of course of regular expressions.
Does someone know how to do this properly?
Thanks in advance.
Edit: I tried throwing in all regex signs and chars in one lexer rule like this
REGEX: ( DIV | ('i') | ('#') | ('[') | (']') | ('+') | ('.') | ('*') | ('-') | ('\\') | ('(') | (')') |('A') |('w') |('a') |('z') |('Z')
//|('w')|('a'));
and then make a parser rule like this:
regex_assignment: (REGEX)+
but there are recognition errors(extraneous input). This is definetly because these signs are ofc used in other rules before.
The thing is I actually don't need to process these regex assignments, I just want it to be recognized correctly without errors. Does anyone have an approach for this in ANTLR? For me a solution would suffice, that just recognzies this as regex and skips it for example.
Unfortunately, there is no regex grammar yet in the ANTLR grammar repository, but similar questions have come up before, e.g. Regex Grammar. Once you have the (E)BNF you can convert that to ANTLR. Or alternatively, you can use the BNF grammar to check your own grammar rules to see if they are correctly defined. Simply throwing together all possible input chars in a single rule won't work.
I know this part of my grammar cause error but I don't know how to fix it I even use %left and right but it didn't help. Can anybody please help me to find out what is the problem with this grammar.
Thanks in advance for your help.
%token VARIABLE NUM
%right '='
%left '+' '-'
%left '*' '/'
%left '^'
%start S_PROOP
EQUATION_SEQUENCE
: FORMULA '=' EQUATION
;
EQUATION
: FORMULA
| FORMULA '=' EQUATION
;
FORMULA
: SUM EXPRESSION
| PRODUCT EXPRESSION
| EXPRESSION '+' EXPRESSION
| EXPRESSION '*' EXPRESSION
| EXPRESSION '/' EXPRESSION
| EXPRESSION '^' EXPRESSION
| EXPRESSION '-' EXPRESSION
| EXPRESSION
;
EXPRESSION
: EXPRESSION EXPRESSION
| '(' EXPRESSION ')'
| NUM
| VARIABLE
;
Normal style is to use lower case for non-terminals and upper case for terminals; using upper case indiscriminately makes your grammar harder to read (at least for those of us used to normal yacc/bison style). So I've written this answer without so much recourse to the caps lock key.
The basic issue is the production
expression: expression expression
which is obviously ambiguous, since it does not provide any indication of associativity. In that, it is not different from
expression: expression '+' expression
but that conflict can be resolved using a precedence declaration:
%left '+'
The difference is that the first production does not have any terminal symbol, which makes it impossible to use precedence rules to disambiguate: in yacc/bison, precedence is always a comparison between a potential reduction and a potential shift. The potential reduction is some production which could be reduced; the potential shift is a terminal symbol which might be able to extend some production. Since the potential shift must be a terminal symbol, that is what is used in the precedence declaration; by default, the precedence of the potential reduction is defined by the last terminal symbol in the right-hand side but it is possible to specify a different terminal using a %prec marker. In any case, the precedence relation involves a terminal symbol, and if the grammar allows juxtaposition of two terminals, there is no relevant terminal symbol.
That's easy to work around, since you are under no obligation to use precedence to resolve conflicts. You could just avoid the conflict:
/* Left associative rule */
expr_sequence: expr | expr_sequence expr
/* Alternative: right associative rule */
expr_sequence: expr | expr expr_sequence
Since there is no indication what you intend by the juxtaposition, I'm unable to recommend one or the other of the above alternatives, but normally I would incline towards the first one.
That's not terribly different from your grammar for equation_sequence, although equation_sequence actually uses a terminal symbol so it could have been handled with a precedence declaration. It's worth noting that equation_sequence, as written, is right-associative. That's usually considered correct for assignment operators, (a = b = c + 3, in a language like C, is parsed as a = (b = c + 3) and not as (a = b) = c + 3, making assignment one of the few right-associative operators.) But if you are using = as an equality operator, it might not actually be what you intended.
I'm writing an IntelliJ language plugin for a C-derived language which includes the comma operator. I'm using Grammar-Kit to generate the parser. Where the formal grammar has a lot of nested expression productions, I've rewritten them using Grammar-Kit's priority-based expression parsing, so my expression production looks like this:
expression ::= comma_expression
| assignment_expression
| conditional_expression
| eor_expression
| xor_expression
| and_expression
| equality_expression
| relation_expression
| add_expression
| mul_expression
| prefix_expression
| postfix_group
| primary_expression
comma_expression ::= expression ',' expression {pin=2}
// etc.
This works fine in itself, but there are places in the grammar where I need to parse an expression that can't be a comma expression. Function calls are one example of this:
function_call_expression ::= identifier '(' ('void'|<<comma_list expression>>)? ')'
private meta comma_list ::= <<p>> (',' <<p>>)*
A function argument can't be a comma expression, because that would be ambiguous with the comma separating the next argument. (In the grammar as I have it now, it always parses as a single comma expression.) The formal grammar deals with this by specifying that each function argument must be an assignment expression, because their assignment expression includes all the expressions with tighter precedence. That doesn't work for the Grammar-Kit priority-based grammar, because an assignment expression really does have to include an assignment.
The same applies to initializers, where allowing a comma expression would lead to an ambiguous parse in cases like int x=1, y;.
How should I deal with this situation? I'd like to keep using the priority-based parse to keep a shallow PSI tree, but also avoid manually rewriting the PSI tree for function calls to turn aCommaExpression into an argument list.
G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.