I have just started self-studying the Dragon book of Compiler Design. I am working on a problem that says to design grammar for an expression containing binary +,-,*,/ and unary +,-
I came up with following
E -> E+T | E-T | T
T -> T*P | T/P | P
P -> +S | -S | S
S -> id | constant | (E)
However, there is an obvious flaw in it. According to this grammar, expressions like
1--3
are valid, which is an error in all programming languages I know. Though, expressions like
1+-+3
and
1- -3
must be valid. How can such a grammar be designed?
I believe your problem is with tokenization. You are identifying 1--3 as an error because you think it should be resolved as 1 --3 rather than 1 - -3, the latter being perfectly valid. So I think you problem comes because when you tokenize the string you are getting:
['1', '-', '-' , '3']
rather than:
['1', '--', '3']
I think you have one extra production rule
P -> +S | -S | S
S -> id | constant | (E)
can be shrinked to
P -> +P | -P | id | constant | (E)
With such grammar you will succesfully match exp "1+-+3" as valid.
You have tokenizer(scanner) problem! before passing token to parser You must differentiate between "-" and "--". you must define a token struct which contains a token type and value, then parse the token list.
Also the rule P->--S must be added to the production rules!!
Related
I am looking for a way to express the following types of conditions in BNF:
if(carFixed) { }
if(carFixed = true) {}
if(cars >= 4) { }
if(cars != 15) { }
if(cars < 3 && cars > 1) {}
Note:
* denotes 0 or more instances of something.
I have replaced normal BNF ::= with :.
I presently am using the following code, and am not sure if it's correct:
conditionOperator: "=" | "!=" | "<=" | ">=" | "<" | ">" | "is";
logicalAndOperator: "&&";
condition: (booleanIdentifier ((conditionOperator booleanIdentifier)* (logicalAndOperator | logicalOrOperator) booleanIdentifer (conditionOperator booleanIdentifier)*)*);
There are several approaches and they usually rely on the capabilities of the parser to indicate precedence and associativty. One that is typically used with recursive-descent parsers is to recreate the precedence of the operators by using the hierarchy provided by the bnf (or, in this case, pseudo-bnf) structure.
(In the examples bellow, CONDITIONAL_OP are the likes of <, != etc and LOGICAL_OP are &&, || etc)
Something in the lines of:
condition: logicalExpr
logicalExpr: conditionalExpr (LOGICAL_OP conditionalExpr)*
conditionalExpr: primary (CONDITIONAL_OP primary)*
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
The problem with the above solution is that the left-associativity of the operators is lost and requires special measures to restore it while parsing.
For parsers able to deal with left recursion, a more 'correct' notation could be:
condition: logicalExpr
logicalExpr: logicalExpr LOGICAL_OP conditionalExpr
| conditionalExpr
conditionalExpr: conditionalExpr CONDITIONAL_OP primary
| primary
primary: NUMBER | IDENTIFIER | BOOLEAN_LITERAL | '(' condition ')'
Finally, some parsers allow a special notation to indicate precedence and associativity. Something like (note that this is a completely invented syntax):
%LEFT LOGICAL_OP
%LEFT CONDITIONAL_OP
condition: condition CONDITIONAL_OP condition
| condition LOGICAL_OP condition
| '(' condition ')'
| NUMBER
| IDENTIFIER
| BOOLEAN_LITERAL
Hope this points you the right direction.
I work with antlr v4 to write a t-sql parser.
Is this warning a problem?
"rule 'sqlCommit' contains an optional block with at least one alternative that can match an empty string"
My Code:
sqlCommit: COMMIT (TRAN | TRANSACTION | WORK)? id?;
id:
ID | CREATE | PROC | AS | EXEC | OUTPUT| INTTYPE |VARCHARTYPE |NUMERICTYPE |CHARTYPE |DECIMALTYPE | DOUBLETYPE | REALTYPE
|FLOATTYPE|TINYINTTYPE|SMALLINTTYPE|DATETYPE|DATETIMETYPE|TIMETYPE|TIMESTAMPTYPE|BIGINTTYPE|UNSIGNEDBIGINTTYPE..........
;
ID: (LETTER | UNDERSCORE | RAUTE) (LETTER | [0-9]| DOT | UNDERSCORE)*
In a version before I used directly the lexer rule ID instead of the parser rule id in sqlCommit. But after change ID to id the warning appears.
(Hint if you are confused of ID and id: I want to use the parser rule id instead of ID because an identifier can be a literal which maybe already matched by an other lexer rule)
Regards
EDIT
With the help of "280Z28" I solved the problem. In the parser rule "id" was one slash more than needed:
BITTYPE|CREATE|PROC|
|AS|EXEC|OUTPUT|
So the | | includes that the parser rule can match an empty string.
From a Google search:
ErrorType.EPSILON_OPTIONAL
Compiler Warning 154.
rule rule contains an optional block with at least one alternative that can match an empty string
A rule contains an optional block ((...)?) around an empty alternative.
The following rule produces this warning.
x : ;
y : x?; // warning 154
z1 : ('foo' | 'bar'? 'bar2'?)?; // warning 154
z2 : ('foo' | 'bar' 'bar2'? | 'bar2')?; // ok
Since:
4.1
The problem described by this warning is primarily a performance problem. By wrapping a zero-length string in an optional block, you added a completely unnecessary decision to the grammar (whether to enter the optional block or not) which has a high likelihood of forcing the prediction algorithm through its slowest path. It's similar to wrapping Java code in the following:
if (slowMethodThatAlwaysReturnsTrue()) {
...
}
I'm struggling to see how this rule also suffers from this warning (with antlr 4.7.1)
join_type: (INNER | (left_right_full__join_type)? (OUTER)?)? JOIN;
left_right_full__join_type: LEFT | RIGHT | FULL;
JOIN: J O I N;
INNER: I N N E R;
OUTER: O U T E R;
AFAICT it always returns JOIN and optionally preceded by the type.
I'm trying to parse values with ANTLR. Here's the relevant part of my grammar:
root : IDENTIFIER | SELF | literal | constructor | call | indexer;
hierarchy : root (SUB^ (IDENTIFIER | call | indexer))*;
factor : hierarchy ((MULT^ | DIV^ | MODULO^) hierarchy)*;
sum : factor ((PLUS^ | MINUS^) factor)*;
comparison : sum (comparison_operator^ sum)*;
value : comparison | '(' value ')';
I won't describe each token or rule since their name is quite explanatory of their role. This grammar works well and compiles, allowing me to parse, using value, things such as:
a.b[c(5).d[3] * e()] < e("f")
The only thing left for value recognition is to be able to have parenthesized hierarchy roots. For instance:
(a.b).c
(3 < d()).e
...
Naively, and without much expectations, I tried adding the following alternative to my root rule:
root : ... | '(' value ')';
This however breaks the value rule due to non-LL(*)ism:
rule value has non-LL(*) decision due to recursive rule invocations reachable
from alts 1,2. Resolve by left-factoring or using syntactic predicates or using
backtrack=true option.
Even after reading most of The Definitive ANTLR Reference, I still don't understand these errors. However, what I do understand is that, upon seeing a parenthesis opening, ANTLR cannot know if it's looking at the beginning of a parenthesized value, or at the beginning of a parenthesized root.
How can I clearly define the behavior of parenthesized hierarchy root?
Edit: As requested, the additional rules:
parameter : type IDENTIFIER -> ^(PARAMETER ^(type IDENTIFIER));
constructor : NEW type PAREN_OPEN (arguments+=value (SEPARATOR arguments+=value)*)? PAREN_CLOSE -> ^(CONSTRUCTOR type ^(ARGUMENTS $arguments*)?);
call : IDENTIFIER PAREN_OPEN (values+=value (SEPARATOR values+=value)*)? PAREN_CLOSE -> ^(CALL IDENTIFIER ^(ARGUMENTS $values*)?);
indexer : IDENTIFIER INDEX_START (values+=value (SEPARATOR values+=value)*)? INDEX_END -> ^(INDEXER IDENTIFIER ^(ARGUMENTS $values*));
Remove '(' value ')' from value and place it in root:
root : IDENTIFIER | SELF | literal | constructor | call | indexer | '(' value ')';
...
value : comparison;
Now (a.b).c will result in the following parse:
And (3 < d()).e in:
Of course, you'll probably want to omit the parenthesis from the AST:
root : IDENTIFIER | SELF | literal | constructor | call | indexer | '('! value ')'!;
Also, you don't need to append tokens in a List using += in your parser rules. The following:
call
: IDENTIFIER PAREN_OPEN (values+=value (SEPARATOR values+=value)*)? PAREN_CLOSE
-> ^(CALL IDENTIFIER ^(ARGUMENTS $values*)?)
;
can be rewritten into:
call
: IDENTIFIER PAREN_OPEN (value (SEPARATOR value)*)? PAREN_CLOSE
-> ^(CALL IDENTIFIER ^(ARGUMENTS value*)?)
;
EDIT
Your main problem is the fact that certain input can be parsed in two (or more!) ways. For example, the input (a) could be parsed by alternative 1 and 2 of your value rule:
value
: comparison // alternative 1
| '(' value ')' // alternative 2
;
Run through your parser rules: a comparison (alternative 1) can match (a) because it matches the root rule, which in its turn matches '(' value ')'. But that is also what alternative 2 matches! And there you have it: the parser "sees" for one input, two different
parses and reports about this ambiguity.
I am implementing a parser with ANTLR for D. This language is based on C so there are some ambiguity around the declarations and the expressions. Consider this:
a* b = c; // This is a declaration of the variable d with a pointer-to-a type.
c = a * b; // as an expression is a multiplication.
As the second example could only appear on the right of an assignment expression I tried to resolve this problem with the following snippet:
expression
: left = assignOrConditional
(',' right = assignOrConditional)*
;
assignOrConditional
: ( postfixExpression ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=') )=> assignExpression
| conditionalExpression
;
assignExpression
: left = postfixExpression
( op = ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=')
right = assignOrExpression
)?
;
conditionalExpression
: left = logicalOrExpression
('?' e1 = conditionalExpression ':' e2 = conditionalExpression)?
;
As far as my understanding goes, this should do the trick to avoid the ambiguity but the tests are failing. If I feed the interpreter with any input, starting with the rule assignOrConditional, it will fail with NoViableAltException.
the inputs were
a = b
b-=c
d
Maybe I'm misunderstanding how the predicates are working therefore it would be great if someone could correct my explanation to the code: If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression. (Note, that the assignmentExpression and the conditionalExpression works well). If the next token isn't of them, it tries to parse it as a conditionalExpression.
EDIT
[solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
Any idea what's wrong with my understanding?
If I feed the interpreter with any input, ...
Don't use ANTLRWorks' interpreter: it is buggy, and disregards any type of predicate. Use its debugger: it works flawlessly.
If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression.
You are correct.
EDIT [solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
What's wrong with that?
I'm finding myself challenged on how to properly format rewrite rules when certain conditions occur in the original rule.
What is the appropriate way to rewrite this:
unaryExpression: op=('!' | '-') t=term
-> ^(UNARY_EXPR $op $t)
Antlr doesn't seem to like me branding anything in parenthesis with a label and "op=" fails. Also, I've tried:
unaryExpression: ('!' | '-') t=term
-> ^(UNARY_EXPR ('!' | '-') $t)
Antlr doesn't like the or '|' and throws a grammar error.
Replacing the character class with a token name does solve this problem, however it creates a quagmire of other issues with my grammar.
--- edit ----
A second problem has been added. Please help me format this rule with tree grammar:
multExpression
: unaryExpression (MULT_OP unaryExpression)*
;
Pretty simple: My expectation is to enclose every matched token in a parent (imaginary) token MULT so that I end up with something like:
MULT
o
|
o---o---o---o---o
| | | | |
'3' '*' '6' '%' 2
unaryExpression
: (op='!' | op='-') term
-> ^(UNARY_EXPR[$op] $op term)
;
I used the UNARY_EXPR[$op] so the root node gets some useful line/column information instead of defaulting to -1.