I have a simple file in rascal for specifying a toy grammar
module temp
import IO;
import ParseTree;
layout LAYOUT = [\t-\n\r\ ]*;
start syntax Simple
= A B ;
syntax A = "Hello"+ ("joe" "pok")* ;
syntax A= "Hi";
syntax B = "world"*|"wembly";
syntax B = C | C C* ;
public void main () {
println("hello");
iprint(parse(#start[Simple], "Hello Hello world world world"));
}
This works fine, however, the problem is that I didn't want to write
syntax B = C | C C* ;
I wanted to write
syntax B = ( C | C C* )?
but it was rejected as a parse error by rascal -even though all of
syntax B = ( C C C* )? ;
syntax B = ( C | C* )? ;
syntax B = C | C C* ;
are accepted fine. Can anyone explain to me what I'm doing wrong?
The sequence symbol (nested sequence) always requires brackets in rascal. The meta notation is defined as
syntax Sym = sequence: "(" Sym+ ")" | opt: Sym "?" | alternative: "(" Sym "|" {Sym "|"}+ ")" | ... ;
So, in your example you should have written:
syntax B = (C | (C C*))?;
What is perhaps confusing is that Rascal uses the | sign twice. Once for separating top-level alternatives, once for nested alternative:
syntax X = "a" | "b"; // top-level
syntax Y = ("c" | "d"); // nested, will internally generate a new rule:
syntax ("c" | "d") = "c" | "d";
Finally, normal alternatives have sequences without brackets, as in:
syntax B
= C
| C C*
;
// or less abstractly:
syntax Exp = left Exp "*" Exp
> left Exp "+" Exp
;
BTW, we generally avoid the use of too many nested regular expressions because they are so anonymous and therefore make interpreting parse trees harder. The best usage of regular expressions is for expressing lexical syntax where we are not so much interested in the internal structure anyhow.
Related
I have a very simple grammar that looks like this:
grammar Testing;
a : d | b;
b : {_input.LT(1).equals("b")}? C;
d : {!_input.LT(1).equals("b")}? C;
C : .;
It parses one character from the input and checks whether the it's equal to the character b. If so, rule b is used, and if not, rule d is used.
However, the parse tree fails the expectation and parses everything using the first rule (rule d).
$ antlr Testing.g4
$ javac *.java
$ grun Testing a -trace (base)
c
enter a, LT(1)=c
enter d, LT(1)=c
consume [#0,0:0='c',<1>,1:0] rule d
exit d, LT(1)=
exit a, LT(1)=
$ grun Testing a -trace (base)
b
enter a, LT(1)=b
enter d, LT(1)=b
consume [#0,0:0='b',<1>,1:0] rule d
exit d, LT(1)=
exit a, LT(1)=
In both cases, rule d is used. However, since there is a guard on rule d, I expect rule d to fail when the first character is exactly 'b'.
Am I doing something wrong when using the semantic predicates?
(I need to use semantic predicates because I need to parse a language where keywords could be used as identifiers).
Reference: https://github.com/antlr/antlr4/blob/master/doc/predicates.md
_input.LT(int) returns a Token, and Token.equals(String) will always return false. What you want to do is call getText() on the Token:
b : {_input.LT(1).getText().equals("b")}? C;
d : {!_input.LT(1).getText().equals("b")}? C;
However, often it is easier to handle keywords-as-identifiers in such a way:
rule
: KEYWORD_1 identifier
;
identifier
: IDENTIFIER
| KEYWORD_1
| KEYWORD_2
| KEYWORD_3
;
KEYWORD_1 : 'k1';
KEYWORD_2 : 'k2';
KEYWORD_3 : 'k3';
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;
In my grammar, it's usually possible to only use declaration which looks like:
int x, y, z = 23;
int i = 1;
int j;
In for I'd like to use a set of comma separated declarations of different types e.g.
for (int i = 0, double d = 2.0; i < 0; i++) { ... }
Using yacc, the limited lookahead creates problems. Here is the naive grammar:
variables_decl
: type_expression IDENT
| variables_decl ',' IDENT
;
declaration
: variables_decl
| variables_decl '=' initializer
;
declaration_list
: declaration
| declaration_list ',' declaration
;
This causes a shift/reduce error on ',':
state 149
100 variables_decl: variables_decl . ',' IDENT
101 declaration: variables_decl .
102 | variables_decl . '=' initializer
',' shift, and go to state 261
'=' shift, and go to state 262
',' [reduce using rule 101 (declaration)]
$default reduce using rule 101 (declaration)
I'd like fix this issue so that this actually works:
for (double x, y, int i, j = 0, long l = 1; i < 0; i++) { ... }
But it's not obvious to me how to do that.
In general terms, you avoid this type of shift/reduce conflict by avoiding forcing the parser to make a decision until absolutely necessary.
It's understandable why you have structured the grammar as you have; intuitively, a declaration list is a list of declarations, where each declaration is a type and a list of variables of that type. The problem is that this definition makes it impossible to know whether a comma belongs to an inner or outer list.
Moreover, one extra lookahead token might not be enough, since the following IDENT could be a typename or the name of a variable to be declared, assuming type-expression is the usual C syntax which can start with an identifier corresponding to a typename.
But that's not the only way to look at the declaration list syntax. Instead, you can think of it as a list of individual declarations, each of which starts with an optional type (except the first in the list, which must have an explicit type), using the semantic convention that an omitted type is the same as the type of the previous variable. That leads to the following conflict-free grammar:
declaration_list: explicit_decl
| declaration_list ',' declaration
declaration : explicit_decl
| implicit_decl
explicit_decl : type_expression implicit_decl
implicit_decl : IDENT opt_init
opt_init : %empty | '=' expr
That does not capture the syntax of C declarations, since not all C declarations have the form type_expression IDENT. The IDENT being defined can be buried inside the declaration, as with, for example, int a[4] or int f(int i); fortunately, these forms are of limited use in a for loop.
On the other hand, unlike your grammar it does allow all declared variables to be initialised, so
int a = 1, b = 0, double x = -1.0, y = 0.0
should work.
Another note: the first item in a C for clause can be empty, a declaration (possibly in the form of a list) or an expression. In the last case, a top-level , is an operator, not a list indicator.
In short, the above fragment might or might not be a solution in the context of your actual grammar. But it is conflict-free in a simple test framework where typed declarations are always of the form typename identifier.
I work with antlr v4 to write a t-sql parser.
Is this warning a problem?
"rule 'sqlCommit' contains an optional block with at least one alternative that can match an empty string"
My Code:
sqlCommit: COMMIT (TRAN | TRANSACTION | WORK)? id?;
id:
ID | CREATE | PROC | AS | EXEC | OUTPUT| INTTYPE |VARCHARTYPE |NUMERICTYPE |CHARTYPE |DECIMALTYPE | DOUBLETYPE | REALTYPE
|FLOATTYPE|TINYINTTYPE|SMALLINTTYPE|DATETYPE|DATETIMETYPE|TIMETYPE|TIMESTAMPTYPE|BIGINTTYPE|UNSIGNEDBIGINTTYPE..........
;
ID: (LETTER | UNDERSCORE | RAUTE) (LETTER | [0-9]| DOT | UNDERSCORE)*
In a version before I used directly the lexer rule ID instead of the parser rule id in sqlCommit. But after change ID to id the warning appears.
(Hint if you are confused of ID and id: I want to use the parser rule id instead of ID because an identifier can be a literal which maybe already matched by an other lexer rule)
Regards
EDIT
With the help of "280Z28" I solved the problem. In the parser rule "id" was one slash more than needed:
BITTYPE|CREATE|PROC|
|AS|EXEC|OUTPUT|
So the | | includes that the parser rule can match an empty string.
From a Google search:
ErrorType.EPSILON_OPTIONAL
Compiler Warning 154.
rule rule contains an optional block with at least one alternative that can match an empty string
A rule contains an optional block ((...)?) around an empty alternative.
The following rule produces this warning.
x : ;
y : x?; // warning 154
z1 : ('foo' | 'bar'? 'bar2'?)?; // warning 154
z2 : ('foo' | 'bar' 'bar2'? | 'bar2')?; // ok
Since:
4.1
The problem described by this warning is primarily a performance problem. By wrapping a zero-length string in an optional block, you added a completely unnecessary decision to the grammar (whether to enter the optional block or not) which has a high likelihood of forcing the prediction algorithm through its slowest path. It's similar to wrapping Java code in the following:
if (slowMethodThatAlwaysReturnsTrue()) {
...
}
I'm struggling to see how this rule also suffers from this warning (with antlr 4.7.1)
join_type: (INNER | (left_right_full__join_type)? (OUTER)?)? JOIN;
left_right_full__join_type: LEFT | RIGHT | FULL;
JOIN: J O I N;
INNER: I N N E R;
OUTER: O U T E R;
AFAICT it always returns JOIN and optionally preceded by the type.
In an old postI found a recomendation for a SQL parser I was searching for Lex and Yacc. Here is the link.
SQL lex yacc grammar
I later have found that it is the code that comes explained in the O'reilly book "lex & yacc.
I am trying to put it working and I have succesfully integrated in my aplication, but whenever I send an UPDATE command I get a syntax error, even with the simplest ones:
UPDATE user SET name = 'johnfoo'
I get the error on the = symbol. I have tried to trace everything but I cannot find why it gives this message. I have tried to analize the lex and yacc code and It makes no sense for me, as the code looks correct.
[UPDATE]The error I get is just:
1: syntax error at =
Embedded SQL parse failed
INSERT works perfectly.
After some different ways of trying, the sugested solution (now deleted by the author) worked in some way.
What he suggested was updating the lex and yacc definition of comparision.
In the lex file change
<SQL>"=" |
<SQL>"<>" |
<SQL>"<" |
<SQL>">" |
<SQL>"<=" |
<SQL>">=" TOK(COMPARISON)
by
<SQL>"=" TOK(EQ)
<SQL>"<>" TOK(NE)
<SQL>"<" TOK(LT)
<SQL>">" TOK(GT)
<SQL>"<=" TOK(LE)
<SQL>">=" TOK(GE)
In the yacc file add:
comparison:
EQ
| NE
| LT
| GT
| LE
| GE
;
And change all references to = with EQ and the other symbols and COMPARISON with comparison:
%left COMPARISON /* = <> < > <= >= */
by
%left EQ NE LT GT LE GE /* = <> < > <= >= */
assignment:
column = scalar_exp
| column = NULLX
;
by
assignment:
column EQ scalar_exp
| column EQ NULLX
;
And
comparison_predicate:
scalar_exp COMPARISON scalar_exp
| scalar_exp COMPARISON subquery
;
by
comparison_predicate:
scalar_exp comparison scalar_exp
| scalar_exp comparison subquery
;
And it works!
I am implementing a parser with ANTLR for D. This language is based on C so there are some ambiguity around the declarations and the expressions. Consider this:
a* b = c; // This is a declaration of the variable d with a pointer-to-a type.
c = a * b; // as an expression is a multiplication.
As the second example could only appear on the right of an assignment expression I tried to resolve this problem with the following snippet:
expression
: left = assignOrConditional
(',' right = assignOrConditional)*
;
assignOrConditional
: ( postfixExpression ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=') )=> assignExpression
| conditionalExpression
;
assignExpression
: left = postfixExpression
( op = ('=' | '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '~=' | '<<=' | '>>=' | '>>>=' | '^^=')
right = assignOrExpression
)?
;
conditionalExpression
: left = logicalOrExpression
('?' e1 = conditionalExpression ':' e2 = conditionalExpression)?
;
As far as my understanding goes, this should do the trick to avoid the ambiguity but the tests are failing. If I feed the interpreter with any input, starting with the rule assignOrConditional, it will fail with NoViableAltException.
the inputs were
a = b
b-=c
d
Maybe I'm misunderstanding how the predicates are working therefore it would be great if someone could correct my explanation to the code: If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression. (Note, that the assignmentExpression and the conditionalExpression works well). If the next token isn't of them, it tries to parse it as a conditionalExpression.
EDIT
[solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
Any idea what's wrong with my understanding?
If I feed the interpreter with any input, ...
Don't use ANTLRWorks' interpreter: it is buggy, and disregards any type of predicate. Use its debugger: it works flawlessly.
If the input can be read as a postfixExpression it will check if the next token after the postfixExpression is one of the assignment operators and if it is, it will parse the rule as an assignmentExpression.
You are correct.
EDIT [solved] Now, there's an other problem with this solution that I could realize: the assignmentExpression has to choose in it's right hand expression is an assignment again (that is, postfix and assignment operator follows), if it is chained up.
What's wrong with that?