Why are Semantic Predicates not working in ANTLR4 - antlr

I have a very simple grammar that looks like this:
grammar Testing;
a : d | b;
b : {_input.LT(1).equals("b")}? C;
d : {!_input.LT(1).equals("b")}? C;
C : .;
It parses one character from the input and checks whether the it's equal to the character b. If so, rule b is used, and if not, rule d is used.
However, the parse tree fails the expectation and parses everything using the first rule (rule d).
$ antlr Testing.g4
$ javac *.java
$ grun Testing a -trace (base)
c
enter a, LT(1)=c
enter d, LT(1)=c
consume [#0,0:0='c',<1>,1:0] rule d
exit d, LT(1)=
exit a, LT(1)=
$ grun Testing a -trace (base)
b
enter a, LT(1)=b
enter d, LT(1)=b
consume [#0,0:0='b',<1>,1:0] rule d
exit d, LT(1)=
exit a, LT(1)=
In both cases, rule d is used. However, since there is a guard on rule d, I expect rule d to fail when the first character is exactly 'b'.
Am I doing something wrong when using the semantic predicates?
(I need to use semantic predicates because I need to parse a language where keywords could be used as identifiers).
Reference: https://github.com/antlr/antlr4/blob/master/doc/predicates.md

_input.LT(int) returns a Token, and Token.equals(String) will always return false. What you want to do is call getText() on the Token:
b : {_input.LT(1).getText().equals("b")}? C;
d : {!_input.LT(1).getText().equals("b")}? C;
However, often it is easier to handle keywords-as-identifiers in such a way:
rule
: KEYWORD_1 identifier
;
identifier
: IDENTIFIER
| KEYWORD_1
| KEYWORD_2
| KEYWORD_3
;
KEYWORD_1 : 'k1';
KEYWORD_2 : 'k2';
KEYWORD_3 : 'k3';
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;

Related

Why doesn't this ANTLR grammar derive the string `baba`?

Using ANTLR v4.9.3, I created the following grammar …
grammar G ;
start : s EOF ;
s : 'ba' a b ;
a : 'b' ;
b : 'a' ;
Given the above grammar, I thought that the following derivation is possible …
start → s → 'ba' a b → 'ba' 'b' b → 'ba' 'b' 'a' = 'baba'
However, my Java test program indicates a syntax error occurs when trying to parse the string baba.
Shouldn't the string baba be in the language generated by grammar G ?
Although the conclusion/answer is already in the comments, here an answers that explains it in a bit more detail.
When defining literal tokens inside parser rule (the 'ba', 'a' and 'b'), ANTLR implicitly creates the following grammar:
grammar G ;
start : s EOF ;
s : T__0 a b ;
a : T__1 ;
b : T__2 ;
T__0 : 'ba';
T__1 : 'b';
T__2 : 'a';
Now, when the lexer get the input "baba", it will create 2 T__0 tokens. The lexer is not driven by whatever the parser is trying to match. It works independently from the parser. The lexer creates tokens following these 2 rules:
try to match as many characters as possible for a rule
when 2 (or more) lexer rules match the same characters, let the one defined first "win"
Because of rule 1, it is apparent that 2 T__0 tokens are created.
As you already mentioned in a comment, removing the 'ba' token (and using 'b' followed by 'a') would resolve the issue:
grammar G ;
start : s EOF ;
s : 'b' 'a' a b ;
a : 'b' ;
b : 'a' ;
which would really be the grammar:
grammar G ;
start : s EOF ;
s : T__0 T__1 a b ;
a : T__0 ;
b : T__1 ;
T__0 : 'b';
T__1 : 'a';

ANTLR: "for" keyword used for loops conflicts with "for" used in messages

I have the following grammar:
myg : line+ EOF ;
line : ( for_loop | command params ) NEWLINE;
for_loop : FOR WORD INT DO NEWLINE stmt_body;
stmt_body: line+ END;
params : ( param | WHITESPACE)*;
param : WORD | INT;
command : WORD;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
WORD : (LOWERCASE | UPPERCASE | DIGIT | [_."'/\\-])+ (DIGIT)* ;
INT : DIGIT+ ;
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
FOR: 'for';
DO: 'do';
END: 'end';
My problem is that the 2 following are valid in this language:
message please wait for 90 seconds
This would be a valid command printing a message with the word "for".
for n 2 do
This would be the beginning of a for loop.
The problem is that with the current lexer it doesn't match the for loop since 'for' is matched by the WORD rule as it appears first.
I could solve that by putting the FOR rule before the WORD rule but then 'for' in message would be matched by the FOR rule
This is the typical keywords versus identifier problem and I thought there were quite a number of questions regarding that here on Stackoverflow. But to my surprise I can only find an old answer of mine for ANTLR3.
Even though the principle mentioned there remains the same, you no longer can change the returned token type in a parser rule, with ANTLR4.
There are 2 steps required to make your scenario work.
Define the keywords before the WORD rule. This way they get own token types you need for grammar parts which require specific keywords.
Add keywords selectively to rules, which parse names, where you want to allow those keywords too.
For the second step modify your rules:
param: WORD | INT | commandKeyword;
command: WORD | commandKeyword;
commandKeyword: FOR | DO | END; // Keywords allowed as names in commands.

Matching similar terms in ANTLR without capturing difference

As part of the nand2tetris challenge I'm trying to write a parser using ANTLR to generate machine code, having already implemented it using regex.
However, I'm struggling to work out how to use ANTLR effectively. A subset of the problem is below.
(Some) Valid instructions
M=D
D=M
M=D+1
D;JMP
0;JMP
A (partial) Regex
(?<assignment>(?<destination>[ADM])=)?(?<computation>[ADM+10])(?<condition>;(?<jump>JMP))?
A (partial) grammar
command
: assignment '=' computation
| computation ';' condition
| assignment '=' computation ';' condition
;
assignment
: ASSIGNMENT
;
computation
: OPERATION
;
condition
: CONDITION
;
ASSIGNMENT
: DESTINATION
;
CONDITION
: JUMP
;
DESTINATION
: 'A'
| 'D'
| ...etc
;
OPERATION
: 'A'
| 'D'
| 'A+D'
| ... etc
;
JUMP
: JMP
| JLE
| etc...
;
Now, as you can see, the lexer will get mixed up between what is an computation and what is an assignment, as both could be 'A'...
However, if I change the ASSIGNMENT to
ASSIGNMENT
: DESTINATION '='
;
and command to
command
: assignment computation
| etc...
then assignment picks up the equals sign.
So, I am trying to match under two tokens (FOO and FOO=) in different contexts, but I'm not interested in the =, only the FOO.
Am I barking up the wrong tree entirely with the current approach?

antlr 4 - warning: rule contains an optional block with at least one alternative that can match an empty string

I work with antlr v4 to write a t-sql parser.
Is this warning a problem?
"rule 'sqlCommit' contains an optional block with at least one alternative that can match an empty string"
My Code:
sqlCommit: COMMIT (TRAN | TRANSACTION | WORK)? id?;
id:
ID | CREATE | PROC | AS | EXEC | OUTPUT| INTTYPE |VARCHARTYPE |NUMERICTYPE |CHARTYPE |DECIMALTYPE | DOUBLETYPE | REALTYPE
|FLOATTYPE|TINYINTTYPE|SMALLINTTYPE|DATETYPE|DATETIMETYPE|TIMETYPE|TIMESTAMPTYPE|BIGINTTYPE|UNSIGNEDBIGINTTYPE..........
;
ID: (LETTER | UNDERSCORE | RAUTE) (LETTER | [0-9]| DOT | UNDERSCORE)*
In a version before I used directly the lexer rule ID instead of the parser rule id in sqlCommit. But after change ID to id the warning appears.
(Hint if you are confused of ID and id: I want to use the parser rule id instead of ID because an identifier can be a literal which maybe already matched by an other lexer rule)
Regards
EDIT
With the help of "280Z28" I solved the problem. In the parser rule "id" was one slash more than needed:
BITTYPE|CREATE|PROC|
|AS|EXEC|OUTPUT|
So the | | includes that the parser rule can match an empty string.
From a Google search:
ErrorType.EPSILON_OPTIONAL
Compiler Warning 154.
rule rule contains an optional block with at least one alternative that can match an empty string
A rule contains an optional block ((...)?) around an empty alternative.
The following rule produces this warning.
x : ;
y : x?; // warning 154
z1 : ('foo' | 'bar'? 'bar2'?)?; // warning 154
z2 : ('foo' | 'bar' 'bar2'? | 'bar2')?; // ok
Since:
4.1
The problem described by this warning is primarily a performance problem. By wrapping a zero-length string in an optional block, you added a completely unnecessary decision to the grammar (whether to enter the optional block or not) which has a high likelihood of forcing the prediction algorithm through its slowest path. It's similar to wrapping Java code in the following:
if (slowMethodThatAlwaysReturnsTrue()) {
...
}
I'm struggling to see how this rule also suffers from this warning (with antlr 4.7.1)
join_type: (INNER | (left_right_full__join_type)? (OUTER)?)? JOIN;
left_right_full__join_type: LEFT | RIGHT | FULL;
JOIN: J O I N;
INNER: I N N E R;
OUTER: O U T E R;
AFAICT it always returns JOIN and optionally preceded by the type.

Rascal error when specifying grammar

I have a simple file in rascal for specifying a toy grammar
module temp
import IO;
import ParseTree;
layout LAYOUT = [\t-\n\r\ ]*;
start syntax Simple
= A B ;
syntax A = "Hello"+ ("joe" "pok")* ;
syntax A= "Hi";
syntax B = "world"*|"wembly";
syntax B = C | C C* ;
public void main () {
println("hello");
iprint(parse(#start[Simple], "Hello Hello world world world"));
}
This works fine, however, the problem is that I didn't want to write
syntax B = C | C C* ;
I wanted to write
syntax B = ( C | C C* )?
but it was rejected as a parse error by rascal -even though all of
syntax B = ( C C C* )? ;
syntax B = ( C | C* )? ;
syntax B = C | C C* ;
are accepted fine. Can anyone explain to me what I'm doing wrong?
The sequence symbol (nested sequence) always requires brackets in rascal. The meta notation is defined as
syntax Sym = sequence: "(" Sym+ ")" | opt: Sym "?" | alternative: "(" Sym "|" {Sym "|"}+ ")" | ... ;
So, in your example you should have written:
syntax B = (C | (C C*))?;
What is perhaps confusing is that Rascal uses the | sign twice. Once for separating top-level alternatives, once for nested alternative:
syntax X = "a" | "b"; // top-level
syntax Y = ("c" | "d"); // nested, will internally generate a new rule:
syntax ("c" | "d") = "c" | "d";
Finally, normal alternatives have sequences without brackets, as in:
syntax B
= C
| C C*
;
// or less abstractly:
syntax Exp = left Exp "*" Exp
> left Exp "+" Exp
;
BTW, we generally avoid the use of too many nested regular expressions because they are so anonymous and therefore make interpreting parse trees harder. The best usage of regular expressions is for expressing lexical syntax where we are not so much interested in the internal structure anyhow.