how to check the end of the line in the ANTLR parser rule even if the new line character is sent to hidden channel - antlr

I have 3 statements as follows
1) IF a==b THEN print(a);
2) IF a==b THEN /* Action block follows */
3) IF a==b THEN
how can I differentiate between these statements using ANTLR parser rule
I'm using a rule like
if_stmt : IF_T LITERAL_T '==' LITERAL_T THEN_T
{
/* My java code goes here*/
}
I would like to maintain the rule as same and differentiate in the action block of the rule
Note : new line character and comment goes to hidden channel

Maybe you should not put all the "intelligence" into the parser itself. It might get over complicated very quickly. You can either traverse the AST (maybe using treewalker) and check if getLineNumber() for if statement returns the same value as for the first statement in the then block.
You can also put similar condition into the if_stmt rule action.

Related

Semantic predicates fail but don't go to the next one

I tried to use ANTLR4 to identify a range notation like <1..100>, and here is my attempt:
#parser::members {
def evalRange(self, minnum, maxnum, num):
if minnum <= num <= maxnum:
return True
return False
}
range_1_100 : INT { self.evalRange(1, 100, $INT.int) }? ;
But it does not work for more than one range like:
some_rule : range_1_100 | range_200_300 ;
When I input a number (200), it just stops at the first rule:
200
line 3:0 rule range_1_100 failed predicate: { self.evalRange(1, 100, $INT.int) }?
(top (range_1_100 200))
It is not as I expected. How can I make the token match the next rule (range_200_300)?
Here's an excerpt from the docs (emphasis mine):
Predicates can appear anywhere within a parser rule just like actions can, but only those appearing on the left edge of alternatives can affect prediction (choosing between alternatives).
[...]
ANTLR's general decision-making strategy is to find all viable alternatives and then ignore the alternatives guarded with predicates that currently evaluate to false. (A viable alternative is one that matches the current input.) If more than one viable alternative remains, the parser chooses the alternative specified first in the decision.
Which basically means your predicate must be the first item in the alternation to be taken into account during the prediction phase.
Of course, you won't be able to use $INT as it wasn't matched yet at this point, but you can replace it with something like _input.LA(1) instead (lookahead of one token) - the exact syntax depends on your language target.
As a side note, I'd advise you to not validate the input through the grammar, it's easier and better to perform a separate validation pass after the parse. Let the grammar handle the syntax, not the semantics.

Rule with identical string token twice

Using yacc, I want to parse text like
begin foo ... end foo
The string foo is not known at compile time and there can be different
such strings in the same input.
So far, the only option I see is to check for syntactical correctness after parsing:
block : BEGIN IDENT something END IDENT
{ if (strcmp($2, $5) != 0) yyerror("Mismatch"); }
This feels wrong. The parser should already detect the errors. Is there something built-in to yacc?
yacc only knows about tokens which the lexer can identify. Since those are identical, the lexer could only improve this case by using states.
That is, you could tell lex to remember that it saw a BEGIN and to count the tokens itself, and return a different type of IDENT (and do the checking there).
However, yacc is better suited to this sort of thing, so the answer to the original question is "no", there is no better solution.

Bison parser with operator tokens in variable name

I am new to bison, and have the misfortune of needing to write a parser for a language that may have what would otherwise be an operator within a variable name. For example, depending on context, the expression
FOO = BAR-BAZ
could be interpreted as either:
the variable "FOO" being assigned the value of the variable "BAR" minus the value of the variable "BAZ", OR
the variable "FOO" being assigned the value of the variable "BAR-BAZ"
Fortunately the language requires variable declarations ahead of time, so I can determine whether a given string is a valid variable via a function I've implemented:
bool isVariable(char* name);
that will return true if the given string is a valid variable name, and false otherwise.
How do I tell bison to attempt the second scenario above first, and only if (through use of isVariable()) that path fails, go back and try it as the first scenario above? I've read that you can have bison try multiple parsing paths and cull invalid ones when it encounters a YYERROR, so I've tried a set of rules similar to:
variable:
STRING { if(!isVariable($1)) YYERROR; }
;
expression:
expression '-' expression
| variable
;
but when given "BAR-BAZ" the parser tries it as a single variable and just stops completely when it hits the YYERROR instead of exploring the "BAR" - "BAZ" path as I expect. What am I doing wrong?
Edit:
I'm beginning to think that my flex rule for STRING might be the culprit:
((A-Z0-9][-A-Z0-9_///.]+)|([A-Z])) {yylval.sval = strdup(yytext); return STRING;}
In this case, if '-' appears in the middle of alphanumeric characters, the whole lot is treated as 1 STRING, without the possibility for subdivision by the parser (and therefore only one path explored). I suppose I could manually parse the STRING in the parser action, but it seems like there should be a better way. Perhaps flex could give back alternate token streams (one for the "BAR-BAZ" case and another for the "BAR"-"BAZ" case) that are diverted to different parser stacks for exploration? Is something like that possible?
It's not impossible to solve this problem within a bison-generated parser, but it's not easy, and the amount of hackery required might detract from the readability and verifiability of the grammar.
To be clear, GLR parsers are not fallback parsers. The GLR algorithm explores all possible parses in parallel, and rejects invalid ones as it goes. (The bison implementation requires that the parse converge to a single possible parse; the original GLR algorithm produces forest of parse trees.) Also, the GLR algorithm does not contemplate multiple lexical analyses.
If you want to solve this problem in the context of the parser, you'll probably need to introduce special handling for whitespace, or at least for - which are not surrounded by whitespace. Otherwise, you will not be able to distinguish between a - b (presumably always subtraction) and a-b (which might be the variable a-b if that variable were defined). Leaving aside that issue, you would be looking for something like this (but this won't work, as explained below):
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: var
| '(' expr ')'
var : ident { if (!isVariable($1)) { /* reject this production */ } }
ident : WORD
| ident '-' WORD { $$ = concatenate($1, "-", $3); }
This won't work because the action associated with var : ident is not executed until after the parse has been disambiguated. So if the production is rejected, the parse fails, because the parser has already determined that the production is necessary. (Until the parser makes that determination, actions are deferred.)
Bison allows GLR grammars to use semantic predicates, which are executed immediately instead of being deferred. But that doesn't help, because semantic predicates cannot make use of computed semantic values (since the semantic value computations are still deferred when the semantic predicate is evaluated). You might think you could get around this by making the computation of the concatenated identifier (in the second ident production) a semantic predicate, but then you run into another limitation: semantic predicates do not themselves have semantic values.
Probably there is a hack which will get around this problem, but that might leave you with a different problem. Suppose that a, c, a-b and b-c are defined variables. Then, what is the meaning of a-b-c? Is it (a-b) - c or a - (b-c) or an error?
If you expect it to be an error, then there is no problem since the GLR parser will find both possible parses and bison-generated GLR parsers signal a syntax error if the parse is ambiguous. But then the question becomes: is a-b-c only an error if it is ambiguous? Or is it an error because you cannot use a subtraction operator without surround whitespace if its arguments are hyphenated variables? (So that a-b-c can only be resolved to (a - b) - c or to (a-b-c), regardless of whether a-b and b-c exist?) To enforce the latter requirement, you'll need yet more complication.
If, on the other hand, your language is expected to model a "fallback" approach, then the result should be (a-b) - c. But making that selection is not a simple merge procedure between two expr reductions, because of the possibility of a higher precedence * operator: d * a-b-c either resolves to (d * a-b) - c or (d * a) - b-c; in those two cases, the parse trees are radically different.
An alternative solution is to put the disambiguation of hyphenated variables into the scanner, instead of the parser. This leads to a much simpler and somewhat clearer definition, but it leads to a different problem: how do you tell the scanner when you don't want the semantic disambiguation to happen? For example, you don't want the scanner to insist on breaking up a variable name into segments when you the name occurs in a declaration.
Even though the semantic tie-in with the scanner is a bit ugly, I'd go with that approach in this case. A rough outline of a solution is as follows:
First, the grammar. Here I've added a simple declaration syntax, which may or may not have any resemblance to the one in your grammar. See notes below.
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: VARIABLE
| '(' expr ')'
decl : { splitVariables(false); } "set" VARIABLE
{ splitVariables(true); } '=' expr ';'
{ addVariable($2); /* ... */ }
(See below for the semantics of splitVariables.)
Now, the lexer. Again, it's important to know what the intended result for a-b-c is; I'll outline two possible strategies. First, the fallback strategy, which can be implemented in flex:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] {
yymore();
if (isVariable(yytext))
candidate_len = yyleng;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]* { if (!isVariable(yytext))
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
}
That uses yymore and yyless to find the longest prefix sequence of hyphenated words which is a valid variable. (If there is no such prefix, it chooses the first word. An alternative would be to select the entire sequence if there is no such prefix.)
A similar alternative, which only allows the complete hyphenated sequence (in the case where that is a valid variable) or individual words. Again, we use yyless and yymore, but this time we don't bother checking intermediate prefixes and we use a second start condition for the case where we know we're not going to combine words:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>("-"[[:alpha:]][[:alnum:]]*)*[[:alpha:]][[:alnum:]]* {
if (isVariable(yytext)) {
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
} else {
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(NO_COMBINE);
return WORD;
}
}
<NO_COMBINE>[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<NO_COMBINE>"-" { return '-'; }
<NO_COMBINE>.|\n { yyless(0); /* rescan */
BEGIN(INITIAL);
}
Both of the above solutions use isVariable to decide whether or not a hyphenated sequence is a valid variable. As mentioned earlier, there must be a way to turn off the check, for example in the case of a declaration. To accomplish this, we need to implement splitVariables(bool). The implementation is straightforward; it simply needs to set a flag visible to isVariable. If the flag is set to true, then isVariable always returns true without actually checking for the existence of the variable in the symbol table.
All of that assumes that the symbol table and the splitVariables flag are shared between the parser and the scanner. A naïve solution would make both of these variables globals; a cleaner solution would be to use a pure parser and lexer, and pass the symbol table structure (including the flag) from the main program into the parser, and from there (using %lex-param) into the lexer.

antlr length of token and error handling

I'm using altlr version 3.4.
First question, please see grammar:
request: 'C' DELIM source DELIM target
{ System.out.println("Hi"); }
;
source: ID ;
target: ID ;
DELIM: '|' ;
fragment ALPHA: 'a'..'z' | 'A'..'Z' ;
fragment NUM: '0'..'9' ;
ID: ALPHA (ALPHA | NUM)* ;
"source" and "target" cannot be empty. But my test shows the following:
for input "C|n1|n2" : normal case, no problem.
for input "C||n2" : syntax error, and "Hi" not printed. Expected. Ok
for input "C|n1|" : syntax error, but "Hi" is printed. Not good.
I do need to set other things if "request" token is reached. But from above even for syntax error the code still reaches "request" token. Why?
Second question: how do I specify a rule for fixed length token, for example, a token of exact 10 digits?
Third question is about error handling. I override emitErrorMessage() in parser to set an error flag, but I found another emitErrorMessage() in lexer. I don't want to share the error flag between the parser and lexer objects. Can I override emitErrorMessage() in lexer to do nothing, and totally rely on the parser to report error? Or put another way, if there is an error, will the parser capture it for sure?
And if the error flag is set for one error, can the parser actually recovers and matches anther rule, so the previous error is false alarm?
Thanks for any help!
...
for input "C|n1|" : syntax error, but "Hi" is printed. Not good.
I do need to set other things if "request" token is reached. But from above even for syntax error the code still reaches "request" token. Why?
Because the parser tries to recover from this. If you don't want the parser to (try to) recover from mis-matched tokens, simply throw an exception like this:
grammar T;
// options...
#members {
#Override
public void emitErrorMessage(String message) {
throw new RuntimeException(message);
}
}
request
: 'C' DELIM source DELIM target { System.out.println("Hi"); }
;
// more rules...
Note that #members is short for #parser::members, it will only cause the emitErrorMessage(...) to be overridden in the parser, not the lexer. For lexer-members, you need to do #lexer::members.
Second question: how do I specify a rule for fixed length token, for example, a token of exact 10 digits?
See: ANTR3 set the number of accepted characters for a token
Third question is about error handling. ...
See the first part of my answer: simply override emitErrorMessage() and do nothing in it (the default action is to print on the std.err).
Can I override emitErrorMessage() in lexer to do nothing, and totally rely on the parser to report error?
Well, the parser and lexer handle different type or errors, so ignoring certain problems in the lexer might not cause the parser to produce a warning/error.
Bart, your help is great. I also thought it through and understood the behavior for Question#1 is legitimate. Like a compiler the parser will recover and continue to find as many errors as possible.
For question#2, I also figured out some way to do fixed length. Don't know if it's the popular way:
example : exact3 '|' exact4 ;
// method 1:
exact3 : (d+=DIGIT)+ {$d!=null && $d.size()==3}? ;
// method 2
exact4 : atmost4 {$atmost4.text.length()==4}? ;
atmost4:
#init {int n=1;}
: ({n<=4}?=>DIGIT {n++;})+
;
DIGIT:'0'..'9' ;
For question#3, I'll do fail on first error, i.e. override emitErrorMessage() in both lexer and parser to throw an exception. The choice of emitErrorMessage(msg) is because it has the error message properly prepared.
Thanks all who are sharing!

What does it mean when yacc {code} are in the middle?

extdefs:
{$<ttype>$ = NULL_TREE; } extdef
| extdefs {$<ttype>$ = NULL_TREE; } extdef
;
Why is it in the middle?
It could be everywhere. Sometimes it's useful to have something done in between the tokens, especially in this kind of or expressions.
In the standard description of the yacc utility it's said that:
Actions can occur anywhere in a rule
(not just at the end); an action can
access values returned by actions to
its left, and in turn the value it
returns can be accessed by actions to
its right. An action appearing in the
middle of a rule shall be equivalent
to replacing the action with a new
non-terminal symbol and adding an
empty rule with that non-terminal
symbol on the left-hand side. The
semantic action associated with the
new rule shall be equivalent to the
original action. The use of actions
within rules might introduce conflicts
that would not otherwise exist.