I am having some issues to related to statement termination for an SQL grammar. Currently it supports statement termination through semicolon ';' token. Making this as optional makes the grammar go out of memory. I want to know whether I can match the first token of next statement as a terminator for previous one without consuming it. Here is a snippet of my parser grammar.
unit_statement
: statement SEMICOLON!
;
statement
options{
backtrack=true;
}
:
declare_statement
| assignment_statement
| sql_statement
;
Here the sql_statement rule must end with a semicolon but declare_statement should not. I am using ANTLR 3
Related
i have following grammar
root
: sql_statements EOF | EOF
;
sql_statements
:( sql_statement )+
;
sql_statement
: ddl_statement semicolon
;
ddl_statement
: alter_statement
;
This is a initial snapshot of my grammar
i have following test case
aleter table user;alter table group_user;
first statement has error hence i am getting error for both statements under sql_statements rule
Failed to parse at line 1:1 due to mismatched input 'aleter' expecting ALTER
Now i want to use Defaulterrorhandler functionality to skip first statement till semicolon.
Then it should start from second statement starting with rule sql_statements.
Question :
How to tell which rule to start after consuming error tokens ?
Thanks in advance.
I'm trying to recover from an error in an If-Else statement.
In my grammar an If is always followed by an Else.
statement: OBRACES statements CBRACES
| IF OPAR exp CPAR statement ELSE statement
| IF OPAR exp CPAR statement error '\n' { yyerrok; yyclearin;}
;
The error found is in the commented else in the last lines:
public boolean Equal(Element other){
if (!this.Compare(aux01,Age))
ret_val = false ;
//else
//nt = 0 ;
}
error: syntax error, unexpected CBRACES, expecting ELSE -> } # line 29
It is not recovering from that error ignoring the errors that come after.
Maybe i'm not understanding well how this error works but i can only find 2 examples on every site about error recovery: "error '\n'" and "'(' error ')'"
Anyone have an idea how to recover from this error (when an if is not followed by an else).
Thanks
You've provided not enough context to know exactly, but my guess is that the lexer/tokenizer, which feeds tokens to your parser, skips white space - including '\n'. So the parser never sees the newline and b/c of that never reduces the
IF OPAR exp CPAR statement error '\n'
production and so its action
{ yyerrok; yyclearin;}
never gets executed and so the error is not recovered.
Probably (although it is hard to say for sure without seeing more of the grammar) you don't need to skip any tokens in the case that else is not found.
The most likely case is that the program is simply lacking an else clause (perhaps because its author is used to other programming languages in which else is optional), and parsing can simply continue as though there were an empty else clause. So you should be able to just use:
IF OPAR exp CPAR statement error { yyerrok; }
(Note: I removed yyclearin because you almost certainly don't want to do that. In the case of the error in the OP, the result would be to ignore the '}' token, leading to extraneous errors later in the parse.)
You probably should take advantage of the action in this error production to produce a clear error message ("if statements must have else clauses"), although the default message is reasonably clear as well.
It is certainly the case that whatever token(s) are used as error context must be produceable by the scanner. That generally precludes error recovery techniques such as "skip to the end of the line", except in the case of languages in which newlines are syntactically significant.
I found out the problem. Thanks for the help anyway.
I put the error inside the braces and now is working.
statement: OBRACES statements CBRACES
| OBRACES error CBRACES { yyerrok; yyclearin;}
| IF OPAR exp CPAR statement ELSE statement
;
i want identifiers that can contain whitespace.
grammar WhitespaceInSymbols;
premise : ( options {greedy=false;} : 'IF' ) id=ID{
System.out.println($id.text);
};
ID : ('a'..'z'|'A'..'Z')+ (' '('a'..'z'|'A'..'Z')+)*
;
WS : ' '+ {skip();}
;
When i test this with "IF statement analyzed" i get a MissingTokenException and the output "IF statement analyzed".
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token. But instead the IF is part of the ID.
Is there a way to achieve my goal? I already tried some variations of the greed=false-option, but without success.
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token.
No, the parser has nothing to say about the creation of tokens: the input is first tokenized and then the parser rules are applied on these tokens. So setting greedy=false has no effect.
You can do this (creating ID tokens with white spaces), but it will be a horrible solution with many predicates, and a few custom methods in the lexer doing manual look-aheads: you really, really don't want this! A much cleaner solution would be to introduce a id rule in your parser and let it match one or more ID tokens.
A demo:
grammar WhitespaceInSymbols;
premise
: IF id THEN EOF
;
id
: ID+
;
IF
: 'IF'
;
THEN
: 'THEN'
;
ID
: ('a'..'z' | 'A'..'Z')+
;
WS
: ' '+ {skip();}
;
would parse the input IF statement analyzed THEN into the following tree:
My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error.
";" or "45;" (without newlines) result in EarlyExitException.
"NL" by itself parses without error.
"456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";" or "45;", the token STMTEND will never be created.
";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.
What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.
G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.