move to a new state after cosuming tokens in antlr4 - antlr

i have following grammar
root
: sql_statements EOF | EOF
;
sql_statements
:( sql_statement )+
;
sql_statement
: ddl_statement semicolon
;
ddl_statement
: alter_statement
;
This is a initial snapshot of my grammar
i have following test case
aleter table user;alter table group_user;
first statement has error hence i am getting error for both statements under sql_statements rule
Failed to parse at line 1:1 due to mismatched input 'aleter' expecting ALTER
Now i want to use Defaulterrorhandler functionality to skip first statement till semicolon.
Then it should start from second statement starting with rule sql_statements.
Question :
How to tell which rule to start after consuming error tokens ?
Thanks in advance.

Related

Alternation with EOF producing odd output

The following works fine:
testRoot
: sqlStatement EOF ?
;
However, if I add in the following:
testRoot
: sqlStatement (SEMI | EOF) ?
;
The same input now gives me the following error message:
line 1:8 extraneous input '<EOF>' expecting {<EOF>, SEMI}
Is there something that I'm missing here, or why is that error popping up?
EOF really isn't something that's optional. A token stream will always have an EOF token at the end of the stream.
testRoot: sqlStatement SEMI? EOF;
You always want an EOF at the end of a start rule. Without it, ANTLR may be able to recognize a portion of your input, and will discard the remainder without an error. With it, the rule says... this rule has to end with the EOF token, so it will consume all of the input to get to the EOF and will report any syntax errors it encounters.

ANTLR: Use First Token of next statement to terminate the preceeing one

I am having some issues to related to statement termination for an SQL grammar. Currently it supports statement termination through semicolon ';' token. Making this as optional makes the grammar go out of memory. I want to know whether I can match the first token of next statement as a terminator for previous one without consuming it. Here is a snippet of my parser grammar.
unit_statement
: statement SEMICOLON!
;
statement
options{
backtrack=true;
}
:
declare_statement
| assignment_statement
| sql_statement
;
Here the sql_statement rule must end with a semicolon but declare_statement should not. I am using ANTLR 3

Jison: Reduce Conflict where actually no conflict is

I'm trying to generate a small JavaScript parser which also includes typed variables for a small project.
Luckily, jison already provides a jscore.js which I just adjusted to fit my needs. After adding types I ran into a reduce conflict. I minimized to problem to this minimum JISON:
Jison:
%start SourceElements
%%
// This is up to become more complex soon
Type
: VAR
| IDENT
;
// Can be a list of statements
SourceElements
: Statement
| SourceElements Statement
;
// Either be a declaration or an expression
Statement
: VariableStatement
| ExprStatement
;
// Parses something like: MyType hello;
VariableStatement
: Type IDENT ";"
;
// Parases something like hello;
ExprStatement
: PrimaryExprNoBrace ";"
;
// Parses something like hello;
PrimaryExprNoBrace
: IDENT
;
Actually this script does nothing than parsing two statements:
VariableStatement
IDENT IDENT ";"
ExpStatement
IDENT ";"
As this is a extremly minimized JISON Script, I can't simply replace "Type" be "IDENT" (which btw. worked).
Generating the parser throws the following conflicts:
Conflict in grammar: multiple actions possible when lookahead token is IDENT in state 8
- reduce by rule: PrimaryExprNoBrace -> IDENT
- reduce by rule: Type -> IDENT
Conflict in grammar: multiple actions possible when lookahead token is ; in state 8
- reduce by rule: PrimaryExprNoBrace -> IDENT
- reduce by rule: Type -> IDENT
States with conflicts:
State 8
Type -> IDENT . #lookaheads= IDENT ;
PrimaryExprNoBrace -> IDENT . #lookaheads= IDENT ;
Is there any trick to fix this conflict?
Thank you in advanced!
~Benjamin
This looks like a Jison bug to me. It is complaining about ambiguity in the cases of these two sequences of tokens:
IDENT IDENT
IDENT ";"
The state in question is that reached after shifting the first IDENT token. Jison observes that it needs to reduce that token, and that (it claims) it doesn't know whether to reduce to a Type or to a PrimaryExpressionNoBrace.
But Jison should be able to distinguish based on the next token: if it is a second IDENT then only reducing to a Type can lead to a valid parse, whereas if it is ";" then only reducing to PrimaryExpressionNoBrace can lead to a valid parse.
Are you sure the given output goes with the given grammar? It would be possible either to add rules or to modify the given ones to produce an ambiguity such as the one described. This just seems like such a simple case that I'm surprised Jison gets it wrong. If it in fact does, however, then you should consider filing a bug report.

ANTLR - identifier with whitespace

i want identifiers that can contain whitespace.
grammar WhitespaceInSymbols;
premise : ( options {greedy=false;} : 'IF' ) id=ID{
System.out.println($id.text);
};
ID : ('a'..'z'|'A'..'Z')+ (' '('a'..'z'|'A'..'Z')+)*
;
WS : ' '+ {skip();}
;
When i test this with "IF statement analyzed" i get a MissingTokenException and the output "IF statement analyzed".
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token. But instead the IF is part of the ID.
Is there a way to achieve my goal? I already tried some variations of the greed=false-option, but without success.
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token.
No, the parser has nothing to say about the creation of tokens: the input is first tokenized and then the parser rules are applied on these tokens. So setting greedy=false has no effect.
You can do this (creating ID tokens with white spaces), but it will be a horrible solution with many predicates, and a few custom methods in the lexer doing manual look-aheads: you really, really don't want this! A much cleaner solution would be to introduce a id rule in your parser and let it match one or more ID tokens.
A demo:
grammar WhitespaceInSymbols;
premise
: IF id THEN EOF
;
id
: ID+
;
IF
: 'IF'
;
THEN
: 'THEN'
;
ID
: ('a'..'z' | 'A'..'Z')+
;
WS
: ' '+ {skip();}
;
would parse the input IF statement analyzed THEN into the following tree:

Parsing Newlines, EOF as End-of-Statement Marker with ANTLR3

My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error.
";" or "45;" (without newlines) result in EarlyExitException.
"NL" by itself parses without error.
"456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";" or "45;", the token STMTEND will never be created.
";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.
What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.