I have a grammar to parse some source code:
document
: header body_block* EOF
-> body_block*
;
header
: header_statement*
;
body_block
: '{' block_contents '}'
;
block_contents
: declaration_list
| ... other things ....
It's legal for a document to have a header without a body or a body without a header.
If I try to parse a document that looks like
int i;
then ANTLR complains that it found int when it was expecting EOF. This is true, but I'd like it to say that it was expecting {. That is, if the input contains something between the header and the EOF that's not a body_block, then I'd like to suggest to the user that they meant to enclose that text inside a body_block.
I've made a couple almost working attempts at this that I can post if that's illuminating, but I'm hoping that I've just missed something easy.
Not pretty, but something like this would do it:
body_block
: ('{')=> '{' block_contents '}'
| t=.
{
if(!$t.text.equals("{")) {
String message = "expected a '{' on line " + $t.getLine() + " near '" + $t.text + "'";
}
else {
String message = "encountered a '{' without a '}' on line " + $t.getLine();
}
throw new RuntimeException(message);
}
;
(not tested, may contain syntax errors!)
So, whenever '{' ... '}' is not matched, it falls through to .1 and produces a more understandable error message.
1 note that a . in a parser rule matches any token, not any character!
Related
I have started working with ANTLR4 to create a syntax parser for a self defined template file format.
The format basically consists of a mandatory part called '#settings' and at least one part called '#region'. Parts body is surrounded by braces.
I have created a sample file and also copy-pasted-modified an antlr g4 file to parse it. Works fine so far:
File:
#settings
{
setting1: value1
setting2: value2
}
#region
{
[Key1]=Value1(Comment1)
[Key2]=Value2(Comment2)
}
The G4 file for this sample:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
This works as expected. Now I want to add complexity to the file format and the parser and extend the #region header by #region NAME (Attributes).
So what I changed in the sample and in the G4 file is:
Sample changed to
...
#region name (attributes, moreAttributes)
{
...
and g4 file modified to
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
Now the parser brings up the following error:
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
And I don't get why it is behaving like this. I expected the parser to not concat the whole line when comparing. What am I doing wrong?
Thanks.
There are a couple of problems here:
whatever NOSPACE matches, is also matched by TEXT
TEXT is waaaaay too greedy
Issue 1
ANTLR's lexer works independently from the parser and the lexer will match as much characters as possible.
When 2 (or more) lexer rules match the same amount of characters, the one defined first "wins".
So, if the input is Foo and the parser is. trying to match a NOSPACE token, you're out of luck: because both TEXT and NOSPACE match the text Foo and TEXT is defined first, the lexer will produce a TEXT token. There's nothing you can do about that: it's the way ANTLR works.
Issue 2
As explained in issue 1, the lexer tries to match as much characters as possible. Because of that, your TEXT rule is too greedy. This is what your input is tokenised as:
'{' `{`
TEXT `setting1: value1`
TEXT `setting2: value2`
'}' `}`
TEXT `#region name (attributes, moreAttributes)`
'{' `{`
TEXT `[Key1]=Value1(Comment1)`
TEXT `[Key2]=Value2(Comment2)`
'}' `}`
As you can see, TEXT matches too much. And this is what the error
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
is telling you: #region name (attributes, moreAttributes) is a single TEXT token where a #region is trying to be matched by the parser.
Solution?
Remove NOSPACE and make the TEXT token less greedy (or the other way around).
Bart,
thank you very much for clarifying this to me. The key phrase was the lexer will match as much characters as possible. This is a behavior I still need to get used to. I redesigned my Lexer and Parser rules and it seems to work for my test case now.
For completeness, this is my g4 File now:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: TEXT
;
settingsText
: TEXT
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: TEXT
;
regionText
: regionLine '('? (regionComment?) ')'?
;
regionLine
: TEXT
;
regionComment
: TEXT
;
TEXT
: ([A-z0-9:\-|= ])+
;
WS
: [ \t\n\r] + -> skip
;
As described in the header, I am using Bison and Flex to get a parser, yet I need to handle the error and continue after I find one. Thus I use:
Stmt: Reference '=' Expr ';' { printf(" Reference = Expr ;\n");}
| '{' Stmts '}' { printf("{ Stmts }");}
| WHILE '(' Bool ')' '{' Stmts '}' { printf(" WHILE ( Bool ) { Stmts } ");}
| FOR NAME '=' Expr TO Expr BY Expr '{' Stmts '}' { printf(" FOR NAME = Expr TO Expr BY Expr { Stmts } ");}
| IF '(' Bool ')' THEN Stmt { printf(" IF ( Bool ) THEN Stmt ");}
| IF '(' Bool ')' THEN Stmt ELSE Stmt { printf(" IF ( Bool ) THEN Stmt ELSE Stmt ");}
| READ Reference ';' { printf(" READ Reference ;");}
| WRITE Expr ';' { printf(" WRITE Expr ;");}
| error ';' { yyerror("Statement is not valid"); yyclearin; yyerrok;}
;
however, I always get a msg "syntax error" and I do not know where does it come from and how to prevent it so that my own "error code" will be executed.
I am trying to do an error recovery here so that my parser will continue to parse the input till the EOF.
People often confuse the purpose of error rules in yacc/bison -- they are for error RECOVERY, not for error HANDLING. So an error rule is not called in response to an error -- the error happens and then the error rule is used to recover.
If you want to handle the error yourself (so avoid printing a "syntax error" message), you need to define your own yyerror function (that is the error handler) that does something with "syntax error" string other than printing it. One option is to do nothing, and then print a message in your error recovery rule (eg, where you call yyerror, change it to printf instead). The problem being that if error recovery fails, you won't get any message (you will get a failure return from yyparse, so could print a message there).
Is there any way to express this in ANTLR4:
Any string as long as it doesn't contain the asterisk immediately
followed by a forward slash?
This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'
This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.
I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.
The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.
The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.
grammar myTest;
#header
{
import java.util.*;
}
#parser::members
{
boolean hasEvilCharacters(String input)
{
if (input.contains("*/"))
{
return false;
}
else
{
return true;
}
}
}
// Mimics a very simple sentence, such as:
// I am clean.
// I have evil char*/acters.
myTest
: { hasEvilCharacters(_input.LT(1).getText()) }? String
(Space { hasEvilCharacters(_input.LT(1).getText()) }? String)*
Period EOF
;
String
: ('A'..'Z' | 'a'..'z')+
;
Space
: ' '
;
Period
: '.'
;
Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.
If the disallowed sequences are few there exists a solution without parser/lexer actions:
grammar NotParser;
program
: (starslash | notstarslash)+
;
notstarslash
: NOT_STAR_SLASH
;
starslash
: STAR_SLASH
;
STAR_SLASH
: '*'+ '/'
;
NOT_STAR_SLASH
: (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
;
fragment F_NOT_STAR_SLASH
: ~('*'|'/')
;
fragment F_STAR_NOT_SLASH
: '*'+ ~('*'|'/')
| '*'+ EOF
| '/'
;
The idea is to compose the token of
all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'
There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')
I’m trying to implement a simple parsing over custom .c files with added syntax.
Ex: test.c
.
// I don’t need this in output
int func1(int a, int b);
//I need this.
#parseme int func2(int a, int b);
//and this …
#parseme
void func3()
{
Int a;
//put here where ever
…
{
//inside block
}
return;
}
.
I want to use a fuzzy parsing approach on the lexer phase then, on the parser rules, rewrite token with TokenRewriteStream and templates.
Well it’s a lexer piece …
lexer grammar Lexi;
options {filter = true;}
// Pick everything between #parseme and ';' or '{ }'
METHOD
: HEADER .* (';' | BODY )
;
fragment
HEADER
: '#' ('parseme' | 'PARSEME') ;
fragment
BODY: '{' .* '}' ;
.
…
The problem is simple for a expert look:
1- Lexer stop parse when found ‘;’ before to reach the last ‘}’ on “ #parseme void func3() …. “
2- Lexer stop parse when found inside block right curly.
3- And surely more cases don’t tested yet.
The problem is really obvious. Is the solution too?? I hope soo !!
Thanks.
Answer my self.
lexer grammar Lexi;
options {filter = true;}
// Pick everything between #parseme and ';' or '{}'
METHOD
: METHOD_HEADER (~'{')* METHOD_END ;
fragment
METHOD_HEADER
: '#' ('parseme' | 'PARSEME') ;
fragment
METHOD_END
: (';' | BLOCK ) ;
fragment
BLOCK
: '{' ( ~('{' | '}') | BLOCK )* '}' ;
WS : (' '|'\r'|'\t'|'\n')+ ;
The solution was very simple.
I have a simple grammar to parse files containing identifiers and keywords between brackets (hopefully):
grammar Keyword;
// PARSER RULES
//
entry_point : ('['ID']')*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
It works for input:
[Hi]
[Hi]
It returns a NoViableAltException error for input:
[Hi]
[Ki]
If I comment KEYWORD, then it works fine. Also, if I change my grammar to:
grammar Keyword;
// PARSER RULES
//
entry_point : ID*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : '[' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ']';
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
Then it works. Could you please help me figuring out why?
Best regards.
The 1st grammar fails because whenever the lexer sees "[K", the lexer will enter the KEYWORD rule. If it then encounters something other then "eyword]", "i" in your case, it tries to go back to some other rule that can match "[K". But there is no other lexer rule that starts with "[K" and will therefor throw an exception. Note that the lexer doesn't remove "K" and then tries to match again (the lexer is a dumb machine)!
Your 2nd grammar works, because the lexer now can find something to fall back on when "[Ki" does not get matched by the KEYWORD since ID now includes the "[".