GOLD Parser comment grammar - grammar

I'm having some trouble with comment blocks in my grammar. The syntax is fine, but Step 3 DFA scanner is complaining about the way I'm going about it.
The language I'm trying to parse looks like this:
{statement}{statement} etc.
Within each statement can be a couple of different types of comments:
{% This is a comment.
It can contain multiple lines
and continues until the statement end}
{statement REM This is a comment.
It can contain multiple lines
and continues until the statement end}
This is a simplified grammar that displays the problem I'm running into:
"Start Symbol" = <Program>
{String Chars} = {Printable} + {HT} - ["\]
StringLiteral = '"' ( {String Chars} | '\' {Printable} )* '"'
Comment Start = '{%'
Comment End = '}'
Comment Block #= { Ending = Closed } ! Eat the } and produce an empty statement
!Comment #= { Type = Noise } !Implied by GOLD
Remark Start = 'REM'
Remark End = '}'
Remark Block #= { Ending = Open } ! Don't eat the }, the statements expects it
Remark #= { Type = Noise }
<Program> ::= <Statements>
<Statements> ::= '{' <Statement> '}' <Statements> | <>
<Statement> ::= StringLiteral
Step 3 is complaining about the } in <Statements> and the } for the End of the lexical group.
Anyone know how to accomplish what I need?
[Edit]
I got the REM portion working with the following:
{Remark Chars} = {Printable} + {WhiteSpace} - [}]
Remark = 'REM' {Remark Chars}* '}'
<Statements> ::= <Statements> '{' <Statement> '}'
| <Statements> '{' <Statement> <Remark Stmt>
| <>
<Remark Stmt> ::= Remark
This is actually ideal, since Remarks are not necessarily noise to me.
Still having issues with the comment lexical group. I'll look at solving in the same way.

I don't think capturing the REM comment with a lexical group is possible.
I think you need to define a new terminal like this:
Remark = 'REM' ({Printable} - '}')*
This however means, that you need to be able to handle this new terminal in your productions...
Eg.
From:
<CurlyStatement> ::= '{' <Statement> '}'
To:
<CurlyStatement> ::= '{' <Statement> '}'
| '{' <Statement> Remark '}'
I have'nt checked the syntax in the above examples, but I hope you get my idear

Related

R Language: Grammar for Raw Strings

I'm trying to create a new rule in the R grammar for Raw Strings.
Quote of the R news:
There is a new syntax for specifying raw character constants similar
to the one used in C++: r"(...)" with ... any character sequence not
containing the sequence )". This makes it easier to write strings that
contain backslashes or both single and double quotes. For more details
see ?Quotes.
Examples:
## A Windows path written as a raw string constant:
r"(c:\Program files\R)"
## More raw strings:
r"{(\1\2)}"
r"(use both "double" and 'single' quotes)"
r"---(\1--)-)---"
But I'm unsure if a grammar file alone is enough to implement the rule.
Until now I tried something like this as a basis from older suggestions of similar grammars:
Parser:
| RAW_STRING_LITERAL #e42
Lexer:
RAW_STRING_LITERAL
: ('R' | 'r') '"' ( '\\' [btnfr"'\\] | ~[\r\n"]|LETTER )* '"' ;
Any hints or suggestions are appreciated.
R ANTLR Grammar:
https://github.com/antlr/grammars-v4/blob/master/r/R.g4
Original R Grammar in Bison:
https://svn.r-project.org/R/trunk/src/main/gram.y
To match start- and end-delimiters, you will have to use target specific code. In Java that could look like this:
#lexer::members {
boolean closeDelimiterAhead() {
// Get the part between `r"` and `(`
String delimiter = getText().substring(2, getText().indexOf('('));
// Construct the end of the raw string
String stopFor = ")" + delimiter + "\"";
for (int n = 1; n <= stopFor.length(); n++) {
if (this._input.LA(n) != stopFor.charAt(n - 1)) {
// No end ahead yet
return false;
}
}
return true;
}
}
RAW_STRING
: [rR] '"' ~[(]* '(' ( {!closeDelimiterAhead()}? . )* ')' ~["]* '"'
;
which tokenizes r"---( )--" )----" )---" as a single RAW_STRING.
EDIT
And since the delimiters can only consist of hyphens (and parenthesis/braces) and not just any arbitrary character, this should do it as well:
RAW_STRING
: [rR] '"' INNER_RAW_STRING '"'
;
fragment INNER_RAW_STRING
: '-' INNER_RAW_STRING '-'
| '(' .*? ')'
| '{' .*? '}'
| '[' .*? ']'
;

ANTLR Pattern "line 1:9 extraneous input ' ' expecting WORD"

I'm just getting started with using ANTLR. I'm trying to write a parser for field definitions that look like:
field_name = value
Example:
is_true_true = yes;
My grammar looks like this:
grammar Hello;
//Lexer Rules
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9';
fragment TRUE: 'TRUE'|'true';
fragment FALSE: 'FALSE'|'false';
INTEGER : DIGIT+ ;
STRING : ('\''.*?'\'') ;
BOOLEAN : (TRUE|FALSE);
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WHITESPACE : (' ' | '\t')+ ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
field_def : WORD '=' WORD ';' ;
But when I run the generated Parser on 'working = yes;' i get the error message:
line 1:7 extraneous input ' ' expecting '='
line 1:9 extraneous input ' ' expecting WORD
I do not understand this fully, is there an error in matching the WORD-pattern or is it something else entirely?
Since it's quite usual that the whitespace is not significant to your grammar (i.e. there's no semantic meaning to it, apart of separating words), ANTLR makes it possible to just skip it:
In ANTLR 4 this is done by
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
In ANTLR 3 the syntax is
WHITESPACE : (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE : ('\r'? '\n' | '\r')+ { $channel = HIDDEN; };
What this does is the lexer tokenizes the input as usual, but parser understands that these tokens are not significant to it and behaves as if they were not there, allowing you to keep your rules simple and without need to add optional whitespace everywhere.
Your example has whitespace but your field_def isn't accounting for it.

Outputting a context-appropriate error message on a missing token

I have a grammar to parse some source code:
document
: header body_block* EOF
-> body_block*
;
header
: header_statement*
;
body_block
: '{' block_contents '}'
;
block_contents
: declaration_list
| ... other things ....
It's legal for a document to have a header without a body or a body without a header.
If I try to parse a document that looks like
int i;
then ANTLR complains that it found int when it was expecting EOF. This is true, but I'd like it to say that it was expecting {. That is, if the input contains something between the header and the EOF that's not a body_block, then I'd like to suggest to the user that they meant to enclose that text inside a body_block.
I've made a couple almost working attempts at this that I can post if that's illuminating, but I'm hoping that I've just missed something easy.
Not pretty, but something like this would do it:
body_block
: ('{')=> '{' block_contents '}'
| t=.
{
if(!$t.text.equals("{")) {
String message = "expected a '{' on line " + $t.getLine() + " near '" + $t.text + "'";
}
else {
String message = "encountered a '{' without a '}' on line " + $t.getLine();
}
throw new RuntimeException(message);
}
;
(not tested, may contain syntax errors!)
So, whenever '{' ... '}' is not matched, it falls through to .1 and produces a more understandable error message.
1 note that a . in a parser rule matches any token, not any character!

Simple grammar not working

I have a simple grammar to parse files containing identifiers and keywords between brackets (hopefully):
grammar Keyword;
// PARSER RULES
//
entry_point : ('['ID']')*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
It works for input:
[Hi]
[Hi]
It returns a NoViableAltException error for input:
[Hi]
[Ki]
If I comment KEYWORD, then it works fine. Also, if I change my grammar to:
grammar Keyword;
// PARSER RULES
//
entry_point : ID*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : '[' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ']';
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
Then it works. Could you please help me figuring out why?
Best regards.
The 1st grammar fails because whenever the lexer sees "[K", the lexer will enter the KEYWORD rule. If it then encounters something other then "eyword]", "i" in your case, it tries to go back to some other rule that can match "[K". But there is no other lexer rule that starts with "[K" and will therefor throw an exception. Note that the lexer doesn't remove "K" and then tries to match again (the lexer is a dumb machine)!
Your 2nd grammar works, because the lexer now can find something to fall back on when "[Ki" does not get matched by the KEYWORD since ID now includes the "[".

ANTLR: how to parse a region within matching brackets with a lexer

i want to parse something like this in my lexer:
( begin expression )
where expressions are also surrounded by brackets. it isn't important what is in the expression, i just want to have all what's between the (begin and the matching ) as a token. an example would be:
(begin
(define x (+ 1 2)))
so the text of the token should be (define x (+ 1 2)))
something like
PROGRAM : LPAREN BEGIN .* RPAREN;
does (obviously) not work because as soon as he sees a ")", he thinks the rule is over, but i need the matching bracket for this.
how can i do that?
Inside lexer rules, you can invoke rules recursively. So, that's one way to solve this. Another approach would be to keep track of the number of open- and close parenthesis and let a gated semantic predicate loop as long as your counter is more than zero.
A demo:
T.g
grammar T;
parse
: BeginToken {System.out.println("parsed :: " + $BeginToken.text);} EOF
;
BeginToken
#init{int open = 1;}
: '(' 'begin' ( {open > 0}?=> // keep reapeating `( ... )*` as long as open > 0
( ~('(' | ')') // match anything other than parenthesis
| '(' {open++;} // match a '(' in increase the var `open`
| ')' {open--;} // match a ')' in decrease the var `open`
)
)*
;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String input = "(begin (define x (+ (- 1 3) 2)))";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
java -cp antlr-3.3-complete.jar org.antlr.Tool T.g
javac -cp antlr-3.3-complete.jar *.java
java -cp .:antlr-3.3-complete.jar Main
parsed :: (begin (define x (+ (- 1 3) 2)))
Note that you'll need to beware of string literals inside your source that might include parenthesis:
BeginToken
#init{int open = 1;}
: '(' 'begin' ( {open > 0}?=> // ...
( ~('(' | ')' | '"') // ...
| '(' {open++;} // ...
| ')' {open--;} // ...
| '"' ... // TODO: define a string literal here
)
)*
;
or comments that may contain parenthesis.
The suggestion with the predicate uses some language specific code (Java, in this case). An advantage of calling a lexer rule recursively is that you don't have custom code in your lexer:
BeginToken
: '(' Spaces? 'begin' Spaces? NestedParens Spaces? ')'
;
fragment NestedParens
: '(' ( ~('(' | ')') | NestedParens )* ')'
;
fragment Spaces
: (' ' | '\t')+
;