SableCC expecting EOF - sablecc

I seem to be having issues with SableCC generating lexing.grammar
This is what i run on sableCC
Package lexing ; // A Java package is produced for the
// generated scanner
Helpers
num = ['0'..'9']+; // A num is 1 or more decimal digits
letter = ['a'..'z'] | ['A'..'Z'] ;
// A letter is a single upper or
// lowercase character.
Tokens
number = num; // A number token is a whole number
ident = letter (letter | num)* ;
// An ident token is a letter followed by
// 0 or more letters and numbers.
arith_op = [ ['+' + '-' ] + ['*' + '/' ] ] ;
// Arithmetic operators
rel_op = ['<' + '>'] | '==' | '<=' | '>=' | '!=' ;
// Relational operators
paren = ['(' + ')']; // Parentheses
blank = (' ' | '\t' | 10 | '\n')+ ; // White space
unknown = [0..0xffff] ;
// Any single character which is not part
// of one of the above tokens.
This is the result
org.sablecc.sablecc.parser.ParserException: [21,1] expecting: EOF
at org.sablecc.sablecc.parser.Parser.parse(Parser.java:1792)
at org.sablecc.sablecc.SableCC.processGrammar(SableCC.java:203)
at org.sablecc.sablecc.SableCC.processGrammar(SableCC.java:171)
at org.sablecc.sablecc.SableCC.main(SableCC.java:137)

You can only have a short_comment if you put one empty line after it. If you use long_comments instead (/* ... */), there's no need for it.
The reason is that, according to the grammar that defines the SableCC 2.x input language, a short comment is defined as a consuming pattern of eol:
cr = 13;
lf = 10;
eol = cr lf | cr | lf; // This takes care of different platforms
short_comment = '//' not_cr_lf* eol;
Since the last line has:
// of one of the above tokens.
It consumes the last (invisible) EOF token expected at the end of any
.sable file, explaining the error.

Related

R Language: Grammar for Raw Strings

I'm trying to create a new rule in the R grammar for Raw Strings.
Quote of the R news:
There is a new syntax for specifying raw character constants similar
to the one used in C++: r"(...)" with ... any character sequence not
containing the sequence )". This makes it easier to write strings that
contain backslashes or both single and double quotes. For more details
see ?Quotes.
Examples:
## A Windows path written as a raw string constant:
r"(c:\Program files\R)"
## More raw strings:
r"{(\1\2)}"
r"(use both "double" and 'single' quotes)"
r"---(\1--)-)---"
But I'm unsure if a grammar file alone is enough to implement the rule.
Until now I tried something like this as a basis from older suggestions of similar grammars:
Parser:
| RAW_STRING_LITERAL #e42
Lexer:
RAW_STRING_LITERAL
: ('R' | 'r') '"' ( '\\' [btnfr"'\\] | ~[\r\n"]|LETTER )* '"' ;
Any hints or suggestions are appreciated.
R ANTLR Grammar:
https://github.com/antlr/grammars-v4/blob/master/r/R.g4
Original R Grammar in Bison:
https://svn.r-project.org/R/trunk/src/main/gram.y
To match start- and end-delimiters, you will have to use target specific code. In Java that could look like this:
#lexer::members {
boolean closeDelimiterAhead() {
// Get the part between `r"` and `(`
String delimiter = getText().substring(2, getText().indexOf('('));
// Construct the end of the raw string
String stopFor = ")" + delimiter + "\"";
for (int n = 1; n <= stopFor.length(); n++) {
if (this._input.LA(n) != stopFor.charAt(n - 1)) {
// No end ahead yet
return false;
}
}
return true;
}
}
RAW_STRING
: [rR] '"' ~[(]* '(' ( {!closeDelimiterAhead()}? . )* ')' ~["]* '"'
;
which tokenizes r"---( )--" )----" )---" as a single RAW_STRING.
EDIT
And since the delimiters can only consist of hyphens (and parenthesis/braces) and not just any arbitrary character, this should do it as well:
RAW_STRING
: [rR] '"' INNER_RAW_STRING '"'
;
fragment INNER_RAW_STRING
: '-' INNER_RAW_STRING '-'
| '(' .*? ')'
| '{' .*? '}'
| '[' .*? ']'
;

Why does my antlr grammar seem to properly parse this input?

I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
}
program : idList EOF
| integerList EOF
;
idList : ID whitespace idList
| ID
;
integerList : INTEGER whitespace integerList
| INTEGER
;
whitespace : (WHITESPACE | COMMENT) +;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?
I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
parser.program()
}
program : idList EOF
| integerList EOF
;
idList : ID+
;
integerList : INTEGER+
;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.

ANTLR parsing Java Properties

I'm trying to pick up ANTLR and writing a grammar for Java Properties. I'm hitting an issue here and will appreciate some help.
In Java Properties, it has a little strange escape handling. For example,
key1=1=Key1
key\=2==
results in key-value pairs in Java runtime as
KEY VALUE
=== =====
key1 1=Key1
key=2 =
So far, this is the best I can mimic.. by folding the '=' and value into one single token.
grammar Prop;
file : (pair | LINE_COMMENT)* ;
pair : ID VALUE ;
ID : (~('='|'\r'|'\n') | '\\=')* ;
VALUE : '=' (~('\r'|'\n'))*;
CARRIAGE_RETURN
: ('\r'|'\n') + {$channel=HIDDEN;}
;
LINE_COMMENT
: '#' ~('\r'|'\n')* ('\r'|'\n'|EOF)
;
Is there any good suggestion if I can implement a better one?
Thanks a lot
It's not as easy as that. You can't handle much at the lexing level because many things depend on a certain context. So at the lexing level, you can only match single characters and construct key and values in parser rules. Also, the = and : as possible key-value separators and the fact that these characters can be the start of a value, makes them a pain in the butt to translate into a grammar. The easiest would be to include these (possible) separator chars in your value-rule and after matching the separator and value together, strip the separator chars from it.
A small demo:
JavaProperties.g
grammar JavaProperties;
parse
: line* EOF
;
line
: Space* keyValue
| Space* Comment eol
| Space* LineBreak
;
keyValue
: key separatorAndValue eol
{
// Replace all escaped `=` and `:`
String k = $key.text.replace("\\:", ":").replace("\\=", "=");
// Remove the separator, if it exists
String v = $separatorAndValue.text.replaceAll("^\\s*[:=]\\s*", "");
// Remove all escaped line breaks with trailing spaces
v = v.replaceAll("\\\\(\r?\n|\r)[ \t\f]*", "").trim();
System.out.println("\nkey : `" + k + "`");
System.out.println("value : `" + v + "`");
}
;
key
: keyChar+
;
keyChar
: AlphaNum
| Backslash (Colon | Equals)
;
separatorAndValue
: (Space | Colon | Equals) valueChar+
;
valueChar
: AlphaNum
| Space
| Backslash LineBreak
| Equals
| Colon
;
eol
: LineBreak
| EOF
;
Backslash : '\\';
Colon : ':';
Equals : '=';
Comment
: ('!' | '#') ~('\r' | '\n')*
;
LineBreak
: '\r'? '\n'
| '\r'
;
Space
: ' '
| '\t'
| '\f'
;
AlphaNum
: 'a'..'z'
| 'A'..'Z'
| '0'..'9'
;
The grammar above can be tested with the class:
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRFileStream("test.properties");
JavaPropertiesLexer lexer = new JavaPropertiesLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
JavaPropertiesParser parser = new JavaPropertiesParser(tokens);
parser.parse();
}
}
and the input file:
test.properties
key1 = value 1
key2:value 2
key3 :value3
ke\:\=y4=v\
a\
l\
u\
e 4
key\=5==
key6 value6
to produce the following output:
key : `key1`
value : `value 1`
key : `key2`
value : `value 2`
key : `key3`
value : `value3`
key : `ke:=y4`
value : `value 4`
key : `key=5`
value : `=`
key : `key6`
value : `value6`
Realize that my grammar is just an example: it does not account for all valid properties files (sometimes backslashes should be ignored, there's no Unicode escapes, many characters are missing in the key and value). For a complete specification of properties files, see:
http://download.oracle.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29

How to find the length of a token in antlr?

I am trying to create a grammar which accepts any character or number or just about anything, provided its length is equal to 1.
Is there a function to check the length?
EDIT
Let me make my question more clear with an example.
I wrote the following code:
grammar first;
tokens {
SET = 'set';
VAL = 'val';
UND = 'und';
CON = 'con';
ON = 'on';
OFF = 'off';
}
#parser::members {
private boolean inbounds(Token t, int min, int max) {
int n = Integer.parseInt(t.getText());
return n >= min && n <= max;
}
}
parse : SET expr;
expr : VAL('u'('e')?)? String |
UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF) |
CON('n'('e'('c'('t')?)?)?)? oneChar
;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
dot : .;
oneChar : dot { $dot.text.length() == 1;} ;
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
I want my grammar to do the following things:
Accept commands like: 'set value abc' , 'set underli on' , 'set conn #'. The grammar should be intelligent enough to accept incomplete words like 'underl' instead of 'underline. etc etc.
The third syntax: 'set connect oneChar' should accept any character, but just one character. It can be a numeric digit or alphabet or any special character. I am getting a compiler error in the generated parser file because of this.
The first syntax: 'set value' should accept all the possible strings, even on and off. But when I give something like: 'set value offer', the grammar is failing. I think this is happening because I already have a token 'OFF'.
In my grammar all the three requirements I have listed above are not working fine. Don't know why.
There are some mistakes and/or bad practices in your grammar:
#1
The following is not a validating predicate:
{$dot.text.length() == 1;}
A proper validating predicate in ANTLR has a question mark at the end, and the inner code has no semi colon at the end. So it should be:
{$dot.text.length() == 1}?
instead.
#2
You should not be handling these alternative commands:
expr
: VAL('u'('e')?)? String
| UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF)
| CON('n'('e'('c'('t')?)?)?)? oneChar
;
in a parser rule. You should let the lexer handle this instead. Something like this will do it:
expr
: VAL String
| UND (ON | OFF)
| CON oneChar
;
// ...
VAL : 'val' ('u' ('e')?)?;
UND : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
(also see #5!)
#3
Your lexer rules:
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
are making things complicated for you. The lexer can produce three different kind of tokens because of this: CHAR, DIGIT or String. Ideally, you should only create String tokens since a String can already be a single CHAR or DIGIT. You can do that by adding the fragment keyword before these rules:
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
There will now be no CHAR and DIGIT tokens in your token stream, only String tokens. In short: fragment rules are only used inside lexer rules, by other lexer rules. They will never be tokens of their own (and can therefor never appear in any parser rule!).
#4
The rule:
dot : .;
does not do what you think it does. It matches "any token", not "any character". Inside a lexer rule, the . matches any character but in parser rules, it matches any token. Realize that parser rules can only make use of the tokens created by the lexer.
The input source is first tokenized based on the lexer-rules. After that has been done, the parser (though its parser rules) can then operate on these tokens (not characters!!!). Make sure you understand this! (if not, ask for clarification or grab a book about ANTLR)
- an example -
Take the following grammar:
p : . ;
A : 'a' | 'A';
B : 'b' | 'B';
The parser rule p will now match any token that the lexer produces: which is only a A- or B-token. So, p can only match one of the characters 'a', 'A', 'b' or 'B', nothing else.
And in the following grammar:
prs : . ;
FOO : 'a';
BAR : . ;
the lexer rule BAR matches any single character in the range \u0000 .. \uFFFF, but it can never match the character 'a' since the lexer rule FOO is defined before the BAR rule and captures this 'a' already. And the parser rule prs again matches any token, which is either FOO or BAR.
#5
Putting single characters like 'u' inside your parser rules, will cause the lexer to tokenize an u as a separate token: you don't want that. Also, by putting them in parser rules, it is unclear which token has precedence over other tokens. You should keep all such literals outside your parser rules and make them explicit lexer rules instead. Only use lexer rules in your parser rules.
So, don't do:
pRule : 'u' ':' String
String : ...
but do:
pRule : U ':' String
U : 'u';
String : ...
You could make ':' a lexer rule, but that is of less importance. The 'u' however can also be a String so it must appear as a lexer rule before the String rule.
Okay, those were the most obvious things that come to mind. Based on them, here's a proposed grammar:
grammar first;
parse
: (SET expr {System.out.println("expr = " + $expr.text);} )+ EOF
;
expr
: VAL String {System.out.print("A :: ");}
| UL (ON | OFF) {System.out.print("B :: ");}
| CON oneChar {System.out.print("C :: ");}
;
oneChar
: String {$String.text.length() == 1}?
;
SET : 'set';
VAL : 'val' ('u' ('e')?)?;
UL : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
ON : 'on';
OFF : 'off';
String : (CHAR | DIGIT)+;
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
that can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"set value abc \n" +
"set underli on \n" +
"set conn x \n" +
"set conn xy ";
ANTLRStringStream in = new ANTLRStringStream(source);
firstLexer lexer = new firstLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
firstParser parser = new firstParser(tokens);
System.out.println("parsing:\n======\n" + source + "\n======");
parser.parse();
}
}
which, after generating the lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool first.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
prints the following output:
parsing:
======
set value abc
set underli on
set conn x
set conn xy
======
A :: expr = value abc
B :: expr = underli on
C :: expr = conn x
line 0:-1 rule oneChar failed predicate: {$String.text.length() == 1}?
C :: expr = conn xy
As you can see, the last command, C :: expr = conn xy, produces an error, as expected.

eliminate extra spaces in a given ANTLR grammar

In any grammar I create in ANTLR, is it possible to parse the grammar and the result of the parsing can eleminate any extra spaces in the grammar. f.e
simple example ;
int x=5;
if I write
int x = 5 ;
I would like that the text changes to the int x=5 without the extra spaces. Can the parser return the original text without extra spaces?
Can the parser return the original text without extra spaces?
Yes, you need to define a lexer rule that captures these spaces and then skip() them:
Space
: (' ' | '\t') {skip();}
;
which will cause spaces and tabs to be ignored.
PS. I'm assuming you're using Java as the target language. The skip() can be different in other targets (Skip() for C#, for example). You may also want to include \r and \n chars in this rule.
EDIT
Let's say your language only consists of a couple of variable declarations. Assuming you know the basics of ANTLR, the following grammar should be easy to understand:
grammar T;
parse
: stat* EOF
;
stat
: Type Identifier '=' Int ';'
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
And you're parsing the source:
int x = 5 ; double y =5;boolean z = 0 ;
which you'd like to change into:
int x=5;
double y=5;
boolean z=0;
Here's a way to embed code in your grammar and let the parser rules return custom objects (Strings, in this case):
grammar T;
parse returns [String str]
#init{StringBuilder buffer = new StringBuilder();}
#after{$str = buffer.toString();}
: (stat {buffer.append($stat.str).append('\n');})* EOF
;
stat returns [String str]
: Type Identifier '=' Int ';'
{$str = $Type.text + " " + $Identifier.text + "=" + $Int.text + ";";}
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
Test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "int x = 5 ; double y =5;boolean z = 0 ;";
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
System.out.println("Result:\n"+parser.parse());
}
}
which produces:
Result:
int x=5;
double y=5;
boolean z=0;