ANTRL simple grammar and Identifier - antlr

I wrote this simple grammar for ANTLR
grammar ALang;
#members {
public static void main(String[] args) throws Exception {
ALangLexer lex = new ALangLexer(new ANTLRFileStream("antlr/ALang.al"));
CommonTokenStream tokens = new CommonTokenStream(lex);
ALangParser parser = new ALangParser(tokens);
parser.prog();
}
}
prog :
ID | PRINT
;
PRINT : 'print';
ID : ( 'a'..'z' | 'A'..'Z' )+;
WS : (' ' | '\t' | '\n' | '\r')+ { skip(); };
Using as input:
print
the only token found is a token of type ID. Isn't enough to put the PRINT token definition right before the ID definition?

ALang.g:21:1: The following token definitions can never be matched because prior tokens match the same input: PRINT
Yes, that is enough. If you define PRINT after ID, ANTLR will produce an error:
ALang.g:21:1: The following token definitions can never be matched because prior tokens match the same input: PRINT
I'm so sorry, i didn't want to use this production: PRINT : 'print '; but the production without the trailing space: PRINT : 'print'; The problem is that 'print' is derived from ID and not from PRINT
No, that can't be the case.
The following:
grammar ALang;
#members {
public static void main(String[] args) throws Exception {
ALangLexer lex = new ALangLexer(new ANTLRStringStream("sprint print prints foo"));
CommonTokenStream tokens = new CommonTokenStream(lex);
ALangParser parser = new ALangParser(tokens);
parser.prog();
}
}
prog
: ( ID {System.out.printf("ID :: '\%s'\n", $ID.text);}
| PRINT {System.out.printf("PRINT :: '\%s'\n", $PRINT.text);}
)*
EOF
;
PRINT : 'print';
ID : ('a'..'z' | 'A'..'Z')+;
WS : (' ' | '\t' | '\n' | '\r')+ {skip();};
will print:
ID :: 'sprint'
PRINT :: 'print'
ID :: 'prints'
ID :: 'foo'
As you see, the PRINT rule does match "print".

Related

ANTLR: How to skip multiline comments

Given the following lexer:
lexer grammar CodeTableLexer;
#header {
package ch.bsource.ice.parsers;
}
CodeTabHeader : OBracket Code ' ' Table ' ' Version CBracket;
CodeTable : Code ' '* Table;
EndCodeTable : 'end' ' '* Code ' '* Table;
Code : 'code';
Table : 'table';
Version : '1.0';
Row : 'row';
Tabdef : 'tabdef';
Override : 'override' | 'no_override';
Obsolete : 'obsolete';
Substitute : 'substitute';
Status : 'activ' | 'inactive';
Pkg : 'include_pkg' | 'exclude_pkg';
Ddic : 'include_ddic' | 'exclude_ddic';
Tab : 'tab';
Naming : 'naming';
Dfltlang : 'dfltlang';
Language : 'english' | 'german' | 'french' | 'italian' | 'spanish';
Null : 'null';
Comma : ',';
OBracket : '[';
CBracket : ']';
Boolean
: 'true'
| 'false'
;
Number
: Int* ('.' Digit*)?
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '$' | '#' | '.' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length() - 1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"'))* '"'
;
Comment
: '--' ~('\r' | '\n')* { skip(); }
| '/*' .* '*/' { skip(); }
;
Space
: (' ' | '\t') { skip(); }
;
NewLine
: ('\r' | '\n' | '\u000C') { skip(); }
;
fragment Int
: '1'..'9'
| '0'
;
fragment Digit
: '0'..'9'
;
... and the following parser:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
parse
: block EOF
;
block
: CodeTabHeader^ codeTable endCodeTable
;
codeTable
: CodeTable^ codeTableData
;
codeTableData
: (Identifier^ obsolete?) (tabdef | row)*
;
endCodeTable
: EndCodeTable
;
tabdef
: Tabdef^ Identifier+
;
row
: Row^ rowData
;
rowData
: (Number^ | (Identifier^ (Comma Number)?))
Override?
obsolete?
status?
Pkg?
Ddic?
(tab | field)*
;
tab
: Tab^ value+
;
field
: (Identifier^ value) | naming
;
value
: OBracket? (Identifier | String | Number | Boolean | Null) CBracket?
;
naming
: Naming^ defaultNaming (l10nNaming)*
;
defaultNaming
: Dfltlang^ String
;
l10nNaming
: Language^ String?
;
obsolete
: Obsolete^ Substitute String
;
status
: Status^ Override?
;
... finally my class for making the parser case-insensitive:
package ch.bsource.ice.parsers;
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + 1 - 1) >= n) return CharStream.EOF
return Character.toLowerCase(data[p + 1 - 1]);
}
}
... single-line comments are skipped as expected, while multi-line comments aren't... here is the error message I get:
codetable_1.txt line 38:0 mismatched character '<EOF>' expecting '*'
codetable_1.txt line 38:0 mismatched input '<EOF>' expecting EndCodeTable
java.lang.NullPointerException
...
Am I missing something? Is there anything I should be aware of? I'm using antlr 3.4.
Here is also the example source code I'm trying to parse:
[code table 1.0]
/*
This is a multi-line comment
*/
code table my_table
-- this is a single-line comment
row 1
id "my_id_1"
name "my_name_1"
descn "my_description_1"
naming
dfltlang "My description 1"
english "My description 1"
german "Meine Beschreibung 1"
-- this is another single-line comment
row 2
id "my_id_2"
name "my_name_2"
descn "my_description_2"
naming
dfltlang "My description 2"
english "My description 2"
german "Meine Beschreibung 2"
end code table
Any help would be really appreciated :-)
Thanks,
j3d
To do this in antlr4
BlockComment
: '/*' .*? '*/' -> skip
;
Bart gave me an amazing support and I think we all really appreciate him :-)
Anyway, the problem was a bug in the FileStream class I use to convert parsed char stream to lowercase. Here below is the correct Java source code:
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + i - 1) >= n) return CharStream.EOF;
return Character.toLowerCase(data[p + i - 1]);
}
}
I use 2 rules that I use to skip line and block comments (I print them during parsing for debug purposes). They are split in 2 for better readability, and the block comment does support nested comments.
Also, I do not skip EOL chars (\r and / or \n) in my grammar because I need them explicitly for some rules.
LineComment
: '//' ~('\n'|'\r')* //NEWLINE
{System.out.println("lc > " + getText());
skip();}
;
BlockComment
#init { int depthOfComments = 0;}
: '/*' {depthOfComments++;}
( options {greedy=false;}
: ('/' '*')=> BlockComment {depthOfComments++;}
| '/' ~('*')
| ~('/')
)*
'*/' {depthOfComments--;}
{
if (depthOfComments == 0) {
System.out.println("bc >" + getText());
skip();
}
}
;

Selectively Skip Newline Depending on Context

I must parse files made of two parts. In the first one, new lines must be skipped. In the second one, they are important and used as a delimiter.
I want to avoid solutions like http://www.antlr.org/wiki/pages/viewpage.action?pageId=1734 and use predicate instead.
For the moment, I have something like:
WS: ( ' ' | '\t' | NEWLINE) {SKIP();};
fragment NEWLINE : '\r'|'\n'|'\r\n';
I tried to add a dynamically scoped variable keepNewline that is set to true when "entering" second part of the file.
However, I am not able to create the correct predicate to switch off the "skipping" of newlines.
Any help would be greatly appreciated.
Best regards.
It's easier than you might think: you don't even need a predicate.
Let's say you want to preserve line breaks only inside <pre>...</pre> tags. The following dummy grammar does just that:
grammar Pre;
#lexer::members {
private boolean keepNewLine = false;
}
parse
: (t=.
{
System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text.replace("\n", "\\n"));
}
)*
EOF
;
Word
: ('a'..'z' | 'A'..'Z')+
;
OPr
: '<pre>' {keepNewLine = true;}
;
CPr
: '</pre>' {keepNewLine = false;}
;
NewLine
: ('\r'? '\n' | '\r') {if(!keepNewLine) skip();}
;
Space
: (' ' | '\t') {skip();}
;
which you can test with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
PreLexer lexer = new PreLexer(new ANTLRFileStream("in.txt"));
PreParser parser = new PreParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if in.txt would contain:
foo bar
<pre>
a
b
</pre>
baz
the output of running the Main class would be:
Word 'foo'
Word 'bar'
OPr '<pre>'
NewLine '\n'
Word 'a'
NewLine '\n'
NewLine '\n'
Word 'b'
NewLine '\n'
CPr '</pre>'
Word 'baz'

ANTLR antlrWorks error messages are not displayed to the output console

When enter the following input with an error at the third line:
SELECT entity_one, entity_two FROM myTable;
first_table, extra_table as estable, tineda as cam;
asteroid tenga, tenta as myName, new_eNoal as coble
I debugged it with antlrWorks and found that the error message corresponding to the third line gets shown on the debugger output window:
output/__Test___input.txt line 3:8 required (...)+ loop did not match anything at input ' '
output/__Test___input.txt line 3:9 missing END_COMMAND at 'tenga'
but when I run the application by itself these error messages are not being displayed at the console.
The error messages get displayed on the console whenever the error is on the first line like:
asteroid tenga, tenta as myName, new_eNoal as coble
SELECT entity_one, entity_two FROM myTable;
first_table, extra_table as estable, tineda as cam;
console output:
inputSql.rst line 1:8 required (...)+ loop did not match anything at input ' '
inputSql.rst line 1:9 missing END_COMMAND at 'tenga'
How could I have them displayed on the console too when the errors are not located at the 1st line?
UserRequest.g
grammar UserRequest;
tokens{
COMMA = ',' ;
WS = ' ' ;
END_COMMAND = ';' ;
}
#header {
package com.linktechnology.input;
}
#lexer::header {
package com.linktechnology.input;
}
#members{
public static void main(String[] args) throws Exception {
UserRequestLexer lex = new UserRequestLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
UserRequestParser parser = new UserRequestParser(tokens);
try {
parser.request();
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
process : request* EOF ;
request : (sqlsentence | create) END_COMMAND ;
sqlsentence : SELECT fields tableName ;
fields : tableName (COMMA tableName)* FROM ;
create : tableName (COMMA tableName)+ ;
tableName : WS* NAME (ALIAS NAME)? ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NAME : LETTER ( LETTER |DIGIT | '-' | '_' )* ;
fragment LETTER: LOWER | UPPER;
fragment LOWER: 'a'..'z';
fragment UPPER: 'A'..'Z';
fragment DIGIT: '0'..'9';
SELECT : ('SELECT ' |'select ' ) ;
FROM : (' FROM '|' from ') ;
ALIAS : ( ' AS ' |' as ' ) ;
WHITESPACE : ( '\r' | '\n' | '\t' | WS | '\u000C' )+ { $channel = HIDDEN; } ;
That is because in your main method, you invoke parser.request() while when debugging, you choose the process rule as the starting point. And since request consumes a single (sqlsentence | create) END_COMMAND from your input, it produces no error.
Change the main method into:
#members{
public static void main(String[] args) throws Exception {
UserRequestLexer lex = new UserRequestLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
UserRequestParser parser = new UserRequestParser(tokens);
try {
parser.process();
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
and you'll see the same errors on the console since process forces the parser to consume the entire input, all the way to EOF.

Antlr Array Help

Hey ive started to use Antlr with java and i wanted to know how i can store some values directly into a 2d array and return this array? i cant find any tutorials on this at all, all help is apperciated.
Let's say you want to parse a flat text file containing numbers separated by spaces. You'd like to parse this into a 2d array of int's where each line is a "row" in your array.
The ANTLR grammar for such a "language" could look like:
grammar Number;
parse
: line* EOF
;
line
: Number+ (LineBreak | EOF)
;
Number
: ('0'..'9')+
;
Space
: (' ' | '\t') {skip();}
;
LineBreak
: '\r'? '\n'
| '\r'
;
Now, you'd like to have the parse rule return an List of List<Integer> objects. Do that by adding a returns [List<List<Integer>> numbers] after your parse rule which can be initialized in an #init{ ... } block:
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: line* EOF
;
Your line rule looks a bit the same, only it returns a 1 dimensional list of numbers:
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: Number+ (LineBreak | EOF)
;
The next step is to fill the Lists with the actual values that are being parsed. This can be done embedding the code {$row.add(Integer.parseInt($Number.text));} inside the Number+ loop in your line rule:
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: (Number {$row.add(Integer.parseInt($Number.text));})+ (LineBreak | EOF)
;
And lastly, you'll want to add the Lists being returned by your line rule to be actually added to your 2D numbers list from your parse rule:
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: (line {$numbers.add($line.row);})* EOF
;
Below is the final grammar:
grammar Number;
parse returns [List<List<Integer>> numbers]
#init {
$numbers = new ArrayList<List<Integer>>();
}
: (line {$numbers.add($line.row);})* EOF
;
line returns [List<Integer> row]
#init {
$row = new ArrayList<Integer>();
}
: (Number {$row.add(Integer.parseInt($Number.text));})+ (LineBreak | EOF)
;
Number
: ('0'..'9')+
;
Space
: (' ' | '\t') {skip();}
;
LineBreak
: '\r'? '\n'
| '\r'
;
which can be tested with the following class:
import org.antlr.runtime.*;
import java.util.List;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"1 2 \n" +
"3 4 5 6 7 \n" +
" 8 \n" +
"9 10 11 ";
ANTLRStringStream in = new ANTLRStringStream(source);
NumberLexer lexer = new NumberLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
NumberParser parser = new NumberParser(tokens);
List<List<Integer>> numbers = parser.parse();
System.out.println(numbers);
}
}
Now generate a lexer and parser from the grammar:
java -cp antlr-3.2.jar org.antlr.Tool Number.g
compile all .java source files:
javac -cp antlr-3.2.jar *.java
and run the main class:
// On *nix
java -cp .:antlr-3.2.jar Main
// or Windows
java -cp .;antlr-3.2.jar Main
which produces the following output:
[[1, 2], [3, 4, 5, 6, 7], [8], [9, 10, 11]]
HTH
Here's some excerpts from a grammar I made that parses people's names and returns a Name object. Should be enough to show you how it works. Other objects such as arrays are done the same way.
In the grammar:
grammar PersonNames;
fullname returns [Name name]
#init {
name = new Name();
}
: (directory_style[name] | standard[name] | title_without_fname[name] | family_style[name] | proper_initials[name]) EOF;
standard[Name name]
: (title[name] ' ')* fname[name] ' ' (mname[name] ' ')* (nickname[name] ' ')? lname[name] (sep honorifics[name])*;
fname[Name name] : (f=NAME | f=INITIAL) { name.set(Name.Part.FIRST, toNameCase($f.text)); };
in your regular Java code
public static Name parseName(String str) throws RecognitionException {
System.err.println("parsing `" + str + "`");
CharStream stream = new ANTLRStringStream(str);
PersonNamesLexer lexer = new PersonNamesLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PersonNamesParser parser = new PersonNamesParser(tokens);
return parser.fullname();
}

Using antlr to parse a | separated file

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...
A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.
It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace