ANTLR: How to skip multiline comments - antlr

Given the following lexer:
lexer grammar CodeTableLexer;
#header {
package ch.bsource.ice.parsers;
}
CodeTabHeader : OBracket Code ' ' Table ' ' Version CBracket;
CodeTable : Code ' '* Table;
EndCodeTable : 'end' ' '* Code ' '* Table;
Code : 'code';
Table : 'table';
Version : '1.0';
Row : 'row';
Tabdef : 'tabdef';
Override : 'override' | 'no_override';
Obsolete : 'obsolete';
Substitute : 'substitute';
Status : 'activ' | 'inactive';
Pkg : 'include_pkg' | 'exclude_pkg';
Ddic : 'include_ddic' | 'exclude_ddic';
Tab : 'tab';
Naming : 'naming';
Dfltlang : 'dfltlang';
Language : 'english' | 'german' | 'french' | 'italian' | 'spanish';
Null : 'null';
Comma : ',';
OBracket : '[';
CBracket : ']';
Boolean
: 'true'
| 'false'
;
Number
: Int* ('.' Digit*)?
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '$' | '#' | '.' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length() - 1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"'))* '"'
;
Comment
: '--' ~('\r' | '\n')* { skip(); }
| '/*' .* '*/' { skip(); }
;
Space
: (' ' | '\t') { skip(); }
;
NewLine
: ('\r' | '\n' | '\u000C') { skip(); }
;
fragment Int
: '1'..'9'
| '0'
;
fragment Digit
: '0'..'9'
;
... and the following parser:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
parse
: block EOF
;
block
: CodeTabHeader^ codeTable endCodeTable
;
codeTable
: CodeTable^ codeTableData
;
codeTableData
: (Identifier^ obsolete?) (tabdef | row)*
;
endCodeTable
: EndCodeTable
;
tabdef
: Tabdef^ Identifier+
;
row
: Row^ rowData
;
rowData
: (Number^ | (Identifier^ (Comma Number)?))
Override?
obsolete?
status?
Pkg?
Ddic?
(tab | field)*
;
tab
: Tab^ value+
;
field
: (Identifier^ value) | naming
;
value
: OBracket? (Identifier | String | Number | Boolean | Null) CBracket?
;
naming
: Naming^ defaultNaming (l10nNaming)*
;
defaultNaming
: Dfltlang^ String
;
l10nNaming
: Language^ String?
;
obsolete
: Obsolete^ Substitute String
;
status
: Status^ Override?
;
... finally my class for making the parser case-insensitive:
package ch.bsource.ice.parsers;
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + 1 - 1) >= n) return CharStream.EOF
return Character.toLowerCase(data[p + 1 - 1]);
}
}
... single-line comments are skipped as expected, while multi-line comments aren't... here is the error message I get:
codetable_1.txt line 38:0 mismatched character '<EOF>' expecting '*'
codetable_1.txt line 38:0 mismatched input '<EOF>' expecting EndCodeTable
java.lang.NullPointerException
...
Am I missing something? Is there anything I should be aware of? I'm using antlr 3.4.
Here is also the example source code I'm trying to parse:
[code table 1.0]
/*
This is a multi-line comment
*/
code table my_table
-- this is a single-line comment
row 1
id "my_id_1"
name "my_name_1"
descn "my_description_1"
naming
dfltlang "My description 1"
english "My description 1"
german "Meine Beschreibung 1"
-- this is another single-line comment
row 2
id "my_id_2"
name "my_name_2"
descn "my_description_2"
naming
dfltlang "My description 2"
english "My description 2"
german "Meine Beschreibung 2"
end code table
Any help would be really appreciated :-)
Thanks,
j3d

To do this in antlr4
BlockComment
: '/*' .*? '*/' -> skip
;

Bart gave me an amazing support and I think we all really appreciate him :-)
Anyway, the problem was a bug in the FileStream class I use to convert parsed char stream to lowercase. Here below is the correct Java source code:
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + i - 1) >= n) return CharStream.EOF;
return Character.toLowerCase(data[p + i - 1]);
}
}

I use 2 rules that I use to skip line and block comments (I print them during parsing for debug purposes). They are split in 2 for better readability, and the block comment does support nested comments.
Also, I do not skip EOL chars (\r and / or \n) in my grammar because I need them explicitly for some rules.
LineComment
: '//' ~('\n'|'\r')* //NEWLINE
{System.out.println("lc > " + getText());
skip();}
;
BlockComment
#init { int depthOfComments = 0;}
: '/*' {depthOfComments++;}
( options {greedy=false;}
: ('/' '*')=> BlockComment {depthOfComments++;}
| '/' ~('*')
| ~('/')
)*
'*/' {depthOfComments--;}
{
if (depthOfComments == 0) {
System.out.println("bc >" + getText());
skip();
}
}
;

Related

Using Antlr to parse formulas with multiple locales

I'm very new to Antlr, so forgive what may be a very easy question.
I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale.
Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful.
I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.
What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically.
I don't have ANTLR and C# running here, but the Java demo should be pretty similar:
grammar LocaleDemo;
#lexer::header {
import java.text.DecimalFormatSymbols;
import java.util.Locale;
}
#lexer::members {
private char decimalSeparator = '.';
private char groupingSeparator = ',';
public LocaleDemoLexer(CharStream input, Locale locale) {
this(input);
DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
this.decimalSeparator = dfs.getDecimalSeparator();
this.groupingSeparator = dfs.getGroupingSeparator();
}
}
parse
: .*? EOF
;
NUMBER
: D D? ( DG D D D )* ( DS D+ )?
;
OTHER
: .
;
fragment D : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}? . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;
To test the grammar above, run this class:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;
public class Main {
private static void tokenize(String input, Locale locale) {
LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);
for (Token t : lexer.getAllTokens()) {
System.out.printf(" %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
public static void main(String[] args) throws Exception {
tokenize("1.23", Locale.ENGLISH);
tokenize("1.23", Locale.GERMAN);
tokenize("12.345.678,90", Locale.ENGLISH);
tokenize("12.345.678,90", Locale.GERMAN);
}
}
which would print:
input='1.23', locale=en, tokens:
NUMBER '1.23'
input='1.23', locale=de, tokens:
NUMBER '1'
OTHER '.'
NUMBER '23'
input='12.345.678,90', locale=en, tokens:
NUMBER '12.345'
OTHER '.'
NUMBER '67'
NUMBER '8'
OTHER ','
NUMBER '90'
input='12.345.678,90', locale=de, tokens:
NUMBER '12.345.678,90'
Related Q&A's:
What is a 'semantic predicate' in ANTLR?
What does "fragment" mean in ANTLR?
As a follow-up to Bart's answer, this is the grammar I created with his suggestions:
grammar ExcelScript;
#lexer::header
{
using System;
using System.Globalization;
}
#lexer::members
{
private Int32 listseparator = 44; // UTF16 value for comma
private Int32 decimalseparator = 46; // UTF16 value for period
/// <summary>
/// Creates a new lexer object
/// </summary>
/// <param name="input">The input stream</param>
/// <param name="locale">The locale to use in parsing numbers</param>
/// <returns>A new lexer object</returns>
public ExcelScriptLexer (ICharStream input, CultureInfo locale)
: this(input)
{
this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]);
this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]);
// special case for 8 locales where the list separator is a , and the number separator is a , too
// Excel uses semicolon for list separator, so we will too
if (this.listseparator == 44 && this.decimalseparator == 44)
this.listseparator = 59; // UTF16 value for semicolon
}
}
/*
* Parser Rules
*/
formula
: numberLiteral
| Identifier
| '=' expression
;
expression
: primary # PrimaryExpression
| Identifier arguments # FunctionCallExpression
| ('+' | '-') expression # UnarySignExpression
| expression ('*' | '/' | '%') expression # MulDivModExpression
| expression ('+' | '-') expression # AddSubExpression
| expression ('<=' | '>=' | '>' | '<') expression # CompareExpression
| expression ('=' | '<>') expression # EqualCompareExpression
;
primary
: '(' expression ')' # ParenExpression
| literal # LiteralExpression
| Identifier # IdentifierExpression
;
literal
: numberLiteral # NumberLiteralRule
| booleanLiteral # BooleanLiteralRule
;
numberLiteral
: IntegerLiteral
| FloatingPointLiteral
;
booleanLiteral
: TrueKeyword
| FalseKeyword
;
arguments
: '(' expressionList? ')'
;
expressionList
: expression (ListSeparator expression)*
;
/*
* Lexer Rules
*/
AddOperator : '+' ;
SubOperator : '-' ;
MulOperator : '*' ;
DivOperator : '/' ;
PowOperator : '^' ;
EqOperator : '=' ;
NeqOperator : '<>' ;
LeOperator : '<=' ;
GeOperator : '>=' ;
LtOperator : '<' ;
GtOperator : '>' ;
ListSeparator : {_input.La(1) == listseparator}? . ;
DecimalSeparator : {_input.La(1) == decimalseparator}? . ;
TrueKeyword : [Tt][Rr][Uu][Ee] ;
FalseKeyword : [Ff][Aa][Ll][Ss][Ee] ;
Identifier
: Letter (Letter | Digit)*
;
fragment Letter
: [A-Z_a-z]
;
fragment Digit
: [0-9]
;
IntegerLiteral
: '0'
| [1-9] [0-9]*
;
FloatingPointLiteral
: [0-9]+ DecimalSeparator [0-9]* Exponent?
| DecimalSeparator [0-9]+ Exponent?
| [0-9]+ Exponent
;
fragment Exponent
: ('e' | 'E') ('+' | '-')? ('0'..'9')+
;
WhiteSpace
: [ \t]+ -> channel(HIDDEN)
;

How to get rid of " around my string in ANTLR?

When I get the token with these rules
STRINGA : '"' (options {greedy=false;}: ESC | .)* '"';
STRINGB : '\'' (options {greedy=false;}: ESC | .)* '\'';
it ends up grabbing 'text' instead of just text. I can easily remove the ' and ' myself but was wondering how I can get ANTLR to remove it?
You will need some custom code for that. Also, you shouldn't be using a . (dot) inside the rule: you should explicitly define you want to match everything except a backslash (assuming that is what your ESQ starts with), a quote and line break chars probably.
Something like this would do it:
grammar T;
parse
: STRING EOF {System.out.println($STRING.text);}
;
STRING
: '"' (ESQ | ~('"' | '\\' | '\r' | '\n'))* '"'
{
String matched = getText();
StringBuilder builder = new StringBuilder();
for(int i = 1; i < matched.length() - 1; i++) {
char ch = matched.charAt(i);
if(ch == '\\') {
i++;
ch = matched.charAt(i);
switch(ch) {
case 'n': builder.append('\n'); break;
case 't': builder.append('\t'); break;
default: builder.append(ch); break;
}
}
else {
builder.append(ch);
}
}
setText(builder.toString());
}
;
fragment ESQ
: '\\' ('n' | 't' | '"' | '\\')
;
If you now parse the input "tabs:'\t\t\t'\nquote:\"\nbackslash:\\", the following will be printed to the console:
tabs:' '
quote:"
backslash:\
To keep the grammar clean, you could of course move the code in a custom method:
grammar T;
#lexer::members {
private String fix(String str) {
...
}
}
parse
: STRING EOF {System.out.println($STRING.text);}
;
STRING
: '"' (ESQ | ~('"' | '\\' | '\r' | '\n'))* '"' {setText(fix(getText()));}
;
fragment ESQ
: '\\' ('n' | 't' | '"' | '\\')
;
One approach is to define the string contents as a separate category, for example
STRINGA : '"' STRINGCONTENTS '"';
STRINGB : '\'' STRINGCONTENTS '\'';
then capture the STRINGCONTENTS value.

ASN.1/SMI comment syntax on ANTLR 3

On ANTLR 2, the comment syntax is like this,
// Single-line comments
SL_COMMENT
: (options {warnWhenFollowAmbig=false;}
: '--'( { LA(2)!='-' }? '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') { newline(); }| '--') )
{$setType(Token.SKIP); }
;
However, when porting this to ANTLR 3,
SL_COMMENT
: (
: '--'( { input.LA(2)!='-' }? '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') | '--') )
{$channel = HIDDEN;}
;
because there is no more options {warnWhenFollowAmbig=false;}, the following comment cannot be parsed correctly,
-- some comment -- some not comment
Then, what is the possible way to define this SL_COMMENT rule for ANTLR 3?
Personally, I like to keep grammar rules as "empty" as possible. In this case, I would create a lexer method that returns true if the next two characters in the input are "--". As long as this is not the case, match any character other than \r and \n, and repeat that zero or more times until an optional "--" is encountered. Note that I didn't put a new line at the end because there is not necessarily a new line at the end (it could also be a EOF). Besides, \r and \n will likely be matched by a SPACE rule which is put on the HIDDEN channel: so there's no harm in doing it as I suggest.
A demo:
...
#lexer::members {
private boolean endCommentAhead() {
return input.LA(1) == '-' && input.LA(2) == '-';
}
}
...
SL_COMMENT
: '--' ({!endCommentAhead()}?=> ~('\r' | '\n'))* '--'?
;
...
And if you don't like the lexer members-block, you simply do:
SL_COMMENT
: '--' ({!(input.LA(1) == '-' && input.LA(2) == '-')}?=> ~('\r' | '\n'))* '--'?
;
EDIT
A small, complete demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String source = "12 - 34 -- foo - bar -- 42 \n - - 5678 -- more comments 666\n--\n--";
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
#lexer::members {
private boolean endCommentAhead() {
return input.LA(1) == '-' && input.LA(2) == '-';
}
}
parse
: (t=. {System.out.printf("\%-15s\%s\n", tokenNames[$t.type], $t.text);})* EOF
;
SL_COMMENT
: '--' ({!endCommentAhead()}?=> ~('\r' | '\n'))* '--'?
;
MINUS
: '-'
;
INT
: '0'..'9'+
;
SPACE
: (' ' | '\t' | '\r' | '\n') {skip();}
;
which, after parsing the input:
12 - 34 -- foo - bar -- 42
- - 5678 -- more comments 666
will print:
INT 12
MINUS -
INT 34
SL_COMMENT -- foo - bar --
INT 42
MINUS -
MINUS -
INT 5678
SL_COMMENT -- more comments 666
SL_COMMENT --
SL_COMMENT --
I came across a solution finally,
SL_COMMENT
: COMMENT ( ({input.LA(2) != '-'}? '-') => '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') | COMMENT)
{ $channel = HIDDEN; }
;

define a grammar in Antlr

I have defined the following grammar.
grammar Sample_1;
#header {
package a;
}
#lexer::header {
package a;
}
program
:
define*
implement*
;
define
: IDENT '=(' INTEGER',' INTEGER ')'
;
implement
:IDENT '=(' (IDENT ','?)* ')'
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+ ;
IDENT : LETTER (LETTER | DIGIT)*;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
COMMENT : '//' .* ('\n'|'\r') {$channel = HIDDEN;};
How to check in this grammar so that when I have the example
A=(1,1)
B=(1,2)
G=(A,B)
the result is successful but if I write
A=(1,1)
B=(1,2)
G=(A,E)
it gives an error that E is not defined
thanks
the result:
i got it working thanks a lot:
grammar Sample_1;
#members{
int level=0;
}
#header {
package a;
}
#lexer::header {
package a;
}
program
:
block
;
block
scope {
List symbols;
}
#init {
$block::symbols=new ArrayList();
level++;
}
#after {
System.err.println("Hello");
level--;
}
: (define* implement+)
;
define
: IDENT {$block::symbols.add($IDENT.text);} '=(' INTEGER',' INTEGER ')'
;
implement
:IDENT '=(' (a=IDENT
{if (!$block::symbols.contains($a.text)){
System.err.println("undefined");
}}','?)* ')'
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+ ;
IDENT : LETTER (LETTER | DIGIT)*;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
COMMENT : '//' .* ('\n'|'\r') {$channel = HIDDEN;};
Antlr supports actions, little snippets of code embedded in the grammar file.
An action for an assignment could store into a map. An action for a right-hand-side IDENT could try to pull a value from the map, and throw an exception if it fails.
Chapter 6 in Terrence Parr's "The Definitive ANTLR Reference" covers actions.

Using antlr to parse a | separated file

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...
A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.
It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace