When I get the token with these rules
STRINGA : '"' (options {greedy=false;}: ESC | .)* '"';
STRINGB : '\'' (options {greedy=false;}: ESC | .)* '\'';
it ends up grabbing 'text' instead of just text. I can easily remove the ' and ' myself but was wondering how I can get ANTLR to remove it?
You will need some custom code for that. Also, you shouldn't be using a . (dot) inside the rule: you should explicitly define you want to match everything except a backslash (assuming that is what your ESQ starts with), a quote and line break chars probably.
Something like this would do it:
grammar T;
parse
: STRING EOF {System.out.println($STRING.text);}
;
STRING
: '"' (ESQ | ~('"' | '\\' | '\r' | '\n'))* '"'
{
String matched = getText();
StringBuilder builder = new StringBuilder();
for(int i = 1; i < matched.length() - 1; i++) {
char ch = matched.charAt(i);
if(ch == '\\') {
i++;
ch = matched.charAt(i);
switch(ch) {
case 'n': builder.append('\n'); break;
case 't': builder.append('\t'); break;
default: builder.append(ch); break;
}
}
else {
builder.append(ch);
}
}
setText(builder.toString());
}
;
fragment ESQ
: '\\' ('n' | 't' | '"' | '\\')
;
If you now parse the input "tabs:'\t\t\t'\nquote:\"\nbackslash:\\", the following will be printed to the console:
tabs:' '
quote:"
backslash:\
To keep the grammar clean, you could of course move the code in a custom method:
grammar T;
#lexer::members {
private String fix(String str) {
...
}
}
parse
: STRING EOF {System.out.println($STRING.text);}
;
STRING
: '"' (ESQ | ~('"' | '\\' | '\r' | '\n'))* '"' {setText(fix(getText()));}
;
fragment ESQ
: '\\' ('n' | 't' | '"' | '\\')
;
One approach is to define the string contents as a separate category, for example
STRINGA : '"' STRINGCONTENTS '"';
STRINGB : '\'' STRINGCONTENTS '\'';
then capture the STRINGCONTENTS value.
Related
I am writing a parser for a scripting language, and using antlr 4.5.3 for the purpose.
grammar VSE;
chunk
: block* EOF
;
block
: var '=' exp
| functioncall
;
var
: NAME
| var '[' exp ']'
| var '.' var
;
exp
: number
| string
| var
| functioncall
| <assoc=right> exp exp //concat
;
functioncall
: NAME '(' (exp)? (',' exp)* ')'
| var '.' functioncall
;
string
: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"'
;
NAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
number
: INT | HEX | FLOAT
;
INT
: Digit+
;
HEX
: '0' [xX] [0-9a-fA-F]+
;
FLOAT
: Digit* '.' Digit+
;
Digit
: [0-9]
;
WS
: [ \t\u000C\r\n]+ -> skip
;
However, while testing it, I found a variable assignment like var = something followed by some function call in next line leads to a concat statement. (My concat statement is a variable followed by another like var = var1 var2) I understand that antlr is skipping ALL the new lines in favor of line continuation, but I'd like to add the condition that if there is a new line between two exps, it would treat them as two separate blocks instead of a concat statement. i.e.
var = var2
functioncall(var)
These should be two separate blocks instead of concat statement.
Is there any way to do this?
Does the following rule suitable for you?
block
: var '=' exp NEW_LINE
| functioncall NEW_LINE
;
NEW_LINE: '\r'? '\n'
WS
: [ \t]+ -> skip
;
In another case you should use Semantic Predicates or very unclear grammar.
Given the following lexer:
lexer grammar CodeTableLexer;
#header {
package ch.bsource.ice.parsers;
}
CodeTabHeader : OBracket Code ' ' Table ' ' Version CBracket;
CodeTable : Code ' '* Table;
EndCodeTable : 'end' ' '* Code ' '* Table;
Code : 'code';
Table : 'table';
Version : '1.0';
Row : 'row';
Tabdef : 'tabdef';
Override : 'override' | 'no_override';
Obsolete : 'obsolete';
Substitute : 'substitute';
Status : 'activ' | 'inactive';
Pkg : 'include_pkg' | 'exclude_pkg';
Ddic : 'include_ddic' | 'exclude_ddic';
Tab : 'tab';
Naming : 'naming';
Dfltlang : 'dfltlang';
Language : 'english' | 'german' | 'french' | 'italian' | 'spanish';
Null : 'null';
Comma : ',';
OBracket : '[';
CBracket : ']';
Boolean
: 'true'
| 'false'
;
Number
: Int* ('.' Digit*)?
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '$' | '#' | '.' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length() - 1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"'))* '"'
;
Comment
: '--' ~('\r' | '\n')* { skip(); }
| '/*' .* '*/' { skip(); }
;
Space
: (' ' | '\t') { skip(); }
;
NewLine
: ('\r' | '\n' | '\u000C') { skip(); }
;
fragment Int
: '1'..'9'
| '0'
;
fragment Digit
: '0'..'9'
;
... and the following parser:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
parse
: block EOF
;
block
: CodeTabHeader^ codeTable endCodeTable
;
codeTable
: CodeTable^ codeTableData
;
codeTableData
: (Identifier^ obsolete?) (tabdef | row)*
;
endCodeTable
: EndCodeTable
;
tabdef
: Tabdef^ Identifier+
;
row
: Row^ rowData
;
rowData
: (Number^ | (Identifier^ (Comma Number)?))
Override?
obsolete?
status?
Pkg?
Ddic?
(tab | field)*
;
tab
: Tab^ value+
;
field
: (Identifier^ value) | naming
;
value
: OBracket? (Identifier | String | Number | Boolean | Null) CBracket?
;
naming
: Naming^ defaultNaming (l10nNaming)*
;
defaultNaming
: Dfltlang^ String
;
l10nNaming
: Language^ String?
;
obsolete
: Obsolete^ Substitute String
;
status
: Status^ Override?
;
... finally my class for making the parser case-insensitive:
package ch.bsource.ice.parsers;
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + 1 - 1) >= n) return CharStream.EOF
return Character.toLowerCase(data[p + 1 - 1]);
}
}
... single-line comments are skipped as expected, while multi-line comments aren't... here is the error message I get:
codetable_1.txt line 38:0 mismatched character '<EOF>' expecting '*'
codetable_1.txt line 38:0 mismatched input '<EOF>' expecting EndCodeTable
java.lang.NullPointerException
...
Am I missing something? Is there anything I should be aware of? I'm using antlr 3.4.
Here is also the example source code I'm trying to parse:
[code table 1.0]
/*
This is a multi-line comment
*/
code table my_table
-- this is a single-line comment
row 1
id "my_id_1"
name "my_name_1"
descn "my_description_1"
naming
dfltlang "My description 1"
english "My description 1"
german "Meine Beschreibung 1"
-- this is another single-line comment
row 2
id "my_id_2"
name "my_name_2"
descn "my_description_2"
naming
dfltlang "My description 2"
english "My description 2"
german "Meine Beschreibung 2"
end code table
Any help would be really appreciated :-)
Thanks,
j3d
To do this in antlr4
BlockComment
: '/*' .*? '*/' -> skip
;
Bart gave me an amazing support and I think we all really appreciate him :-)
Anyway, the problem was a bug in the FileStream class I use to convert parsed char stream to lowercase. Here below is the correct Java source code:
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + i - 1) >= n) return CharStream.EOF;
return Character.toLowerCase(data[p + i - 1]);
}
}
I use 2 rules that I use to skip line and block comments (I print them during parsing for debug purposes). They are split in 2 for better readability, and the block comment does support nested comments.
Also, I do not skip EOL chars (\r and / or \n) in my grammar because I need them explicitly for some rules.
LineComment
: '//' ~('\n'|'\r')* //NEWLINE
{System.out.println("lc > " + getText());
skip();}
;
BlockComment
#init { int depthOfComments = 0;}
: '/*' {depthOfComments++;}
( options {greedy=false;}
: ('/' '*')=> BlockComment {depthOfComments++;}
| '/' ~('*')
| ~('/')
)*
'*/' {depthOfComments--;}
{
if (depthOfComments == 0) {
System.out.println("bc >" + getText());
skip();
}
}
;
On ANTLR 2, the comment syntax is like this,
// Single-line comments
SL_COMMENT
: (options {warnWhenFollowAmbig=false;}
: '--'( { LA(2)!='-' }? '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') { newline(); }| '--') )
{$setType(Token.SKIP); }
;
However, when porting this to ANTLR 3,
SL_COMMENT
: (
: '--'( { input.LA(2)!='-' }? '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') | '--') )
{$channel = HIDDEN;}
;
because there is no more options {warnWhenFollowAmbig=false;}, the following comment cannot be parsed correctly,
-- some comment -- some not comment
Then, what is the possible way to define this SL_COMMENT rule for ANTLR 3?
Personally, I like to keep grammar rules as "empty" as possible. In this case, I would create a lexer method that returns true if the next two characters in the input are "--". As long as this is not the case, match any character other than \r and \n, and repeat that zero or more times until an optional "--" is encountered. Note that I didn't put a new line at the end because there is not necessarily a new line at the end (it could also be a EOF). Besides, \r and \n will likely be matched by a SPACE rule which is put on the HIDDEN channel: so there's no harm in doing it as I suggest.
A demo:
...
#lexer::members {
private boolean endCommentAhead() {
return input.LA(1) == '-' && input.LA(2) == '-';
}
}
...
SL_COMMENT
: '--' ({!endCommentAhead()}?=> ~('\r' | '\n'))* '--'?
;
...
And if you don't like the lexer members-block, you simply do:
SL_COMMENT
: '--' ({!(input.LA(1) == '-' && input.LA(2) == '-')}?=> ~('\r' | '\n'))* '--'?
;
EDIT
A small, complete demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String source = "12 - 34 -- foo - bar -- 42 \n - - 5678 -- more comments 666\n--\n--";
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
#lexer::members {
private boolean endCommentAhead() {
return input.LA(1) == '-' && input.LA(2) == '-';
}
}
parse
: (t=. {System.out.printf("\%-15s\%s\n", tokenNames[$t.type], $t.text);})* EOF
;
SL_COMMENT
: '--' ({!endCommentAhead()}?=> ~('\r' | '\n'))* '--'?
;
MINUS
: '-'
;
INT
: '0'..'9'+
;
SPACE
: (' ' | '\t' | '\r' | '\n') {skip();}
;
which, after parsing the input:
12 - 34 -- foo - bar -- 42
- - 5678 -- more comments 666
will print:
INT 12
MINUS -
INT 34
SL_COMMENT -- foo - bar --
INT 42
MINUS -
MINUS -
INT 5678
SL_COMMENT -- more comments 666
SL_COMMENT --
SL_COMMENT --
I came across a solution finally,
SL_COMMENT
: COMMENT ( ({input.LA(2) != '-'}? '-') => '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') | COMMENT)
{ $channel = HIDDEN; }
;
I have defined the following grammar.
grammar Sample_1;
#header {
package a;
}
#lexer::header {
package a;
}
program
:
define*
implement*
;
define
: IDENT '=(' INTEGER',' INTEGER ')'
;
implement
:IDENT '=(' (IDENT ','?)* ')'
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+ ;
IDENT : LETTER (LETTER | DIGIT)*;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
COMMENT : '//' .* ('\n'|'\r') {$channel = HIDDEN;};
How to check in this grammar so that when I have the example
A=(1,1)
B=(1,2)
G=(A,B)
the result is successful but if I write
A=(1,1)
B=(1,2)
G=(A,E)
it gives an error that E is not defined
thanks
the result:
i got it working thanks a lot:
grammar Sample_1;
#members{
int level=0;
}
#header {
package a;
}
#lexer::header {
package a;
}
program
:
block
;
block
scope {
List symbols;
}
#init {
$block::symbols=new ArrayList();
level++;
}
#after {
System.err.println("Hello");
level--;
}
: (define* implement+)
;
define
: IDENT {$block::symbols.add($IDENT.text);} '=(' INTEGER',' INTEGER ')'
;
implement
:IDENT '=(' (a=IDENT
{if (!$block::symbols.contains($a.text)){
System.err.println("undefined");
}}','?)* ')'
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+ ;
IDENT : LETTER (LETTER | DIGIT)*;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
COMMENT : '//' .* ('\n'|'\r') {$channel = HIDDEN;};
Antlr supports actions, little snippets of code embedded in the grammar file.
An action for an assignment could store into a map. An action for a right-hand-side IDENT could try to pull a value from the map, and throw an exception if it fails.
Chapter 6 in Terrence Parr's "The Definitive ANTLR Reference" covers actions.
In any grammar I create in ANTLR, is it possible to parse the grammar and the result of the parsing can eleminate any extra spaces in the grammar. f.e
simple example ;
int x=5;
if I write
int x = 5 ;
I would like that the text changes to the int x=5 without the extra spaces. Can the parser return the original text without extra spaces?
Can the parser return the original text without extra spaces?
Yes, you need to define a lexer rule that captures these spaces and then skip() them:
Space
: (' ' | '\t') {skip();}
;
which will cause spaces and tabs to be ignored.
PS. I'm assuming you're using Java as the target language. The skip() can be different in other targets (Skip() for C#, for example). You may also want to include \r and \n chars in this rule.
EDIT
Let's say your language only consists of a couple of variable declarations. Assuming you know the basics of ANTLR, the following grammar should be easy to understand:
grammar T;
parse
: stat* EOF
;
stat
: Type Identifier '=' Int ';'
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
And you're parsing the source:
int x = 5 ; double y =5;boolean z = 0 ;
which you'd like to change into:
int x=5;
double y=5;
boolean z=0;
Here's a way to embed code in your grammar and let the parser rules return custom objects (Strings, in this case):
grammar T;
parse returns [String str]
#init{StringBuilder buffer = new StringBuilder();}
#after{$str = buffer.toString();}
: (stat {buffer.append($stat.str).append('\n');})* EOF
;
stat returns [String str]
: Type Identifier '=' Int ';'
{$str = $Type.text + " " + $Identifier.text + "=" + $Int.text + ";";}
;
Type
: 'int'
| 'double'
| 'boolean'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
Int
: '0'..'9'+
;
Space
: (' ' | '\t' | '\n' | 'r')+ {skip();}
;
Test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "int x = 5 ; double y =5;boolean z = 0 ;";
ANTLRStringStream in = new ANTLRStringStream(source);
TLexer lexer = new TLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TParser parser = new TParser(tokens);
System.out.println("Result:\n"+parser.parse());
}
}
which produces:
Result:
int x=5;
double y=5;
boolean z=0;