How to tokenize blocks (comments, strings, ...) as well as inter-blocks (any char outside blocks)? - antlr

I need to tokenize everything that is "outside" any comment, until end of line. For instance:
take me */ and me /* but not me! */ I'm in! // I'm not...
Tokenized as (STR is the "outside" string, BC is block-comment and LC is single-line-comment):
{
STR: "take me */ and me ", // note the "*/" in the string!
BC : " but not me! ",
STR: " I'm in! ",
LC : " I'm not..."
}
And:
/* starting with don't take me */ ...take me...
Tokenized as:
{
BC : " starting with don't take me ",
STR: " ...take me..."
}
The problem is that STR can be anything except the comments, and since the comments openers are not single char tokens I can't use a negation rule for STR.
I thought maybe to do something like:
STR : { IsNextSequenceTerminatesThe_STR_rule(); }?;
But I don't know how to look-ahead for characters in lexer actions.
Is it even possible to accomplish with the ANTLR4 lexer, if yes then how?

Yes, it is possible to perform the tokenization you are attempting.
Based on what has been described above, you want nested comments. These can be achieved in the lexer only without Action, Predicate nor any code. In order to have nested comments, its easier if you do not use the greedy/non-greedy ANTLR options. You will need to specify/code this into the lexer grammar. Below are the three lexer rules you will need... with STR definition.
I added a parser rule for testing. I've not tested this, but it should do everything you mentioned. Also, its not limited to 'end of line' you can make that modification if you need to.
/*
All 3 COMMENTS are Mutually Exclusive
*/
DOC_COMMENT
: '/**'
( [*]* ~[*/] // Cannot START/END Comment
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*'+ '/' -> channel( DOC_COMMENT )
;
BLK_COMMENT
: '/*'
(
( /* Must never match an '*' in position 3 here, otherwise
there is a conflict with the definition of DOC_COMMENT
*/
[/]? ~[*/] // No START/END Comment
| DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
)
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*/' -> channel( BLK_COMMENT )
;
INL_COMMENT
: '//'
( ~[\n\r*/] // No NEW_LINE
| INL_COMMENT // Nested Inline Comment
)* -> channel( INL_COMMENT )
;
STR // Consume everthing up to the start of a COMMENT
: ( ~'/' // Any Char not used to START a Comment
| '/' ~[*/] // Cannot START a Comment
)+
;
start
: DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| STR
;

Try something like this:
grammar T;
#lexer::members {
// Returns true iff either "//" or "/*" is ahead in the char stream.
boolean startCommentAhead() {
return _input.LA(1) == '/' && (_input.LA(2) == '/' || _input.LA(2) == '*');
}
}
// other rules
STR
: ( {!startCommentAhead()}? . )+
;

Related

antlr4: How to keep comments in parse tree? [duplicate]

I'm writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I'm trying to keep all of the comments. This is difficult comments can be used literally anywhere in the code. If a comment is somewhere in the source code that doesn't match the grammar, ANTLR can't finish parsing the file.
Is there a way to make ANTLR automatically add any comments it finds to the AST? I know the lexer can simply ignore all of the comments using either {skip();} or by sending the text to the hidden channel. With either of those options set, ANTLR parses the file without any problems at all.
Any ideas are welcome.
Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:
grammar Java;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
Then for your comments rules do this:
COMMENT
: '/*' .*? '*/' -> channel(COMMENTS)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(COMMENTS)
;
Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.
first: direct all comments to a certain channel (only comments)
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
second: print out all comments
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (int index = 0; index < tokens.size(); index++)
{
Token token = tokens.get(index);
// substitute whatever parser you have
if (token.getType() != Parser.WS)
{
String out = "";
// Comments will be printed as channel 2 (configured in .g4 grammar file)
out += "Channel: " + token.getChannel();
out += " Type: " + token.getType();
out += " Hidden: ";
List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
{
if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
{
out += "\n\t" + i + ":";
out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + " Type: " + hiddenTokensToLeft.get(i).getType();
out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
}
}
out += token.getText().replaceAll("\\s", "");
System.out.println(out);
}
}
Is there a way to make ANTLR automatically add any comments it finds to the AST?
No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:
...
if_stat
: 'if' comments '(' comments expr comments ')' comments ...
;
...
comments
: (SingleLineComment | MultiLineComment)*
;
SingleLineComment
: '//' ~('\r' | '\n')*
;
MultiLineComment
: '/*' .* '*/'
;
The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:
Island Grammars: Dealing with Different Formats in the Same File
I did that on my lexer part :
WS : ( [ \t\r\n] | COMMENT) -> skip
;
fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;
Like that it will remove them automatically !
For ANTLR v3:
The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.
If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.

ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

Is there any way to express this in ANTLR4:
Any string as long as it doesn't contain the asterisk immediately
followed by a forward slash?
This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'
This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.
I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.
The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.
The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.
grammar myTest;
#header
{
import java.util.*;
}
#parser::members
{
boolean hasEvilCharacters(String input)
{
if (input.contains("*/"))
{
return false;
}
else
{
return true;
}
}
}
// Mimics a very simple sentence, such as:
// I am clean.
// I have evil char*/acters.
myTest
: { hasEvilCharacters(_input.LT(1).getText()) }? String
(Space { hasEvilCharacters(_input.LT(1).getText()) }? String)*
Period EOF
;
String
: ('A'..'Z' | 'a'..'z')+
;
Space
: ' '
;
Period
: '.'
;
Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.
If the disallowed sequences are few there exists a solution without parser/lexer actions:
grammar NotParser;
program
: (starslash | notstarslash)+
;
notstarslash
: NOT_STAR_SLASH
;
starslash
: STAR_SLASH
;
STAR_SLASH
: '*'+ '/'
;
NOT_STAR_SLASH
: (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
;
fragment F_NOT_STAR_SLASH
: ~('*'|'/')
;
fragment F_STAR_NOT_SLASH
: '*'+ ~('*'|'/')
| '*'+ EOF
| '/'
;
The idea is to compose the token of
all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'
There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')

When rewriting tree on ANTLR, java.lang.NullPointerException is thrown

I'm trying to create a parser for QuickBasic and this is my attempt to get the comments:
grammar QuickBasic;
options
{
language = 'CSharp2';
output = AST;
}
tokens
{
COMMENT;
}
parse
: .* EOF
;
// DOESN'T WORK
Comment
: R E M t=~('\n')* { Text = $t; } -> ^(COMMENT $t)
| Quote t=~('\n')* { Text = $t; } -> ^(COMMENT $t)
;
Space
: (' ' | '\t' | '\r' | '\n' | '\u000C') { Skip(); }
;
fragment Quote : '\'';
fragment E : 'E' | 'e';
fragment M : 'M' | 'm';
fragment R : 'R' | 'r';
Even if I rewrite using only the token COMMENT and nothing more, I still get the same error.
// It DOESN'T WORK EITHER
Comment
: (R E M | Quote) ~('\n')* -> ^(COMMENT)
;
If I give up rewriting, it works:
// THIS WORKS
Comment
: (R E M | Quote) ~('\n')*
;
Rewrite rules only work with parser rules not with lexer rules. And t=~('\n')* will cause only the last non-line-break to be stored in the t-label, so that wouldn't have worked anyway.
But why not skip these Comment tokens all together. If you leave them in the token stream, you'd need to account for Comment tokens in all your parser rules (where Comment tokens are valid to occur): not something you'd want, right?
To skip the, simply call Skip() at the end of the rule:
Comment
: R E M ~('\r' | '\n')* { Skip(); }
| Quote ~('\r' | '\n')* { Skip(); }
;
or more concise:
Comment
: (Quote | R E M) ~('\r' | '\n')* { Skip(); }
;
However, if you're really keen on leaving Comment tokens in the stream and strip either "rem" or the quote from the comment, do it like this:
Comment
: (Quote | R E M) t=NonLineBreaks { Text = $t.text; }
;
fragment NonLineBreaks : ~('\r' | '\n')+;
You could then also create a parser rule that creates an AST with COMMENT as the root (although I don't see the benefit over simply using Comment):
comment
: Comment -> ^(COMMENT Comment)
;

Why does my antlr grammar seem to properly parse this input?

I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
}
program : idList EOF
| integerList EOF
;
idList : ID whitespace idList
| ID
;
integerList : INTEGER whitespace integerList
| INTEGER
;
whitespace : (WHITESPACE | COMMENT) +;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?
I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
parser.program()
}
program : idList EOF
| integerList EOF
;
idList : ID+
;
integerList : INTEGER+
;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.

ASN.1/SMI comment syntax on ANTLR 3

On ANTLR 2, the comment syntax is like this,
// Single-line comments
SL_COMMENT
: (options {warnWhenFollowAmbig=false;}
: '--'( { LA(2)!='-' }? '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') { newline(); }| '--') )
{$setType(Token.SKIP); }
;
However, when porting this to ANTLR 3,
SL_COMMENT
: (
: '--'( { input.LA(2)!='-' }? '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') | '--') )
{$channel = HIDDEN;}
;
because there is no more options {warnWhenFollowAmbig=false;}, the following comment cannot be parsed correctly,
-- some comment -- some not comment
Then, what is the possible way to define this SL_COMMENT rule for ANTLR 3?
Personally, I like to keep grammar rules as "empty" as possible. In this case, I would create a lexer method that returns true if the next two characters in the input are "--". As long as this is not the case, match any character other than \r and \n, and repeat that zero or more times until an optional "--" is encountered. Note that I didn't put a new line at the end because there is not necessarily a new line at the end (it could also be a EOF). Besides, \r and \n will likely be matched by a SPACE rule which is put on the HIDDEN channel: so there's no harm in doing it as I suggest.
A demo:
...
#lexer::members {
private boolean endCommentAhead() {
return input.LA(1) == '-' && input.LA(2) == '-';
}
}
...
SL_COMMENT
: '--' ({!endCommentAhead()}?=> ~('\r' | '\n'))* '--'?
;
...
And if you don't like the lexer members-block, you simply do:
SL_COMMENT
: '--' ({!(input.LA(1) == '-' && input.LA(2) == '-')}?=> ~('\r' | '\n'))* '--'?
;
EDIT
A small, complete demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String source = "12 - 34 -- foo - bar -- 42 \n - - 5678 -- more comments 666\n--\n--";
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
#lexer::members {
private boolean endCommentAhead() {
return input.LA(1) == '-' && input.LA(2) == '-';
}
}
parse
: (t=. {System.out.printf("\%-15s\%s\n", tokenNames[$t.type], $t.text);})* EOF
;
SL_COMMENT
: '--' ({!endCommentAhead()}?=> ~('\r' | '\n'))* '--'?
;
MINUS
: '-'
;
INT
: '0'..'9'+
;
SPACE
: (' ' | '\t' | '\r' | '\n') {skip();}
;
which, after parsing the input:
12 - 34 -- foo - bar -- 42
- - 5678 -- more comments 666
will print:
INT 12
MINUS -
INT 34
SL_COMMENT -- foo - bar --
INT 42
MINUS -
MINUS -
INT 5678
SL_COMMENT -- more comments 666
SL_COMMENT --
SL_COMMENT --
I came across a solution finally,
SL_COMMENT
: COMMENT ( ({input.LA(2) != '-'}? '-') => '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') | COMMENT)
{ $channel = HIDDEN; }
;