antlr4: How to keep comments in parse tree? [duplicate] - antlr

I'm writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I'm trying to keep all of the comments. This is difficult comments can be used literally anywhere in the code. If a comment is somewhere in the source code that doesn't match the grammar, ANTLR can't finish parsing the file.
Is there a way to make ANTLR automatically add any comments it finds to the AST? I know the lexer can simply ignore all of the comments using either {skip();} or by sending the text to the hidden channel. With either of those options set, ANTLR parses the file without any problems at all.
Any ideas are welcome.

Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:
grammar Java;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
Then for your comments rules do this:
COMMENT
: '/*' .*? '*/' -> channel(COMMENTS)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(COMMENTS)
;
Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.

first: direct all comments to a certain channel (only comments)
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
second: print out all comments
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (int index = 0; index < tokens.size(); index++)
{
Token token = tokens.get(index);
// substitute whatever parser you have
if (token.getType() != Parser.WS)
{
String out = "";
// Comments will be printed as channel 2 (configured in .g4 grammar file)
out += "Channel: " + token.getChannel();
out += " Type: " + token.getType();
out += " Hidden: ";
List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
{
if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
{
out += "\n\t" + i + ":";
out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + " Type: " + hiddenTokensToLeft.get(i).getType();
out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
}
}
out += token.getText().replaceAll("\\s", "");
System.out.println(out);
}
}

Is there a way to make ANTLR automatically add any comments it finds to the AST?
No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:
...
if_stat
: 'if' comments '(' comments expr comments ')' comments ...
;
...
comments
: (SingleLineComment | MultiLineComment)*
;
SingleLineComment
: '//' ~('\r' | '\n')*
;
MultiLineComment
: '/*' .* '*/'
;

The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:
Island Grammars: Dealing with Different Formats in the Same File

I did that on my lexer part :
WS : ( [ \t\r\n] | COMMENT) -> skip
;
fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;
Like that it will remove them automatically !

For ANTLR v3:
The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.
If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.

Related

Getting plain text in antlr instead of tokens

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

Antlr 4 Lexer with multiple modes failing to tokenise correctly

I'm trying to create a lexer with multiple modes using Antlr 4.7. My lexer currently is:
ACTIONONLY : 'AO';
BELIEFS : ':Initial Beliefs:' -> mode(INITIAL_BELIEFS);
NAME : ':name:';
WORD: ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+;
COMMENT : '/*' .*? '*/' -> skip ;
LINE_COMMENT : '//' ~[\n]* -> skip ;
NEWLINE:'\r'? '\n' -> skip ;
WS : (' '|'\t') -> skip ;
mode INITIAL_BELIEFS;
GOAL_IB : ':Initial Goal:' -> mode(GOALS);
IB_COMMENT : '/*' .*? '*/' -> skip ;
IB_LINE_COMMENT : '//' ~[\n]* -> skip ;
IB_NEWLINE:'\r'? '\n' -> skip ;
IB_WS : (' '|'\t') -> skip ;
BELIEF_BLOCK: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'('|')'|','|'.')+;
mode REASONING_RULES;
R1: 'a';
R2: 'b';
mode GOALS;
GL_COMMENT : '/*' .*? '*/' -> skip ;
GL_LINE_COMMENT : '//' ~[\n]* -> skip ;
GL_NEWLINE:'\r'? '\n' -> skip ;
GL_WS : (' '|'\t') -> skip ;
GOAL_BLOCK: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'('|')'|','|'.')+;
Note that there is no way, at present, to get into the REASONING_RULES mode (so this should not, as I understand it have any effect on the operation of the lexer). Obviously I do want to use this mode, but this is the minimal version of the lexer that seems to display the problem I'm having.
My parser is:
grammar ActionOnly;
options { tokenVocab = ActionOnlyLexer; }
// Mas involving ActionOnly Agents
mas : aoagents;
aoagents: ACTIONONLY (aoagent)+;
// Agent stuff
aoagent :
(ACTIONONLY?)
NAME w=WORD
BELIEFS (bs=BELIEF_BLOCK )?
GOAL_IB gs=GOAL_BLOCK;
and I'm trying to parse:
AO
:name: robot
:Initial Beliefs:
abelief
:Initial Goal:
at(4, 2)
This fails with the error
line 35:0 mismatched input 'at(4,' expecting GOAL_BLOCK
which I'm assuming is because it isn't tokenising correctly.
If I omit rule R2 in the REASONING_RULES mode then it parses correctly (in general I seem to be able to have one rule in REASONING_RULES and it will work, but more than one rule and it fails to match GOAL_BLOCK)
I'm really struggling to see what I'm doing wrong here, but this is the first time I've tried to use lexer modes with Antlr.
I don't get that error when I try your grammars. I also tested with ANTLR 4.7.
Here's my test rig:
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.ParserRuleContext;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
String source = "AO\n" +
"\n" +
":name: robot\n" +
"\n" +
":Initial Beliefs:\n" +
"\n" +
"abelief\n" +
"\n" +
":Initial Goal:\n" +
"\n" +
"at(4, 2)";
ActionOnlyLexer lexer = new ActionOnlyLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
System.out.println("[TOKENS]");
for (Token t : tokens.getTokens()) {
System.out.printf(" %-20s %s\n", ActionOnlyLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
System.out.println("\n[PARSE-TREE]");
ActionOnlyParser parser = new ActionOnlyParser(tokens);
ParserRuleContext context = parser.mas();
System.out.println(" "+context.toStringTree(parser));
}
}
And this is printed to my console:
[TOKENS]
ACTIONONLY AO
NAME :name:
WORD robot
BELIEFS :Initial Beliefs:
BELIEF_BLOCK abelief
GOAL_IB :Initial Goal:
GOAL_BLOCK at(4,
GOAL_BLOCK 2)
EOF <EOF>
[PARSE-TREE]
(mas (aoagents AO (aoagent :name: robot :Initial Beliefs: abelief :Initial Goal: at(4,)))
Perhaps you need to generate new lexer/parser classes?
PS. note that ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'('|')'|','|'.')+ can be written as [a-zA-Z0-9_(),.]+

Antlr4: How can I match end of lines inside multiline comments?

I have to create a program that counts lines of code ignoring those inside a comment. I'm a newbie working with Antlr, and after trying a lot, the nearest I came to a solution is this erroneous grammar:
grammar Comments;
comment : startc content endc;
startc : '/*';
endc : '*/';
content : newline | contenttext;
contenttext : CONTENTCHARS+;
newline : '\r\n';
CONTENTCHARS
: ~'*' '/'
| ~'/' .
;
WS : [ \r\t]+ -> skip;
If I try with /*hello\r\nworld*/ the parser recognizes this, which is erroneous:
In order to count lines, the parser needs to detect newline characters, inside and outside multiline comments. I think my problem is that I don't know how to say "match everything inside /* and */ except \r\n.
Please, can you point me in the right direction? Any help will be appreciated.
Solution
Let's simplify your grammar! In the grammar we will ignore whitespace characters and comments at the lexer stage (and the unwanted newlines at the same time!). For example the COMMENT section will match one line comments or multi-line comments and just skip them!
Next, we will introduce counter variable for counting NEWLINE tokens that are used only in content grammar rule (because COMMENT token is skipped so the NEWLINE token in it!).
Whenever we encounter a NEWLINE token we increment the counter variable.
grammar Comments;
#lexer::members {
int counter = 0;
}
WS : [ \r\t]+ -> skip;
COMMENT : '/*' .*? '*/' NEWLINE? -> skip;
TEXT : [a-zA-Z0-9]+;
NEWLINE : '\r'? '\n' { {System.out.println("Newlines so far: " + (++counter)); } };
content: (TEXT | COMMENT | NEWLINE )* EOF;

ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

Is there any way to express this in ANTLR4:
Any string as long as it doesn't contain the asterisk immediately
followed by a forward slash?
This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'
This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.
I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.
The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.
The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.
grammar myTest;
#header
{
import java.util.*;
}
#parser::members
{
boolean hasEvilCharacters(String input)
{
if (input.contains("*/"))
{
return false;
}
else
{
return true;
}
}
}
// Mimics a very simple sentence, such as:
// I am clean.
// I have evil char*/acters.
myTest
: { hasEvilCharacters(_input.LT(1).getText()) }? String
(Space { hasEvilCharacters(_input.LT(1).getText()) }? String)*
Period EOF
;
String
: ('A'..'Z' | 'a'..'z')+
;
Space
: ' '
;
Period
: '.'
;
Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.
If the disallowed sequences are few there exists a solution without parser/lexer actions:
grammar NotParser;
program
: (starslash | notstarslash)+
;
notstarslash
: NOT_STAR_SLASH
;
starslash
: STAR_SLASH
;
STAR_SLASH
: '*'+ '/'
;
NOT_STAR_SLASH
: (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
;
fragment F_NOT_STAR_SLASH
: ~('*'|'/')
;
fragment F_STAR_NOT_SLASH
: '*'+ ~('*'|'/')
| '*'+ EOF
| '/'
;
The idea is to compose the token of
all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'
There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')

How I do... ? with lexer only rules on ANTLR

I’m trying to implement a simple parsing over custom .c files with added syntax.
Ex: test.c
.
// I don’t need this in output
int func1(int a, int b);
//I need this.
#parseme int func2(int a, int b);
//and this …
#parseme
void func3()
{
Int a;
//put here where ever
…
{
//inside block
}
return;
}
.
I want to use a fuzzy parsing approach on the lexer phase then, on the parser rules, rewrite token with TokenRewriteStream and templates.
Well it’s a lexer piece …
lexer grammar Lexi;
options {filter = true;}
// Pick everything between #parseme and ';' or '{ }'
METHOD
: HEADER .* (';' | BODY )
;
fragment
HEADER
: '#' ('parseme' | 'PARSEME') ;
fragment
BODY: '{' .* '}' ;
.
…
The problem is simple for a expert look:
1- Lexer stop parse when found ‘;’ before to reach the last ‘}’ on “ #parseme void func3() …. “
2- Lexer stop parse when found inside block right curly.
3- And surely more cases don’t tested yet.
The problem is really obvious. Is the solution too?? I hope soo !! 
Thanks.
Answer my self.
lexer grammar Lexi;
options {filter = true;}
// Pick everything between #parseme and ';' or '{}'
METHOD
: METHOD_HEADER (~'{')* METHOD_END ;
fragment
METHOD_HEADER
: '#' ('parseme' | 'PARSEME') ;
fragment
METHOD_END
: (';' | BLOCK ) ;
fragment
BLOCK
: '{' ( ~('{' | '}') | BLOCK )* '}' ;
WS : (' '|'\r'|'\t'|'\n')+ ;
The solution was very simple.