I’m trying to implement a simple parsing over custom .c files with added syntax.
Ex: test.c
.
// I don’t need this in output
int func1(int a, int b);
//I need this.
#parseme int func2(int a, int b);
//and this …
#parseme
void func3()
{
Int a;
//put here where ever
…
{
//inside block
}
return;
}
.
I want to use a fuzzy parsing approach on the lexer phase then, on the parser rules, rewrite token with TokenRewriteStream and templates.
Well it’s a lexer piece …
lexer grammar Lexi;
options {filter = true;}
// Pick everything between #parseme and ';' or '{ }'
METHOD
: HEADER .* (';' | BODY )
;
fragment
HEADER
: '#' ('parseme' | 'PARSEME') ;
fragment
BODY: '{' .* '}' ;
.
…
The problem is simple for a expert look:
1- Lexer stop parse when found ‘;’ before to reach the last ‘}’ on “ #parseme void func3() …. “
2- Lexer stop parse when found inside block right curly.
3- And surely more cases don’t tested yet.
The problem is really obvious. Is the solution too?? I hope soo !!
Thanks.
Answer my self.
lexer grammar Lexi;
options {filter = true;}
// Pick everything between #parseme and ';' or '{}'
METHOD
: METHOD_HEADER (~'{')* METHOD_END ;
fragment
METHOD_HEADER
: '#' ('parseme' | 'PARSEME') ;
fragment
METHOD_END
: (';' | BLOCK ) ;
fragment
BLOCK
: '{' ( ~('{' | '}') | BLOCK )* '}' ;
WS : (' '|'\r'|'\t'|'\n')+ ;
The solution was very simple.
Related
I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.
I'm writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I'm trying to keep all of the comments. This is difficult comments can be used literally anywhere in the code. If a comment is somewhere in the source code that doesn't match the grammar, ANTLR can't finish parsing the file.
Is there a way to make ANTLR automatically add any comments it finds to the AST? I know the lexer can simply ignore all of the comments using either {skip();} or by sending the text to the hidden channel. With either of those options set, ANTLR parses the file without any problems at all.
Any ideas are welcome.
Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:
grammar Java;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
Then for your comments rules do this:
COMMENT
: '/*' .*? '*/' -> channel(COMMENTS)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(COMMENTS)
;
Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.
first: direct all comments to a certain channel (only comments)
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
second: print out all comments
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (int index = 0; index < tokens.size(); index++)
{
Token token = tokens.get(index);
// substitute whatever parser you have
if (token.getType() != Parser.WS)
{
String out = "";
// Comments will be printed as channel 2 (configured in .g4 grammar file)
out += "Channel: " + token.getChannel();
out += " Type: " + token.getType();
out += " Hidden: ";
List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
{
if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
{
out += "\n\t" + i + ":";
out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + " Type: " + hiddenTokensToLeft.get(i).getType();
out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
}
}
out += token.getText().replaceAll("\\s", "");
System.out.println(out);
}
}
Is there a way to make ANTLR automatically add any comments it finds to the AST?
No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:
...
if_stat
: 'if' comments '(' comments expr comments ')' comments ...
;
...
comments
: (SingleLineComment | MultiLineComment)*
;
SingleLineComment
: '//' ~('\r' | '\n')*
;
MultiLineComment
: '/*' .* '*/'
;
The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:
Island Grammars: Dealing with Different Formats in the Same File
I did that on my lexer part :
WS : ( [ \t\r\n] | COMMENT) -> skip
;
fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;
Like that it will remove them automatically !
For ANTLR v3:
The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.
If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.
Given this g4 grammar:
grammar smaller;
root
: ( componentDefinition )* EOF;
componentDefinition
: Addr
Id?
Lbrace
Rbrace
Semi
;
ExprElem
: Num
| Id
;
Addr : 'addr' {System.out.println("addr");};
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
//------------------------------------------------
// Whitespace and Comments
//------------------------------------------------
Wspace : [ \t]+ -> skip;
Newline : ('\r' '\n'?
| '\n'
) -> skip;
and this file to parse
addr basic {
};
this cmdline:
rm *.class *.java ; java -Xmx500M org.antlr.v4.Tool smaller.g4 ; javac *.java ; cat basic | java org.antlr.v4.runtime.misc.TestRig smaller root -tree
I get this error:
line 2:0 mismatched input 'addr' expecting {<EOF>, 'addr'}
(root addr basic { } ;)
If I remove the ExprElem (which is not used anywhere else in the grammar), the parser works:
addr
id
(root (componentDefinition addr basic { } ;) <EOF>)
Why? Note that this is a greatly reduced version of the grammar. Normally, the ExprElem does have a purpose.
Addr is a literal, so it shouldn't conflict with Id in the way that other questions like this usually do.
Your rule ExprElem is a lexer rule, not a parser rule (it begins with an upercase) and is masking the Addr rule, so, no Addr :(
Also, as ExprElem is a lexer rule and it relies on Id or Num rule. Consequently, when an Id is found, ANTLR lexer gives it the ExprElem token type and not the Id token type.
So, two things, you can either rewrite your ExprElem rule to exprElem (assuming you want a parser rule):
exprElem : Num | Id;
or you can use Id token in your ExprElem as part of the rule but you need something that can differentiate ExprElem from Id (example below, but I really think you want a parser rule):
Addr : 'addr' {System.out.println("addr");};
ExprElem
: Sharp Num // This token use others but defines its own 'pattern'
| Sharp Id
;
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
Sharp : '#';
From what I suppose, this is definitely not what you want, but I just put it here to illustrate how lexer rule can reuse others.
When you have doubt about what your token do, do not hesitate to display the recognize tokens. Here is the Java code fragment I often use (I named your grammar test in this case):
public class Main {
public static void main(String[] args) throws InterruptedException {
String txt =
"addr Basic {\n"
+ "\n"
+ "};";
TestLexer lexer = new TestLexer(new ANTLRInputStream(txt));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.root();
for (Token t : tokens.getTokens()) {
System.out.println(t);
}
}
}
NOTE: by the way, Num will never be recognized as Id rule can match the same thing. Try this instead:
Id : Letter (Letter | [0-9])*;
Num : [0-9]+;
fragment Letter : [a-zA-z_];
Is there any way to express this in ANTLR4:
Any string as long as it doesn't contain the asterisk immediately
followed by a forward slash?
This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'
This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.
I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.
The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.
The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.
grammar myTest;
#header
{
import java.util.*;
}
#parser::members
{
boolean hasEvilCharacters(String input)
{
if (input.contains("*/"))
{
return false;
}
else
{
return true;
}
}
}
// Mimics a very simple sentence, such as:
// I am clean.
// I have evil char*/acters.
myTest
: { hasEvilCharacters(_input.LT(1).getText()) }? String
(Space { hasEvilCharacters(_input.LT(1).getText()) }? String)*
Period EOF
;
String
: ('A'..'Z' | 'a'..'z')+
;
Space
: ' '
;
Period
: '.'
;
Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.
If the disallowed sequences are few there exists a solution without parser/lexer actions:
grammar NotParser;
program
: (starslash | notstarslash)+
;
notstarslash
: NOT_STAR_SLASH
;
starslash
: STAR_SLASH
;
STAR_SLASH
: '*'+ '/'
;
NOT_STAR_SLASH
: (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
;
fragment F_NOT_STAR_SLASH
: ~('*'|'/')
;
fragment F_STAR_NOT_SLASH
: '*'+ ~('*'|'/')
| '*'+ EOF
| '/'
;
The idea is to compose the token of
all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'
There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')
I have a simple grammar to parse files containing identifiers and keywords between brackets (hopefully):
grammar Keyword;
// PARSER RULES
//
entry_point : ('['ID']')*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
It works for input:
[Hi]
[Hi]
It returns a NoViableAltException error for input:
[Hi]
[Ki]
If I comment KEYWORD, then it works fine. Also, if I change my grammar to:
grammar Keyword;
// PARSER RULES
//
entry_point : ID*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : '[' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ']';
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
Then it works. Could you please help me figuring out why?
Best regards.
The 1st grammar fails because whenever the lexer sees "[K", the lexer will enter the KEYWORD rule. If it then encounters something other then "eyword]", "i" in your case, it tries to go back to some other rule that can match "[K". But there is no other lexer rule that starts with "[K" and will therefor throw an exception. Note that the lexer doesn't remove "K" and then tries to match again (the lexer is a dumb machine)!
Your 2nd grammar works, because the lexer now can find something to fall back on when "[Ki" does not get matched by the KEYWORD since ID now includes the "[".