How to implement this rule in ANTLR4:
multiline-comment-text-item -> Any Unicode scalar value except /* or */
?
In ANTLR, you cannot say: "match this-or-that character, except these multiple (!) characters". You can only say "match this-or-that character, except these single (!) characters":
ANY_EXCEPT_STAR : ~[*];
ANY_EXCEPT_FSLASH : ~[/];
But doing FOO : ~[/*]; matches any single character except a / and *.
I wouldn't match multiline-comment-text-item in a lexer rule of its own, but rather inside the multiline-comment-text where it's (most likely) used:
MultilineCommentText
: '/*' .*? '*/'
;
Be sure to include the ? in there, making it ungreedy.
Note that quite often, such tokens are hidden or discarded so that they won't end up in parser rules. In that case do either:
MultilineCommentText
: '/*' .*? '*/' -> skip
;
or
MultilineCommentText
: '/*' .*? '*/' -> channel(HIDDEN)
;
See: https://github.com/antlr/antlr4/blob/master/doc/lexer-rules.md
I've just met this rule when trying to parse Swift with ANTLR4. Following is my implementation:
MULTILINE_COMMENT
: '/*' ('/'*? MULTILINE_COMMENT | ('/'* | '*'*) ~[/*])*? '*'*? '*/'
;
It's unnecessary to split multiline-comment into that many subrules as in the document.
Related
I have a rule like this,
BLOCK_COMMENT
: ('/*' ~[!] .*? '*/' | '/**/') -> channel(HIDDEN);
But when I try to match this line,
/**/and /**/1=1
The and symbol is HIDDEN as well. Since ANTLR is greedy, it matched the last occurrence of */, and it end up with only one BLOCK_COMMENT (I was expecting two)
So, I will need something that matches not '*/', and the BLOCK_COMMENT rule should become:
'/*' then not '*/' then '*/'
Anyone know what rules can match not '*/'?
First here is a quote from the book 'Definitive ANTLR4 Reference' on the ~ operator on lexer rules:
~x Match any single character not in the set described by x . Set x
can be a single character literal, a range, or a subrule set like
~('x'|'y'|'z') or ~[xyz] .
so basically we can't use something like ~'*/'.
Since you need to interpret the comments themselfs as well, best way to do it IMHO is with lexer modes.
...
COMMENT_START : '/*' -> mode (COMMENT_MODE);
mode COMMENT_MODE;
COMMENT_END : '*/' -> mode (DEFAULT_MODE);
//match anything else that you need in this mode
...
I have assumed that you only have one mode in addition to the default one. Of course if you have more of them, you can also use popMode and pushMode.
I am trying to write an ANTLR4 parser for a templating language similar to mustache. This uses {{...}} tags interspersed in a normal text file. If the template needs to contain and emit { next to an OPEN_TAG {{, there can be a problem with the lexer/parser. I believe there should be a way to write the parser such that:
This is a left brace {{{tag logic}} and here are two left braces {{{tag logic}}{
translates to
This is a left brace { and here are two left braces {{
Either:
How can I tell the lexer to only match OPEN_TAG to {{ followed by
anything but {, absorbing the leading {s into the previous TEXT
pattern?
Is there a "better" way to provide an escape sequence for {{?
Thanks!
Handle the tags in the lexer using a mode.
LBrace : '{' ;
RBrace : '}' ;
TOpenTag : '{{' -> pushMode(tagLogic) ;
mode tagLogic ;
TLBrace : '{' -> type(LBrace) ;
TRBrace : '}' -> type(RBrace) ;
TCloseTag : '}}' -> popMode ;
TLogic : [a-zA-Z0-9]+ ;
TWs : [ \t\r\n]+ -> skip ;
I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.
I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;
I am trying to write a config file grammar and get ANTLR4 to handle it. I am quite new to ANTLR (this is my first project with it).
Largely, I understand what needs to be done (or at least I think I do) for most of the config file grammar, but the files that I will be reading will have arbitrary C code inside of curly braces. Here is an example:
Something like:
#DEVICE: servo "servos are great"
#ACTION: turnRight "turning right is fun"
{
arbitrary C source code goes here;
some more arbitrary C source code;
}
#ACTION: secondAction "this is another action"
{
some more code;
}
And it could be many of those. I can't seem to get it to understand that I want to just ignore (without skipping) the source code. Here is my grammar so far:
/**
ANTLR4 grammar for practicing
*/
grammar practice;
file: (devconfig)*
;
devconfig: devid (action)+
;
devid: DEV_HDR (COMMENT)?
;
action: ACTN_HDR '{' C_BLOCK '}'
;
DEV_HDR: '#DEVICE: ' ALPHA+(IDCHAR)*
;
fragment
ALPHA: [a-zA-Z]
;
fragment
IDCHAR: ALPHA
| [0-9]
| '_'
;
COMMENT: '"' .*? '"'
;
ACTN_HDR: '#ACTION: ' ACTION_ID
;
fragment
ACTION_ID: ALPHA+(IDCHAR)*
;
C_BLOCK: WHAT DO I PUT HERE?? -> channel(HIDDEN)
;
WS: [ \t\n\r]+ -> skip
;
The problem is that whatever I put in the C_BLOCK lexer rule seems to screw up the whole thing - like if I put .*? -> channel(HIDDEN), it doesn't seem to work at all (of course, there is an error when using ANTLR on the grammar to the tune of ".*? can match the empty string" - but what should I put there if not that, so that it ignores the C code, but in such a way that I can access it later (i.e., not skipping it)?
Your C_BLOCK rule can be defined just like the usual multi line comment rule is done in so many languages. Make the curly braces part of the rule too:
C_BLOCK: CURLY .*? CURLY -> channel(HIDDEN);
If you need to nest blocks you write something like:
C_BLOCK: CURLY .*? C_BLOCK? .*? CURLY -> channel(HIDDEN);
or maybe:
C_BLOCK:
CURLY (
C_BLOCK
| .
)*?
CURLY
;
(untested).
Update: changed code to use the non-greedy kleene operator as suggested by a comment.