String Interpolation in Antlr4 - antlr

I have a grammar that uses modes to do string interpolation:
Something along the lines of:
lexer grammar Example;
//default mode tokens
LBRACE: '{' -> pushMode(DEFAULT_MODE);
RBRACE: '}' -> popMode;
OPEN_STRING: '"' -> pushMode(STRING);
mode STRING;
ID_INTERPOLATION: '$' IDEN;
OPEN_EXPR_INTERPOLATION: '${' -> pushMode(DEFAULT_MODE);
TEXT: '$' | (~[$\r\n])+;
CLOSE_STRING: '"' -> popMode;
parser grammar ExampleParser;
options {tokenVocab = Example;}
test: string* EOF;
string: OPEN_STRING string_part* CLOSE_STRING;
string_part: TEXT | ID_INTERPOLATION | OPEN_EXPR_INTERPOLATION expr RBRACE;
//more rules that use LBRACE and RBRACE
Now this works and tokenizes everything mostly how I want it, but it does have 2 flaws.
if the number of RBRACES goes too far, it can pop the first default mode which can glitch out the IDE, and does not just show an error.
The token for closing a block and closing interpolation is the same, so I cannot highlight them however I want. (this is the main one)
My IDE highlights based on tokens only, so this is a problem, I'd like to be able to highlight them differently. So basically I'd like a solution for this that makes the RBRACE a different token when it's in a string.
I'd prefer to do it without semantic predicates because I don't want to tie it down to a language, but if needed, I'm ok with it, I just might need a little more explanation because I haven't used them that much.

Thank you #sepp2k for helping me solve my issue.
It's a bit of a hack but it does exactly what I need it to
I solved it by changing my popMode on RBRACE to be the following:
RBRACE: '}' {
if(_modeStack.size() > 0) {
popMode();
if(_mode != DEFAULT_MODE) {
setType(EXPR_INTERPOLATION);
}
}
};
I also changed my parser to be
string_part: TEXT | ID_INTERPOLATION | EXPR_INTERPOLATION expr EXPR_INTERPOLATION;
I know it's pretty hacky to change the token type under a specific circumstance, but it got the job done for me, so I'm gonna keep it unless I find a less hacky way to do this.

So I set out to implement an interpolated string parser with using only ANTLR code (no host language code blocks). I found that this works well, including nesting interpolated strings...
lexer grammar Lexer;
LeftBrace: '{';
RightBrace: '}' -> popMode;
Backtick: '`' -> pushMode(InterpolatedString);
Integer: [0-9]+;
Plus: '+';
mode InterpolatedString;
EscapedLeftBrace: '\\{' -> type(Grapheme);
EscapedBacktick: '\\`' -> type(Grapheme);
ExprStart: '{' -> type(LeftBrace), pushMode(DEFAULT_MODE);
End: '`' -> type(Backtick), popMode;
Grapheme: ~('{' | '`');
parser grammar Parser;
options {
tokenVocab = Lexer;
}
startRule: expression EOF;
interpolatedString: Backtick (Grapheme | interpolatedStringExpression)* Backtick;
interpolatedStringExpression: LeftBrace expression RightBrace;
expression
: expression Plus expression
| atom
;
atom: Integer | interpolatedString;
You can test it with input
`{`{`{`{`{`{`{`hello world`}`}`}`}`}`}`}`

Related

Parse string antlr

I have strings as a parser rule rather than lexer because strings may contain escapes with expressions in them, such as "The variable is \(variable)".
string
: '"' character* '"'
;
character
: escapeSequence
| .
;
escapeSequence
: '\(' expression ')'
;
IDENTIFIER
: [a-zA-Z][a-zA-Z0-9]*
;
WHITESPACE
: [ \r\t,] -> skip
;
This doesn't work because . matches any token rather than any character, so many identifiers will be matched and whitespace will be completely ignored.
How can I parse strings that can have expressions inside of them?
Looking into the parser for Swift and Javascript, both languages that support things like this, I can't figure out how they work. From what I can tell, they just output a string such as "my string with (variables) in it" without actually being able to parse the variable as its own thing.
This problem can be approached using lexical modes by having one mode for the inside of strings and one (or more) for the outside. Seeing a " on the outside would switch to the inside mode and seeing a \( or " would switch back outside. The only complicated part would be seeing a ) on the outside: Sometimes it should switch back to the inside (because it corresponds to a \() and some times it shouldn't (when it corresponds to a plain ().
The most basic way to achieve this would be like this:
Lexer:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(' -> pushMode(DEFAULT_MODE);
RPAR: ')' -> popMode;
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(DEFAULT_MODE);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
Parser:
parser grammar StringParser;
options {
tokenVocab = 'StringLexer';
}
start: exp EOF ;
exp : '(' exp ')'
| IDENTIFIER
| DQUOTE stringContents* DQUOTE
;
stringContents : TEXT
| ESCAPE_SEQUENCE
| '\\(' exp ')'
;
Here we push the default mode every time we see a ( or \( and pop the mode every time we see a ). This way it will go back inside the string only if the mode on top of the stack is the string mode, which would only be the case if there aren't any unclosed ( left since the last \(.
This approach works, but has the downside that an unmatched ) will cause an empty stack exception rather than a normal syntax error because we're calling popMode on an empty stack.
To avoid this, we can add a member that tracks how deeply nested we are inside parentheses and doesn't pop the stack when the nesting level is 0 (i.e. if the stack is empty):
#members {
int nesting = 0;
}
LPAR: '(' {
nesting++;
pushMode(DEFAULT_MODE);
};
RPAR: ')' {
if (nesting > 0) {
nesting--;
popMode();
}
};
mode IN_STRING;
BACKSLASH_PAREN: '\\(' {
nesting++;
pushMode(DEFAULT_MODE);
};
(The parts I left out are the same as in the previous version).
This works and produces normal syntax errors for unmatched )s. However, it contains actions and is thus no longer language-agnostic, which is only a problem if you plan to use the grammar from multiple languages (and depending on the language, you might even be lucky and the code might be valid in all of your targeted languages).
If you want to avoid actions, the last alternative would be to have three modes: One for code that's outside of any strings, one for the inside of the string and one for the inside of \(). The third one will be almost identical to the outer one, except that it will push and pop the mode when seeing parentheses, whereas the outer one will not. To make both modes produce the same types of tokens, the rules in the third mode will all call type(). This will look like this:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(';
RPAR: ')';
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(EMBEDDED);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
mode EMBEDDED;
E_IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* -> type(IDENTIFIER);
E_DQUOTE: '"' -> pushMode(IN_STRING), type(DQUOTE);
E_LPAR: '(' -> type(LPAR), pushMode(EMBEDDED);
E_RPAR: ')' -> type(RPAR), popMode;
Note that we now can no longer use string literals in the parser grammar because string literals can't be used when multiple lexer rules are defined using the same string literal. So now we have to use LPAR instead of '(' in the parser and so on (we already had to do this for DQUOTE for the same reason).
Since this version involves a lot of duplication (especially as the amount of tokens rises) and prevents the use of string literals in the parser grammar, I generally prefer the version with the actions.
The full code for all three alternatives can also be found on GitHub.

ANTLR proper ordering of grammar rules

I am trying to write a grammar that will recognize <<word>> as a special token but treat <word> as just a regular literal.
Here is my grammar:
grammar test;
doc: item+ ;
item: func | atom ;
func: '<<' WORD '>>' ;
atom: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| WORD #wordAtom
;
WS : [ \t] -> skip ;
NEWLINE : [\n\r]+ ;
PUNCT : [.,?!]+ ;
WORD : CHAR+ ;
fragment CHAR : (LETTER | DIGIT | SYMB | PUNCT) ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}\n\r\t] ;
So something like <<word>> will be matched by two rules, both func and atom. I want it to be recognized as a func, so I put the func rule first.
When I test my grammar with <word> it treats it as an atom, as expected. However when I test my grammar and give it <<word>> it treats it as an atom as well.
Is there something I'm missing?
PS - I have separated atom into PUNCT, NEWLINE, and WORD and given them labels #punctAtom, #newlineAtom, and #wordAtom because I want to treat each of those differently when I traverse the parse tree. Also, a WORD can contain PUNCT because, for instance, someone can write "Hello," and I want to treat that as a single word (for simplicity later on).
PPS - One thing I've tried is I've included < and > in the last rule, which is a list of symbols that I'm "disallowing" to exist inside a WORD. This solves one problem, in that <<word>> is now recognized as a func, but it creates a new problem because <word> is no longer accepted as an atom.
ANTLR's lexer tries to match as much characters as possible, so both <<WORD>> and <WORD> are matched by the lexer rul WORD. Therefor, there in these cases the tokens << and >> (or < and > for that matter) will not be created.
You can see what tokens are being created by running these lines of code:
Lexer lexer = new testLexer(CharStreams.fromString("<word> <<word>>"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", testLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
which will print:
WORD <word>
WORD <<word>>
EOF <EOF>
What you could do is something like this:
func
: '<<' WORD '>>'
;
atom
: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| word #wordAtom
;
word
: WORD
| '<' WORD '>'
;
...
fragment SYMB : ~[<>a-zA-Z0-9.,?! |{}\n\r\t] ;
Of course, something like foo<bar will not become a single WORD, which it previously would.

What is the ANTLR4 equivalent of a ! in a lexer rule?

I'm working on converting an old ANTLR 2 grammar to ANTLR 4, and I'm having trouble with the string rule.
STRING :
'\''!
(
~('\'' | '\\' | '\r' | '\n')
)*
'\''!
;
This creates a STRING token whose text contains the contents of the string, but does not contain the starting and ending quotes, because of the ! symbol after the quote literals.
ANTLR 4 chokes on the ! symbol, ('!' came as a complete surprise to me (AC0050)) but if I leave it off, I end up with tokens that contain the quotes, which is not what I want. What's the correct way to port this to ANTLR 4?
Antlr4 generally treats tokens as being immutable, at least in the sense that there is no support for a language neutral equivalent of !.
Perhaps the simplest way to accomplish the equivalent is:
string : str=STRING { Strings.unquote($str); } ;
STRING : SQuote ~[\r\n\\']* SQuote ;
fragment SQuote : '\'' ;
where Strings.unquote is:
public static void unquote(Token token) {
CommonToken ct = (CommonToken) token;
String text = ct.getText();
text = .... unquote it ....
ct.setText(text);
}
The reason for using a parser rule is because attribute references are not (currently) supported in the lexer. Still, it could be done on the lexer rule - just would require a slight bit more effort to dig to the token.
An alternative to modifying the token text is to implement a custom token with custom fields and methods. See this answer if of interest.
I believe in ANTLR4 your problem can be solved using lexical modes and lexer commands.
Here is an example from there that I think does exactly what you need (although for double quotes but it's an easy fix):
lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;
mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

How do I ignore arbitrary stuff inside braces in ANTLR?

I am trying to write a config file grammar and get ANTLR4 to handle it. I am quite new to ANTLR (this is my first project with it).
Largely, I understand what needs to be done (or at least I think I do) for most of the config file grammar, but the files that I will be reading will have arbitrary C code inside of curly braces. Here is an example:
Something like:
#DEVICE: servo "servos are great"
#ACTION: turnRight "turning right is fun"
{
arbitrary C source code goes here;
some more arbitrary C source code;
}
#ACTION: secondAction "this is another action"
{
some more code;
}
And it could be many of those. I can't seem to get it to understand that I want to just ignore (without skipping) the source code. Here is my grammar so far:
/**
ANTLR4 grammar for practicing
*/
grammar practice;
file: (devconfig)*
;
devconfig: devid (action)+
;
devid: DEV_HDR (COMMENT)?
;
action: ACTN_HDR '{' C_BLOCK '}'
;
DEV_HDR: '#DEVICE: ' ALPHA+(IDCHAR)*
;
fragment
ALPHA: [a-zA-Z]
;
fragment
IDCHAR: ALPHA
| [0-9]
| '_'
;
COMMENT: '"' .*? '"'
;
ACTN_HDR: '#ACTION: ' ACTION_ID
;
fragment
ACTION_ID: ALPHA+(IDCHAR)*
;
C_BLOCK: WHAT DO I PUT HERE?? -> channel(HIDDEN)
;
WS: [ \t\n\r]+ -> skip
;
The problem is that whatever I put in the C_BLOCK lexer rule seems to screw up the whole thing - like if I put .*? -> channel(HIDDEN), it doesn't seem to work at all (of course, there is an error when using ANTLR on the grammar to the tune of ".*? can match the empty string" - but what should I put there if not that, so that it ignores the C code, but in such a way that I can access it later (i.e., not skipping it)?
Your C_BLOCK rule can be defined just like the usual multi line comment rule is done in so many languages. Make the curly braces part of the rule too:
C_BLOCK: CURLY .*? CURLY -> channel(HIDDEN);
If you need to nest blocks you write something like:
C_BLOCK: CURLY .*? C_BLOCK? .*? CURLY -> channel(HIDDEN);
or maybe:
C_BLOCK:
CURLY (
C_BLOCK
| .
)*?
CURLY
;
(untested).
Update: changed code to use the non-greedy kleene operator as suggested by a comment.