Grammar for string interpolation where malformed interpolations are treated as normal strings - antlr

Here is a subset of the language I want to parse:
A program consists of statements
A statement is an assignment: A = "b"
Assignment's left side is an identifier (all caps)
Assignment's right side is a string enclosed by quotation marks
A string supports string interpolation by inserting a bracket-enclosed identifier (A = "b[C]d")
So far this is straight forward enough. Here is what works:
Lexer:
lexer grammar string_testLexer;
STRING_START: '"' -> pushMode(STRING);
WS: [ \t\r\n]+ -> skip ;
ID: [A-Z]+;
EQ: '=';
mode STRING;
VAR_START: '[' -> pushMode(INTERPOLATION);
DOUBLE_QUOTE_INSIDE: '"' -> popMode;
REGULAR_STRING_INSIDE: ~('"'|'[')+;
mode INTERPOLATION;
ID_INSIDE: [A-Z]+;
CLOSE_BRACKET_INSIDE: ']' -> popMode;
Parser:
parser grammar string_testParser;
options { tokenVocab=string_testLexer; }
mainz: stat *;
stat: ID EQ string;
string: STRING_START string_part* DOUBLE_QUOTE_INSIDE;
string_part: interpolated_var | REGULAR_STRING_INSIDE;
interpolated_var: VAR_START ID_INSIDE CLOSE_BRACKET_INSIDE;
So far so good. However there is one more language feature:
if there is no valid identifier (that is all caps) in the brackets, treat as normal string.
Eg:
A = "hello" => "hello"
B = "h[A]a" => "h", A, "a"
C="h [A] a" => "h ", A, " a"
D="h [A][V] a" => "h ", A, V, " a"
E = "h [A] [V] a" => "h ", A, " ", V, " a"
F = "h [aVd] a" => "h [aVd] a"
G = "h [Va][VC] a" => "h [Va]", VC, " a"
H = "h [V][][ff[Z]" => "h ", V, "[][ff", Z
I tried to replace REGULAR_STRING_INSIDE: ~('"'|'[')+; With just REGULAR_STRING_INSIDE: ~('"')+;, but that does not work in ANTLR. It results in matching all the lines above as strings.
Since in ANTLR4 there is no backtracking to enable I'm not sure how to overcome this and tell ANTLR that if it did not match the interpolated_var rule it should go ahead and match REGULAR_STRING_INSIDE instead, it seems to always chose the latter.
I read that lexer always matches the longest token, so I tried to lift REGULAR_STRING_INSIDE and VAR_START as a parser rules, hoping that alternatives order in the parser will be honoured:
r: REGULAR_STRING_INSIDE
v: VAR_START
string: STRING_START string_part* DOUBLE_QUOTE_INSIDE;
string_part: v ID_INSIDE CLOSE_BRACKET_INSIDE | r;
That did not seem to make any difference at all.
I also read that antlr4 semantic predicates could help. But I have troubles coming up with the ones that needs to be applied in this case.
How do I modify this grammar above so that it can match both interpolated bits, or treat them as strings if they are malformed?
Test input:
A = "hello"
B = "h[A]a"
C="h [A] a"
D="h [A][V] a"
E = "h [A] [V] a"
F = "h [aVd] a"
G = "h [Va][VC] a"
H = "h [V][][ff[Z]"
How I compile / test:
antlr4 string_testLexer.g4
antlr4 string_testParser.g4
javac *.java
grun string_test mainz st.txt -tree

I tried to replace REGULAR_STRING_INSIDE: ~('"'|'[')+; With just REGULAR_STRING_INSIDE: ~('"')+;, but that does not work in ANTLR. It results in matching all the lines above as strings.
Correct, ANTLR tries to match as much as possible. So ~('"')+ will be far too greedy.
I also read that antlr4 semantic predicates could help.
Only use predicates as a last resort. It introduces target specific code in your grammar. If it's not needed (which in this case it isn't), then don't use them.
Try something like this:
REGULAR_STRING_INSIDE
: ( ~( '"' | '[' )+
| '[' [A-Z]* ~( ']' | [A-Z] )
| '[]'
)+
;
The rule above would read as:
match any char other than " or [ once or more
OR match a [ followed by zero or more capitals, followed by any char other than ] or a capital (your [Va and [aVd cases)
OR match an empty block, []
And match one of these 3 alternatives above once or more to create a single REGULAR_STRING_INSIDE.
And if a string can end with one or mote [, you may also want to do this:
DOUBLE_QUOTE_INSIDE
: '['* '"' -> popMode
;

Related

Antlr4 DM string lexer rules

I'm trying to represent the BYOND DM language strings in lexer form (See http://byond.com and http://byond.com/docs/ref). Here are the rules for strings:
The string start and end with double quotes. i.e. "hello world" evaluates to hello world
A backslash acts as an escape character, which can escape the end quote. i.e. "hello\"world" evaluates to hello"world
Newlines in the string can be ignored by ending the line with a backslash. i.e. "hello\
world" evaluates to helloworld
If the string opens/closes with the sequence {"/"} respectively, newlines are allowed and entered into the final string. The sequence \\\n is still ignored
The string can contain embedded expressions inside braces which are formatted into the result. Backslashes can escape the opening brace. i.e. "hello [ "world" ] \[" evaluates to hello world [ at run-time. Any expression can go in the braces (calls, math, etc...)
If the starting quote/curly brace is prefixed with '#' escape sequences and embedded expressions are disabled for the string. i.e. #{"hello [worl\d"} and #"hello [worl\d" both evaluate to hello [worl\d
I am trying to construct ANTLR4 .g4 lexer rules to tokenize these strings. I figure there's 4 (or more) token types I'd need:
Normal string. i.e "hello world", #"hello world", #{"hello world"} or {"hello world"}
String start before embedded expression. i.e. "hello [ or {"hello [
String end after embedded expression. i.e. ] world" or ] world"}
String in between two embedded expressions. i.e. ] hello world [
Here are my (incomplete and unsuccessful) attempts:
LSTRING: '"' ('\\[' | ~[[\r\n])* '[';
RSTRING: ']' ('\\"' | ~["\r\n])* '"';
CSTRING: ']' ('\\[' | ~[[\r\n])* '[';
FSTRING: '"' ('\\"' | ~["\r\n])* '"';
If this can't be solved in the lexer, I can write the parser rules on my own with the tokens #, {", "}, [, ], \\, and ". But, I figure I'd give this a shot since it'd be more performant.
I solved it with the following lexer tidbits. Permalink
...
#lexer::members
{
ulong regularAccessLevel;
System.Collections.Generic.Stack<bool> multiString = new System.Collections.Generic.Stack<bool>();
}
...
VERBATIUM_STRING: '#"' (~["\r\n])* '"';
MULTILINE_VERBATIUM_STRING: '#{"' (~'"')* '"}';
MULTI_STRING_START: '{"' { multiString.Push(true); } -> pushMode(INTERPOLATION_STRING);
STRING_START: '"' { multiString.Push(false); } -> pushMode(INTERPOLATION_STRING);
...
LBRACE: '[' { ++regularAccessLevel; };
RBRACE: ']' { if(regularAccessLevel > 0) --regularAccessLevel; else if(multiString.Count > 0) { PopMode(); } };
...
mode INTERPOLATION_STRING;
CHAR_INSIDE: '\\\''
| '\\"'
| '\\['
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v'
;
EMBED_START: '[' -> pushMode(DEFAULT_MODE);
MULTI_STRING_CLOSE: {multiString.Peek()}? '"}' { multiString.Pop(); PopMode(); };
STRING_CLOSE: {!multiString.Peek()}? '"' { multiString.Pop(); PopMode(); };
STRING_INSIDE: {!multiString.Peek()}? ~('[' | '\\' | '"' | '\r' | '\n')+;
MULTI_STRING_INSIDE: {multiString.Peek()}? ~('[' | '\\' | '"')+;
Certain strings can cause it to emit multiple STRING_INSIDE/MULTI_STRING_INSIDE tokens in sequence, but this is acceptable since the parser will eat it all anyway.
A lot of it came from reading the C# interpolated strings in the antlr4 examples permalink

Getting plain text in antlr instead of tokens

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

How to tokenize blocks (comments, strings, ...) as well as inter-blocks (any char outside blocks)?

I need to tokenize everything that is "outside" any comment, until end of line. For instance:
take me */ and me /* but not me! */ I'm in! // I'm not...
Tokenized as (STR is the "outside" string, BC is block-comment and LC is single-line-comment):
{
STR: "take me */ and me ", // note the "*/" in the string!
BC : " but not me! ",
STR: " I'm in! ",
LC : " I'm not..."
}
And:
/* starting with don't take me */ ...take me...
Tokenized as:
{
BC : " starting with don't take me ",
STR: " ...take me..."
}
The problem is that STR can be anything except the comments, and since the comments openers are not single char tokens I can't use a negation rule for STR.
I thought maybe to do something like:
STR : { IsNextSequenceTerminatesThe_STR_rule(); }?;
But I don't know how to look-ahead for characters in lexer actions.
Is it even possible to accomplish with the ANTLR4 lexer, if yes then how?
Yes, it is possible to perform the tokenization you are attempting.
Based on what has been described above, you want nested comments. These can be achieved in the lexer only without Action, Predicate nor any code. In order to have nested comments, its easier if you do not use the greedy/non-greedy ANTLR options. You will need to specify/code this into the lexer grammar. Below are the three lexer rules you will need... with STR definition.
I added a parser rule for testing. I've not tested this, but it should do everything you mentioned. Also, its not limited to 'end of line' you can make that modification if you need to.
/*
All 3 COMMENTS are Mutually Exclusive
*/
DOC_COMMENT
: '/**'
( [*]* ~[*/] // Cannot START/END Comment
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*'+ '/' -> channel( DOC_COMMENT )
;
BLK_COMMENT
: '/*'
(
( /* Must never match an '*' in position 3 here, otherwise
there is a conflict with the definition of DOC_COMMENT
*/
[/]? ~[*/] // No START/END Comment
| DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
)
( DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| .
)*?
)?
'*/' -> channel( BLK_COMMENT )
;
INL_COMMENT
: '//'
( ~[\n\r*/] // No NEW_LINE
| INL_COMMENT // Nested Inline Comment
)* -> channel( INL_COMMENT )
;
STR // Consume everthing up to the start of a COMMENT
: ( ~'/' // Any Char not used to START a Comment
| '/' ~[*/] // Cannot START a Comment
)+
;
start
: DOC_COMMENT
| BLK_COMMENT
| INL_COMMENT
| STR
;
Try something like this:
grammar T;
#lexer::members {
// Returns true iff either "//" or "/*" is ahead in the char stream.
boolean startCommentAhead() {
return _input.LA(1) == '/' && (_input.LA(2) == '/' || _input.LA(2) == '*');
}
}
// other rules
STR
: ( {!startCommentAhead()}? . )+
;

Special character handling in ANTLR lexer

I wrote the following grammar for string variable declaration. Strings are defined like anything between single quotes, but there must be a way to add a single quote to the string value by escaping using $ letter.
grammar test;
options
{
language = Java;
}
tokens
{
VAR = 'VAR';
END_VAR = 'END_VAR';
}
var_declaration: VAR string_type_declaration END_VAR EOF;
string_type_declaration: identifier ':=' string;
identifier: ID;
string: STRING_VALUE;
STRING_VALUE: '\'' ('$\''|.)* '\'';
ID: LETTER+;
WSFULL:(' ') {$channel=HIDDEN;};
fragment LETTER: (('a'..'z') | ('A'..'Z'));
This grammar doesn't work, if you try to run this code for var_declaration rule:
VAR A :='$12.2' END_VAR
I get MismatchedTokenException.
But this code works fine for string_type_declaration rule:
A :='$12.2'
Your STRING_VALUE isn't properly tokenized. Inside the loop ( ... )*, the $ expects a single quote after it, but the string in your input, '$12.2', doesn't have a quote after $. You should make the single quote optional ('$' '\''? | .)*. But now your alternative in the loop, the ., will also match a single quote: better let it match anything other than a single quote and $:
STRING_VALUE
: '\'' ( '$' '\''? | ~('$' | '\'') )* '\''
;
resulting in the following parse tree:

How to find the length of a token in antlr?

I am trying to create a grammar which accepts any character or number or just about anything, provided its length is equal to 1.
Is there a function to check the length?
EDIT
Let me make my question more clear with an example.
I wrote the following code:
grammar first;
tokens {
SET = 'set';
VAL = 'val';
UND = 'und';
CON = 'con';
ON = 'on';
OFF = 'off';
}
#parser::members {
private boolean inbounds(Token t, int min, int max) {
int n = Integer.parseInt(t.getText());
return n >= min && n <= max;
}
}
parse : SET expr;
expr : VAL('u'('e')?)? String |
UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF) |
CON('n'('e'('c'('t')?)?)?)? oneChar
;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
dot : .;
oneChar : dot { $dot.text.length() == 1;} ;
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
I want my grammar to do the following things:
Accept commands like: 'set value abc' , 'set underli on' , 'set conn #'. The grammar should be intelligent enough to accept incomplete words like 'underl' instead of 'underline. etc etc.
The third syntax: 'set connect oneChar' should accept any character, but just one character. It can be a numeric digit or alphabet or any special character. I am getting a compiler error in the generated parser file because of this.
The first syntax: 'set value' should accept all the possible strings, even on and off. But when I give something like: 'set value offer', the grammar is failing. I think this is happening because I already have a token 'OFF'.
In my grammar all the three requirements I have listed above are not working fine. Don't know why.
There are some mistakes and/or bad practices in your grammar:
#1
The following is not a validating predicate:
{$dot.text.length() == 1;}
A proper validating predicate in ANTLR has a question mark at the end, and the inner code has no semi colon at the end. So it should be:
{$dot.text.length() == 1}?
instead.
#2
You should not be handling these alternative commands:
expr
: VAL('u'('e')?)? String
| UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF)
| CON('n'('e'('c'('t')?)?)?)? oneChar
;
in a parser rule. You should let the lexer handle this instead. Something like this will do it:
expr
: VAL String
| UND (ON | OFF)
| CON oneChar
;
// ...
VAL : 'val' ('u' ('e')?)?;
UND : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
(also see #5!)
#3
Your lexer rules:
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
are making things complicated for you. The lexer can produce three different kind of tokens because of this: CHAR, DIGIT or String. Ideally, you should only create String tokens since a String can already be a single CHAR or DIGIT. You can do that by adding the fragment keyword before these rules:
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
There will now be no CHAR and DIGIT tokens in your token stream, only String tokens. In short: fragment rules are only used inside lexer rules, by other lexer rules. They will never be tokens of their own (and can therefor never appear in any parser rule!).
#4
The rule:
dot : .;
does not do what you think it does. It matches "any token", not "any character". Inside a lexer rule, the . matches any character but in parser rules, it matches any token. Realize that parser rules can only make use of the tokens created by the lexer.
The input source is first tokenized based on the lexer-rules. After that has been done, the parser (though its parser rules) can then operate on these tokens (not characters!!!). Make sure you understand this! (if not, ask for clarification or grab a book about ANTLR)
- an example -
Take the following grammar:
p : . ;
A : 'a' | 'A';
B : 'b' | 'B';
The parser rule p will now match any token that the lexer produces: which is only a A- or B-token. So, p can only match one of the characters 'a', 'A', 'b' or 'B', nothing else.
And in the following grammar:
prs : . ;
FOO : 'a';
BAR : . ;
the lexer rule BAR matches any single character in the range \u0000 .. \uFFFF, but it can never match the character 'a' since the lexer rule FOO is defined before the BAR rule and captures this 'a' already. And the parser rule prs again matches any token, which is either FOO or BAR.
#5
Putting single characters like 'u' inside your parser rules, will cause the lexer to tokenize an u as a separate token: you don't want that. Also, by putting them in parser rules, it is unclear which token has precedence over other tokens. You should keep all such literals outside your parser rules and make them explicit lexer rules instead. Only use lexer rules in your parser rules.
So, don't do:
pRule : 'u' ':' String
String : ...
but do:
pRule : U ':' String
U : 'u';
String : ...
You could make ':' a lexer rule, but that is of less importance. The 'u' however can also be a String so it must appear as a lexer rule before the String rule.
Okay, those were the most obvious things that come to mind. Based on them, here's a proposed grammar:
grammar first;
parse
: (SET expr {System.out.println("expr = " + $expr.text);} )+ EOF
;
expr
: VAL String {System.out.print("A :: ");}
| UL (ON | OFF) {System.out.print("B :: ");}
| CON oneChar {System.out.print("C :: ");}
;
oneChar
: String {$String.text.length() == 1}?
;
SET : 'set';
VAL : 'val' ('u' ('e')?)?;
UL : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
ON : 'on';
OFF : 'off';
String : (CHAR | DIGIT)+;
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
that can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"set value abc \n" +
"set underli on \n" +
"set conn x \n" +
"set conn xy ";
ANTLRStringStream in = new ANTLRStringStream(source);
firstLexer lexer = new firstLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
firstParser parser = new firstParser(tokens);
System.out.println("parsing:\n======\n" + source + "\n======");
parser.parse();
}
}
which, after generating the lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool first.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
prints the following output:
parsing:
======
set value abc
set underli on
set conn x
set conn xy
======
A :: expr = value abc
B :: expr = underli on
C :: expr = conn x
line 0:-1 rule oneChar failed predicate: {$String.text.length() == 1}?
C :: expr = conn xy
As you can see, the last command, C :: expr = conn xy, produces an error, as expected.