Antlr4 DM string lexer rules - antlr

I'm trying to represent the BYOND DM language strings in lexer form (See http://byond.com and http://byond.com/docs/ref). Here are the rules for strings:
The string start and end with double quotes. i.e. "hello world" evaluates to hello world
A backslash acts as an escape character, which can escape the end quote. i.e. "hello\"world" evaluates to hello"world
Newlines in the string can be ignored by ending the line with a backslash. i.e. "hello\
world" evaluates to helloworld
If the string opens/closes with the sequence {"/"} respectively, newlines are allowed and entered into the final string. The sequence \\\n is still ignored
The string can contain embedded expressions inside braces which are formatted into the result. Backslashes can escape the opening brace. i.e. "hello [ "world" ] \[" evaluates to hello world [ at run-time. Any expression can go in the braces (calls, math, etc...)
If the starting quote/curly brace is prefixed with '#' escape sequences and embedded expressions are disabled for the string. i.e. #{"hello [worl\d"} and #"hello [worl\d" both evaluate to hello [worl\d
I am trying to construct ANTLR4 .g4 lexer rules to tokenize these strings. I figure there's 4 (or more) token types I'd need:
Normal string. i.e "hello world", #"hello world", #{"hello world"} or {"hello world"}
String start before embedded expression. i.e. "hello [ or {"hello [
String end after embedded expression. i.e. ] world" or ] world"}
String in between two embedded expressions. i.e. ] hello world [
Here are my (incomplete and unsuccessful) attempts:
LSTRING: '"' ('\\[' | ~[[\r\n])* '[';
RSTRING: ']' ('\\"' | ~["\r\n])* '"';
CSTRING: ']' ('\\[' | ~[[\r\n])* '[';
FSTRING: '"' ('\\"' | ~["\r\n])* '"';
If this can't be solved in the lexer, I can write the parser rules on my own with the tokens #, {", "}, [, ], \\, and ". But, I figure I'd give this a shot since it'd be more performant.

I solved it with the following lexer tidbits. Permalink
...
#lexer::members
{
ulong regularAccessLevel;
System.Collections.Generic.Stack<bool> multiString = new System.Collections.Generic.Stack<bool>();
}
...
VERBATIUM_STRING: '#"' (~["\r\n])* '"';
MULTILINE_VERBATIUM_STRING: '#{"' (~'"')* '"}';
MULTI_STRING_START: '{"' { multiString.Push(true); } -> pushMode(INTERPOLATION_STRING);
STRING_START: '"' { multiString.Push(false); } -> pushMode(INTERPOLATION_STRING);
...
LBRACE: '[' { ++regularAccessLevel; };
RBRACE: ']' { if(regularAccessLevel > 0) --regularAccessLevel; else if(multiString.Count > 0) { PopMode(); } };
...
mode INTERPOLATION_STRING;
CHAR_INSIDE: '\\\''
| '\\"'
| '\\['
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v'
;
EMBED_START: '[' -> pushMode(DEFAULT_MODE);
MULTI_STRING_CLOSE: {multiString.Peek()}? '"}' { multiString.Pop(); PopMode(); };
STRING_CLOSE: {!multiString.Peek()}? '"' { multiString.Pop(); PopMode(); };
STRING_INSIDE: {!multiString.Peek()}? ~('[' | '\\' | '"' | '\r' | '\n')+;
MULTI_STRING_INSIDE: {multiString.Peek()}? ~('[' | '\\' | '"')+;
Certain strings can cause it to emit multiple STRING_INSIDE/MULTI_STRING_INSIDE tokens in sequence, but this is acceptable since the parser will eat it all anyway.
A lot of it came from reading the C# interpolated strings in the antlr4 examples permalink

Related

R Language: Grammar for Raw Strings

I'm trying to create a new rule in the R grammar for Raw Strings.
Quote of the R news:
There is a new syntax for specifying raw character constants similar
to the one used in C++: r"(...)" with ... any character sequence not
containing the sequence )". This makes it easier to write strings that
contain backslashes or both single and double quotes. For more details
see ?Quotes.
Examples:
## A Windows path written as a raw string constant:
r"(c:\Program files\R)"
## More raw strings:
r"{(\1\2)}"
r"(use both "double" and 'single' quotes)"
r"---(\1--)-)---"
But I'm unsure if a grammar file alone is enough to implement the rule.
Until now I tried something like this as a basis from older suggestions of similar grammars:
Parser:
| RAW_STRING_LITERAL #e42
Lexer:
RAW_STRING_LITERAL
: ('R' | 'r') '"' ( '\\' [btnfr"'\\] | ~[\r\n"]|LETTER )* '"' ;
Any hints or suggestions are appreciated.
R ANTLR Grammar:
https://github.com/antlr/grammars-v4/blob/master/r/R.g4
Original R Grammar in Bison:
https://svn.r-project.org/R/trunk/src/main/gram.y
To match start- and end-delimiters, you will have to use target specific code. In Java that could look like this:
#lexer::members {
boolean closeDelimiterAhead() {
// Get the part between `r"` and `(`
String delimiter = getText().substring(2, getText().indexOf('('));
// Construct the end of the raw string
String stopFor = ")" + delimiter + "\"";
for (int n = 1; n <= stopFor.length(); n++) {
if (this._input.LA(n) != stopFor.charAt(n - 1)) {
// No end ahead yet
return false;
}
}
return true;
}
}
RAW_STRING
: [rR] '"' ~[(]* '(' ( {!closeDelimiterAhead()}? . )* ')' ~["]* '"'
;
which tokenizes r"---( )--" )----" )---" as a single RAW_STRING.
EDIT
And since the delimiters can only consist of hyphens (and parenthesis/braces) and not just any arbitrary character, this should do it as well:
RAW_STRING
: [rR] '"' INNER_RAW_STRING '"'
;
fragment INNER_RAW_STRING
: '-' INNER_RAW_STRING '-'
| '(' .*? ')'
| '{' .*? '}'
| '[' .*? ']'
;

exit antlr 3 parser early without raising exception

I am using antlr 3.1.3 and generating a python target. My lexer and parser accept very large files. Based on command-line or dynamic run-time controlled parameters, I would like to capture a portion of the recognized input and stop parsing early. For example, if my language consists of a header and a body, and the body might have gigabytes of tokens, and I am only interested in the header, I would like to have a rule that stops the lexer and parser without raising an exception. For performance reasons, I don't want to read the entire body.
grammar Example;
options {
language=Python;
k=2;
}
language:
header
body
EOF
;
header:
HEAD
(STRING)*
;
body:
BODY { if stopearly: help() }
(STRING)*
;
// string literals
STRING: '"'
(
'"' '"'
| NEWLINE
| ~('"'|'\n'|'\r')
)*
'"'
;
// Whitespace -- ignored
WS:
( ' '
| '\t'
| '\f'
| NEWLINE
)+ { $channel=HIDDEN }
;
HEAD: 'head';
BODY: 'body';
fragment NEWLINE: '\r' '\n' | '\r' | '\n';
What about:
body:
BODY {!stopearly}? => (STRING)*
;
?
That's using a syntantic predicate to enable certain language parts. I use that often to toggle language parts depending on a version number. I'm not 100% certain. It might be you have to move the predicate and the code following it into an own rule.
This is a python-specific answer. I Added this to my parser:
#parser::header
{
class QuitEarlyException(Exception):
def __init__(self, value):
self.value = value
def __str__(self):
return repr(self.value)
}
and changed this:
body:
BODY { if stopearly: raise QuitEarlyException('ok') }
(STRING)*
;
Now I have a "try" block around my parser:
try:
parser.language()
except QuitEarlyException as e:
print "stopped early"

Special character handling in ANTLR lexer

I wrote the following grammar for string variable declaration. Strings are defined like anything between single quotes, but there must be a way to add a single quote to the string value by escaping using $ letter.
grammar test;
options
{
language = Java;
}
tokens
{
VAR = 'VAR';
END_VAR = 'END_VAR';
}
var_declaration: VAR string_type_declaration END_VAR EOF;
string_type_declaration: identifier ':=' string;
identifier: ID;
string: STRING_VALUE;
STRING_VALUE: '\'' ('$\''|.)* '\'';
ID: LETTER+;
WSFULL:(' ') {$channel=HIDDEN;};
fragment LETTER: (('a'..'z') | ('A'..'Z'));
This grammar doesn't work, if you try to run this code for var_declaration rule:
VAR A :='$12.2' END_VAR
I get MismatchedTokenException.
But this code works fine for string_type_declaration rule:
A :='$12.2'
Your STRING_VALUE isn't properly tokenized. Inside the loop ( ... )*, the $ expects a single quote after it, but the string in your input, '$12.2', doesn't have a quote after $. You should make the single quote optional ('$' '\''? | .)*. But now your alternative in the loop, the ., will also match a single quote: better let it match anything other than a single quote and $:
STRING_VALUE
: '\'' ( '$' '\''? | ~('$' | '\'') )* '\''
;
resulting in the following parse tree:

Simple grammar not working

I have a simple grammar to parse files containing identifiers and keywords between brackets (hopefully):
grammar Keyword;
// PARSER RULES
//
entry_point : ('['ID']')*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
It works for input:
[Hi]
[Hi]
It returns a NoViableAltException error for input:
[Hi]
[Ki]
If I comment KEYWORD, then it works fine. Also, if I change my grammar to:
grammar Keyword;
// PARSER RULES
//
entry_point : ID*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : '[' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ']';
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
Then it works. Could you please help me figuring out why?
Best regards.
The 1st grammar fails because whenever the lexer sees "[K", the lexer will enter the KEYWORD rule. If it then encounters something other then "eyword]", "i" in your case, it tries to go back to some other rule that can match "[K". But there is no other lexer rule that starts with "[K" and will therefor throw an exception. Note that the lexer doesn't remove "K" and then tries to match again (the lexer is a dumb machine)!
Your 2nd grammar works, because the lexer now can find something to fall back on when "[Ki" does not get matched by the KEYWORD since ID now includes the "[".

ANTLR parsing Java Properties

I'm trying to pick up ANTLR and writing a grammar for Java Properties. I'm hitting an issue here and will appreciate some help.
In Java Properties, it has a little strange escape handling. For example,
key1=1=Key1
key\=2==
results in key-value pairs in Java runtime as
KEY VALUE
=== =====
key1 1=Key1
key=2 =
So far, this is the best I can mimic.. by folding the '=' and value into one single token.
grammar Prop;
file : (pair | LINE_COMMENT)* ;
pair : ID VALUE ;
ID : (~('='|'\r'|'\n') | '\\=')* ;
VALUE : '=' (~('\r'|'\n'))*;
CARRIAGE_RETURN
: ('\r'|'\n') + {$channel=HIDDEN;}
;
LINE_COMMENT
: '#' ~('\r'|'\n')* ('\r'|'\n'|EOF)
;
Is there any good suggestion if I can implement a better one?
Thanks a lot
It's not as easy as that. You can't handle much at the lexing level because many things depend on a certain context. So at the lexing level, you can only match single characters and construct key and values in parser rules. Also, the = and : as possible key-value separators and the fact that these characters can be the start of a value, makes them a pain in the butt to translate into a grammar. The easiest would be to include these (possible) separator chars in your value-rule and after matching the separator and value together, strip the separator chars from it.
A small demo:
JavaProperties.g
grammar JavaProperties;
parse
: line* EOF
;
line
: Space* keyValue
| Space* Comment eol
| Space* LineBreak
;
keyValue
: key separatorAndValue eol
{
// Replace all escaped `=` and `:`
String k = $key.text.replace("\\:", ":").replace("\\=", "=");
// Remove the separator, if it exists
String v = $separatorAndValue.text.replaceAll("^\\s*[:=]\\s*", "");
// Remove all escaped line breaks with trailing spaces
v = v.replaceAll("\\\\(\r?\n|\r)[ \t\f]*", "").trim();
System.out.println("\nkey : `" + k + "`");
System.out.println("value : `" + v + "`");
}
;
key
: keyChar+
;
keyChar
: AlphaNum
| Backslash (Colon | Equals)
;
separatorAndValue
: (Space | Colon | Equals) valueChar+
;
valueChar
: AlphaNum
| Space
| Backslash LineBreak
| Equals
| Colon
;
eol
: LineBreak
| EOF
;
Backslash : '\\';
Colon : ':';
Equals : '=';
Comment
: ('!' | '#') ~('\r' | '\n')*
;
LineBreak
: '\r'? '\n'
| '\r'
;
Space
: ' '
| '\t'
| '\f'
;
AlphaNum
: 'a'..'z'
| 'A'..'Z'
| '0'..'9'
;
The grammar above can be tested with the class:
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRFileStream("test.properties");
JavaPropertiesLexer lexer = new JavaPropertiesLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
JavaPropertiesParser parser = new JavaPropertiesParser(tokens);
parser.parse();
}
}
and the input file:
test.properties
key1 = value 1
key2:value 2
key3 :value3
ke\:\=y4=v\
a\
l\
u\
e 4
key\=5==
key6 value6
to produce the following output:
key : `key1`
value : `value 1`
key : `key2`
value : `value 2`
key : `key3`
value : `value3`
key : `ke:=y4`
value : `value 4`
key : `key=5`
value : `=`
key : `key6`
value : `value6`
Realize that my grammar is just an example: it does not account for all valid properties files (sometimes backslashes should be ignored, there's no Unicode escapes, many characters are missing in the key and value). For a complete specification of properties files, see:
http://download.oracle.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29