Special character handling in ANTLR lexer - antlr

I wrote the following grammar for string variable declaration. Strings are defined like anything between single quotes, but there must be a way to add a single quote to the string value by escaping using $ letter.
grammar test;
options
{
language = Java;
}
tokens
{
VAR = 'VAR';
END_VAR = 'END_VAR';
}
var_declaration: VAR string_type_declaration END_VAR EOF;
string_type_declaration: identifier ':=' string;
identifier: ID;
string: STRING_VALUE;
STRING_VALUE: '\'' ('$\''|.)* '\'';
ID: LETTER+;
WSFULL:(' ') {$channel=HIDDEN;};
fragment LETTER: (('a'..'z') | ('A'..'Z'));
This grammar doesn't work, if you try to run this code for var_declaration rule:
VAR A :='$12.2' END_VAR
I get MismatchedTokenException.
But this code works fine for string_type_declaration rule:
A :='$12.2'

Your STRING_VALUE isn't properly tokenized. Inside the loop ( ... )*, the $ expects a single quote after it, but the string in your input, '$12.2', doesn't have a quote after $. You should make the single quote optional ('$' '\''? | .)*. But now your alternative in the loop, the ., will also match a single quote: better let it match anything other than a single quote and $:
STRING_VALUE
: '\'' ( '$' '\''? | ~('$' | '\'') )* '\''
;
resulting in the following parse tree:

Related

R Language: Grammar for Raw Strings

I'm trying to create a new rule in the R grammar for Raw Strings.
Quote of the R news:
There is a new syntax for specifying raw character constants similar
to the one used in C++: r"(...)" with ... any character sequence not
containing the sequence )". This makes it easier to write strings that
contain backslashes or both single and double quotes. For more details
see ?Quotes.
Examples:
## A Windows path written as a raw string constant:
r"(c:\Program files\R)"
## More raw strings:
r"{(\1\2)}"
r"(use both "double" and 'single' quotes)"
r"---(\1--)-)---"
But I'm unsure if a grammar file alone is enough to implement the rule.
Until now I tried something like this as a basis from older suggestions of similar grammars:
Parser:
| RAW_STRING_LITERAL #e42
Lexer:
RAW_STRING_LITERAL
: ('R' | 'r') '"' ( '\\' [btnfr"'\\] | ~[\r\n"]|LETTER )* '"' ;
Any hints or suggestions are appreciated.
R ANTLR Grammar:
https://github.com/antlr/grammars-v4/blob/master/r/R.g4
Original R Grammar in Bison:
https://svn.r-project.org/R/trunk/src/main/gram.y
To match start- and end-delimiters, you will have to use target specific code. In Java that could look like this:
#lexer::members {
boolean closeDelimiterAhead() {
// Get the part between `r"` and `(`
String delimiter = getText().substring(2, getText().indexOf('('));
// Construct the end of the raw string
String stopFor = ")" + delimiter + "\"";
for (int n = 1; n <= stopFor.length(); n++) {
if (this._input.LA(n) != stopFor.charAt(n - 1)) {
// No end ahead yet
return false;
}
}
return true;
}
}
RAW_STRING
: [rR] '"' ~[(]* '(' ( {!closeDelimiterAhead()}? . )* ')' ~["]* '"'
;
which tokenizes r"---( )--" )----" )---" as a single RAW_STRING.
EDIT
And since the delimiters can only consist of hyphens (and parenthesis/braces) and not just any arbitrary character, this should do it as well:
RAW_STRING
: [rR] '"' INNER_RAW_STRING '"'
;
fragment INNER_RAW_STRING
: '-' INNER_RAW_STRING '-'
| '(' .*? ')'
| '{' .*? '}'
| '[' .*? ']'
;

Antlr4 DM string lexer rules

I'm trying to represent the BYOND DM language strings in lexer form (See http://byond.com and http://byond.com/docs/ref). Here are the rules for strings:
The string start and end with double quotes. i.e. "hello world" evaluates to hello world
A backslash acts as an escape character, which can escape the end quote. i.e. "hello\"world" evaluates to hello"world
Newlines in the string can be ignored by ending the line with a backslash. i.e. "hello\
world" evaluates to helloworld
If the string opens/closes with the sequence {"/"} respectively, newlines are allowed and entered into the final string. The sequence \\\n is still ignored
The string can contain embedded expressions inside braces which are formatted into the result. Backslashes can escape the opening brace. i.e. "hello [ "world" ] \[" evaluates to hello world [ at run-time. Any expression can go in the braces (calls, math, etc...)
If the starting quote/curly brace is prefixed with '#' escape sequences and embedded expressions are disabled for the string. i.e. #{"hello [worl\d"} and #"hello [worl\d" both evaluate to hello [worl\d
I am trying to construct ANTLR4 .g4 lexer rules to tokenize these strings. I figure there's 4 (or more) token types I'd need:
Normal string. i.e "hello world", #"hello world", #{"hello world"} or {"hello world"}
String start before embedded expression. i.e. "hello [ or {"hello [
String end after embedded expression. i.e. ] world" or ] world"}
String in between two embedded expressions. i.e. ] hello world [
Here are my (incomplete and unsuccessful) attempts:
LSTRING: '"' ('\\[' | ~[[\r\n])* '[';
RSTRING: ']' ('\\"' | ~["\r\n])* '"';
CSTRING: ']' ('\\[' | ~[[\r\n])* '[';
FSTRING: '"' ('\\"' | ~["\r\n])* '"';
If this can't be solved in the lexer, I can write the parser rules on my own with the tokens #, {", "}, [, ], \\, and ". But, I figure I'd give this a shot since it'd be more performant.
I solved it with the following lexer tidbits. Permalink
...
#lexer::members
{
ulong regularAccessLevel;
System.Collections.Generic.Stack<bool> multiString = new System.Collections.Generic.Stack<bool>();
}
...
VERBATIUM_STRING: '#"' (~["\r\n])* '"';
MULTILINE_VERBATIUM_STRING: '#{"' (~'"')* '"}';
MULTI_STRING_START: '{"' { multiString.Push(true); } -> pushMode(INTERPOLATION_STRING);
STRING_START: '"' { multiString.Push(false); } -> pushMode(INTERPOLATION_STRING);
...
LBRACE: '[' { ++regularAccessLevel; };
RBRACE: ']' { if(regularAccessLevel > 0) --regularAccessLevel; else if(multiString.Count > 0) { PopMode(); } };
...
mode INTERPOLATION_STRING;
CHAR_INSIDE: '\\\''
| '\\"'
| '\\['
| '\\\\'
| '\\0'
| '\\a'
| '\\b'
| '\\f'
| '\\n'
| '\\r'
| '\\t'
| '\\v'
;
EMBED_START: '[' -> pushMode(DEFAULT_MODE);
MULTI_STRING_CLOSE: {multiString.Peek()}? '"}' { multiString.Pop(); PopMode(); };
STRING_CLOSE: {!multiString.Peek()}? '"' { multiString.Pop(); PopMode(); };
STRING_INSIDE: {!multiString.Peek()}? ~('[' | '\\' | '"' | '\r' | '\n')+;
MULTI_STRING_INSIDE: {multiString.Peek()}? ~('[' | '\\' | '"')+;
Certain strings can cause it to emit multiple STRING_INSIDE/MULTI_STRING_INSIDE tokens in sequence, but this is acceptable since the parser will eat it all anyway.
A lot of it came from reading the C# interpolated strings in the antlr4 examples permalink

Getting plain text in antlr instead of tokens

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

Grammar to negate two like characters in a lexer rule inside a single quoted string

ANLTR 4:
I need to support a single quoted string literal with escaped characters AND the ability to use double curly braces as an 'escape sequence' that will need additional parsing. So both of these examples need to be supported. I'm not so worried about the second example because that seems trivial if I can get the first to work and not match double curly brace characters.
1. 'this is a string literal with an escaped\' character'
2. 'this is a string {{functionName(x)}} literal with double curlies'
StringLiteral
: '\'' (ESC | AnyExceptDblCurlies)*? '\'' ;
fragment
ESC : '\\' [btnr\'\\];
fragment
AnyExceptDblCurlies
: '{' ~'{'
| ~'{' .;
I've done a lot of research on this and understand that you can't negate multiple characters, and have even seen a similar approach work in Bart's answer in this post...
Negating inside lexer- and parser rules
But what I'm seeing is that in example 1 above, the escaped single quote is not being recognized and I get a parser error that it cannot match ' character'.
if I alter the string literal token rule to the following it works...
StringLiteral
: '\'' (ESC | .)*? '\'' ;
Any ideas how to handle this scenario better? I can deduce that the escaped character is getting matched by AnyExceptDblCurlies instead of ESC, but I'm not sure how to solve this problem.
To parse the template definition out of the string pretty much requires handling in the parser. Use lexer modes to distinguish between string characters and the template name.
Parser:
options {
tokenVocab = TesterLexer ;
}
test : string EOF ;
string : STRBEG ( SCHAR | template )* STREND ; // allow multiple templates per string
template : TMPLBEG TMPLNAME TMPLEND ;
Lexer:
STRBEG : Squote -> pushMode(strMode) ;
mode strMode ;
STRESQ : Esqote -> type(SCHAR) ; // predeclare SCHAR in tokens block
STREND : Squote -> popMode ;
TMPLBEG : DBrOpen -> pushMode(tmplMode) ;
STRCHAR : . -> type(SCHAR) ;
mode tmplMode ;
TMPLEND : DBrClose -> popMode ;
TMPLNAME : ~'}'* ;
fragment Squote : '\'' ;
fragment Esqote : '\\\'' ;
fragment DBrOpen : '{{' ;
fragment DBrClose : '}}' ;
Updated to correct the TMPLNAME rule, add main rule and options block.

ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

Is there any way to express this in ANTLR4:
Any string as long as it doesn't contain the asterisk immediately
followed by a forward slash?
This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'
This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.
I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.
The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.
The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.
grammar myTest;
#header
{
import java.util.*;
}
#parser::members
{
boolean hasEvilCharacters(String input)
{
if (input.contains("*/"))
{
return false;
}
else
{
return true;
}
}
}
// Mimics a very simple sentence, such as:
// I am clean.
// I have evil char*/acters.
myTest
: { hasEvilCharacters(_input.LT(1).getText()) }? String
(Space { hasEvilCharacters(_input.LT(1).getText()) }? String)*
Period EOF
;
String
: ('A'..'Z' | 'a'..'z')+
;
Space
: ' '
;
Period
: '.'
;
Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.
If the disallowed sequences are few there exists a solution without parser/lexer actions:
grammar NotParser;
program
: (starslash | notstarslash)+
;
notstarslash
: NOT_STAR_SLASH
;
starslash
: STAR_SLASH
;
STAR_SLASH
: '*'+ '/'
;
NOT_STAR_SLASH
: (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
;
fragment F_NOT_STAR_SLASH
: ~('*'|'/')
;
fragment F_STAR_NOT_SLASH
: '*'+ ~('*'|'/')
| '*'+ EOF
| '/'
;
The idea is to compose the token of
all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'
There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')