ANTLR String interpolation - antlr

I'm trying to write an ANTLR grammar that parses string interpolation expressions such as:
my.greeting = "hello ${your.name}"
The error I get is:
line 1:31 token recognition error at: 'e'
line 1:34 no viable alternative at input '<EOF>'
MyParser.g4:
parser grammar MyParser;
options { tokenVocab=MyLexer; }
program: variable EQ expression EOF;
expression: (string | variable);
variable: (VAR DOT)? VAR;
string: (STRING_SEGMENT_END expression)* STRING_END;
MyLexer.g4:
lexer grammar MyLexer;
START_STR: '"' -> more, pushMode(STRING_MODE) ;
VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
EQ: '=';
DOT: '.';
WHITE_SPACE: (SPACE | NEW_LINE | TAB)+ -> skip;
fragment DIGIT: '0'..'9';
fragment LOWERCASE: 'a'..'z';
fragment UPPERCASE: 'A'..'Z';
fragment ANY_CHAR: LOWERCASE | UPPERCASE | DIGIT;
fragment NEW_LINE: '\n' | '\r' | '\r\n';
fragment SPACE: ' ';
fragment TAB: '\t';
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
mode STRING_MODE;
STRING_END: '"' -> popMode;
STRING_SEGMENT_END: '${' -> pushMode(INTERPOLATION_MODE);
TEXT : ~["$]+ -> more ;
Expressions like the following work fine:
my.greeting = "hello"
my.greeting = "hello ${} world"
Any ideas what I might be doing wrong?

Instead of:
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
I_DOT: '.';
...
variable: ((VAR|I_VAR) (DOT|I_DOT))? (VAR|I_VAR);
you could try:
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR* -> type(VAR);
I_DOT: '.' -> type(DOT);
...
variable: (VAR DOT)? VAR;

Ok, I've worked out (inspired by this) that I need to define the default lexer rules again in the INTERPOLATION_MODE:
MyLexer.g4:
...
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
I_DOT: '.';
...
MyParser.g4:
...
variable: ((VAR|I_VAR) (DOT|I_DOT))? (VAR|I_VAR);
...
This seems overkill though, so still holding out for someone with a better answer.

String interpolation also implemented in existing C# and PHP grammars in official ANTLR grammars repository.

Related

Antlr4 mismatched input '<' expecting '<' with (seemingly) no lexer ambiguity

I cannot seem to figure out what antlr is doing here in this grammar. I have a grammar that should match an input like:
i,j : bool;
setvar : set<bool>;
i > 5;
j < 10;
But I keep getting an error telling me that "line 3:13 mismatched input '<' expecting '<'". This tells me there is some ambiguity in the lexer, but I only use '<' in a single token.
Here is the grammar:
//// Parser Rules
grammar MLTL1;
start: block*;
block: var_list ';'
| expr ';'
;
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type ;
type: BASE_TYPE
| KW_SET REL_LT BASE_TYPE REL_GT
;
expr: expr REL_OP expr
| '(' expr ')'
| IDENTIFIER
| INT
;
//// Lexical Spec
// Types
BASE_TYPE: 'bool'
| 'int'
| 'float'
;
// Keywords
KW_SET: 'set' ;
// Op groups for precedence
REL_OP: REL_EQ | REL_NEQ | REL_GT | REL_LT
| REL_GTE | REL_LTE ;
// Relational ops
REL_EQ: '==' ;
REL_NEQ: '!=' ;
REL_GT: '>' ;
REL_LT: '<' ;
REL_GTE: '>=' ;
REL_LTE: '<=' ;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
INT
: SIGN? NONZERODIGIT DIGIT*
| '0'
;
fragment
SIGN
: [+-]
;
fragment
DIGIT
: [0-9]
;
fragment
NONZERODIGIT
: [1-9]
;
fragment
LETTER
: [a-zA-Z_]
;
COMMENT : '#' ~[\r\n]* -> skip;
WS : [ \t\r\n]+ -> channel(HIDDEN);
I tested the grammar to see what tokens it is generating for the test input above using this python:
from antlr4 import InputStream, CommonTokenStream
import MLTL1Lexer
import MLTL1Parser
input="""
i,j : bool;
setvar: set<bool>;
i > 5;
j < 10;
"""
lexer = MLTL1Lexer.MLTL1Lexer(InputStream(input))
stream = CommonTokenStream(lexer)
stream.fill()
tokens = stream.getTokens(0,100)
for t in tokens:
print(str(t.type) + " " + t.text)
parser = MLTL1Parser.MLTL1Parser(stream)
parse_tree = parser.start()
print(parse_tree.toStringTree(recog=parser))
And noticed that both '>' and '<' were assigned the same token value despite being two different tokens. Am I missing something here?
(There may be more than just these two instances, but...)
Change REL_OP and BASE_TYPE to parser rules (i.e. make them lowercase.
As you've used them, you're turning many of your intended Lexer rules, effectively into fragments.
I't important to understand that tokens are the "atoms" you have in your grammar, when you combine several of them into another Lexer rule, you just make that the token type.
(If you used grun to dump the tokens you would have seen them identified as REL_OP tokens.
With the changes below, your sample input works just fine.
grammar MLTL1
;
start: block*;
block: var_list ';' | expr ';';
var_list: IDENTIFIER (',' IDENTIFIER)* ':' type;
type: baseType | KW_SET REL_LT baseType REL_GT;
expr: expr rel_op expr | '(' expr ')' | IDENTIFIER | INT;
//// Lexical Spec
// Types
baseType: 'bool' | 'int' | 'float';
// Keywords
KW_SET: 'set';
// Op groups for precedence
rel_op: REL_EQ | REL_NEQ | REL_GT | REL_LT | REL_GTE | REL_LTE;
// Relational ops
REL_EQ: '==';
REL_NEQ: '!=';
REL_GT: '>';
REL_LT: '<';
REL_GTE: '>=';
REL_LTE: '<=';
IDENTIFIER: LETTER (LETTER | DIGIT)*;
INT: SIGN? NONZERODIGIT DIGIT* | '0';
fragment SIGN: [+-];
fragment DIGIT: [0-9];
fragment NONZERODIGIT: [1-9];
fragment LETTER: [a-zA-Z_];
COMMENT: '#' ~[\r\n]* -> skip;
WS: [ \t\r\n]+ -> channel(HIDDEN);

String interpolation: is it possible without adding members?

I would like to create an Antlr parser for custom language and decided to pick a simple calculator as an example. In my new grammar it should be possible to define a string, like this:
s = "Hello, I am a string"
and handle string interpolation.
Text in double quotes enclosed in persent should be treated as interpolated, e.g.
s = "Hello, did you know that %2 + 2% is 4?"
Double percent sign should not be processed, e.g.
s = "He wants 50%% of this deal."
But at the same time my calculator should support modulus operation:
x = 5 % 2
So far, I was able to craft a Lexer/Grammar, which could switch mode and parse simple strings, here they are:
lexer grammar CalcLexer;
EQ: '=';
PLUS: '+';
MINUS: '-';
MULT: '*';
DIV: '/';
LPAREN : '(' ;
RPAREN : ')' ;
SINGLE_PERCENT_POP: '%' -> popMode;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
OPEN_DOUBLE_QUOTE: '"' -> pushMode(STRING_MODE);
NEWLINE:'\r'? '\n' ;
WS : [ \t]+ -> skip;
mode STRING_MODE;
DOUBLE_PERCENT: '%%';
SINGLE_PERCENT: '%' -> pushMode(DEFAULT_MODE);
TEXT: ~('%'|'\n'|'"')+;
CLOSE_DOUBLE_QUOTE: '"' -> popMode;
and
parser grammar CalcGrammar;
options { tokenVocab=CalcLexer; } // use tokens from CalcLexer.g4
prog: stat+ ;
stat: expr NEWLINE
| ID EQ (expr|text) NEWLINE
| NEWLINE
;
text: OPEN_DOUBLE_QUOTE content* CLOSE_DOUBLE_QUOTE;
content: DOUBLE_PERCENT | TEXT | SINGLE_PERCENT expr SINGLE_PERCENT_POP;
expr: expr (MULT|DIV) expr
| expr (PLUS|MINUS) expr
| INT
| ID
| LPAREN expr RPAREN
;
But only thing doesn't work and I'm not sure if it ever possible to implement without custom code (members) is modulus operation:
x = 5 % 2
There is no way I can ask Anltr to check for previous mode and safely pop mode.
But I hope my understanding is wrong and there is some way to treat % sign as operator in default mode?
I have found several sources for inspiration, probably they would help you as well:
Parsing string interpolation in ANTLR
ANTLR String interpolation
Parsing String Interpolations with ANTLR4
String interpolation and lexer modes
Murphy's law for StackOverflow: you will find an answer to your own question after several minutes you post detailed question to SO.
Instead of switching to DEFAULT_MODE, I should create separate one - STRING_INTERPOLATION. This way I have to define separate tokens for this mode, which will let use % sign in normal mode (and prohibit in interpolated).
Here is Lexer and Grammar which works for me:
lexer grammar CalcLexer;
EQ: '=';
PLUS: '+';
MINUS: '-';
MULT: '*';
DIV: '/';
MOD: '%';
LPAREN : '(' ;
RPAREN : ')' ;
ID : F_ID;
INT : F_INT;
fragment F_ID: [a-zA-Z]+ ;
fragment F_INT: [0-9]+ ;
OPEN_DOUBLE_QUOTE: '"' -> pushMode(STRING_MODE);
NEWLINE:'\r'? '\n' ;
WS : [ \t]+ -> skip;
mode STRING_MODE;
DOUBLE_PERCENT: '%%';
SINGLE_PERCENT: '%' -> pushMode(STRING_INTERPOLATION);
TEXT: ~('%'|'\n'|'"')+;
CLOSE_DOUBLE_QUOTE: '"' -> popMode;
mode STRING_INTERPOLATION;
SINGLE_PERCENT_POP: '%' -> popMode;
I_PLUS: PLUS -> type(PLUS);
I_MINUS: MINUS -> type(MINUS);
I_MULT: MULT -> type(MULT);
I_DIV: DIV -> type(DIV);
I_MOD: MOD -> type(MOD);
I_LPAREN: LPAREN -> type(LPAREN);
I_RPAREN: RPAREN -> type(RPAREN);
I_ID : F_ID -> type(ID);
I_INT : F_INT -> type(INT);
WS1 : [ \t]+ -> skip;
and
parser grammar CalcGrammar;
options { tokenVocab=CalcLexer; } // use tokens from CalcLexer.g4
prog: stat+ ;
stat: expr NEWLINE
| ID EQ (expr|text) NEWLINE
| NEWLINE
;
text: OPEN_DOUBLE_QUOTE content* CLOSE_DOUBLE_QUOTE;
content: DOUBLE_PERCENT | TEXT | SINGLE_PERCENT expr SINGLE_PERCENT_POP;
expr: expr (MULT|DIV|MOD) expr
| expr (PLUS|MINUS) expr
| INT
| ID
| LPAREN expr RPAREN
;
I hope this would help someone. Probably, future me.

ANTLR Pattern "line 1:9 extraneous input ' ' expecting WORD"

I'm just getting started with using ANTLR. I'm trying to write a parser for field definitions that look like:
field_name = value
Example:
is_true_true = yes;
My grammar looks like this:
grammar Hello;
//Lexer Rules
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9';
fragment TRUE: 'TRUE'|'true';
fragment FALSE: 'FALSE'|'false';
INTEGER : DIGIT+ ;
STRING : ('\''.*?'\'') ;
BOOLEAN : (TRUE|FALSE);
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WHITESPACE : (' ' | '\t')+ ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
field_def : WORD '=' WORD ';' ;
But when I run the generated Parser on 'working = yes;' i get the error message:
line 1:7 extraneous input ' ' expecting '='
line 1:9 extraneous input ' ' expecting WORD
I do not understand this fully, is there an error in matching the WORD-pattern or is it something else entirely?
Since it's quite usual that the whitespace is not significant to your grammar (i.e. there's no semantic meaning to it, apart of separating words), ANTLR makes it possible to just skip it:
In ANTLR 4 this is done by
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
In ANTLR 3 the syntax is
WHITESPACE : (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE : ('\r'? '\n' | '\r')+ { $channel = HIDDEN; };
What this does is the lexer tokenizes the input as usual, but parser understands that these tokens are not significant to it and behaves as if they were not there, allowing you to keep your rules simple and without need to add optional whitespace everywhere.
Your example has whitespace but your field_def isn't accounting for it.

ANTLR4 stop parsing section of file and put it to a accessible buffer (string)

I would like to have grammar that would have it's structure strictly defined, but part of the structure should not be parsed by my grammar but put into some sort of a buffer (string) for later use.
My grammar looks like this:
grammar RSL;
rsl: sectionStructs? sectionProgram;
sectionProgram: 'section' 'program' '{' '}';
sectionStructs: 'section' 'structs' '{' structDef+ '}';
sectionName: ID;
structDef: 'struct' ID '{' varDef+ '}' ';';
varDef: ID ID ';';
ID: [a-zA-Z_][a-zA-Z_\-0-9]*;
WS : [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
And my wish is to have this sort of parsing going on:
section structs {
struct TestStruct {
int var1;
float var2;
...
};
struct Struct2 {
int var1;
...
};
}
section program {
// Do not parse anything that would be in this section
// just store it in a buffer for later use.
}
So all contents of section program should be stored in a string for a later use and no grammar rules should apply to program.
What is the best way of approaching this problem?
Thanks!
One way would be to create a lexer rule that matches this section program { ... }:
grammar RSL;
rsl
: sectionStructs? SECTION_PROGRAM EOF
;
sectionStructs
: 'section' 'structs' '{' structDef+ '}'
;
structDef
: 'struct' ID '{' varDef+ '}' ';'
;
varDef
: ID ID ';'
;
SECTION
: 'section'
;
ID
: [a-zA-Z_][a-zA-Z_\-0-9]*
;
SECTION_PROGRAM
: 'section' S+ 'program' S* BLOCK
;
WS
: S+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
fragment BLOCK
: '{' ( ~[{}] | BLOCK )* '}'
;
fragment S
: [ \t\r\n]
;
which would parse your input as follows:
Of course, if your language allows for things like string literals, you would also need to account for that in the fragment BLOCK rule.

ANTLR expression interpreter

I have created the following grammar: I would like some idea how to build an interpreter that returns a tree in java, which I can later use for printing in the screen, Im bit stack on how to start on it.
grammar myDSL;
options {
language = Java;
}
#header {
package DSL;
}
#lexer::header {
package DSL;
}
program
: IDENT '={' components* '}'
;
components
: IDENT '=('(shape)(shape|connectors)* ')'
;
shape
: 'Box' '(' (INTEGER ','?)* ')'
| 'Cylinder' '(' (INTEGER ','?)* ')'
| 'Sphere' '(' (INTEGER ','?)* ')'
;
connectors
: type '(' (INTEGER ','?)* ')'
;
type
: 'MG'
| 'EL'
;
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
INTEGER: '0'..'9'+;
// This if for the empty spaces between tokens and avoids them in the parser
WS: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;};
COMMENT: '//' .* ('\n' | '\r') {$channel=HIDDEN;};
A couple of remarks:
There's no need to set the language for Java, which is the default target language. So you can remove this:
options {
language = Java;
}
Your IDENT contains an error:
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
the '0'..'0') should most probably be '0'..'9').
The sub rule (INTEGER ','?)* also matches source like 1 2 3 4 (no comma's at all!). Perhaps you meant to do: (INTEGER (',' INTEGER)*)?
Now, as to your question: how to let ANTLR construct a proper AST? This can be done by adding output = AST; in your options block:
options {
//language = Java;
output = AST;
}
And then either adding the "tree operators" ^ and ! in your parser rules, or by using tree rewrite rules: rule: a b c -> ^(c b a).
The "tree operator" ^ is used to define the root of the (sub) tree and ! is used to exclude a token from the (sub) tree.
Rewrite rules have ^( /* tokens here */ ) where the first token (right after ^() is the root of the (sub) tree, and all following tokens are child nodes of the root.
An example might be in order. Let's take your first rule:
program
: IDENT '={' components* '}'
;
and you want to let IDENT be the root, components* the children and you want to exclude ={ and } from the tree. You can do that by doing:
program
: IDENT^ '={'! components* '}'!
;
or by doing:
program
: IDENT '={' components* '}' -> ^(IDENT components*)
;