ANTLR how to create a "rest of it" token - antlr

I have realy simple DSL defined in ANTLR like this.
grammar Transformer;
fragment Digit : [0-9];
Amp:'\'';
Left:'(';
Right: ')';
Comma: ',';
Id: [A-Za-z][a-zA-Z0-9]+;
Int: '-'? Digit+;
WS: [\n\r\t]+ ->skip;
FuncStart: '>';
DataStart: '#';
parse: (datainput | function)+;
qoutedtext: Amp .*? Amp;
datainput: DataStart Id;
function: FuncStart Id Left param (Comma param)* Right;
param: (datainput|function|qoutedtext|Int);
When parsing this text
#Id;>ToUpper(#Name);ThisShouldEndUpAsAToken>FillLeft(#EmpNo,20,'abc')
This is the "tree" i get:
The tree looks as expecte, except that I am not able to catch the ThisShouldEndUpAsAToken tekst as a token.
I know that I do not have any parse in the grammer that should do that now, but I'm not able to figure out how to do it.
HEEELP :)

How about changing your parse rule like this:
parse: (datainput | function | Id)+;
(Your test input is sprinkled with ; that shouldn't parse. Are you sure that's the input you're parsing?)

Related

Getting plain text in antlr instead of tokens

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

Grammar to negate two like characters in a lexer rule inside a single quoted string

ANLTR 4:
I need to support a single quoted string literal with escaped characters AND the ability to use double curly braces as an 'escape sequence' that will need additional parsing. So both of these examples need to be supported. I'm not so worried about the second example because that seems trivial if I can get the first to work and not match double curly brace characters.
1. 'this is a string literal with an escaped\' character'
2. 'this is a string {{functionName(x)}} literal with double curlies'
StringLiteral
: '\'' (ESC | AnyExceptDblCurlies)*? '\'' ;
fragment
ESC : '\\' [btnr\'\\];
fragment
AnyExceptDblCurlies
: '{' ~'{'
| ~'{' .;
I've done a lot of research on this and understand that you can't negate multiple characters, and have even seen a similar approach work in Bart's answer in this post...
Negating inside lexer- and parser rules
But what I'm seeing is that in example 1 above, the escaped single quote is not being recognized and I get a parser error that it cannot match ' character'.
if I alter the string literal token rule to the following it works...
StringLiteral
: '\'' (ESC | .)*? '\'' ;
Any ideas how to handle this scenario better? I can deduce that the escaped character is getting matched by AnyExceptDblCurlies instead of ESC, but I'm not sure how to solve this problem.
To parse the template definition out of the string pretty much requires handling in the parser. Use lexer modes to distinguish between string characters and the template name.
Parser:
options {
tokenVocab = TesterLexer ;
}
test : string EOF ;
string : STRBEG ( SCHAR | template )* STREND ; // allow multiple templates per string
template : TMPLBEG TMPLNAME TMPLEND ;
Lexer:
STRBEG : Squote -> pushMode(strMode) ;
mode strMode ;
STRESQ : Esqote -> type(SCHAR) ; // predeclare SCHAR in tokens block
STREND : Squote -> popMode ;
TMPLBEG : DBrOpen -> pushMode(tmplMode) ;
STRCHAR : . -> type(SCHAR) ;
mode tmplMode ;
TMPLEND : DBrClose -> popMode ;
TMPLNAME : ~'}'* ;
fragment Squote : '\'' ;
fragment Esqote : '\\\'' ;
fragment DBrOpen : '{{' ;
fragment DBrClose : '}}' ;
Updated to correct the TMPLNAME rule, add main rule and options block.

ANTLR String interpolation

I'm trying to write an ANTLR grammar that parses string interpolation expressions such as:
my.greeting = "hello ${your.name}"
The error I get is:
line 1:31 token recognition error at: 'e'
line 1:34 no viable alternative at input '<EOF>'
MyParser.g4:
parser grammar MyParser;
options { tokenVocab=MyLexer; }
program: variable EQ expression EOF;
expression: (string | variable);
variable: (VAR DOT)? VAR;
string: (STRING_SEGMENT_END expression)* STRING_END;
MyLexer.g4:
lexer grammar MyLexer;
START_STR: '"' -> more, pushMode(STRING_MODE) ;
VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
EQ: '=';
DOT: '.';
WHITE_SPACE: (SPACE | NEW_LINE | TAB)+ -> skip;
fragment DIGIT: '0'..'9';
fragment LOWERCASE: 'a'..'z';
fragment UPPERCASE: 'A'..'Z';
fragment ANY_CHAR: LOWERCASE | UPPERCASE | DIGIT;
fragment NEW_LINE: '\n' | '\r' | '\r\n';
fragment SPACE: ' ';
fragment TAB: '\t';
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
mode STRING_MODE;
STRING_END: '"' -> popMode;
STRING_SEGMENT_END: '${' -> pushMode(INTERPOLATION_MODE);
TEXT : ~["$]+ -> more ;
Expressions like the following work fine:
my.greeting = "hello"
my.greeting = "hello ${} world"
Any ideas what I might be doing wrong?
Instead of:
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
I_DOT: '.';
...
variable: ((VAR|I_VAR) (DOT|I_DOT))? (VAR|I_VAR);
you could try:
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR* -> type(VAR);
I_DOT: '.' -> type(DOT);
...
variable: (VAR DOT)? VAR;
Ok, I've worked out (inspired by this) that I need to define the default lexer rules again in the INTERPOLATION_MODE:
MyLexer.g4:
...
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
I_DOT: '.';
...
MyParser.g4:
...
variable: ((VAR|I_VAR) (DOT|I_DOT))? (VAR|I_VAR);
...
This seems overkill though, so still holding out for someone with a better answer.
String interpolation also implemented in existing C# and PHP grammars in official ANTLR grammars repository.

antlr4 grammar errors when parsing

I have the following grammar:
grammar Token;
prog: (expr NL?)+ EOF;
expr: '[' type ']';
type : typeid ':' value;
typeid : 'TXT' | 'ENC' | 'USR';
value: Text | INT;
INT : '0' | [1-9] [0-9]*;
//WS : [ \t]+;
WS : [ \t\n\r]+ -> skip ;
NL: '\r'? '\n';
Text : ~[\]\[\n\r"]+ ;
and the text I need to parse is something like this below
[TXT:look at me!]
[USR:19700]
[TXT:, can I go there?]
[ENC:124124]
[TXT:this is needed for you to go...]
I need to split this text but I getting some errors when I run grun.bat Token prog -gui -trace -diagnostics
enter prog, LT(1)=[
enter expr, LT(1)=[
consume [#0,0:0='[',<3>,1:0] rule expr
enter type, LT(1)=TXT:look at me!
enter typeid, LT(1)=TXT:look at me!
line 1:1 mismatched input 'TXT:look at me!' expecting {'TXT', 'ENC', 'USR'}
... much more ...
what is wrong with my grammar? please, help me!
You must understand that the tokens are not created based on what the parser is trying to match. The lexer tries to match as much characters as possible (independently from that parser!): your Text token should be defined differently.
You could let the Text rule become a parser rule instead, and match single char tokens like this:
grammar Token;
prog : expr+ EOF;
expr : '[' type ']';
type : typeid ':' value;
typeid : 'TXT' | 'ENC' | 'USR';
value : text | INT;
text : CHAR+;
INT : '0' | [1-9] [0-9]*;
WS : [ \t\n\r]+ -> skip ;
CHAR : ~[\[\]\r\n];

Special character handling in ANTLR lexer

I wrote the following grammar for string variable declaration. Strings are defined like anything between single quotes, but there must be a way to add a single quote to the string value by escaping using $ letter.
grammar test;
options
{
language = Java;
}
tokens
{
VAR = 'VAR';
END_VAR = 'END_VAR';
}
var_declaration: VAR string_type_declaration END_VAR EOF;
string_type_declaration: identifier ':=' string;
identifier: ID;
string: STRING_VALUE;
STRING_VALUE: '\'' ('$\''|.)* '\'';
ID: LETTER+;
WSFULL:(' ') {$channel=HIDDEN;};
fragment LETTER: (('a'..'z') | ('A'..'Z'));
This grammar doesn't work, if you try to run this code for var_declaration rule:
VAR A :='$12.2' END_VAR
I get MismatchedTokenException.
But this code works fine for string_type_declaration rule:
A :='$12.2'
Your STRING_VALUE isn't properly tokenized. Inside the loop ( ... )*, the $ expects a single quote after it, but the string in your input, '$12.2', doesn't have a quote after $. You should make the single quote optional ('$' '\''? | .)*. But now your alternative in the loop, the ., will also match a single quote: better let it match anything other than a single quote and $:
STRING_VALUE
: '\'' ( '$' '\''? | ~('$' | '\'') )* '\''
;
resulting in the following parse tree: