Yield a modified token in ANTLR4 - antlr
I have a syntax like the following
Identifier
: [a-zA-Z0-9_.]+
| '`' Identifier '`'
;
When I matched an identifier, e.g `someone`, I'd like to strip the backtick and yield a different token, aka someone
Of course, I could walk through the final token array, but is it possible to do it during token parsing?
If I well understand, given the input (file t.text) :
one `someone`
two `fred`
tree `henry`
you would like that tokens are automatically produced as if the grammar had the lexer rules :
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
ID : [a-zA-Z0-9_.]+ ;
But tokens are identified by a type, i.e. an integer, not by the name of the lexer rule. You can change this type with setType() :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::members { int next_number = 1001; }
question
#init {System.out.println("Question last update 1117");}
: expr+ EOF
;
expr
: ID BACKTICK_ID
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`' { setType(next_number); next_number+=1; } ;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
[#0,0:2='one',<ID>,1:0]
[#1,4:12='`someone`',<1001>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='`fred`',<1002>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='`henry`',<1003>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1117
line 1:4 mismatched input '`someone`' expecting BACKTICK_ID
line 2:4 mismatched input '`fred`' expecting BACKTICK_ID
line 3:5 mismatched input '`henry`' expecting BACKTICK_ID
The basic types come from the lexer rules :
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
the other from setType. Instead of incrementing a number for each token, you could write the tokens found in a table, and before creating a new one, access the table to check if it already exists and avoid duplicate tokens receive a different number.
Anyway you can do nothing useful in the parser because parser rules need to know the type number.
If you have a set of names known in advance, you can list them in a tokens statement :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::header {
import java.util.*;
}
tokens { SOMEONE, FRED, HENRY }
#lexer::members {
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
put("someone", QuestionParser.SOMEONE);
put("fred", QuestionParser.FRED);
put("henry", QuestionParser.HENRY);
}};
}
question
#init {System.out.println("Question last update 1746");}
: expr+ EOF
;
expr
: ID SOMEONE
| ID FRED
| ID HENRY
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`'
{ String textb = getText();
String texta = textb.substring(1, textb.length() - 1);
System.out.println("text before=" + textb + ", text after="+ texta);
if ( keywords.containsKey(texta)) {
setType(keywords.get(texta)); // reset token type
setText(texta); // remove backticks
}
}
;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
text before=`someone`, text after=someone
text before=`fred`, text after=fred
text before=`henry`, text after=henry
[#0,0:2='one',<ID>,1:0]
[#1,4:12='someone',<4>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='fred',<5>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='henry',<6>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1746
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
SOMEONE=4
FRED=5
HENRY=6
As you can see, there are no more errors because the expr rule is happy with well identified tokens. Even if there are no
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
only ID and BACKTICK_ID, the types have been defined behind the scene by the tokens statement :
public static final int
ID=1, BACKTICK_ID=2, WS=3, SOMEONE=4, FRED=5, HENRY=6;
I'm afraid that if you want a free list of names, it's not possible because the parser works with types, not the name of lexer rules :
public static class ExprContext extends ParserRuleContext {
public TerminalNode ID() { return getToken(QuestionParser.ID, 0); }
public TerminalNode SOMEONE() { return getToken(QuestionParser.SOMEONE, 0); }
public TerminalNode FRED() { return getToken(QuestionParser.FRED, 0); }
public TerminalNode HENRY() { return getToken(QuestionParser.HENRY, 0); }
...
public final ExprContext expr() throws RecognitionException {
try { ...
setState(17);
case 1:
enterOuterAlt(_localctx, 1);
{
setState(11);
match(ID);
setState(12);
match(SOMEONE);
}
break;
In
match(SOMEONE);
SOMEONE is a constant representing the number 4.
If you don't have a list of known names, emit will not solve your problem because it creates a Token whose most important field is _type :
public Token emit() {
Token t = _factory.create(_tokenFactorySourcePair, _type, _text, _channel, _tokenStartCharIndex, getCharIndex()-1,
_tokenStartLine, _tokenStartCharPositionInLine);
emit(t);
return t;
}
Related
Why parser splits command name into different nodes
I have the statement: =MYFUNCTION_NAME(1,2,3) My grammar is: grammar Expression; options { language=CSharp3; output=AST; backtrack=true; } tokens { FUNC; PARAMS; } #parser::namespace { Expression } #lexer::namespace { Expression } public parse : ('=' func )*; func : funcId '(' formalPar* ')' -> ^(FUNC funcId formalPar); formalPar : (par ',')* par -> ^(PARAMS par+); par : INT; funcId : complexId+ ('_'? complexId+)*; complexId : ID+ | ID+DIGIT+ ; ID : ('a'..'z'|'A'..'Z'|'а'..'я'|'А'..'Я')+; DIGIT : ('0'..'9')+; INT : '-'? ('0'..'9')+; In a tree i get: [**FUNC**] | [MYFUNCTION] [_] [NAME] [**PARAMS**] Why the parser splits function's name into 3 nodes: "MYFUNCTION, "_", "NAME" ? How can i fix it?
The division is always performed based on tokens. Since the ID token cannot contain an _ character, the result is 3 separate tokens that are handled later by the funcId grammar rule. To create a single node for your function name, you'll need to create a lexer rule that can match the input MYFUNCTION_NAME as a single token.
ANTLR Variable Troubles
In short: how do I implement dynamic variables in ANTLR? I come to you again with a basic ANTLR question. I have this grammar: grammar Amethyst; options { language = Java; } #header { package org.omer.amethyst.generated; import java.util.HashMap; } #lexer::header { package org.omer.amethyst.generated; } #members { HashMap memory = new HashMap(); } begin: expr; expr: (defun | println)* ; println: 'println' atom {System.out.println($atom.value);} ; defun: 'defun' VAR INT {memory.put($VAR.text, Integer.parseInt($INT.text));} | 'defun' VAR STRING_LITERAL {memory.put($VAR.text, $STRING_LITERAL.text);} ; atom returns [Object value]: INT {$value = Integer.parseInt($INT.text);} | ID { Object v = memory.get($ID.text); if (v != null) $value = v; else System.err.println("undefined variable " + $ID.text); } | STRING_LITERAL { String v = (String) memory.get($STRING_LITERAL.text); if (v != null) $value = String.valueOf(v); else System.err.println("undefined variable " + $STRING_LITERAL.text); } ; INT: '0'..'9'+ ; STRING_LITERAL: '"' .* '"'; VAR: ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9')* ; ID: ('a'..'z'|'A'..'Z'|'0'..'9')+ ; LETTER: ('a..z'|'A'..'Z')+ ; WS: (' '|'\t'|'\n'|'\r')+ {skip();} ; What it does (or should do), so far, is have a built-in "println" function to do exactly what you think it does, and a "defun" rule to define variables. When "defun" is called on either a string or integer, the value is put into the "memory" HashMap with the first parameter being the variable's name and the second being its value. When println is called on an atom, it should display the atom's value. The atom can be either a string or integer. It gets its value from memory and returns it. So for example: defun greeting "Hello world!" println greeting But when I run this code, I get this error: line 3:8 no viable alternative at input 'greeting' null NOTE: This output comes when I do: println "greeting" Output: undefined variable "greeting"null Does anyone know why this is so? Sorry if I'm not being clear, I don't understand most of this.
defun greeting "Hello world!" println greeting But when I run this code, I get this error: line 3:8 no viable alternative at input 'greeting' Because the input "greeting" is being tokenized as a VAR and a VAR is no atom. So the input defun greeting "Hello world!" is properly matched by the 2nd alternative of the defun rule: defun : 'defun' VAR INT // 1st alternative | 'defun' VAR STRING_LITERAL // 2nd alternative ; but the input println "greeting" cannot be matched by the println rule: println : 'println' atom ; You must realize that the lexer does not produce tokens based on what the parser tries to match at a particular time. The input "greeting" will always be tokenized as a VAR, never as an ID rule. What you need to do is remove the ID rule from the lexer, and replace ID with VAR inside your parser rules.
How can I build an ANTLR Works style parse tree?
I've read that you need to use the '^' and '!' operators in order to build a parse tree similar to the ones displayed in ANTLR Works (even though you don't need to use them to get a nice tree in ANTLR Works). My question then is how can I build such a tree? I've seen a few pages on tree construction using the two operators and rewrites, and yet say I have an input string abc abc123 and a grammar: grammar test; program : idList; idList : id* ; id : ID ; ID : LETTER (LETTER | NUMBER)* ; LETTER : 'a' .. 'z' | 'A' .. 'Z' ; NUMBER : '0' .. '9' ; ANTLR Works will output: What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators? You can't use ^ and ! alone. These operators only operate on existing tokens, while you want to create extra tokens (and make these the root of your sub trees). You can do that using rewrite rules and defining some imaginary tokens. A quick demo: grammar test; options { output=AST; ASTLabelType=CommonTree; } tokens { IdList; Id; } #parser::members { private static void walk(CommonTree tree, int indent) { if(tree == null) return; for(int i = 0; i < indent; i++, System.out.print(" ")); System.out.println(tree.getText()); for(int i = 0; i < tree.getChildCount(); i++) { walk((CommonTree)tree.getChild(i), indent + 1); } } public static void main(String[] args) throws Exception { testLexer lexer = new testLexer(new ANTLRStringStream("abc abc123")); testParser parser = new testParser(new CommonTokenStream(lexer)); walk((CommonTree)parser.program().getTree(), 0); } } program : idList EOF -> idList; idList : id* -> ^(IdList id*); id : ID -> ^(Id ID); ID : LETTER (LETTER | DIGIT)*; SPACE : ' ' {skip();}; fragment LETTER : 'a' .. 'z' | 'A' .. 'Z'; fragment DIGIT : '0' .. '9'; If you run the demo above, you will see the following being printed to the console: IdList Id abc Id abc123 As you can see, imaginary tokens must also start with an upper case letter, just like lexer rules. If you want to give the imaginary tokens the same text as the parser rule they represent, do something like this instead: idList : id* -> ^(IdList["idList"] id*); id : ID -> ^(Id["id"] ID); which will print: idList id abc id abc123
variable not passed to predicate method in ANTLR
The java code generated from ANTLR is one rule, one method in most times. But for the following rule: switchBlockLabels[ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts] : ^(SWITCH_BLOCK_LABEL_LIST switchCaseLabel[_entity, _method, _preStmts]* switchDefaultLabel? switchCaseLabel*) ; it generates a submethod named synpred125_TreeParserStage3_fragment(), in which mehod switchCaseLabel(_entity, _method, _preStmts) is called: synpred125_TreeParserStage3_fragment(){ ...... switchCaseLabel(_entity, _method, _preStmts);//variable not found error ...... } switchBlockLabels(ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts){ ...... synpred125_TreeParserStage3_fragment(); ...... } The problem is switchCaseLabel has parameters and the parameters come from the parameters of switchBlockLabels() method, so "variable not found error" occurs. How can I solve this problem?
My guess is that you've enabled global backtracking in your grammar like this: options { backtrack=true; } in which case you can't pass parameters to ambiguous rules. In order to communicate between ambiguous rules when you have enabled global backtracking, you must use rule scopes. The "predicate-methods" do have access to rule scopes variables. A demo Let's say we have this ambiguous grammar: grammar Scope; options { backtrack=true; } parse : atom+ EOF ; atom : numberOrName+ ; numberOrName : Number | Name ; Number : '0'..'9'+; Name : ('a'..'z' | 'A'..'Z')+; Space : ' ' {skip();}; (for the record, the atom+ and numberOrName+ make it ambiguous) If you now want to pass information between the parse and numberOrName rule, say an integer n, something like this will fail (which is the way you tried it): grammar Scope; options { backtrack=true; } parse #init{int n = 0;} : (atom[++n])+ EOF ; atom[int n] : (numberOrName[n])+ ; numberOrName[int n] : Number {System.out.println(n + " = " + $Number.text);} | Name {System.out.println(n + " = " + $Name.text);} ; Number : '0'..'9'+; Name : ('a'..'z' | 'A'..'Z')+; Space : ' ' {skip();}; In order to do this using rule scopes, you could do it like this: grammar Scope; options { backtrack=true; } parse scope{int n; /* define the scoped variable */ } #init{$parse::n = 0; /* important: initialize the variable! */ } : atom+ EOF ; atom : numberOrName+ ; numberOrName /* increment and print the scoped variable from the parse rule */ : Number {System.out.println(++$parse::n + " = " + $Number.text);} | Name {System.out.println(++$parse::n + " = " + $Name.text);} ; Number : '0'..'9'+; Name : ('a'..'z' | 'A'..'Z')+; Space : ' ' {skip();}; Test If you now run the following class: import org.antlr.runtime.*; public class Main { public static void main(String[] args) throws Exception { String src = "foo 42 Bar 666"; ScopeLexer lexer = new ScopeLexer(new ANTLRStringStream(src)); ScopeParser parser = new ScopeParser(new CommonTokenStream(lexer)); parser.parse(); } } you will see the following being printed to the console: 1 = foo 2 = 42 3 = Bar 4 = 666 P.S. I don't know what language you're parsing, but enabling global backtracking is usually overkill and can have quite an impact on the performance of your parser. Computer languages often are ambiguous in just a few cases. Instead of enabling global backtracking, you really should look into adding syntactic predicates, or enabling backtracking on those rules that are ambiguous. See The Definitive ANTLR Reference for more info.
Antlr backtrack option not working
I am not sure but I think the Antlr backtrack option is not working properly or something... Here is my grammar: grammar Test; options { backtrack=true; memoize=true; } prog: (code)+; code : ABC {System.out.println("ABC");} | OTHER {System.out.println("OTHER");} ; ABC : 'ABC'; OTHER : .; If the input stream is "ABC" then I'll see ABC printed. If the input stream is "ACD" then I'll see 3 times OTHER printed. But if the input stream is "ABD" then I'll see line 1:2 mismatched character 'D' expecting 'C' line 1:3 required (...)+ loop did not match anything at input '' but I expect to see three times OTHER, since the input should match the second rule if the first rule fails. That doesn't make any sense. Why the parser didn't backtrack when it sees that the last character was not 'C'? However, it was ok with "ACD." Could someone please help me solve this issue??? Thanks for your time!!!
The option backtrack=true applies to parser rules only, not lexer rules. EDIT The only work-around I am aware of, is by letting "AB" followed by some other char other than "C" be matched in the same ABC rule and then manually emitting other tokens. A demo: grammar Test; #lexer::members { List<Token> tokens = new ArrayList<Token>(); public void emit(int type, String text) { state.token = new CommonToken(type, text); tokens.add(state.token); } public Token nextToken() { super.nextToken(); if(tokens.size() == 0) { return Token.EOF_TOKEN; } return tokens.remove(0); } } prog : code+ ; code : ABC {System.out.println("ABC");} | OTHER {System.out.println("OTHER");} ; ABC : 'ABC' | 'AB' t=~'C' { emit(OTHER, "A"); emit(OTHER, "B"); emit(OTHER, String.valueOf((char)$t)); } ; OTHER : . ;
Another solution. this might be a simpler solution though. i made use of "syntactic predicates". grammar ABC; #lexer::header {package org.inanme.antlr;} #parser::header {package org.inanme.antlr;} prog: (code)+ EOF; code: ABC {System.out.println($ABC.text);} | OTHER {System.out.println($OTHER.text);}; ABC : ('ABC') => 'ABC' | 'A'; OTHER : .;