Why parser splits command name into different nodes - antlr

I have the statement:
=MYFUNCTION_NAME(1,2,3)
My grammar is:
grammar Expression;
options
{
language=CSharp3;
output=AST;
backtrack=true;
}
tokens
{
FUNC;
PARAMS;
}
#parser::namespace { Expression }
#lexer::namespace { Expression }
public
parse : ('=' func )*;
func : funcId '(' formalPar* ')' -> ^(FUNC funcId formalPar);
formalPar : (par ',')* par -> ^(PARAMS par+);
par : INT;
funcId : complexId+ ('_'? complexId+)*;
complexId
: ID+
| ID+DIGIT+ ;
ID : ('a'..'z'|'A'..'Z'|'а'..'я'|'А'..'Я')+;
DIGIT : ('0'..'9')+;
INT : '-'? ('0'..'9')+;
In a tree i get:
[**FUNC**]
|
[MYFUNCTION] [_] [NAME] [**PARAMS**]
Why the parser splits function's name into 3 nodes: "MYFUNCTION, "_", "NAME" ? How can i fix it?

The division is always performed based on tokens. Since the ID token cannot contain an _ character, the result is 3 separate tokens that are handled later by the funcId grammar rule. To create a single node for your function name, you'll need to create a lexer rule that can match the input MYFUNCTION_NAME as a single token.

Related

Yield a modified token in ANTLR4

I have a syntax like the following
Identifier
: [a-zA-Z0-9_.]+
| '`' Identifier '`'
;
When I matched an identifier, e.g `someone`, I'd like to strip the backtick and yield a different token, aka someone
Of course, I could walk through the final token array, but is it possible to do it during token parsing?
If I well understand, given the input (file t.text) :
one `someone`
two `fred`
tree `henry`
you would like that tokens are automatically produced as if the grammar had the lexer rules :
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
ID : [a-zA-Z0-9_.]+ ;
But tokens are identified by a type, i.e. an integer, not by the name of the lexer rule. You can change this type with setType() :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::members { int next_number = 1001; }
question
#init {System.out.println("Question last update 1117");}
: expr+ EOF
;
expr
: ID BACKTICK_ID
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`' { setType(next_number); next_number+=1; } ;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
[#0,0:2='one',<ID>,1:0]
[#1,4:12='`someone`',<1001>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='`fred`',<1002>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='`henry`',<1003>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1117
line 1:4 mismatched input '`someone`' expecting BACKTICK_ID
line 2:4 mismatched input '`fred`' expecting BACKTICK_ID
line 3:5 mismatched input '`henry`' expecting BACKTICK_ID
The basic types come from the lexer rules :
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
the other from setType. Instead of incrementing a number for each token, you could write the tokens found in a table, and before creating a new one, access the table to check if it already exists and avoid duplicate tokens receive a different number.
Anyway you can do nothing useful in the parser because parser rules need to know the type number.
If you have a set of names known in advance, you can list them in a tokens statement :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::header {
import java.util.*;
}
tokens { SOMEONE, FRED, HENRY }
#lexer::members {
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
put("someone", QuestionParser.SOMEONE);
put("fred", QuestionParser.FRED);
put("henry", QuestionParser.HENRY);
}};
}
question
#init {System.out.println("Question last update 1746");}
: expr+ EOF
;
expr
: ID SOMEONE
| ID FRED
| ID HENRY
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`'
{ String textb = getText();
String texta = textb.substring(1, textb.length() - 1);
System.out.println("text before=" + textb + ", text after="+ texta);
if ( keywords.containsKey(texta)) {
setType(keywords.get(texta)); // reset token type
setText(texta); // remove backticks
}
}
;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
text before=`someone`, text after=someone
text before=`fred`, text after=fred
text before=`henry`, text after=henry
[#0,0:2='one',<ID>,1:0]
[#1,4:12='someone',<4>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='fred',<5>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='henry',<6>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1746
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
SOMEONE=4
FRED=5
HENRY=6
As you can see, there are no more errors because the expr rule is happy with well identified tokens. Even if there are no
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
only ID and BACKTICK_ID, the types have been defined behind the scene by the tokens statement :
public static final int
ID=1, BACKTICK_ID=2, WS=3, SOMEONE=4, FRED=5, HENRY=6;
I'm afraid that if you want a free list of names, it's not possible because the parser works with types, not the name of lexer rules :
public static class ExprContext extends ParserRuleContext {
public TerminalNode ID() { return getToken(QuestionParser.ID, 0); }
public TerminalNode SOMEONE() { return getToken(QuestionParser.SOMEONE, 0); }
public TerminalNode FRED() { return getToken(QuestionParser.FRED, 0); }
public TerminalNode HENRY() { return getToken(QuestionParser.HENRY, 0); }
...
public final ExprContext expr() throws RecognitionException {
try { ...
setState(17);
case 1:
enterOuterAlt(_localctx, 1);
{
setState(11);
match(ID);
setState(12);
match(SOMEONE);
}
break;
In
match(SOMEONE);
SOMEONE is a constant representing the number 4.
If you don't have a list of known names, emit will not solve your problem because it creates a Token whose most important field is _type :
public Token emit() {
Token t = _factory.create(_tokenFactorySourcePair, _type, _text, _channel, _tokenStartCharIndex, getCharIndex()-1,
_tokenStartLine, _tokenStartCharPositionInLine);
emit(t);
return t;
}

How can I keep concatenated tokens separate during lexing when a more general token is availible

The language I'm working on allows certain tokens to be stuck together (eg "intfloat") and I'm looking for a way to have the lexer not turn them into an ID so they're available separately at parse time. The simplest grammar I can come up with that demonstrates it is (WS omitted):
B: 'B';
C: 'C';
ID: ('a'..'z')+;
doc : (B | C | ID)* EOF;
Run against:
bc
abc
bcd
What I'd like out of the lexer:
B C
ID (starts with not-a-keyword so it's an ID)
<error> (cannot concat non-keywords)
But what I get is 3 IDs, as expected.
I have been looking at making the ID not greedy but that degenerates into individual tokens for each character. I suppose I could glue them back together later, but it feels like there should be a better way.
Any thoughts?
Thanks
Here's a start towards a solution, using the lexer to break up the text into tokens. The trick here is that rule ID can emit more than one token per invokation. This is non-standard lexer behavior so there are some caveats:
I'm confident that this won't work in ANTLR4.
This code assumes all tokens are queued into tokenQueue.
Rule ID doesn't prevent a keyword from repeating, so intintint produces tokens INT INT INT. If that's bad, you'll want to handle that either on the lexer or parser side, depending on which makes more sense in your grammar.
The shorter the keyword, the more fragile this solution becomes. Input internal is an invalid ID because it starts with keyword int but is followed by a non-keyword string.
The grammar produces warnings that I haven't expunged. If you use this code, I recommend attempting to remove them.
Here is the grammar:
MultiToken.g
grammar MultiToken;
#lexer::members{
private java.util.LinkedList<Token> tokenQueue = new java.util.LinkedList<Token>();
#Override
public Token nextToken() {
Token t = super.nextToken();
if (tokenQueue.isEmpty()){
if (t.getType() == Token.EOF){
return t;
} else {
throw new IllegalStateException("All tokens must be queued!");
}
} else {
return tokenQueue.removeFirst();
}
}
public void emit(int ttype, int tokenIndex) {
//This is lifted from ANTLR's Lexer class,
//but modified to handle queueing and multiple tokens per rule.
Token t;
if (tokenIndex > 0){
CommonToken last = (CommonToken) tokenQueue.getLast();
t = new CommonToken(input, ttype, state.channel, last.getStopIndex() + 1, getCharIndex() - 1);
} else {
t = new CommonToken(input, ttype, state.channel, state.tokenStartCharIndex, getCharIndex() - 1);
}
t.setLine(state.tokenStartLine);
t.setText(state.text);
t.setCharPositionInLine(state.tokenStartCharPositionInLine);
emit(t);
}
#Override
public void emit(Token t){
super.emit(t);
tokenQueue.addLast(t);
}
}
doc : (INT | FLOAT | ID | NUMBER)* EOF;
fragment
INT : 'int';
fragment
FLOAT : 'float';
NUMBER : ('0'..'9')+;
ID
#init {
int index = 0;
boolean rawId = false;
boolean keyword = false;
}
: ({!rawId}? INT {emit(INT, index++); keyword = true;}
| {!rawId}? FLOAT {emit(FLOAT, index++); keyword = true;}
| {!keyword}? ('a'..'z')+ {emit(ID, index++); rawId = true;}
)+
;
WS : (' '|'\t'|'\f'|'\r'|'\n')+ {skip();};
Test Case 1: Mixed Keywords
Input
intfloat a
int b
float c
intfloatintfloat d
Output (Tokens)
[INT : int] [FLOAT : float] [ID : a]
[INT : int] [ID : b]
[FLOAT : float] [ID : c]
[INT : int] [FLOAT : float] [INT : int] [FLOAT : float] [ID : d]
Test Case 2: Ids containing Keywords
Input
aintfloat
bint
cfloat
dintfloatintfloat
Output (Tokens)
[ID : aintfloat]
[ID : bint]
[ID : cfloat]
[ID : dintfloatintfloat]
Test Case 3: Bad Id #1
Input
internal
Output (Tokens & Lexer Error)
[INT : int] [ID : rnal]
line 1:3 rule ID failed predicate: {!keyword}?
Test Case 4: Bad Id #2
Input
floatation
Output (Tokens & Lexer Error)
[FLOAT : float] [ID : tion]
line 1:5 rule ID failed predicate: {!keyword}?
Test Case 5: Non-ID Rules
Input
int x
float 3 float 4 float 5
5 a 6 b 7 int 8 d
Output (Tokens)
[INT : int] [ID : x]
[FLOAT : float] [NUMBER : 3] [FLOAT : float] [NUMBER : 4] [FLOAT : float] [NUMBER : 5]
[NUMBER : 5] [ID : a] [NUMBER : 6] [ID : b] [NUMBER : 7] [INT : int] [NUMBER : 8] [ID : d]
Here's an almost-all-grammar solution for ANTLR 4 (only requires one small predicate in the target language):
lexer grammar PackedKeywords;
INT : 'int' -> pushMode(Keywords);
FLOAT : 'float' -> pushMode(Keywords);
fragment ID_CHAR : [a-z];
ID_START : ID_CHAR {Character.isLetter(_input.LA(1))}? -> more, pushMode(Identifier);
ID : ID_CHAR;
// these are the other tokens in the grammar
WS : [ \t]+ -> channel(HIDDEN);
Newline : '\r' '\n'? | '\n' -> channel(HIDDEN);
// The Keywords mode duplicates the default mode, except it replaces ID
// with InvalidKeyword. You can handle InvalidKeyword tokens in whatever way
// suits you best.
mode Keywords;
Keywords_INT : INT -> type(INT);
Keywords_FLOAT : FLOAT -> type(FLOAT);
InvalidKeyword : ID_CHAR;
// must include every token which can follow the Keywords mode
Keywords_WS : WS -> type(WS), channel(HIDDEN), popMode;
Keywords_Newline : Newline -> type(Newline), channel(HIDDEN), popMode;
// The Identifier mode is only entered if we know the current token is an
// identifier with >1 characters and which doesn't start with a keyword. This is
// essentially the default mode without keywords.
mode Identifier;
Identifier_ID : ID_CHAR+ -> type(ID);
// must include every token which can follow the Identifiers mode
Identifier_WS : WS -> type(WS), channel(HIDDEN), popMode;
Identifier_Newline : Newline -> type(Newline), channel(HIDDEN), popMode;
This grammar also works in the ANTLRWorks 2 lexer interpreter (coming soon!) for everything except single-character identifiers. Since the lexer interpreter can't evaluate the predicate in ID_START, an input like a<space> will (in the interpreter) produce a single token with text a<space> of type WS on the HIDDEN channel.

ANTLR Parser, need to which parser rule is matched

In ANTLR, for a given token, is there a way to tell which parser rule is matched?
For example, from the ANTLR grammar:
tokens
{
ADD='Add';
SUB='Sub';
}
fragment
ANYDIGIT : '0'..'9';
fragment
UCASECHAR : 'A'..'Z';
fragment
LCASECHAR : 'a'..'z';
fragment
DATEPART : ('0'..'1') (ANYDIGIT) '/' ('0'..'3') (ANYDIGIT) '/' (ANYDIGIT) (ANYDIGIT) (ANYDIGIT) (ANYDIGIT);
fragment
TIMEPART : ('0'..'2') (ANYDIGIT) ':' ('0'..'5') (ANYDIGIT) ':' ('0'..'5') (ANYDIGIT);
SPACE : ' ';
NEWLINE : '\r'? '\n';
TAB : '\t';
FORMFEED : '\f';
WS : (SPACE|NEWLINE|TAB|FORMFEED)+ {$channel=HIDDEN;};
IDENTIFIER : (LCASECHAR|UCASECHAR|'_') (LCASECHAR|UCASECHAR|ANYDIGIT|'_')*;
TIME : '\'' (TIMEPART) '\'';
DATE : '\'' (DATEPART) (' ' (TIMEPART))? '\'';
STRING : '\''! (.)* '\''!;
DOUBLE : (ANYDIGIT)+ '.' (ANYDIGIT)+;
INT : (ANYDIGIT)+;
literal : INT|DOUBLE|STRING|DATE|TIME;
var : IDENTIFIER;
param : literal|fcn_call|var;
fcn_name : ADD |
SUB |
DIVIDE |
MOD |
DTSECONDSBETWEEN |
DTGETCURRENTDATETIME |
APPEND |
STRINGTOFLOAT;
fcn_call : fcn_name WS? '('! WS? ( param WS? ( ','! WS? param)*)* ')'!;
expr : fcn_call WS? EOF;
And in Java:
CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
nodes.reset();
Object obj;
while((obj = nodes.nextElement()) != null)
{
if(nodes.isEOF(obj))
{
break;
}
System.out.println(obj);
}
So, what I want to know, at System.out.println(obj), did the node match the fcn_name rule, or did it match the var rule.
The reason being, I am trying to handle vars differently than fcn_names.
Add this to your listener/visitor:
String[] ruleNames;
public void loadParser(gramParser parser) { //get parser
ruleNames = parser.getRuleNames(); //load parser rules from parser
}
Call loadParser() from wherever you create your listener/visitor, eg.:
MyParser parser = new MyParser(tokens);
MyListener listener = new MyListener();
listener.loadParser(parser); //so we can access rule names
Then inside each rule you can get the name of the rule like this:
ruleName = ruleNames[ctx.getRuleIndex()];
No, you cannot get the name of a parser rule (at least, not without an ugly hack ➊).
But if tree is an instance of CommonTree, it means you've already invoked the expr rule of your parser, which means you already know expr matches first (which in its turn matches fcn_name).
➊ On a related note, see: Get active Antlr rule

variable not passed to predicate method in ANTLR

The java code generated from ANTLR is one rule, one method in most times. But for the following rule:
switchBlockLabels[ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts]
: ^(SWITCH_BLOCK_LABEL_LIST switchCaseLabel[_entity, _method, _preStmts]* switchDefaultLabel? switchCaseLabel*)
;
it generates a submethod named synpred125_TreeParserStage3_fragment(), in which mehod switchCaseLabel(_entity, _method, _preStmts) is called:
synpred125_TreeParserStage3_fragment(){
......
switchCaseLabel(_entity, _method, _preStmts);//variable not found error
......
}
switchBlockLabels(ITdcsEntity _entity,TdcsMethod _method,List<IStmt> _preStmts){
......
synpred125_TreeParserStage3_fragment();
......
}
The problem is switchCaseLabel has parameters and the parameters come from the parameters of switchBlockLabels() method, so "variable not found error" occurs.
How can I solve this problem?
My guess is that you've enabled global backtracking in your grammar like this:
options {
backtrack=true;
}
in which case you can't pass parameters to ambiguous rules. In order to communicate between ambiguous rules when you have enabled global backtracking, you must use rule scopes. The "predicate-methods" do have access to rule scopes variables.
A demo
Let's say we have this ambiguous grammar:
grammar Scope;
options {
backtrack=true;
}
parse
: atom+ EOF
;
atom
: numberOrName+
;
numberOrName
: Number
| Name
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
(for the record, the atom+ and numberOrName+ make it ambiguous)
If you now want to pass information between the parse and numberOrName rule, say an integer n, something like this will fail (which is the way you tried it):
grammar Scope;
options {
backtrack=true;
}
parse
#init{int n = 0;}
: (atom[++n])+ EOF
;
atom[int n]
: (numberOrName[n])+
;
numberOrName[int n]
: Number {System.out.println(n + " = " + $Number.text);}
| Name {System.out.println(n + " = " + $Name.text);}
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
In order to do this using rule scopes, you could do it like this:
grammar Scope;
options {
backtrack=true;
}
parse
scope{int n; /* define the scoped variable */ }
#init{$parse::n = 0; /* important: initialize the variable! */ }
: atom+ EOF
;
atom
: numberOrName+
;
numberOrName /* increment and print the scoped variable from the parse rule */
: Number {System.out.println(++$parse::n + " = " + $Number.text);}
| Name {System.out.println(++$parse::n + " = " + $Name.text);}
;
Number : '0'..'9'+;
Name : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
Test
If you now run the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "foo 42 Bar 666";
ScopeLexer lexer = new ScopeLexer(new ANTLRStringStream(src));
ScopeParser parser = new ScopeParser(new CommonTokenStream(lexer));
parser.parse();
}
}
you will see the following being printed to the console:
1 = foo
2 = 42
3 = Bar
4 = 666
P.S.
I don't know what language you're parsing, but enabling global backtracking is usually overkill and can have quite an impact on the performance of your parser. Computer languages often are ambiguous in just a few cases. Instead of enabling global backtracking, you really should look into adding syntactic predicates, or enabling backtracking on those rules that are ambiguous. See The Definitive ANTLR Reference for more info.

How to pass CommonTree parameter to an Antlr rule

I am trying to do what I think is a simple parameter passing to a rule in Antlr 3.3:
grammar rule_params;
options
{
output = AST;
}
rule_params
: outer;
outer: outer_id '[' inner[$outer_id.tree] ']';
inner[CommonTree parent] : inner_id '[' ']';
outer_id : '#'! ID;
inner_id : '$'! ID ;
ID : ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' )* ;
So the inner[CommonTree parent] generates the following:
inner4=inner((outer_id2!=null?((Object)outer_id2.tree):null));
resulting in this error:
The method inner(CommonTree) in the type rule_paramsParser is not applicable for the arguments (Object)
As best I can tell, this is the exact same as the example in the Antrl book:
classDefinition[CommonTree mod]
(Kindle Location 3993) - sorry I don't know the page number but it is in the middle of the book in chapter 9, section labeled "Creating Nodes with Arbitrary Actions".
Thanks for any help.
M
If you don't explicitly specify the tree to be used in your grammar, .tree (which is short for getTree()) will return a java.lang.Object and a CommonTree will be used as default Tree implementation. To avoid casting, set the type of tree in your options { ... } section:
options
{
output=AST;
ASTLabelType=CommonTree;
}