antlr how to get a smaller tree in this case - antlr

I am printing out the type field and text field from an AST tree based on my grammar and I get this
type=5 text=and
type=14 text==
type=4 text=ALIAS
type=20 text=a
type=7 text=ATTR_NAME
type=20 text=column_b
type=36 text=STR_VAL
type=35 text="asdfds"
type=14 text==
type=4 text=ALIAS
type=20 text=a
type=7 text=ATTR_NAME
type=20 text=yyyy
type=12 text=DEC_VAL
type=11 text=564.555
Valid type ints in my generated lexer are
public static final int EOF=-1;
public static final int ALIAS=4;
public static final int AND=5;
public static final int ATTR_NAME=7;
public static final int DECIMAL=11;
public static final int DEC_VAL=12;
public static final int EQ=14;
public static final int ID=20;
public static final int STR_VAL=36;
I would very much like to NOT have type=20 ever in the tree!!! and instead move the nodes with type 20 up one level so text would be the information(not the token name) and the type would be 4,7, or ALIAS or ATTR_NAME types. Is there a way to do this?
This part of my current grammar is using imaginary tokens ATTR_NAME and ALIAS right now like so(comment if I need to put more of my grammar up but I think this is enough to solve it)
primaryExpr
: compExpr
| inExpr
| parameterExpr
| attribute
;
parameterExpr
: attribute (EQ | NE | GT | LT | GE | LE)^ parameter
| aliasdAttribute (EQ | NE | GT | LT | GE | LE)^parameter
;
compExpr
: attribute (EQ | NE | GT | LT | GE | LE)^ value
| aliasdAttribute(EQ | NE | GT | LT | GE | LE)^value
;
alias
: ID
;
inExpr : attribute IN^ valueList
;
attribute: ID -> ^(ATTR_NAME ID);
aliasdAttribute
: alias(DOT)(ID) -> ^(ALIAS alias ) ^(ATTR_NAME ID)
;

Is there a way to do this?
Sure.
The alias rule in grammar T:
grammar T;
options {
output=AST;
}
tokens {
ALIAS;
}
alias
: ID -> ALIAS[$ID.text]
;
ID : ('a'..'z' | 'A'..'Z')+;
will always produce (rewrite) a token with type ALIAS, but with inner text the same as the ID token.

Related

Dealing with too many terminal nodes in grammar

I'm trying to write a parser for protobuf3 using the grammars from https://github.com/antlr/grammars-v4/blob/master/protobuf3/Protobuf3.g4.
and I'm trying to deal with the _type declaration in my grammar:
field
: ( REPEATED )? type_ fieldName EQ fieldNumber ( LB fieldOptions RB )? SEMI
;
type_
: DOUBLE
| FLOAT
| INT32
| INT64
| UINT32
| UINT64
| SINT32
| SINT64
| FIXED32
| FIXED64
| SFIXED32
| SFIXED64
| BOOL
| STRING
| BYTES
| messageDefinition
| enumType
;
Inside enterField I have this snippet:
#Override
public void enterField(Protobuf3Parser.FieldContext ctx) {
MessageDefinition messageDefinition = this.messageStack.peek();
Field field = new Field();
field.setName(ctx.fieldName().ident().getText());
field.setPosition(ctx.fieldNumber().getAltNumber());
messageDefinition.addField(field);
super.enterField(ctx);
}
However I'm not sure on how I can deal with the type_ context here. It has too many terminal nodes (for basic types) and it could have a messageType or an enumType.
For my use case all I care about is if it is a basic type (and in that case get the type name) or if it is a complex type (such as another message or enum) get the definition name.
Is there a way to do this without having to check each possible outcome of ctx.field_() ?
Thank you
If both, messageDefinition and enumType return single lexer token, you can make the entire access very easy by using a label:
type_
: value = DOUBLE
| value = FLOAT
| value = INT32
| value = INT64
| value = UINT32
| value = UINT64
| value = SINT32
| value = SINT64
| value = FIXED32
| value = FIXED64
| value = SFIXED32
| value = SFIXED64
| value = BOOL
| value = STRING
| value = BYTES
| value = messageDefinition
| value = enumType
;
With that you only need to use the field value:
#Override
public void enterField(Protobuf3Parser.FieldContext ctx) {
...
const type = ctx.type_().value.getText();
...
super.enterField(ctx);
}

Implement IF Else for, while loop and logical Statement in Antr

I am new to ANTLR and I am trying to implement if-else, for, while loop and logical symbol, but I am not able to do so. Can Anyone help me with this? Below is what I have done.
grammar BasForCCAL;
#header {
package basforccal;
import java.util.HashMap;
import java.util.Scanner;
}
#lexer::header{
package basforccal;
}
#members{
String programName;
HashMap memory = new HashMap();
public void checkName(String endName){
if(!endName.equals(programName)){
System.out.println("Wrong Program name in end of the program");
}
}
}
program : start programbody end;
start :'PROGRAM' ID {programName = $ID.text ; System.out.println("Checking program :"+$ID.text);};
programbody
: (devcar|ID'='(expr|CHAR)| ctrlStmt)*;
devcar : initInt var1|
intFloat var1|
intChar var1 ;
initInt : 'INT'
;
intFloat
: 'FLOAT'
;
intChar: 'CHAR';
var1 : idname (',' var1)* ;
idname : ID {Integer v = (Integer)memory.get($ID.text);
if(v!=null)
{System.err.println("Error: "+$ID.text+" already defined line:"+$ID.getLine());}
else
{memory.put($ID.text,new Integer('1'));}
}
;
expr
: (multExpr |'('expr')')
( '+' multExpr
| '-' multExpr
| '/' multExpr
| '*' multExpr
)*
;
logiExpr
: expr relOpr expr;
relOpr
: '<'
| '>'
| '<>'
| '<='
| '>='
;
ctrlStmt
: 'IF''('logiExpr')' 'THEN' (stat)+ 'ENDIF'
| 'WHILE''('logiExpr')' 'DO' (stat)+ 'ENDDO'
| 'FOR' ID '=' expr 'TO' expr 'LOOP' stat+ 'ENDLOOP';
stat
: ctrlStmt|multExpr
| ID '=' (expr|CHAR);
multExpr
: ID {
Integer v = (Integer)memory.get($ID.text);
if ( v!=null ){}
else System.err.println("undefined variable "+$ID.text);
}
| INT
| FLOAT
;
end
: 'END' ID '.' {checkName($ID.text);};
My Java code to check it.
import org.antlr.runtime.ANTLRFileStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import java.io.IOException;
public class AntlrParser {
public static void main(String args[]) throws IOException, RecognitionException {
basforccal.BasForCCALLexer lexer = new basforccal.BasForCCALLexer(new ANTLRFileStream(args[0]));
CommonTokenStream token = new CommonTokenStream(lexer);
basforccal.BasForCCALParser parser = new basforccal.BasForCCALParser(token);
parser.program();
}
}
Below is the program in a file(prog1.bfcc) which I am trying to check using my Java code.
PROGRAM TESTIF
FLOAT A,B,C
A=1.0
C=1.0
IF(A>1.0)THEN
B=2.0
ENDIF
IF(B*C<=10)THEN
IF(A>0.0)THEN
C=5.0
ENDIF
ENDIF=
IF(3=4)THEN
A=1.0
B=2.0
C=3.0
ENDIF
END TESTIF.
Below is the error which I am getting while checking it from JAVA.
Checking program :TESTIF
C:\Users\vivek\IdeaProjects\BasForCCal\prog1.bfcc line 16:4 mismatched input '=' expecting set null
Process finished with exit code 0
You have ENDIF= in your input, which looks suspicious. It should probably be: ENDIF without the =. This is what the error message is trying to tell you.
Also, there is IF(3=4)THEN in your input, but your relOpr does not inlcude the = operator. You should probably add that to it:
relOpr
: '='
| '<'
| '>'
| '<>'
| '<='
| '>='
;

ANTLR3 - Decision can match input using multiple alternatives

When running ANTLR3 on the following code, I get the message - warning(200): MYGRAMMAR.g:40:36: Decision can match input such as "QMARK" using multiple alternatives: 3, 4
As a result, alternative(s) 4 were disabled for that input.
The warning message is pointing me to postfixExpr. Is there a way to fix this?
grammar MYGRAMMAR;
options {language = C;}
tokens {
BANG = '!';
COLON = ':';
FALSE_LITERAL = 'false';
GREATER = '>';
LSHIFT = '<<';
MINUS = '-';
MINUS_MINUS = '--';
PLUS = '+';
PLUS_PLUS = '++';
QMARK = '?';
QMARK_COLON = '?:';
TILDE = '~';
TRUE_LITERAL = 'true';
}
condExpr
: shiftExpr (QMARK condExpr COLON condExpr)? ;
shiftExpr
: addExpr ( shiftOp addExpr)* ;
addExpr
: qmarkColonExpr ( addOp qmarkColonExpr)* ;
qmarkColonExpr
: prefixExpr ( QMARK_COLON prefixExpr )? ;
prefixExpr
: ( prefixOrUnaryMinus | postfixExpr) ;
prefixOrUnaryMinus
: prefixOp prefixExpr ;
postfixExpr
: primaryExpr ( postfixOp | BANG | QMARK )*;
primaryExpr
: literal ;
shiftOp
: ( LSHIFT | rShift);
addOp
: (PLUS | MINUS);
prefixOp
: ( BANG | MINUS | TILDE | PLUS_PLUS | MINUS_MINUS );
postfixOp
: (PLUS_PLUS | MINUS_MINUS);
rShift
: (GREATER GREATER)=> a=GREATER b=GREATER {assertNoSpace($a,$b)}? ;
literal
: ( TRUE_LITERAL | FALSE_LITERAL );
assertNoSpace [pANTLR3_COMMON_TOKEN t1, pANTLR3_COMMON_TOKEN t2]
: { $t1->line == $t2->line && $t1->getCharPositionInLine($t1) + 1 == $t2->getCharPositionInLine($t2) }? ;
I think one problem is that PLUS_PLUS as well as MINUS_MINUS will never be matched as they are defined after the respective PLUS or MINUS token. therefore the lexer will always output two PLUS tokens instead of one PLUS_PLUS token.
In order to avaoid something like this you have to define your PLUS_PLUS or MINUS_MINUS token before the PLUS or MINUS token as the lexer processes them in the order they are defined and won't look any further once it found a way to match the current input.
The same problem applies to QMARK_COLON as it is defined after QMARK (this only is a problem because there is another token type COLON to match the following colon).
See if fixing the ambiguities resolves the error message.

Why parser splits command name into different nodes

I have the statement:
=MYFUNCTION_NAME(1,2,3)
My grammar is:
grammar Expression;
options
{
language=CSharp3;
output=AST;
backtrack=true;
}
tokens
{
FUNC;
PARAMS;
}
#parser::namespace { Expression }
#lexer::namespace { Expression }
public
parse : ('=' func )*;
func : funcId '(' formalPar* ')' -> ^(FUNC funcId formalPar);
formalPar : (par ',')* par -> ^(PARAMS par+);
par : INT;
funcId : complexId+ ('_'? complexId+)*;
complexId
: ID+
| ID+DIGIT+ ;
ID : ('a'..'z'|'A'..'Z'|'а'..'я'|'А'..'Я')+;
DIGIT : ('0'..'9')+;
INT : '-'? ('0'..'9')+;
In a tree i get:
[**FUNC**]
|
[MYFUNCTION] [_] [NAME] [**PARAMS**]
Why the parser splits function's name into 3 nodes: "MYFUNCTION, "_", "NAME" ? How can i fix it?
The division is always performed based on tokens. Since the ID token cannot contain an _ character, the result is 3 separate tokens that are handled later by the funcId grammar rule. To create a single node for your function name, you'll need to create a lexer rule that can match the input MYFUNCTION_NAME as a single token.

Whats the correct way to add new tokens (rewrite) to create AST nodes that are not on the input steam

I've a pretty basic math expression grammar for ANTLR here and what's of interest is handling the implied * operator between parentheses e.g. (2-3)(4+5)(6*7) should actually be (2-3)*(4+5)*(6*7).
Given the input (2-3)(4+5)(6*7) I'm trying to add the missing * operator to the AST tree while parsing, in the following grammar I think I've managed to achieve that but I'm wondering if this is the correct, most elegant way?
grammar G;
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ADD = '+' ;
SUB = '-' ;
MUL = '*' ;
DIV = '/' ;
OPARN = '(' ;
CPARN = ')' ;
}
start
: expression EOF!
;
expression
: mult (( ADD^ | SUB^ ) mult)*
;
mult
: atom (( MUL^ | DIV^) atom)*
;
atom
: INTEGER
| (
OPARN expression CPARN -> expression
)
(
OPARN expression CPARN -> ^(MUL expression)+
)*
;
INTEGER : ('0'..'9')+ ;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;};
This grammar appears to output the correct AST Tree in ANTLRworks:
I'm only just starting to get to grips with parsing and ANTLR, don't have much experience so feedback with really appreciated!
Thanks in advance! Carl
First of all, you did a great job given the fact that you've never used ANTLR before.
You can omit the language=Java and ASTLabelType=CommonTree, which are the default values. So you can just do:
options {
output=AST;
}
Also, you don't have to specify the root node for each operator separately. So you don't have to do:
(ADD^ | SUB^)
but the following:
(ADD | SUB)^
will suffice. With only two operators, there's not much difference, but when implementing relational operators (>=, <=, > and <), the latter is a bit easier.
Now, for you AST: you'll probably want to create a binary tree: that way, all internal nodes are operators, and the leafs will be operands which makes the actual evaluating of your expressions much easier. To get a binary tree, you'll have to change your atom rule slightly:
atom
: INTEGER
| (
OPARN expression CPARN -> expression
)
(
OPARN e=expression CPARN -> ^(MUL $atom $e)
)*
;
which produces the following AST given the input "(2-3)(4+5)(6*7)":
(image produced by: graphviz-dev.appspot.com)
The DOT file was generated with the following test-class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
GLexer lexer = new GLexer(new ANTLRStringStream("(2-3)(4+5)(6*7)"));
GParser parser = new GParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}