Inline comments and empty line in antlr4 grammar - antlr

please can anyone explain me, what i need to change i this grammar to support inline comments (such as // some text) and empty line (which contains any number of space characters). I write following grammar, but this doesn't work.
program: line* EOF ;
line: (expression | assignment) (NEWLINE | EOF);
assignment : VARIABLE '=' expression ;
expression : '(' expression ')' #parenthesisExpression
| '-' expression #unaryExpression
| left=expression OP1 right=expression #firstPriorityExpression
| left=expression OP2 right=expression #secondPriorityExpression
| number=NUMBER #numericExpression
| variable=VARIABLE #variableExpression
;
NUMBER : [0-9]+ ;
VARIABLE : [a-zA-Z][a-zA-Z0-9]* ;
OP1 : '*' | '/' ;
OP2 : '+' | '-' ;
NEWLINE : '\r'? '\n' ;
WHITESPACE : [ \t\r]+ -> skip ;
COMMENT : '//' ~[\n\r]* -> skip ;

The fact you added - in a parser rule as a literal token, and also made OP2 match this character causes OP2 to never match a -. You need to have a lexer rule that matches only the single minus sign (as I showed earlier):
op1
: MUL
| DIV
;
op2
: ADD
| MIN
;
...
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
MIN : '-' ;
and then use MIN in your unary alternative:
...
| MIN expression #unaryExpression
...
When you have a separate MIN : '-' ; rule, you could do this:
...
| '-' expression #unaryExpression
...
because now ANTLR "knows" you mean the rule that matches a single -, but ANTLR does not "know" this when you have a lexer rule that matches a either a - or + like your OP2 rule:
OP2 : '+' | '-' ;

Related

ANTLR4 - How to close "longest-match-wins" and use first match rule?

Orignial question:
My code to parse:
N100G1M4
What I expcted: N100 G1 M4
But ANTLR can not idetify this because ANTLR always match longest substring?
How to handle the case?
Update
What I am going to do:
I am trying to parse CNC G-Code txt and get keywords from a file stream, which is usually used to control a machine and drive motors to move.
The G-Code rule is :
// Define a grammar called Hello
grammar GCode;
script : blocks+ EOF;
blocks:
assign_stat
| ncblock
| NEWLINE
;
ncblock :
ncelements NEWLINE //
;
ncelements :
ncelement+
;
ncelement
:
LINENUMEXPR // linenumber N100
| GCODEEXPR // G10 G54.1
| MCODEEXPR // M30
| coordexpr // X100 Y100 Z[A+b*c]
| FeedExpr // F10.12
| AccExpr // E2.0
// | callSubroutine
;
assign_stat:
VARNAME '=' expression NEWLINE
;
expression:
multiplyingExpression ('+' | '-') multiplyingExpression
;
multiplyingExpression
: powExpression (('*' | '/') powExpression)*
;
powExpression
: signedAtom ('^' signedAtom)*
;
signedAtom
: '+' signedAtom
| '-' signedAtom
| atom
;
atom
: scientific
| variable
| '(' expression ')'
;
LINENUMEXPR: 'N' Digit+ ;
GCODEEXPR : 'G' GPOSTFIX;
MCODEEXPR : 'M' INT;
coordexpr:
CoordExpr
| ParameterKeyword getValueExpr
;
getValueExpr:
'[' expression ']'
;
CoordExpr
:
ParameterKeyword SCIENTIFIC_NUMBER
;
ParameterKeyword: [XYZABCUVWIJKR];
FeedExpr: 'F' SCIENTIFIC_NUMBER;
AccExpr: 'E' SCIENTIFIC_NUMBER;
fragment
GPOSTFIX
: Digit+ ('.' Digit+)*
;
variable
: VARNAME
;
scientific
: SCIENTIFIC_NUMBER
;
SCIENTIFIC_NUMBER
: SIGN? NUMBER (('E' | 'e') SIGN? NUMBER)?
;
fragment NUMBER
: ('0' .. '9') + ('.' ('0' .. '9') +)?
;
HEX_INTEGER
: '0' [xX] HEX_DIGIT+
;
fragment HEX_DIGIT
: [0-9a-fA-F]
;
INT : Digit+;
fragment
Digit : [0-9];
fragment
SIGN
: ('+' | '-')
;
VARNAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
NEWLINE
: '\r'? '\n'
;
WS : [ \t]+ -> skip ; // skip spaces, tabs, newlines
Sample program(it works well except the last line):
N200 G54.1
a = 100
b = 10
c = a + b
Z[a + b*c]
N002 G2 X30.1 Y20.1 I20.1 J0.1 K0.2 R20
N100 G1X100.5Z[VAR1+100]M3H3 // it works well except the last line
I want to parse N100G1X100.5YE5Z[VAR1+100]M3H3 to
-> N100 G1 X100 Z[VAR1+100]
-> or it will be better to split the node X100 to two subnode X 100:
I am trying to use ANTLR, but ANTLR always take the rule "longest match wins". N100G1X100 is identified to a word.
Append question:
What's the best tool to finish the task?
ANTLR has a strict separation between pasrer and lexer, and therefor the lexer operates in a predictable way (longest match wins). So if you have some sort of identifier rule that matches N100G1M4 but sometimes want to match N100, G1 and M4 separately, you're out of luck.
How to handle the case?
The only answer one can give (with the amount of details given) is: remove the rule that matches N100G1M4 as 1 token. If that is something you cannot do, then don't use ANTLR, but use a "scannerless" parser.
Scannerless Parser Generators

ANTLR4 Grammar - Issue with "dot" in fields and extended expressions

I have the following ANTLR4 Grammar
grammar ExpressionGrammar;
parse: (expr)
;
expr: MIN expr
| expr ( MUL | DIV ) expr
| expr ( ADD | MIN ) expr
| NUM
| function
| '(' expr ')'
;
function : ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL : '*';
DIV : '/';
MIN : '-';
ADD : '+';
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
NUM : '0' | [1-9][0-9]*;
ID : [a-zA-Z_] [a-zA-Z]*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
I have an input expression like this :-
(Fields.V1)*(Fields.V2) + (Constants.Value1)*(Constants.Value2)
The ANTLR parser generated the following text from the grammar above :-
(FieldsV1)*(FieldsV2)+(Constants<missing ')'>
As you can see, the "dots" in Fields.V1 and Fields.V2 are missing from the text and also there is a <missing ')' Error node. I believe I should somehow make ANTLR understand that an expression can also have fields with dot operators.
A question on top of this :-
(Var1)(Var2)
ANTLR is not throwing me error for this above scenario , the expressions should not be (Var1)(Var2) -- It should always have the operator (var1)*(var2) or (var1)+(var2) etc. The parser error tree is not generating this error. How should the grammar be modified to make sure even this scenario is taken into consideration.
To recognize IDs like Fields.V1, change you Lexer rule for ID to something like this:
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
Notice, since each "node" of the ID follows the same rule, I made it a lexer fragment that I could use to compose the ID rule. I also added 0-9 to the second part of the fragment, since it appears that you want to allow numbers in IDs
Then the ID rule uses the fragment to build out the Lexer rule that allows for dots in the ID.
You also didn't add ID as a valid expr alternative
To handle detection of the error condition in (Var1)(Var2), you need Mike's advice to add the EOF Lexer rule to the end of the parse parser rule. Without the EOF, ANTLR will stop parsing as soon as it reaches the end of a recognized expr ((Var1)). The EOF says "and then you need to find an EOF", so ANTLR will continue parsing into the (Var2) and give you the error.
A revised version that handles both of your examples:
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| NUM
| ID
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: '0' | [1-9][0-9]*;
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
(Now that I've read through the comments, this is pretty much just applying the suggestions in the comments)

How to detect an expression result is unused in an interpreted programming language?

I'm working on a simple procedural interpreted scripting language, written in Java using ANTLR4. Just a hobby project. I have written a few DSLs using ANTLR4 and the lexer and parser presented no real problems. I got quite a bit of the language working by interpreting directly from the parse tree but that strategy, apart from being slow, started to break down when I started to add functions.
So I've created a stack-based virtual machine, based on Chapter 10 of "Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages". I have an assembler for the VM that works well and I'm now trying to make the scripting language generate assembly via an AST.
Something I can't quite see is how to detect when an expression or function result is unused, so that I can generate a POP instruction to discard the value from the top of the operand stack.
I want things like assignment statements to be expressions, so that I can do things like:
x = y = 1;
In the AST, the assignment node is annotated with the symbol (the lvalue) and the rvalue comes from visiting the children of the assignment node. At the end of the visit of the assignment node, the rvalue is stored into the lvalue, and this is reloaded back into the operand stack so that it can be used as an expression result.
This generates ( for x = y = 1):
CLOAD 1 ; Push constant value
GSTOR y ; Store into global y and pop
GLOAD y ; Push value of y
GSTOR x ; Store into global x and pop
GLOAD x ; Push value of x
But it needs a POP instruction at the end to discard the result, otherwise the operand stack starts to grow with these unused results. I can't see the best way of doing this.
I guess my grammar could be flawed, which is preventing me seeing a solution here.
grammar g;
// ----------------------------------------------------------------------------
// Parser
// ----------------------------------------------------------------------------
parse
: (functionDefinition | compoundStatement)*
;
functionDefinition
: FUNCTION ID parameterSpecification compoundStatement
;
parameterSpecification
: '(' (ID (',' ID)*)? ')'
;
compoundStatement
: '{' compoundStatement* '}'
| conditionalStatement
| iterationStatement
| statement ';'
;
statement
: declaration
| expression
| exitStatement
| printStatement
| returnStatement
;
declaration
: LET ID ASSIGN expression # ConstantDeclaration
| VAR ID ASSIGN expression # VariableDeclaration
;
conditionalStatement
: ifStatement
;
ifStatement
: IF expression compoundStatement (ELSE compoundStatement)?
;
exitStatement
: EXIT
;
iterationStatement
: WHILE expression compoundStatement # WhileStatement
| DO compoundStatement WHILE expression # DoStatement
| FOR ID IN expression TO expression (STEP expression)? compoundStatement # ForStatement
;
printStatement
: PRINT '(' (expression (',' expression)*)? ')' # SimplePrintStatement
| PRINTF '(' STRING (',' expression)* ')' # PrintFormatStatement
;
returnStatement
: RETURN expression?
;
expression
: expression '[' expression ']' # Indexed
| ID DEFAULT expression # DefaultValue
| ID op=(INC | DEC) # Postfix
| op=(ADD | SUB | NOT) expression # Unary
| op=(INC | DEC) ID # Prefix
| expression op=(MUL | DIV | MOD) expression # Multiplicative
| expression op=(ADD | SUB) expression # Additive
| expression op=(GT | GE | LT | LE) expression # Relational
| expression op=(EQ | NE) expression # Equality
| expression AND expression # LogicalAnd
| expression OR expression # LogicalOr
| expression IF expression ELSE expression # Ternary
| ID '(' (expression (',' expression)*)? ')' # FunctionCall
| '(' expression ')' # Parenthesized
| '[' (expression (',' expression)* )? ']' # LiteralArray
| ID # Identifier
| NUMBER # LiteralNumber
| STRING # LiteralString
| BOOLEAN # LiteralBoolean
| ID ASSIGN expression # SimpleAssignment
| ID op=(CADD | CSUB | CMUL | CDIV) expression # CompoundAssignment
| ID '[' expression ']' ASSIGN expression # IndexedAssignment
;
// ----------------------------------------------------------------------------
// Lexer
// ----------------------------------------------------------------------------
fragment
IDCHR : [A-Za-z_$];
fragment
DIGIT : [0-9];
fragment
ESC : '\\' ["\\];
COMMENT : '#' .*? '\n' -> skip;
// ----------------------------------------------------------------------------
// Keywords
// ----------------------------------------------------------------------------
DO : 'do';
ELSE : 'else';
EXIT : 'exit';
FOR : 'for';
FUNCTION : 'function';
IF : 'if';
IN : 'in';
LET : 'let';
PRINT : 'print';
PRINTF : 'printf';
RETURN : 'return';
STEP : 'step';
TO : 'to';
VAR : 'var';
WHILE : 'while';
// ----------------------------------------------------------------------------
// Operators
// ----------------------------------------------------------------------------
ADD : '+';
DIV : '/';
MOD : '%';
MUL : '*';
SUB : '-';
DEC : '--';
INC : '++';
ASSIGN : '=';
CADD : '+=';
CDIV : '/=';
CMUL : '*=';
CSUB : '-=';
GE : '>=';
GT : '>';
LE : '<=';
LT : '<';
AND : '&&';
EQ : '==';
NE : '!=';
NOT : '!';
OR : '||';
DEFAULT : '??';
// ----------------------------------------------------------------------------
// Literals and identifiers
// ----------------------------------------------------------------------------
BOOLEAN : ('true'|'false');
NUMBER : DIGIT+ ('.' DIGIT+)?;
STRING : '"' (ESC | .)*? '"';
ID : IDCHR (IDCHR | DIGIT)*;
WHITESPACE : [ \t\r\n] -> skip;
ANYCHAR : . ;
So my question is where is the usual place to detect unused expression results, i.e. when expressions are used as plain statements? Is it something I should detect during the parse, then annotate the AST node? Or is this better done when visiting the AST for code generation (assembly generation in my case)? I just can't see where best to do it.
IMO it's not a question of the right grammar, but how you process the AST/parse tree. The fact if a result is used or not could be determined by checking the siblings (and parent siblings etc.). An assignment for instance is made of the lvalue, the operator and the rvalue, hence when you determined the rvalue, check the previous tree node sibling if that is an operator. Similarly you can check if the parent is a parentheses expression (for nested function calls, grouping etc.).
statement
: ...
| expression
If you label this case with # ExpressionStatement, you can generate a pop after every expression statement by overriding exitExpressionStatement() in the listener or visitExpressionStatement in the visitor.

How to create a rule to match 2^3 to create a power operator?

Given that I have the following grammar how would I add a rule to match something like 2^3 to create a power operator?
negation : '!'* term ;
unary : ('+'!|'-'^)* negation ;
mult : unary (('*' | '/' | ('%'|'mod') ) unary)* ;
add : mult (('+' | '-') mult)* ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (('&&' | '||') relation)* ;
// LEXER ================================================================
HEX_NUMBER : '0x' HEX_DIGIT+;
fragment
FLOAT: ;
INTEGER : DIGIT+ ({input.LA(1)=='.' && input.LA(2)>='0' && input.LA(2)<='9'}?=> '.' DIGIT+ {$type=FLOAT;})? ;
fragment
HEX_DIGIT : (DIGIT|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9') ;
What I have tried:
I tried something like power : ('+' | '-') unary'^' unary but that doesn't seem to work.
I also tried mult : unary (('*' | '/' | ('%'|'mod') | '^' ) unary)* ; but that doesn't work either.
To give ^ higher precedence than negation, do this:
pow : term ('^' term)* ;
negation : '!' negation | pow ;
unary : ('+'! | '-'^)* negation ;
If you want to consider the right-associativity already in the grammar, you can also use recursion:
pow : term ('^'^ pow)?
;
negation : '!'* pow;
...

Why is this grammar giving me a "non LL(*) decision" error?

I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.