What is the meaning of the ANTLR syntax in this grammar file? - antlr

I am trying to parse a file using ANTLR4 via Python. I am following a tutorial (https://faun.pub/introduction-to-antlr-python-af8a3c603d23); I am able to execute the code and get responses like the ones shown in the tutorial, but I'm failing to understand the logic of the grammar file.
grammar MyGrammer;
expr: left=expr op=('*'|'/') right=expr # InfixExpr
| left=expr op=('+'|'-') right=expr # InfixExpr
| atom=INT # NumberExpr
| '(' expr ')' # ParenExpr
| atom=HELLO # HelloExpr
| atom=BYE # ByeExpr
;
HELLO: ('hello'|'hi') ;
BYE : ('bye'| 'tata') ;
INT : [0-9]+ ;
WS : [ \t]+ -> skip ;
From my understanding, The constants (what I call them since they are all capitals) HELLO, BYE, INT, and WS define rules for what that set of text can contain. I think they are relating to functions somehow, but I am not sure. So the HELLO function will be executed if the parser encounters something that says either 'hello' or 'hi'. The expr is what is confusing me.
expr: left=expr op=('*'|'/') right=expr # InfixExpr
| left=expr op=('+'|'-') right=expr # InfixExpr
| atom=INT # NumberExpr
| '(' expr ')' # ParenExpr
| atom=HELLO # HelloExpr
| atom=BYE # ByeExpr
;
HELLO: ('hello'|'hi') ;
BYE : ('bye'| 'tata') ;
INT : [0-9]+ ;
WS : [ \t]+ -> skip ;
When I run the command
antlr4 -Dlanguage=Python3 MyGrammer.g4 -visitor -o dist
it produces many files but the main one contains InfixExpr, NumberExpr, ParenExpr, HelloExpr, and ByeExpr. I can see that somehow the author knows that he is doing something with the constants HELLO, BYE, etc. Is there any documentation on the expr piece above and what do the keywords atom, left, right mean?

Any rules that begin with a capital letter (often we captilize the entire rule name to make it obvious) is a Lexer rule.
Rules that begin with lower case letters are parser rules.
It’s VERY important to understand the difference and the flow of your input all the way through to a parse tree.
Your input stream of characters is first processed by the Lexer (using the Lexer rules) to produce a stream of tokens for the parser to act upon. It’s important to understand that the parser has NO impact on how the Lexer interprets the input.
When multiple Lexer rules could match you input, two “tie breakers” come into play.
1 - if a rules matches more characters in your input stream than other rules, then that will be the rules used to produce a token.
2 - if there is a tie of multiple Lexer rules matching the same sequence of input characters, then the Lexer rules that appears first in your grammar will be used to generate a token.
Your parser rules are evaluated using a recursive descent approach beginning with whatever startRule you specify. ANTLR uses several techniques to do it’s best to recognize your input, that includes trying alternatives until one is found that matches, ignoring a token (and producing an error) if that allows the parser to continue on, and inserting a missing token (and producing an error) if that allows the parser to continue.
re: the expr portion:
The rule says that there are 6 possible ways to recognize an expr
left=expr op=('*'|'/') right=expr (which will create an InfixExprContext node in the parse tree)
left=expr op=('+'|'-') right=expr (InfixExprContext (also))
atom=INT (NumberExprContext)
'(' expr ')' (ParenExprContext)
atom=HELLO (HelloExprContext)
atom=BYE (ByeExprContext)
The benefit of the labels (ex: # InfixExpr) is that, by creating a Context more specific than an ExprContext) you will have visitInfixExpr, visitNumberExpr, (etc.) methods that you can override in you Visitor instead of just a visitExpr method that contains all the alternatives. A similar thing will result for the enterXX and exitXX methods for your Listener classes.
In the left=expr op=('*'|'/') right=expr rule, the left, op and right names will generate accessors that make it easier to access those parts of you parse tree in you *Context class (without them you'd just have an array of expr, for example and expr[0] would be the first expr and expr[1] would be the second. (It's probably a good idea to look at the generated code with and without the names and labels to see the difference. Both make it MUCH easier to write the logic in your visitor/listeners.

Related

How to make a simple calculator syntax highlighting for IntelliJ?

I'm making a custom language support plugin according to this tutorial and I'm stuck with a few .bnf concepts. Let's say I want to parse a simple calculator language that supports +,-,*,/,unary -, and parentheses. Here's what I currently have:
Flex:
package com.intellij.circom;
import com.intellij.lexer.FlexLexer;
import com.intellij.psi.tree.IElementType;
import com.intellij.circom.psi.CircomTypes;
import com.intellij.psi.TokenType;
%%
%class CircomLexer
%implements FlexLexer
%unicode
%function advance
%type IElementType
%eof{ return;
%eof}
WHITESPACE = [ \n\r\t]+
NUMBER = [0-9]+
%%
{WHITESPACE} { return TokenType.WHITE_SPACE; }
{NUMBER} { return CircomTypes.NUMBER; }
Bnf:
{
parserClass="com.intellij.circom.parser.CircomParser"
extends="com.intellij.extapi.psi.ASTWrapperPsiElement"
psiClassPrefix="Circom"
psiImplClassSuffix="Impl"
psiPackage="com.intellij.circom.psi"
psiImplPackage="com.intellij.circom.psi.impl"
elementTypeHolderClass="com.intellij.circom.psi.CircomTypes"
elementTypeClass="com.intellij.circom.psi.CircomElementType"
tokenTypeClass="com.intellij.circom.psi.CircomTokenType"
}
expr ::=
expr ('+' | '-') expr
| expr ('*' | '/') expr
| '-' expr
| '(' expr ')'
| literal;
literal ::= NUMBER;
First it complains that expr is recursive. How do I rewrite it to not be recursive? Second, when I try to compile and run it, it freezes idea test instance when trying to parse this syntax, looks like an endless loop.
Calling the grammar files "BNF" is a bit misleading, since they are actually modified PEG (parsing expression grammar) format, which allows certain extended operators, including grouping, repetition and optionality, and ordered choice (which is semantically different from the regular definition of |).
Since the underlying technology is PEG, you cannot use left-recursive rules. Left-recursion will cause an infinite loop in the parser, unless the code generator refuses to generate left-recursive code. Fortunately, repetition operators are available so you only need recursion for syntax involving parentheses, and that's not left-recursion so it presents no problem.
As far as I can see from the documentation I found, grammar kit does not provide for operator precedence declarations. If you really need to produce a correct parse taking operator-precedence into account, you'll need to use multiple precedence levels. However, if your only use case is syntax highlighting, you probably do not require a precisely accurate parse, and it would be sufficient to do something like the following:
expr ::= unary (('+' | '-' | '*' | '/') unary)*
unary ::= '-'* ( '(' expr ')' | literal )
(For precise parsing, you'd need to split expr above into two precedence levels, one for additive operators and another for multiplicative. But I suggest not doing that unless you intend to use the parse for evaluation or code-generation.)
Also, you almost certainly require some lexical rule to recognise the various operator characters and return appropriate single character tokens.

Ambiguous Lexer rules in Antlr

I have an antlr grammar with multiple lexer rules that match the same word. It can't be resolved during lexing, but with the grammar, it becomes unambiguous.
Example:
conversion: NUMBER UNIT CONVERT UNIT;
NUMBER: [0-9]+;
UNIT: 'in' | 'meters' | ......;
CONVERT: 'in';
Input: 1 in in meters
The word "in" matches the lexer rules UNIT and CONVERT.
How can this be solved while keeping the grammar file readable?
When an input matches two lexer rules, ANTLR chooses either the longest or the first, see disambiguate. With your grammar, in will be interpreted as UNIT, never CONVERT, and the rule
conversion: NUMBER UNIT CONVERT UNIT;
can't work because there are three UNIT tokens :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<UNIT>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<UNIT>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<UNIT>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0159
line 1:5 missing 'in' at 'in'
line 1:8 mismatched input 'meters' expecting <EOF>
What you can do is to have only ID or TEXT tokens and distinguish them with a label, like this :
grammar Question;
question
#init {System.out.println("Question last update 0132");}
: conversion NL EOF
;
conversion
: NUMBER unit1=ID convert=ID unit2=ID
{System.out.println("Quantity " + $NUMBER.text + " " + $unit1.text +
" to convert " + $convert.text + " " + $unit2.text);}
;
ID : LETTER ( LETTER | DIGIT | '_' )* ; // or TEXT : LETTER+ ;
NUMBER : DIGIT+ ;
NL : [\r\n] ;
WS : [ \t] -> channel(HIDDEN) ; // -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
Execution :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<ID>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<ID>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<ID>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0132
Quantity 1 in to convert in meters
Labels are available from the rule's context in the visitor, so it is easy to distinguish tokens of the same type.
Based on the info in your question, it's hard to say what the best solution would be - I don't know what your lexer rules are, for example - nor can I tell why you have lexer rules that are ambiguous at all.
In my experience with antlr, lexer rules don't generally carry any semantic meaning; they are just text that matches some kind of regular expression. So, instead of having VARIABLE, METHOD_NAME, etc, I'd just have IDENTIFIER, and then figure it out at a higher level.
In other words, it seems (from the little I can glean from your question) that you might benefit either from replacing UNIT and CONVERT with grammar rules, or just having a single rule:
conversion: NUMBER TEXT TEXT TEXT
and validating the text values in your ANTLR listener/tree-walker/etc.
EDIT
Thanks for updating your question with lexer rules. It's clear now why it's failing - as BernardK points out, antlr will always choose the first matching lexer rule. This means it's impossible for the second of two ambiguous lexer rules to match, and makes your proposed design infeasible.
My opinion is that lexer rules are not the correct layer to do things like unit validation; they excel at structure, not content. Evaluating the parse tree will be much more practical than trying to contort an antlr grammar.
Finally, you might also do something with embedded actions on parse rules, like validating the value of an ID token against a known set of units. It could work, but would destroy the reusability of your grammar.

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.