Matching same token type in ANTLR4 - antlr

Currently the grammar for my vector is like its a collection of numbers, strings, vectors and identifiers.
vector:
'[' elements+=vector_members? (vector_delimiters elements+=vector_members)* ']'
;
vector_delimiters
:
','
;
vector_members:
NUMBER
| STRING
| vector
| ID
;
Now, is there a way to enforce through grammar such that the vector can contain only elements of a particular type like numbers or strings etc

Sure, there is a way, but that doesn't mean it's a good idea:
vector
: '[' ']'
| '[' elements+=NUMBER (vector_delimiters elements+=NUMBER)* ']'
| '[' elements+=STRING (vector_delimiters elements+=STRING )* ']'
| '[' elements+=ID (vector_delimiters elements+=ID)* ']'
| '[' elements+=vector (vector_delimiters elements+=vector)* ']'
;
See, that's pretty ugly.
This kind of validation should not be part of the grammar. Build a visitor to check your consistency rules. The code will be simpler, more maintainable, and will respect the separation of concerns principle. Let the parser do the parsing, and do the validation in a later stage. As a bonus, you'll be able to provide better error messages than just unexpected token.
As a side note, your initial grammar will accept constructs like this: [ , 42 ]. Your vector rule should rather be:
vector
: '[' ']'
| '[' elements+=vector_members (vector_delimiters elements+=vector_members)* ']'
;

Related

ANLTR4: parse both integer and float

I'm trying to use ANLTR4 to parse 2 types of expressions:
pair expressions are a pair of integer or float numbers, like (1,2) or (1.0 , 2.0).
single expressions are a single integer (1).
I designed my grammar like below but
If I write INT before NUM, pair expressions with integers like (1, 2) can't be tokenized because of expecting a NUM;
If I write NUM before INT, single expressions like (1) can't be tokenized because of expecting a INT.
grammar Expr;
prog : single | pair ;
single : '(' INT ')' ;
pair : '(' NUM ',' NUM ')' ;
INT : [0-9]+ ;
NUM : INT | FLOAT ;
FLOAT : '-'? INT '.' INT ;
WS : [ \t\r\n] -> skip ;
To make both expressions be able to be tokenized, I can remove NUM lexer and manually write pair like:
pair : '(' INT ',' INT ')'
| '(' INT ',' FLOAT ')'
| '(' FLOAT ',' INT ')'
| '(' FLOAT ',' FLOAT ')'
;
then both expressions can be parsed, and the pair expression supports both integers and floats.
But this is silly since if it's not pair but tuple10, it's impossible to write 1024 cases.
Is there any better solution ?
As kaby76 already mentioned as a comment: promote NUM to a parser rule. It doesn't make a lot of sense to define INT and FLOAT in the lexer, and then define a NUM that makes the tokens INT and FLOAT never to become real tokens on their own.
prog : single | pair ;
single : '(' INT ')' ;
pair : '(' num ',' num ')' ;
num : INT | FLOAT ;
INT : [0-9]+ ;
FLOAT : '-'? INT '.' INT ;
WS : [ \t\r\n] -> skip ;

Accessing Elements in 2-Dimensional Array Elm

How do I get at certain elements in this array using Array.get?
First I have a 2D List:
node = [['X',' ',' '],['O','O',' '],['X',' ',' ']]
-- so node is a [['X','X',' '],[' ',' ',' '],[' ',' ',' ']] : List (List Char)
I convert it to a 2D array so I get:
Array.fromList [Array.fromList ['X','X',' '],Array.fromList [' ',' ',' '],Array.fromList [' ',' ',' ']] : Array.Array (Array.Array Char)
although side note: why did the repl give me that instead of just reporting it as [['X','X',' '],[' ',' ',' '],[' ',' ',' ']] : Array.Array (Array.Array Char) ? Just wondered, thought that was odd.
so now node's a 2D array instead of a list.
Now what if I want to access the value at say position [0][1] in the 2D array, so the second 'X' in the first row (so that's index 1) 2D array, how would I do that with get?
Once I figure out how to do this then I will need to figure out how to update that position, for example change an X to an O or change an empty position to an X or O
Is it me or just trying to work with 2D Lists or 2D arrays in elm is just a huge PITA?
Is it me or just trying to work with 2D Lists or 2D arrays in elm is just a huge PITA?
Linked list is a more common structure in functional languages and most of the reasoning about algorithms is based on them, that's why working with classic arrays instead can be tedious sometimes (the option is to review the structure of your application and make lists fit into this structure).
Regarding the question, imagine you have such a function:
arrayNode : Array.Array (Array.Array Char)
arrayNode = Array.fromList [Array.fromList ['X','X',' '],Array.fromList [' ',' ',' '],Array.fromList [' ',' ',' ']]
To get a 0 row in this array Array.get function can be used:
Array.get 0 arrayNode
As far as I understand, the difficulty is that Maybe (Array.Array Char) type is returned, so we can't use Array.get one more time straight away.
We could use case expression and check whether the result is Just (Array.Array Char) or Nothing, but actually there's Maybe.andThen function, which can simplify the code:
Array.get 0 arrayNode |> Maybe.andThen (Array.get 1)
The result is Just 'X', as expected

Antlr4 unexpectedly stops parsing expression

I'm developing a simple calculator with the formula grammar:
grammar Formula ;
expr : <assoc=right> expr POW expr # pow
| MINUS expr # unaryMinus
| PLUS expr # unaryPlus
| expr PERCENT # percent
| expr op=(MULTIPLICATION|DIVISION) expr # multiplyDivide
| expr op=(PLUS|MINUS) expr # addSubtract
| ABS '(' expr ')' # abs
| '|' expr '|' # absParenthesis
| MAX '(' expr ( ',' expr )* ')' # max
| MIN '(' expr ( ',' expr )* ')' # min
| '(' expr ')' # parenthesis
| NUMBER # number
| '"' COLUMN '"' # column
;
MULTIPLICATION: '*' ;
DIVISION: '/' ;
PLUS: '+' ;
MINUS: '-' ;
PERCENT: '%' ;
POW: '^' ;
ABS: [aA][bB][sS] ;
MAX: [mM][aA][xX] ;
MIN: [mM][iI][nN] ;
NUMBER: [0-9]+('.'[0-9]+)? ;
COLUMN: (~[\r\n"])+ ;
WS : [ \t\r\n]+ -> skip ;
"column a"*"column b" input gives me following tree as expected:
But "column a" * "column b" input unexpectedly stops parsing:
What am I missing?
Your WS rule is broken by the COLUMN rule, which has a higher precedence. More precisely, the issue is that ~[\r\n"] matches space characters too.
"column a"*"column b" lexes as follows: '"' COLUMN '"' MULTIPLICATION '"' COLUMN '"'
"column a" * "column b" lexes as follows: '"' COLUMN '"' COLUMN '"' COLUMN '"'
Yes, "space star space" got lexed as a COLUMN token because that's how ANTLR lexer rules work: longer token matches get priority.
As you can see, this token stream does not match the expr rule as a whole, so expr matches as much as it could, which is '"' COLUMN '"'.
Declaring a lexer rule with only a negative rule like you did is always a bad idea. And having separate '"' tokens doesn't feel right for me either.
What you should have done is to include the quotes in the COLUMN rule as they're logically part of the token:
COLUMN: '"' (~["\r\n])* '"';
Then remove the standalone quotes from your parser rule. You can either unquote the text later when you'll be processing the parse tree, or change the token emission logic in the lexer to change the underlying value of the token.
And in order to not ignore trailing input, add another rule which will make sure you've consumed the whole input:
formula: expr EOF;
Then use this rule as your entry rule instead of expr when calling your parser.
But "column a" * "column b" input unexpectedly stops parsing
If I run your grammar with ANTLR 4.6, it does not stop parsing, it parses the whole file and displays in pink what the parser can't match :
The dots represent spaces.
And there is an important error message :
line 1:10 mismatched input ' * ' expecting {<EOF>, '*', '/', '+', '-', '%', '^'}
As I explain here as soon as you have a "mismatched" error, add -tokens to grun.
With "column a"*"column b" :
$ grun Formula expr -tokens -diagnostics t1.text
[#0,0:0='"',<'"'>,1:0]
[#1,1:8='column a',<COLUMN>,1:1]
[#2,9:9='"',<'"'>,1:9]
[#3,10:10='*',<'*'>,1:10]
[#4,11:11='"',<'"'>,1:11]
[#5,12:19='column b',<COLUMN>,1:12]
[#6,20:20='"',<'"'>,1:20]
[#7,22:21='<EOF>',<EOF>,2:0]
With "column a" * "column b":
$ grun Formula expr -tokens -diagnostics t2.text
[#0,0:0='"',<'"'>,1:0]
[#1,1:8='column a',<COLUMN>,1:1]
[#2,9:9='"',<'"'>,1:9]
[#3,10:12=' * ',<COLUMN>,1:10]
[#4,13:13='"',<'"'>,1:13]
[#5,14:21='column b',<COLUMN>,1:14]
[#6,22:22='"',<'"'>,1:22]
[#7,24:23='<EOF>',<EOF>,2:0]
line 1:10 mismatched input ' * ' expecting {<EOF>, '*', '/', '+', '-', '%', '^'}
you immediately see that " * "is interpreted as COLUMN.
Many questions about matching input with lexer rules have been asked these last days :
extraneous input
ordering
greedy
ambiguity
expression
So many times that Lucas has posted a false question just to make an answer which summarizes all that problematic : disambiguate.

ANTLR is taking the wrong branch

I have this very simple grammar:
grammar LispExp;
expression : LITERAL #LiteralExp
| '(' '-' expression ')' #UnaryMinusExp
| '(' OP expression expression ')' #OpExp
| '(' 'if' expression expression expression ')' #IfExp;
OP : '+' | '-' | '*' | '/' | '==' | '<';
LITERAL : '0'|('1'..'9')('0'..'9')*;
WS : ('\t' | '\n' | '\r' | ' ') -> skip;
It should be able to parse a "lisp-like" expression, but when I try to parse this:
(+ (+ 5 (* 7 (/ 5 (- 2 (- 9) ) ) ) ) 8)
ANTLR fails to recognize the last unary minus, and generates the following (with antlr v4) :
(expression ( + (expression ( + (expression 5) (expression ( * (expression 7) (expression ( / (expression 5) (expression ( - (expression 2))) ( -) 9 )) expression ))
So, how can I make ANTLR understand the priority of unary minus over binary expression?
You are using a combined grammar LispExp, as opposed to separate lexer grammar LispExpLexer and parser grammar LispExpParser. When working with combined grammars, if you use a string literal in a parser rule the code generator will create anonymous tokens according to those string literals, and silently override the lexer.
In this case, your expression rule includes the string literal '-'. All instances of - in your input will be assigned this token type, which means they will not ever have the token type OP. Your input contains a subexpression (- 2 (- 9) ) which can only be parsed if the first - is an OP token, so according to the parser you have a syntax error in your input.
If you update your code to use separate lexer and parser grammars, any attempt to use a string literal in the parser grammar which is not defined in the lexer grammar will produce an error when you attempt to generate your lexer and parser.

Xtext list of items or only one item

I'm trying to create a grammar that would parse the following:
reference: java.util.String
but as well
reference: {java.util.String, java.lang.Integer}
In other words, I want it to parse both a list of Qualified names but also only one item (not marked by '{' in this case).
What I tried is this:
Reference:
'reference' ':' ('{' values+=QualifiedName (',' values+=QualifiedName)* '}') | (values+=QualifiedName);
However, I am getting an error: missing '{' at 'java', when using the first form of the reference (without {). Any suggestions what I should try?
EDIT: Also tried
Reference:
'reference' ':' ('{' values+=QualifiedName (',' values+=QualifiedName)* '}') | ((!'{')values+=QualifiedName);
but getting a no viable alternative at input '!' error in the grammar definition.
EDIT2: I am not having problems with the "comma separated list", I tried this separatly and it works well. My only problem is distinguishing between the two parts of the rule based on the '{' character.
This will do the trick:
Reference:
'reference' ':' (
'{' values+=QualifiedName (',' values+=QualifiedName)* '}'
| values+=QualifiedName
);
Please mind the precedencies of groups and alternatives.
I am quite new to Xtext, so just giving it a try:
Reference:
'reference' ':' ('{' values+=QualifiedName (',' values+=QualifiedName)* '}') | (values=QualifiedName);
or
Reference:
'reference' ':' ('{' values+=QualifiedName (',' values+=QualifiedName)+ '}') | (values=QualifiedName);