I am translating a grammar from LALR to ANTLR and I am having trouble with translating this one rule, piecewise expression.
Attached is the sample grammar:
grammar Test;
options {
language = Java;
output = AST;
}
parse : expression ';'
;
expression : binaryExpression
| piecesExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
literalExpression : INT
;
id : IDENTIFIER
;
piecesExpression : 'piecewise' '{' piece expression '}' ('(' expression ',' expression ')')? expression?
;
piece : expression '->' expression ';' (expression '->' expression ';')*
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
ANTLR v3.5 is complaining about the piecesExpression rule. It has 2 fatal errors and I would rather not use backtrack option.
Expected results:
piecewise {t -> s; t -> x; 100}
piecewise {t -> s; t -> x; 100} (0, x+1)
piecewise {t -> s; t -> x; 100} (0, x+1) y+5
How can piecesExpression be able to capture the above results?
Thanks in advance!
ANTLR has problems determining which alternatives to take in (at least) 2 cases:
piece starts with a expression but inside the piecewise{...}, it should also end with an expression
piecesExpression ends with '(' expression ... but also has an optional trailing expression (and an primitiveElement also matches '(' expression ... in its turn)
There's no need to use global backtracking, but without rewriting many rules, you do need to add some predicates (the (...)=> in the example below) to fix the two issues outlined above.
Try this:
piecesExpression
: 'piecewise' '{' ((expression '->')=> piece)+ expression '}'
( ('(' expression ',')=> '(' expression ',' expression ')' expression?
| expression
)
;
piece
: expression '->' expression ';'
;
Related
I have the following ANTLR4 Grammar
grammar ExpressionGrammar;
parse: (expr)
;
expr: MIN expr
| expr ( MUL | DIV ) expr
| expr ( ADD | MIN ) expr
| NUM
| function
| '(' expr ')'
;
function : ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL : '*';
DIV : '/';
MIN : '-';
ADD : '+';
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
NUM : '0' | [1-9][0-9]*;
ID : [a-zA-Z_] [a-zA-Z]*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
I have an input expression like this :-
(Fields.V1)*(Fields.V2) + (Constants.Value1)*(Constants.Value2)
The ANTLR parser generated the following text from the grammar above :-
(FieldsV1)*(FieldsV2)+(Constants<missing ')'>
As you can see, the "dots" in Fields.V1 and Fields.V2 are missing from the text and also there is a <missing ')' Error node. I believe I should somehow make ANTLR understand that an expression can also have fields with dot operators.
A question on top of this :-
(Var1)(Var2)
ANTLR is not throwing me error for this above scenario , the expressions should not be (Var1)(Var2) -- It should always have the operator (var1)*(var2) or (var1)+(var2) etc. The parser error tree is not generating this error. How should the grammar be modified to make sure even this scenario is taken into consideration.
To recognize IDs like Fields.V1, change you Lexer rule for ID to something like this:
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
Notice, since each "node" of the ID follows the same rule, I made it a lexer fragment that I could use to compose the ID rule. I also added 0-9 to the second part of the fragment, since it appears that you want to allow numbers in IDs
Then the ID rule uses the fragment to build out the Lexer rule that allows for dots in the ID.
You also didn't add ID as a valid expr alternative
To handle detection of the error condition in (Var1)(Var2), you need Mike's advice to add the EOF Lexer rule to the end of the parse parser rule. Without the EOF, ANTLR will stop parsing as soon as it reaches the end of a recognized expr ((Var1)). The EOF says "and then you need to find an EOF", so ANTLR will continue parsing into the (Var2) and give you the error.
A revised version that handles both of your examples:
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| NUM
| ID
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: '0' | [1-9][0-9]*;
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
(Now that I've read through the comments, this is pretty much just applying the suggestions in the comments)
I write some Antlr4 rules to parse SQL,just want to distinguish between fields and tables.But they did something unexpected.My rule:
grammar Col;
stat : SELECT select_list FROM table_ref_list;
select_list : select_ele (',' select_ele)* ;
select_ele : Subquery_in_field_nor //subquery select column
| ID '(' .*? ')' //function calls
| NoDotId '(' .*? ')' NoDotId '(' .*? ')' //window function call
| ID //like column of tab
| DIGIT //number
| STRING //like this 'dad'
;
table_ref_list
: table_ref (',' table_ref)*
;
table_ref:table_block (NoDotId)?;
table_block : ID //So much like select_ele
| Subquery_in_field_nor
;
Subquery_in_field_nor : '(' (Subquery_in_field | ~[()])* ')'; //Resolving function nesting
Subquery_in_field : '(' .*? ')' ;
SELECT : [Ss][Ee][Ll][Ee][Cc][Tt];
FROM :[Ff][Rr][Oo][Mm];
NL : [ \r\n]+ ->skip;
ID : [A-Za-z] [A-Za-z0-9.]*;
NoDotId : [A-Za-z] [A-Za-z0-9]*;
DIGIT : ('-')? [0-9]+('.' [0-9]+)?;
STRING : '\'' .*? '\'';
And my sql file like this
SELECT substr(A.EMPNO,1,2),
A.ENAME,
'1',
'wwet',
18,
A.DEPTNO FROM EMP A
Display message
line 1:13 missing FROM at '(A.EMPNO,1,2)'
line 3:7 mismatched input ''1'' expecting {Subquery_in_field_nor, ID}
line 4:7 mismatched input ''wwet'' expecting {Subquery_in_field_nor, ID}
line 5:7 mismatched input '18' expecting {Subquery_in_field_nor, ID}
(stat SELECT (select_list (select_ele substr)) <missing FROM> (table_ref_list (table_ref (table_block (A.EMPNO,1,2))) , (table_ref (table_block A.ENAME)) , (table_ref (table_block '1')) , (table_ref (table_block 'wwet')) , (table_ref (table_block 18)) , (table_ref (table_block A.DEPTNO))))
I don't know why ?Why i user select_list : select_ele (',' select_ele)* ,but still Match table_ref?
substr(A.EMPNO,1,2) is not matched as a select_ele because it produces the tokens:
ID: substr
Subquery_in_field_nor: (A.EMPNO,1,2)
and not by any of the two alternatives:
| ID '(' .*? ')' //function calls
| NoDotId '(' .*? ')' NoDotId '(' .*? ')' //window function call
The part '(' .*? ')' is interpreted as: match a ( token, reluctantly followed by zero or more other tokens, ending with a ) token. But because ANTLR's lexer tries to grab as much characters as possible when tokenising the input, (A.EMPNO,1,2) will always be matched by a single Subquery_in_field_nor token.
Besides the greedy tokenisation in the lexer, you must also be aware that when 2 or more lexer rules can match the same characters, the one defined first "wins". So the input select can be matched by SELECT, ID and NoDotId. But since SELECT is defined first, that input will always become a SELECT token. However, the input foo can be matched by ID and NoDotId and since ID is defined first, it will win. You'll notice that there will never be a NoDotId token at all since whatever it matches is also matched by ID.
Here is a slightly modified grammar that works for your input:
grammar Col;
stat
: SELECT select_list FROM table_ref_list
;
select_list
: select_ele (',' select_ele)*
;
select_ele
: id Subquery_in_field_nor (id Subquery_in_field_nor)?
| id
| DIGIT
| STRING
;
table_ref_list
: table_ref (',' table_ref)*
;
table_ref
: table_block id?
;
table_block
: id
| Subquery_in_field_nor
;
id
: ID ('.' ID)*
;
Subquery_in_field_nor : '(' (Subquery_in_field | ~[()])* ')'; //Resolving function nesting
Subquery_in_field : '(' .*? ')' ;
SELECT : [Ss][Ee][Ll][Ee][Cc][Tt];
FROM : [Ff][Rr][Oo][Mm];
NL : [ \r\n]+ ->skip;
ID : [A-Za-z] [A-Za-z0-9]*;
DIGIT : ('-')? [0-9]+('.' [0-9]+)?;
STRING : '\'' .*? '\'';
Or better, grab an existing SQL grammar from the ANTLR4 Github repo: https://github.com/antlr/grammars-v4 Be aware that all of these grammars are opensource contributions from the community: test them properly, because many of them will have limitations (both in accuracy and in performance).
I am not sure how to solve this problem without using backtrack=true;.
My sample grammar:
grammar Test;
options {
language = Java;
output = AST;
}
parse : expression
;
expression : binaryExpression
| tupleExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/'|'div'|'inter') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| sumExpression
| '(' expression ')'
;
sumExpression : 'sum'|'div'|'inter' expression
;
tupleExpression : ('<' expression '>' (',' '<' expression '>')*)
;
literalExpression : INT
;
id : IDENTIFIER
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
Is there a way to fix the grammar such a way that no warnings can happen? Let's assume I want to choose both alternatives depending on the case.
Thank you in advance!
Note that:
sumExpression : 'sum'|'div'|'inter' expression
;
gets interpreted as:
sumExpression : 'sum' /* nothing */
| 'div' /* nothing */
| 'inter' expression
;
since the | has a low precedence. You probably want:
sumExpression : ('sum'|'div'|'inter') expression
;
Let's assume I want to choose both alternatives depending on the case.
That is not possible: you cannot let the parser choose both (or more) alternatives, it can only choose one.
I assume you know why the grammar is ambiguous? If not, here's why: the input A div B can be parsed in two ways:
alternative 1
unaryExpression 'div' unaryExpression
| |
A B
alternative 2
id sumExpression
| | \
A 'div' B
It looks like you want 'sum', 'div' and 'inter' to be some sort of unary operator, in which case you could just merge them into your unaryExpression rule:
unaryExpression : '!' unaryExpression
| '-' unaryExpression
| 'sum' unaryExpression
| 'div' unaryExpression
| 'inter' unaryExpression
| primitiveElement
;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
That way you don't have any ambiguity. Note that A div B will now be parsed as a multiplyingExpression and A div sum B as:
multiplyingExpression
/ \
'div' unaryExpression
/ / \
A 'sum' B
Given that I have the following grammar how would I add a rule to match something like 2^3 to create a power operator?
negation : '!'* term ;
unary : ('+'!|'-'^)* negation ;
mult : unary (('*' | '/' | ('%'|'mod') ) unary)* ;
add : mult (('+' | '-') mult)* ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (('&&' | '||') relation)* ;
// LEXER ================================================================
HEX_NUMBER : '0x' HEX_DIGIT+;
fragment
FLOAT: ;
INTEGER : DIGIT+ ({input.LA(1)=='.' && input.LA(2)>='0' && input.LA(2)<='9'}?=> '.' DIGIT+ {$type=FLOAT;})? ;
fragment
HEX_DIGIT : (DIGIT|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9') ;
What I have tried:
I tried something like power : ('+' | '-') unary'^' unary but that doesn't seem to work.
I also tried mult : unary (('*' | '/' | ('%'|'mod') | '^' ) unary)* ; but that doesn't work either.
To give ^ higher precedence than negation, do this:
pow : term ('^' term)* ;
negation : '!' negation | pow ;
unary : ('+'! | '-'^)* negation ;
If you want to consider the right-associativity already in the grammar, you can also use recursion:
pow : term ('^'^ pow)?
;
negation : '!'* pow;
...
I have this simple grammar for a C# like syntax. I can't figure out any way to separate fields and methods. All the examples I've seen for parsing C# combine fields and methods in the same rule. I would like to split them up as my synatx is pretty simple.
grammar test;
options
{
language =CSharp2;
k = 3;
output = AST;
}
SEMI : ';' ;
LCURLY : '{' ;
RCURLY : '}' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT :'.';
IDENTIFIER
: ( 'a'..'z' | 'A'..'Z' | '_' )
( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' )*
;
namespaceName
: IDENTIFIER (DOT IDENTIFIER)*
;
classDecl
: 'class' IDENTIFIER LCURLY (fieldDecl | methodDecl)* RCURLY
;
fieldDecl
: namespaceName IDENTIFIER SEMI;
methodDecl
: namespaceName IDENTIFIER LPAREN RPAREN SEMI;
I always end up wit this warning
Decision can match input such as "IDENTIFIER DOT IDENTIFIER" using multiple alternatives: 1, 2
Since namespaceName can be IDENTIFIER DOT IDENTIFIER DOT IDENTIFIER ... I think you have problems with k=3 in your options.
Can you remove the K option, ANTLR will default to K=*.