I'm trying to write Swift language highlight. Also I would like to highlight in addition to tokens of some language constructs. Having problems with the following rule:
Type
: '[' Type ']'
| '[' Type ':' Type ']'
| (Attributes? Function_type_argument_clause 'throws'? '->' Type | Attributes? Function_type_argument_clause 'rethrows' '->' Type)
| (Type_name Generic_argument_clause? | Type_name Generic_argument_clause? '.' Type)
| Tuple_type
| Type '?'
| Type '!'
| (Type_name Generic_argument_clause? | Type_name Generic_argument_clause? '.' Type) '&' Protocol_composition_continuation
| (Type '.' 'Type' | Type '.' 'Protocol')
| 'Any'
| 'Self'
| '(' Type ')'
;
Error: The following sets of rules are mutually left-recursive [Type]
Tried to leave in the rule, only the following cases:
Type
: Type '?'
| 'Any'
| 'Self'
;
But the problem remained: The following sets of rules are mutually left-recursive [Type]
You defined Type as a lexer rule. Lexer rules cannot be left recursive. Type should be a parser rule.
See: Practical difference between parser rules and lexer rules in ANTLR?
Note that there are existing Swift grammars:
https://github.com/antlr/grammars-v4/blob/master/swift2/Swift2.g4
https://github.com/antlr/grammars-v4/blob/master/swift3/Swift3.g4
Note that these grammars are user-comitted, test them properly!
EDIT
I'm still unable to understand it from the point of view of lexical analysis
Oh, you're only tokenising? Well, then you can't use Type as you're doing it now. You will have to rewrite it so that there is no left recursion any more.
For example, let's say the simplified Type rule looks like this:
Type
: '[' Type ']'
| '[' Type ':' Type ']'
| Type '?'
| Type '!'
| 'Any'
| 'Self'
| '(' Type ')'
;
then you should rewrite it like this:
Type
: TypeStart TypeTrailing?
;
fragment TypeStart
: '[' Type ']'
| '[' Type ':' Type ']'
| 'Any'
| 'Self'
| '(' Type ')'
;
fragment TypeTrailing: [?!];
Related
I've just started using antlr so Id really appreciate the help! Im just trying to make a variable declaration declaration rule but its not working! Ive put the files Im working with below, please lmk if you need anything else!
INPUT CODE:
var test;
GRAMMAR G4 FILE:
grammar treetwo;
program : (declaration | statement)+ EOF;
declaration :
variable_declaration
| variable_assignment
;
statement:
expression
| ifstmnt
;
variable_declaration:
VAR NAME SEMICOLON
;
variable_assignment:
NAME '=' NUM SEMICOLON
| NAME '=' STRING SEMICOLON
| NAME '=' BOOLEAN SEMICOLON
;
expression:
operand operation operand SEMICOLON
| expression operation expression SEMICOLON
| operand operation expression SEMICOLON
| expression operation operand SEMICOLON
;
ifstmnt:
IF LPAREN term RPAREN LCURLY
(declaration | statement)+
RCURLY
;
term:
| NUM EQUALITY NUM
| NAME EQUALITY NUM
| NUM EQUALITY NAME
| NAME EQUALITY NAME
;
/*Tokens*/
NUM : '0' | '-'?[1-9][0-9]*;
STRING: [a-zA-Z]+;
BOOLEAN: 'true' | 'false';
VAR : 'var';
NAME : [a-zA-Z]+;
SEMICOLON : ';';
LPAREN: '(';
RPAREN: ')';
LCURLY: '{';
RCURLY: '}';
EQUALITY: '==' | '<' | '>' | '<=' | '>=' | '!=' ;
operation: '+' | '-' | '*' | '/';
operand: NUM;
IF: 'if';
WS : [ \t\r\n]+ -> skip;
Error I'm getting:
(line 1,char 0): mismatched input 'var' expecting {NUM, 'var', NAME, 'if'}
Your STRING rule is the same as your NAME rule.
With the ANTLR lexer, if two lexer rules match the same input, the first one declared will be used. As a result, you’ll never see a NAME token.
Most tutorials will show you have to dump out the token stream. It’s usually a good idea to view the token stream and verify your Lexer rules before getting too far into your parser rules.
With the following (subset of a) grammer for a scripting language:
expr
...
| 'regex(' str=expr ',' re=expr ')' #regexExpr
...
an expression like regex('s', 're') parses to the following tree which makes sense:
regexExpr
'regex('
expr: stringLiteral ('s')
','
expr: stringLiteral ('re')
')'
I'm now trying to add an option third argument to my regex function, so I've used this modified rule:
'regex(' str=expr ',' re=expr (',' n=expr )? ')'
This causes regex('s', 're', 1) to be parsed in a way that's unexpected to me:
regexExpr
'regex('
expr:listExpression
expr: stringLiteral ('s')
','
expr: stringLiteral ('re')
','
expr: integerLiteral(1)
')'
where listExpression is another rule defined below regexExpr:
expr
...
| 'regex(' str=expr ',' re=expr (',' n=expr)? ')' #regexExpr
...
| left=expr ',' right=expr #listExpr
...
I think this listExpr could have been defined better (by defining surrounding tokens), but I've got compatibility concerns with changing it now.
I don't understand the parser rule matching precedence here. Is there a way I can add the optional third arg to regex() without causing the first two args to be parsed as a listExpr?
Try defining them in two separate alternatives and with the same label #regexExpr:
expr
: 'regex' '(' str=expr ',' re=expr ',' n=expr ')' #regexExpr
| 'regex' '(' str=expr ',' re=expr ')' #regexExpr
| left=expr ',' right=expr #listExpr
| ...
;
I want to write an expression engine use antlr4.
The following is the grammar.
expression
: primary
| expression '.' Identifier
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
| expression ('*'|'/'|'%') expression
| expression ('+'|'-') expression
| expression ('<' '<' | '>' '>' '>' | '>' '>') expression
| expression ('<=' | '>=' | '>' | '<') expression
| expression ('==' | '!=') expression
| expression '&' expression
| expression '^' expression
| expression '|' expression
| expression '&&' expression
| expression '||' expression
| expression '?' expression ':' expression
| <assoc=right> expression
( '='
| '+='
| '-='
| '*='
| '/='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '%='
)
expression
;
This grammar is right but cannot distinguish between attribute access expressions, method invocation expressions, and array access expressions. So I changed the grammar to
attributeAccessMethod:
expression '.' Identifier;
expression
: primary
| attributeAccessMethod
| expression '(' expressionList? ')'
| expression '[' expression ']'
| expression ('++' | '--')
| ('+'|'-'|'++'|'--') expression
| ('~'|'!') expression
but this grammar is a left-recursive [expression, attributeAccessMethod]. How can I write a better grammar - can I somehow remove the left-recursive property and distinguish these conditions?
Add tags to your different rule alternatives, for example:
expression
: primary # RulePrimary
| expression '.' Identifier # RuleAttribute
| expression '(' expressionList? ')' # RuleExpression
| expression '[' expression ']' # RuleArray
... etc.
When you do this for all your alternatives in this rule, your BaseVisitor or BaseListener will be generated with public overrides for these special cases, where you can treat each one as you see fit.
I don't suggest you define your grammar this way. In addition to #JLH's answer, your grammar has a potential to mess up associativity of these expressions.
What I'm suggesting is you should "top-down" your grammar with associativity order.
For example, you can treat all literals, method invokes etc as atoms (because they will always start with a literal or an identifier) in your grammar, and you will associate these atoms with your associate operators.
Then you could write your grammar like:
expression: binary_expr;
// Binary_Expr
// Logic_Expr
// Add_expr
// Mult_expr
// Pow_expr
// Unary_expr
associate_expr
: index_expr # ToIndexExpr
| lhs=index_expr '.' rhs=associate_expr # AssociateExpr
;
index_expr
: index_expr '[' (expression (COMMA expression)*) ']' # IndexExpr
| atom #ToAtom
;
atom
: literals_1 #wwLiteral
| ... #xxLiteral
| ... #yyLiteral
| literals_n #zzLiteral
| function_call # FunctionCall
;
function_call
: ID '(' (expression (',' expression)*)? ')';
// Define Literals
// Literals Block
And part of your arithmetic expression could look like:
add_expr
: mul_expr # ToMulExpr
| lhs=add_expr PLUS rhs=mul_expr #AddExpr
| lhs=add_expr MINUS rhs=mul_expr #SubtractExpr
;
mul_expr
: pow_expr # ToPowExpr
| lhs=mul_expr '+' rhs=pow_expr # MultiplyExpr
| lhs=mul_expr '/' rhs=pow_expr # DivideExpr
| lhs=mul_expr '%' rhs=pow_expr # ModExpr
;
You make your left hand side as current expr, and right hand side as your other level associated expr, so that you can maintain the order of associativity while having left recursion on them.
I'm currently building a grammar for unit tests regarding a proprietary language my company uses.
This language resembles Regex in some way, for example F=bing* indicates the possible repetition of bing. A single * however represents one any block, and ** means any number of arbitrary blocks.
My only solution to this is using semantic predicates, checking if the preceding token was a space. If anyone has suggestions circumventing this problem in a different way, please share!
Otherwise, my grammar looks like this right now, but the predicates don't seem to work as expected.
grammar Pattern;
element:
ID
| macro;
macro:
MACRONAME macroarg? REPEAT?;
macroarg: '['( (element | MACROFREE ) ';')* (element | MACROFREE) ']';
and_con :
element '&' element
| and_con '&' element
|'(' and_con ')';
head_con :
'H[' block '=>' block ']';
block :
element
| and_con
| or_con
| head_con
| '(' block ')';
blocksequence :
(block ' '+)* block;
or_con :
((element | and_con) '|')+ (element | and_con)
| or_con '|' (element | and_con)
| '(' blocksequence (')|(' blocksequence)+ ')' REPEAT?;
patternlist :
(blocksequence ' '* ',' ' '*)* blocksequence;
sentenceord :
'S=(' patternlist ')';
sentenceunord :
'S={' patternlist '}';
pattern :
sentenceord
| sentenceunord
| blocksequence;
multisentence :
MS pattern;
clause :
'CLS' ' '+ pattern;
complexpattern :
pattern
| multisentence
| clause
| SECTIONS ' ' complexpattern;
dictentry:
NUM ';' complexpattern
| NUM ';' NAME ';' complexpattern
| COMMENT;
dictionary:
(dictentry ('\n'|'\r\n'))* (dictentry)? EOF;
ID : ( '^'? '!'? ('F'|'C'|'L'|'P'|'CA'|'N'|'PE'|'G'|'CD'|'T'|'M'|'D')'=' NAME REPEAT? '$'? )
| SINGLESTAR REPEAT?;
fragment SINGLESTAR: {_input.LA(-1)==' '}? '*';
fragment REPEATSTAR: {_input.LA(-1)!=' '}? '*';
fragment NAME: CHAR+ | ',' | '.' | '*';
fragment CHAR: [a-zA-Z0-9_äöüßÄÖÜ\-];
REPEAT: (REPEATSTAR|'+'|'?'|FROMTIL);
fragment FROMTIL: '{'NUM'-'NUM'}';
MS : 'MS' [0-9];
SECTIONS: 'SEC' '=' ([0-9]+','?)+;
NUM: [0-9]+;
MACRONAME: '#'[a-zA-Z_][a-zA-Z_0-9]*;
MACROFREE: [a-zA-Z!]+;
COMMENT: '//' ~('\r'|'\n')*;
When targeting Python, the syntax of lookahead predicates needs to be like this:
SINGLESTAR: {self._input.LA(-1)==ord(' ')}? '*';
Note that it is necessary to add the "self." reference to the call and wrap the character with the ord() function which returns a unicode value for comparison. Antlr documentation for Python target is seriously lacking!
I'm new to Antlr and I have the following simplified language:
grammar Hello;
sentence : targetAttributeName EQUALS expression+ (IF relationedExpression (logicalRelation relationedExpression)*)?;
expression :
'(' expression ')' |
expression ('*'|'/') expression |
expression ('+'|'-') expression |
function |
targetAttributeName |
NUMBER;
filterExpression :
'(' filterExpression ')' |
filterExpression ('*'|'/') filterExpression |
filterExpression ('+'|'-') filterExpression |
function |
filterAttributeName |
NUMBER |
DATE;
relationedExpression :
filterExpression ('<'|'<='|'>'|'>='|'=') filterExpression |
filterAttributeName '=' STRING |
STRING '=' filterAttributeName
;
logicalRelation :
'AND' |
'OR'
;
targetAttributeName :
'x'|
'y'
;
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
function:
simpleFunction |
complexFunction ;
simpleFunction :
'simpleFunction' '(' expression ')' |
'simpleFunction2' '(' expression ')'
;
complexFunction :
'complexFunction' '(' expression ')' |
'complexFunction2' '(' expression ')'
;
EQUALS : '=';
IF : 'IF';
STRING : '"' [a-zA-z0-9]* '"';
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
DATE: NUMBER NUMBER NUMBER NUMBER '.' NUMBER NUMBER? '.' NUMBER NUMBER? '.';
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It works with x = y * 2, but it doesn't work with x =y * 1.
The error message is the following:
Hello::sentence:1:7: mismatched input '1' expecting {'simpleFunction', 'complexFunction', 'x', 'y', 'complexFunction2', '(', 'simpleFunction2', NUMBER}
It is very strange for me, because 1 is a NUMBER...
If I change the filterAttribute from 'a' '1' to 'a1', then it works with x=y*1, but I don't understand the difference between the two cases. Could somebody explain it for me?
Thanks.
By doing this:
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
ANTLR creates lexer rules from these inline tokens. So you really have a lexer grammar that looks like this:
T_1 : '1': // the rule name will probably be different though
T_a : 'a';
...
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
In other words, the input 1 will be tokenized as T_1, not as a NUMBER.
EDIT
Whenever certain input can match two or more lexer rules, ANTLR chooses the one defined first. The lexer does not "listen" to the parser to see what it needs at a particular time. The lexing and parsing are 2 distinct phases. This is simply how ANTLR works, and many other other parser generators. If this is not acceptable for you, you should google for "scanner-less parsing", or "packrat parsers".