i have this rule in antlr :
anREs : anRE
| ('(' anREs ')') => '(' anREs ')'
| (anREs '|' anREs) => anREs '|' anREs ;
where the anRE is a regular expression , when i want to compile the rules file i have this error message due to 3rd alternative in last rule :
error(210): The following sets of
rules are mutually left-recursive
[anREs]
how i can re write this rule ?
thanks
Here is your left recursion:
... | (anREs '|' anREs) => anREs '|' anREs ;
Worse, its ambiguous. If you have anREs_1 | anREs_2 | anREs3 as input,
it isn't clear what the subterms of the | operator are.
I'd expect this to solve the problem, and resolve the ambiguity, too:
... | (anRE '|' anREs) => anRE '|' anREs ;
Related
With the following (subset of a) grammer for a scripting language:
expr
...
| 'regex(' str=expr ',' re=expr ')' #regexExpr
...
an expression like regex('s', 're') parses to the following tree which makes sense:
regexExpr
'regex('
expr: stringLiteral ('s')
','
expr: stringLiteral ('re')
')'
I'm now trying to add an option third argument to my regex function, so I've used this modified rule:
'regex(' str=expr ',' re=expr (',' n=expr )? ')'
This causes regex('s', 're', 1) to be parsed in a way that's unexpected to me:
regexExpr
'regex('
expr:listExpression
expr: stringLiteral ('s')
','
expr: stringLiteral ('re')
','
expr: integerLiteral(1)
')'
where listExpression is another rule defined below regexExpr:
expr
...
| 'regex(' str=expr ',' re=expr (',' n=expr)? ')' #regexExpr
...
| left=expr ',' right=expr #listExpr
...
I think this listExpr could have been defined better (by defining surrounding tokens), but I've got compatibility concerns with changing it now.
I don't understand the parser rule matching precedence here. Is there a way I can add the optional third arg to regex() without causing the first two args to be parsed as a listExpr?
Try defining them in two separate alternatives and with the same label #regexExpr:
expr
: 'regex' '(' str=expr ',' re=expr ',' n=expr ')' #regexExpr
| 'regex' '(' str=expr ',' re=expr ')' #regexExpr
| left=expr ',' right=expr #listExpr
| ...
;
I'm trying to write Swift language highlight. Also I would like to highlight in addition to tokens of some language constructs. Having problems with the following rule:
Type
: '[' Type ']'
| '[' Type ':' Type ']'
| (Attributes? Function_type_argument_clause 'throws'? '->' Type | Attributes? Function_type_argument_clause 'rethrows' '->' Type)
| (Type_name Generic_argument_clause? | Type_name Generic_argument_clause? '.' Type)
| Tuple_type
| Type '?'
| Type '!'
| (Type_name Generic_argument_clause? | Type_name Generic_argument_clause? '.' Type) '&' Protocol_composition_continuation
| (Type '.' 'Type' | Type '.' 'Protocol')
| 'Any'
| 'Self'
| '(' Type ')'
;
Error: The following sets of rules are mutually left-recursive [Type]
Tried to leave in the rule, only the following cases:
Type
: Type '?'
| 'Any'
| 'Self'
;
But the problem remained: The following sets of rules are mutually left-recursive [Type]
You defined Type as a lexer rule. Lexer rules cannot be left recursive. Type should be a parser rule.
See: Practical difference between parser rules and lexer rules in ANTLR?
Note that there are existing Swift grammars:
https://github.com/antlr/grammars-v4/blob/master/swift2/Swift2.g4
https://github.com/antlr/grammars-v4/blob/master/swift3/Swift3.g4
Note that these grammars are user-comitted, test them properly!
EDIT
I'm still unable to understand it from the point of view of lexical analysis
Oh, you're only tokenising? Well, then you can't use Type as you're doing it now. You will have to rewrite it so that there is no left recursion any more.
For example, let's say the simplified Type rule looks like this:
Type
: '[' Type ']'
| '[' Type ':' Type ']'
| Type '?'
| Type '!'
| 'Any'
| 'Self'
| '(' Type ')'
;
then you should rewrite it like this:
Type
: TypeStart TypeTrailing?
;
fragment TypeStart
: '[' Type ']'
| '[' Type ':' Type ']'
| 'Any'
| 'Self'
| '(' Type ')'
;
fragment TypeTrailing: [?!];
I'm currently building a grammar for unit tests regarding a proprietary language my company uses.
This language resembles Regex in some way, for example F=bing* indicates the possible repetition of bing. A single * however represents one any block, and ** means any number of arbitrary blocks.
My only solution to this is using semantic predicates, checking if the preceding token was a space. If anyone has suggestions circumventing this problem in a different way, please share!
Otherwise, my grammar looks like this right now, but the predicates don't seem to work as expected.
grammar Pattern;
element:
ID
| macro;
macro:
MACRONAME macroarg? REPEAT?;
macroarg: '['( (element | MACROFREE ) ';')* (element | MACROFREE) ']';
and_con :
element '&' element
| and_con '&' element
|'(' and_con ')';
head_con :
'H[' block '=>' block ']';
block :
element
| and_con
| or_con
| head_con
| '(' block ')';
blocksequence :
(block ' '+)* block;
or_con :
((element | and_con) '|')+ (element | and_con)
| or_con '|' (element | and_con)
| '(' blocksequence (')|(' blocksequence)+ ')' REPEAT?;
patternlist :
(blocksequence ' '* ',' ' '*)* blocksequence;
sentenceord :
'S=(' patternlist ')';
sentenceunord :
'S={' patternlist '}';
pattern :
sentenceord
| sentenceunord
| blocksequence;
multisentence :
MS pattern;
clause :
'CLS' ' '+ pattern;
complexpattern :
pattern
| multisentence
| clause
| SECTIONS ' ' complexpattern;
dictentry:
NUM ';' complexpattern
| NUM ';' NAME ';' complexpattern
| COMMENT;
dictionary:
(dictentry ('\n'|'\r\n'))* (dictentry)? EOF;
ID : ( '^'? '!'? ('F'|'C'|'L'|'P'|'CA'|'N'|'PE'|'G'|'CD'|'T'|'M'|'D')'=' NAME REPEAT? '$'? )
| SINGLESTAR REPEAT?;
fragment SINGLESTAR: {_input.LA(-1)==' '}? '*';
fragment REPEATSTAR: {_input.LA(-1)!=' '}? '*';
fragment NAME: CHAR+ | ',' | '.' | '*';
fragment CHAR: [a-zA-Z0-9_äöüßÄÖÜ\-];
REPEAT: (REPEATSTAR|'+'|'?'|FROMTIL);
fragment FROMTIL: '{'NUM'-'NUM'}';
MS : 'MS' [0-9];
SECTIONS: 'SEC' '=' ([0-9]+','?)+;
NUM: [0-9]+;
MACRONAME: '#'[a-zA-Z_][a-zA-Z_0-9]*;
MACROFREE: [a-zA-Z!]+;
COMMENT: '//' ~('\r'|'\n')*;
When targeting Python, the syntax of lookahead predicates needs to be like this:
SINGLESTAR: {self._input.LA(-1)==ord(' ')}? '*';
Note that it is necessary to add the "self." reference to the call and wrap the character with the ord() function which returns a unicode value for comparison. Antlr documentation for Python target is seriously lacking!
I'm new to Antlr and I have the following simplified language:
grammar Hello;
sentence : targetAttributeName EQUALS expression+ (IF relationedExpression (logicalRelation relationedExpression)*)?;
expression :
'(' expression ')' |
expression ('*'|'/') expression |
expression ('+'|'-') expression |
function |
targetAttributeName |
NUMBER;
filterExpression :
'(' filterExpression ')' |
filterExpression ('*'|'/') filterExpression |
filterExpression ('+'|'-') filterExpression |
function |
filterAttributeName |
NUMBER |
DATE;
relationedExpression :
filterExpression ('<'|'<='|'>'|'>='|'=') filterExpression |
filterAttributeName '=' STRING |
STRING '=' filterAttributeName
;
logicalRelation :
'AND' |
'OR'
;
targetAttributeName :
'x'|
'y'
;
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
function:
simpleFunction |
complexFunction ;
simpleFunction :
'simpleFunction' '(' expression ')' |
'simpleFunction2' '(' expression ')'
;
complexFunction :
'complexFunction' '(' expression ')' |
'complexFunction2' '(' expression ')'
;
EQUALS : '=';
IF : 'IF';
STRING : '"' [a-zA-z0-9]* '"';
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
DATE: NUMBER NUMBER NUMBER NUMBER '.' NUMBER NUMBER? '.' NUMBER NUMBER? '.';
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It works with x = y * 2, but it doesn't work with x =y * 1.
The error message is the following:
Hello::sentence:1:7: mismatched input '1' expecting {'simpleFunction', 'complexFunction', 'x', 'y', 'complexFunction2', '(', 'simpleFunction2', NUMBER}
It is very strange for me, because 1 is a NUMBER...
If I change the filterAttribute from 'a' '1' to 'a1', then it works with x=y*1, but I don't understand the difference between the two cases. Could somebody explain it for me?
Thanks.
By doing this:
filterAttributeName :
'a' |
'a' '1' |
targetAttributeName;
ANTLR creates lexer rules from these inline tokens. So you really have a lexer grammar that looks like this:
T_1 : '1': // the rule name will probably be different though
T_a : 'a';
...
NUMBER : [-]?[0-9]+('.'[0-9]+)?;
In other words, the input 1 will be tokenized as T_1, not as a NUMBER.
EDIT
Whenever certain input can match two or more lexer rules, ANTLR chooses the one defined first. The lexer does not "listen" to the parser to see what it needs at a particular time. The lexing and parsing are 2 distinct phases. This is simply how ANTLR works, and many other other parser generators. If this is not acceptable for you, you should google for "scanner-less parsing", or "packrat parsers".
I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.