I'm using GNU Bison 2.4.2 to write a grammar for a new language I'm working on and I have a question.
When I specify a rule, let's say:
statement : T_CLASS T_IDENT '{' T_CLASS_MEMBERS '}' {
// create a node for the statement ...
}
If I have a variation on the rule, for instance
statement : T_CLASS T_IDENT T_EXTENDS T_IDENT_LIST '{' T_CLASS_MEMBERS '}' {
// create a node for the statement ...
}
Where (from flex scanner rules) :
"class" return T_CLASS;
"extends" return T_EXTENDS;
[a-zA-Z\_][a-zA-Z0-9\_]* return T_IDENT;
(and T_IDENT_LIST is a rule for comma separated identifiers).
Is there any way to specify all of this only in one rule, setting somehow the "T_EXTENDS T_IDENT_LIST" as optional?
I've already tried with
T_CLASS T_IDENT (T_EXTENDS T_IDENT_LIST)? '{' T_CLASS_MEMBERS '}' {
// create a node for the statement ...
}
But Bison gave me an error.
Thanks
To make a long story short, no. Bison only deals with LALR(1) grammars, which means it only uses one symbol of lookahead. What you need is something like this:
statement: T_CLASS T_IDENT extension_list '{' ...
extension_list:
| T_EXTENDS T_IDENT_LIST
;
There are other parser generators that work with more general grammars though. If memory serves, some of them support optional elements relatively directly like you're asking for.
Why don't you just split them using the choice (|) operator?
statement:
T_CLASS T_IDENT T_EXTENDS T_IDENT_LIST '{' T_CLASS_MEMBERS '}'
| T_CLASS T_IDENT '{' T_CLASS_MEMBERS '}'
I don't think you can do it just because this is a LALR(1) bottom-up parser, you would need something different like a LL(k) (ANTLR?) to do what you want to do..
I think the most you can do is
statement : T_CLASS T_IDENT '{' T_CLASS_MEMBERS '}'
| T_CLASS T_IDENT T_EXTENDS T_IDENT_LIST '{' T_CLASS_MEMBERS '}' {
}
Related
I'm doing some experiments with ANTLR4 with this grammar:
srule
: '(' srule ')'
| srule srule
| '(' ')';
this grammar is for the language of balanced parentheses.
The problem is that when I run antlr with this string: (()))(
This string is obviously wrong but antlr simply return this AST:
It seems to stop when it finds the wrong parenthesis, but no error message returns. I would like to know more about this behavior. Thank you
The parser recognises (())) and then stops. If you want to force the parser to consume all tokens, "anchor" your test rule with the EOF token:
parse_all
: srule EOF
;
Btw, it's always a good idea to include the EOF token in the entry point (entry rule) of your grammar.
I have a grammar that uses modes to do string interpolation:
Something along the lines of:
lexer grammar Example;
//default mode tokens
LBRACE: '{' -> pushMode(DEFAULT_MODE);
RBRACE: '}' -> popMode;
OPEN_STRING: '"' -> pushMode(STRING);
mode STRING;
ID_INTERPOLATION: '$' IDEN;
OPEN_EXPR_INTERPOLATION: '${' -> pushMode(DEFAULT_MODE);
TEXT: '$' | (~[$\r\n])+;
CLOSE_STRING: '"' -> popMode;
parser grammar ExampleParser;
options {tokenVocab = Example;}
test: string* EOF;
string: OPEN_STRING string_part* CLOSE_STRING;
string_part: TEXT | ID_INTERPOLATION | OPEN_EXPR_INTERPOLATION expr RBRACE;
//more rules that use LBRACE and RBRACE
Now this works and tokenizes everything mostly how I want it, but it does have 2 flaws.
if the number of RBRACES goes too far, it can pop the first default mode which can glitch out the IDE, and does not just show an error.
The token for closing a block and closing interpolation is the same, so I cannot highlight them however I want. (this is the main one)
My IDE highlights based on tokens only, so this is a problem, I'd like to be able to highlight them differently. So basically I'd like a solution for this that makes the RBRACE a different token when it's in a string.
I'd prefer to do it without semantic predicates because I don't want to tie it down to a language, but if needed, I'm ok with it, I just might need a little more explanation because I haven't used them that much.
Thank you #sepp2k for helping me solve my issue.
It's a bit of a hack but it does exactly what I need it to
I solved it by changing my popMode on RBRACE to be the following:
RBRACE: '}' {
if(_modeStack.size() > 0) {
popMode();
if(_mode != DEFAULT_MODE) {
setType(EXPR_INTERPOLATION);
}
}
};
I also changed my parser to be
string_part: TEXT | ID_INTERPOLATION | EXPR_INTERPOLATION expr EXPR_INTERPOLATION;
I know it's pretty hacky to change the token type under a specific circumstance, but it got the job done for me, so I'm gonna keep it unless I find a less hacky way to do this.
So I set out to implement an interpolated string parser with using only ANTLR code (no host language code blocks). I found that this works well, including nesting interpolated strings...
lexer grammar Lexer;
LeftBrace: '{';
RightBrace: '}' -> popMode;
Backtick: '`' -> pushMode(InterpolatedString);
Integer: [0-9]+;
Plus: '+';
mode InterpolatedString;
EscapedLeftBrace: '\\{' -> type(Grapheme);
EscapedBacktick: '\\`' -> type(Grapheme);
ExprStart: '{' -> type(LeftBrace), pushMode(DEFAULT_MODE);
End: '`' -> type(Backtick), popMode;
Grapheme: ~('{' | '`');
parser grammar Parser;
options {
tokenVocab = Lexer;
}
startRule: expression EOF;
interpolatedString: Backtick (Grapheme | interpolatedStringExpression)* Backtick;
interpolatedStringExpression: LeftBrace expression RightBrace;
expression
: expression Plus expression
| atom
;
atom: Integer | interpolatedString;
You can test it with input
`{`{`{`{`{`{`{`hello world`}`}`}`}`}`}`}`
I have never used antlr in past, but now have to migrate grammar for an older version to the latest. I am trying to generate lexer and parser for c# target. I am stuck on migrating the start rule seen below.
grammar expr;
DQUOTE: '\"';
SQUOTE: '\'';
NEG : '-';
PLUS : '+';
OPEN : '(';
CLOSE : ')';
PERIOD: '.';
COMMA : ',';
start returns [Expression value]
:
expression EOF { $value = $expression.value; }
;
expression returns [Expression value]
:
literal { $value = $literal.value; }
| name { $value = $name.value; }
| functionCall { $value = $functionCall.value; }
;
I get the following error.
syntax error:
mismatched input '[Expression value]' expecting ARG_ACTION while
matching a rule.
I have already come across a post Troubles with returns declaration on the first parser rule in an ANTLR4 grammar. But Sam's response has not helped me figure out what I should be changing in my case.
I would appreciate if anyone could let me know the equivalent of the start rule in latest grammar.
The answer you linked appears to be applicable to your case. Move lexer rules (i.e. those starting with uppercase letters, DQUOTE and so on) after parser rules like start.
I am trying to write an ANTLR4 parser for a templating language similar to mustache. This uses {{...}} tags interspersed in a normal text file. If the template needs to contain and emit { next to an OPEN_TAG {{, there can be a problem with the lexer/parser. I believe there should be a way to write the parser such that:
This is a left brace {{{tag logic}} and here are two left braces {{{tag logic}}{
translates to
This is a left brace { and here are two left braces {{
Either:
How can I tell the lexer to only match OPEN_TAG to {{ followed by
anything but {, absorbing the leading {s into the previous TEXT
pattern?
Is there a "better" way to provide an escape sequence for {{?
Thanks!
Handle the tags in the lexer using a mode.
LBrace : '{' ;
RBrace : '}' ;
TOpenTag : '{{' -> pushMode(tagLogic) ;
mode tagLogic ;
TLBrace : '{' -> type(LBrace) ;
TRBrace : '}' -> type(RBrace) ;
TCloseTag : '}}' -> popMode ;
TLogic : [a-zA-Z0-9]+ ;
TWs : [ \t\r\n]+ -> skip ;
I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;