Trouble migrating antlr grammar - antlr

I have never used antlr in past, but now have to migrate grammar for an older version to the latest. I am trying to generate lexer and parser for c# target. I am stuck on migrating the start rule seen below.
grammar expr;
DQUOTE: '\"';
SQUOTE: '\'';
NEG : '-';
PLUS : '+';
OPEN : '(';
CLOSE : ')';
PERIOD: '.';
COMMA : ',';
start returns [Expression value]
:
expression EOF { $value = $expression.value; }
;
expression returns [Expression value]
:
literal { $value = $literal.value; }
| name { $value = $name.value; }
| functionCall { $value = $functionCall.value; }
;
I get the following error.
syntax error:
mismatched input '[Expression value]' expecting ARG_ACTION while
matching a rule.
I have already come across a post Troubles with returns declaration on the first parser rule in an ANTLR4 grammar. But Sam's response has not helped me figure out what I should be changing in my case.
I would appreciate if anyone could let me know the equivalent of the start rule in latest grammar.

The answer you linked appears to be applicable to your case. Move lexer rules (i.e. those starting with uppercase letters, DQUOTE and so on) after parser rules like start.

Related

String Interpolation in Antlr4

I have a grammar that uses modes to do string interpolation:
Something along the lines of:
lexer grammar Example;
//default mode tokens
LBRACE: '{' -> pushMode(DEFAULT_MODE);
RBRACE: '}' -> popMode;
OPEN_STRING: '"' -> pushMode(STRING);
mode STRING;
ID_INTERPOLATION: '$' IDEN;
OPEN_EXPR_INTERPOLATION: '${' -> pushMode(DEFAULT_MODE);
TEXT: '$' | (~[$\r\n])+;
CLOSE_STRING: '"' -> popMode;
parser grammar ExampleParser;
options {tokenVocab = Example;}
test: string* EOF;
string: OPEN_STRING string_part* CLOSE_STRING;
string_part: TEXT | ID_INTERPOLATION | OPEN_EXPR_INTERPOLATION expr RBRACE;
//more rules that use LBRACE and RBRACE
Now this works and tokenizes everything mostly how I want it, but it does have 2 flaws.
if the number of RBRACES goes too far, it can pop the first default mode which can glitch out the IDE, and does not just show an error.
The token for closing a block and closing interpolation is the same, so I cannot highlight them however I want. (this is the main one)
My IDE highlights based on tokens only, so this is a problem, I'd like to be able to highlight them differently. So basically I'd like a solution for this that makes the RBRACE a different token when it's in a string.
I'd prefer to do it without semantic predicates because I don't want to tie it down to a language, but if needed, I'm ok with it, I just might need a little more explanation because I haven't used them that much.
Thank you #sepp2k for helping me solve my issue.
It's a bit of a hack but it does exactly what I need it to
I solved it by changing my popMode on RBRACE to be the following:
RBRACE: '}' {
if(_modeStack.size() > 0) {
popMode();
if(_mode != DEFAULT_MODE) {
setType(EXPR_INTERPOLATION);
}
}
};
I also changed my parser to be
string_part: TEXT | ID_INTERPOLATION | EXPR_INTERPOLATION expr EXPR_INTERPOLATION;
I know it's pretty hacky to change the token type under a specific circumstance, but it got the job done for me, so I'm gonna keep it unless I find a less hacky way to do this.
So I set out to implement an interpolated string parser with using only ANTLR code (no host language code blocks). I found that this works well, including nesting interpolated strings...
lexer grammar Lexer;
LeftBrace: '{';
RightBrace: '}' -> popMode;
Backtick: '`' -> pushMode(InterpolatedString);
Integer: [0-9]+;
Plus: '+';
mode InterpolatedString;
EscapedLeftBrace: '\\{' -> type(Grapheme);
EscapedBacktick: '\\`' -> type(Grapheme);
ExprStart: '{' -> type(LeftBrace), pushMode(DEFAULT_MODE);
End: '`' -> type(Backtick), popMode;
Grapheme: ~('{' | '`');
parser grammar Parser;
options {
tokenVocab = Lexer;
}
startRule: expression EOF;
interpolatedString: Backtick (Grapheme | interpolatedStringExpression)* Backtick;
interpolatedStringExpression: LeftBrace expression RightBrace;
expression
: expression Plus expression
| atom
;
atom: Integer | interpolatedString;
You can test it with input
`{`{`{`{`{`{`{`hello world`}`}`}`}`}`}`}`

Why isn't antlr 4 breaking my tokens up as expected?

So I am fairly new to ANTLR 4. I have stripped down the grammar as much as I can to show the problem:
grammar DumbGrammar;
equation
: expression (AND expression)*
;
expression
: ID
;
ID : LETTER(LETTER|DIGIT)* ;
AND: 'and';
LETTER: [a-zA-Z_];
DIGIT : [0-9];
WS : [ \r\n\t] + -> channel (HIDDEN);
If use this grammar, and use the sample text: abc and d I get a weird tree with unexpected structure as shown below(using IntelliJ and ANTLR4 plug in):
If I simply change the terminal rule AND: 'and'; to read AND: '&&'; and then submit abc && d as input I get the following tree, as expected:
I cannot figure out why it isn't parsing "and" correctly, but does parse '&&' correctly.
The input "and" is being tokenized as an ID token. Since both ID and AND match the input "and", ANTLR needs to make a decision which token to choose. It takes ID since it was defined before AND.
The solution: define AND before ID:
AND: 'and';
ID : LETTER(LETTER|DIGIT)* ;

What is the ANTLR4 equivalent of a ! in a lexer rule?

I'm working on converting an old ANTLR 2 grammar to ANTLR 4, and I'm having trouble with the string rule.
STRING :
'\''!
(
~('\'' | '\\' | '\r' | '\n')
)*
'\''!
;
This creates a STRING token whose text contains the contents of the string, but does not contain the starting and ending quotes, because of the ! symbol after the quote literals.
ANTLR 4 chokes on the ! symbol, ('!' came as a complete surprise to me (AC0050)) but if I leave it off, I end up with tokens that contain the quotes, which is not what I want. What's the correct way to port this to ANTLR 4?
Antlr4 generally treats tokens as being immutable, at least in the sense that there is no support for a language neutral equivalent of !.
Perhaps the simplest way to accomplish the equivalent is:
string : str=STRING { Strings.unquote($str); } ;
STRING : SQuote ~[\r\n\\']* SQuote ;
fragment SQuote : '\'' ;
where Strings.unquote is:
public static void unquote(Token token) {
CommonToken ct = (CommonToken) token;
String text = ct.getText();
text = .... unquote it ....
ct.setText(text);
}
The reason for using a parser rule is because attribute references are not (currently) supported in the lexer. Still, it could be done on the lexer rule - just would require a slight bit more effort to dig to the token.
An alternative to modifying the token text is to implement a custom token with custom fields and methods. See this answer if of interest.
I believe in ANTLR4 your problem can be solved using lexical modes and lexer commands.
Here is an example from there that I think does exactly what you need (although for double quotes but it's an easy fix):
lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;
mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

enterDecision(int) in the type DebugEventListener is not applicable for the arguments (int, boolean)?

I am using ANTLR 3.1.3 to generate the parser. After importing the generated testParser, I found there is several errors like
try { dbg.enterDecision(2, decisionCanBacktrack[2]);
Description Resource Path Location Type
The method enterDecision(int) in the type DebugEventListener is not applicable for the arguments (int, boolean) testParser.java /ANTLRTest/src line 280 Java Problem
If I changed to dbg.enterDecision(2), then everything is fine.
The grammar is as follows,
grammar Test;
options {output=AST;}
expr : mexpr (PLUS^ mexpr)* SEMI! ;
mexpr : atom (STAR^ atom)* ;
atom: INT ;
//class csharpTestLexer extends Lexer;
WS : (' ' | '\t' | '\n' | '\r') { $channel = HIDDEN; } ;
LPAREN: '(' ;
RPAREN: ')' ;
STAR: '*' ;
PLUS: '+' ;
SEMI: ';' ;
DIGIT : '0'..'9' ;
INT : (DIGIT)+ ;
I am using ANTLRWorks 1.4.3 to generate lexer and parser.
JDK 1.6
Any reason to this error?
It looks like you've generated a lexer and parser with an ANTLR version that is different than the one you added to Eclipse's classpath.
If you generate a lexer and/or parser with ANTLRWorks 1.4.3 (which contains ANTLR 3.4), you should also add ANTLR 3.4 to your project's build path in Eclipse and remove ANTLR 3.1.3 from it.
BTW, I get error for 2 + 2 * 3, do you know why? anything wrong with the above grammar.
That is because single digit numbers are being tokenized as DIGIT tokens. Either make DIGIT a fragment:
fragment DIGIT : '0'..'9' ;
INT : (DIGIT)+ ;
or remove it:
INT : '0'..'9'+ ;
See: What does "fragment" mean in ANTLR?