How do I combine ANTLR lexer rules from different lexer grammars? - antlr

I am playing around with ANTLR and building a DSL like jsx.
But in jsx you have javasciprt expressions inside {}, how can I define those ECMAScript rules in my lexer grammar? Should I rewrite the whole ECMAScript lexer from scratch or there are some ways to import those rules in my lexer grammar?
<div class={javasciprt expression in here}></div>

You can use lexical modes for that:
lexer grammar YourCurrentLexer;
EXISTING_TOKEN
: '...'
;
// other tokens
ECMA_START
: '{' -> pushMode(ECMA_MODE)
;
mode ECMA_MODE;
ECMA_TOKEN
: '...'
;
// other ECMA tokens
// Support nested { ... }
OPEN_BRACE
: '{' -> pushMode(ECMA_MODE)
;
CLOSE_BRACE
: '}' -> popMode
;
Should I rewrite the whole ECMAScript lexer from scratch or there are some ways to import those rules in my lexer grammar?
You cannot just import an existing grammar inside a lexical mode. Just copy-paste the rules you want: https://github.com/antlr/grammars-v4/blob/master/javascript/ecmascript/JavaScript/ECMAScript.g4

Related

String Interpolation in Antlr4

I have a grammar that uses modes to do string interpolation:
Something along the lines of:
lexer grammar Example;
//default mode tokens
LBRACE: '{' -> pushMode(DEFAULT_MODE);
RBRACE: '}' -> popMode;
OPEN_STRING: '"' -> pushMode(STRING);
mode STRING;
ID_INTERPOLATION: '$' IDEN;
OPEN_EXPR_INTERPOLATION: '${' -> pushMode(DEFAULT_MODE);
TEXT: '$' | (~[$\r\n])+;
CLOSE_STRING: '"' -> popMode;
parser grammar ExampleParser;
options {tokenVocab = Example;}
test: string* EOF;
string: OPEN_STRING string_part* CLOSE_STRING;
string_part: TEXT | ID_INTERPOLATION | OPEN_EXPR_INTERPOLATION expr RBRACE;
//more rules that use LBRACE and RBRACE
Now this works and tokenizes everything mostly how I want it, but it does have 2 flaws.
if the number of RBRACES goes too far, it can pop the first default mode which can glitch out the IDE, and does not just show an error.
The token for closing a block and closing interpolation is the same, so I cannot highlight them however I want. (this is the main one)
My IDE highlights based on tokens only, so this is a problem, I'd like to be able to highlight them differently. So basically I'd like a solution for this that makes the RBRACE a different token when it's in a string.
I'd prefer to do it without semantic predicates because I don't want to tie it down to a language, but if needed, I'm ok with it, I just might need a little more explanation because I haven't used them that much.
Thank you #sepp2k for helping me solve my issue.
It's a bit of a hack but it does exactly what I need it to
I solved it by changing my popMode on RBRACE to be the following:
RBRACE: '}' {
if(_modeStack.size() > 0) {
popMode();
if(_mode != DEFAULT_MODE) {
setType(EXPR_INTERPOLATION);
}
}
};
I also changed my parser to be
string_part: TEXT | ID_INTERPOLATION | EXPR_INTERPOLATION expr EXPR_INTERPOLATION;
I know it's pretty hacky to change the token type under a specific circumstance, but it got the job done for me, so I'm gonna keep it unless I find a less hacky way to do this.
So I set out to implement an interpolated string parser with using only ANTLR code (no host language code blocks). I found that this works well, including nesting interpolated strings...
lexer grammar Lexer;
LeftBrace: '{';
RightBrace: '}' -> popMode;
Backtick: '`' -> pushMode(InterpolatedString);
Integer: [0-9]+;
Plus: '+';
mode InterpolatedString;
EscapedLeftBrace: '\\{' -> type(Grapheme);
EscapedBacktick: '\\`' -> type(Grapheme);
ExprStart: '{' -> type(LeftBrace), pushMode(DEFAULT_MODE);
End: '`' -> type(Backtick), popMode;
Grapheme: ~('{' | '`');
parser grammar Parser;
options {
tokenVocab = Lexer;
}
startRule: expression EOF;
interpolatedString: Backtick (Grapheme | interpolatedStringExpression)* Backtick;
interpolatedStringExpression: LeftBrace expression RightBrace;
expression
: expression Plus expression
| atom
;
atom: Integer | interpolatedString;
You can test it with input
`{`{`{`{`{`{`{`hello world`}`}`}`}`}`}`}`

Problems with ANTLR4 grammar

I have a very simple grammar file, which looks like this:
grammar Wort;
// Parser Rules:
word
: ANY_WORD EOF
;
// Lexer Rules:
ANY_WORD
: SMALL_WORD | CAPITAL_WORD
;
SMALL_WORD
: SMALL_LETTER (SMALL_LETTER)+
;
CAPITAL_WORD
: CAPITAL_LETTER (SMALL_LETTER)+
;
fragment SMALL_LETTER
: ('a'..'z')
;
fragment CAPITAL_LETTER
: ('A'..'Z')
;
If i try to parse the input "Hello", everything is OK, BUT if if modify my grammar file like this:
...
// Parser Rules:
word
: CAPITAL_WORD EOF
;
...
the input "Hello" is no longer recognized as a valid input. Can anybody explain, what is going wrong?
Thanx, Lars
The issue here has to do with precedence in the lexer grammar. Because ANY_WORD is listed before CAPITAL_WORD, it is given higher precedence. The lexer will identify Hello as a CAPITAL_WORD, but since an ANY_WORD can be just a CAPITAL_WORD, and the lexer is set up to prefer ANY_WORD, it will output the token ANY_WORD. The parser acts on the output of the lexer, and since ANY_WORD EOF doesn't match any of its rules, the parse fails.
You can make the lexer behave differently by moving CAPITAL_WORD above ANY_WORD in the grammar, but that will create the opposite problem -- capitalized words will never lex as ANY_WORDs. The best thing to do is probably what Mephy suggested -- make ANY_WORD a parser rule.

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?
You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

How do I ignore arbitrary stuff inside braces in ANTLR?

I am trying to write a config file grammar and get ANTLR4 to handle it. I am quite new to ANTLR (this is my first project with it).
Largely, I understand what needs to be done (or at least I think I do) for most of the config file grammar, but the files that I will be reading will have arbitrary C code inside of curly braces. Here is an example:
Something like:
#DEVICE: servo "servos are great"
#ACTION: turnRight "turning right is fun"
{
arbitrary C source code goes here;
some more arbitrary C source code;
}
#ACTION: secondAction "this is another action"
{
some more code;
}
And it could be many of those. I can't seem to get it to understand that I want to just ignore (without skipping) the source code. Here is my grammar so far:
/**
ANTLR4 grammar for practicing
*/
grammar practice;
file: (devconfig)*
;
devconfig: devid (action)+
;
devid: DEV_HDR (COMMENT)?
;
action: ACTN_HDR '{' C_BLOCK '}'
;
DEV_HDR: '#DEVICE: ' ALPHA+(IDCHAR)*
;
fragment
ALPHA: [a-zA-Z]
;
fragment
IDCHAR: ALPHA
| [0-9]
| '_'
;
COMMENT: '"' .*? '"'
;
ACTN_HDR: '#ACTION: ' ACTION_ID
;
fragment
ACTION_ID: ALPHA+(IDCHAR)*
;
C_BLOCK: WHAT DO I PUT HERE?? -> channel(HIDDEN)
;
WS: [ \t\n\r]+ -> skip
;
The problem is that whatever I put in the C_BLOCK lexer rule seems to screw up the whole thing - like if I put .*? -> channel(HIDDEN), it doesn't seem to work at all (of course, there is an error when using ANTLR on the grammar to the tune of ".*? can match the empty string" - but what should I put there if not that, so that it ignores the C code, but in such a way that I can access it later (i.e., not skipping it)?
Your C_BLOCK rule can be defined just like the usual multi line comment rule is done in so many languages. Make the curly braces part of the rule too:
C_BLOCK: CURLY .*? CURLY -> channel(HIDDEN);
If you need to nest blocks you write something like:
C_BLOCK: CURLY .*? C_BLOCK? .*? CURLY -> channel(HIDDEN);
or maybe:
C_BLOCK:
CURLY (
C_BLOCK
| .
)*?
CURLY
;
(untested).
Update: changed code to use the non-greedy kleene operator as suggested by a comment.

Match lowercase with ANTLR

I use ANTLRWorks for a simple grammar:
grammar boolean;
// [...]
lowercase_string
: ('a'..'z')+ ;
However, the lowercase_string doesn't match foobar according to the Interpreter (MismatchedSetException(10!={}). Ideas?
You can't use the .. operator inside parser rules like that. To match the range 'a' to 'z', create a lexer rule for it (lexer rules start with a capital).
Try it like this:
lowercase_string
: Lower+
;
Lower
: 'a'..'z'
;
or:
lowercase_string
: Lower
;
Lower
: 'a'..'z'+
;
Also see this previous Q&A: Practical difference between parser rules and lexer rules in ANTLR?