ANTLR lexer -- can one prefer the shorter match? - antlr

Here is a simple lexer grammar:
lexer grammar TextLexer;
#members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}
COMMENT: 'comment' .*? 'end' -> channel(HIDDEN);
WORD: [a-z]+ ;
WS
: ' ' -> channel(HIDDEN)
;
For the most part, it behaves as expected, grabbing the words out of the stream, and ignoring anything bounded by comment . . . end. But not always. For example, if the input is the following:
quick brown fox commentandending
it will see that the word "commentandending" is longer than the comment "commentandend". So it comes out with a token "commentandending" rather than a token "ing".
Is there a way to change that behavior?

This grammar will solve the problem in ANTLR4:
lexer grammar TextLexer;
COMMENT_BEGIN: 'comment' -> more,pushMode(MCOMMENT);
WORD_BEGIN: [a-z] -> more, pushMode(MWORD);
WS: ' ' -> channel(HIDDEN);
mode MCOMMENT;
COMMENT: .+? 'end'-> mode(DEFAULT_MODE);
mode MWORD;
WORD: [a-z]+ -> mode(DEFAULT_MODE);

Related

Getting plain text in antlr instead of tokens

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

Antlr4: How can I match end of lines inside multiline comments?

I have to create a program that counts lines of code ignoring those inside a comment. I'm a newbie working with Antlr, and after trying a lot, the nearest I came to a solution is this erroneous grammar:
grammar Comments;
comment : startc content endc;
startc : '/*';
endc : '*/';
content : newline | contenttext;
contenttext : CONTENTCHARS+;
newline : '\r\n';
CONTENTCHARS
: ~'*' '/'
| ~'/' .
;
WS : [ \r\t]+ -> skip;
If I try with /*hello\r\nworld*/ the parser recognizes this, which is erroneous:
In order to count lines, the parser needs to detect newline characters, inside and outside multiline comments. I think my problem is that I don't know how to say "match everything inside /* and */ except \r\n.
Please, can you point me in the right direction? Any help will be appreciated.
Solution
Let's simplify your grammar! In the grammar we will ignore whitespace characters and comments at the lexer stage (and the unwanted newlines at the same time!). For example the COMMENT section will match one line comments or multi-line comments and just skip them!
Next, we will introduce counter variable for counting NEWLINE tokens that are used only in content grammar rule (because COMMENT token is skipped so the NEWLINE token in it!).
Whenever we encounter a NEWLINE token we increment the counter variable.
grammar Comments;
#lexer::members {
int counter = 0;
}
WS : [ \r\t]+ -> skip;
COMMENT : '/*' .*? '*/' NEWLINE? -> skip;
TEXT : [a-zA-Z0-9]+;
NEWLINE : '\r'? '\n' { {System.out.println("Newlines so far: " + (++counter)); } };
content: (TEXT | COMMENT | NEWLINE )* EOF;

ANTLR String interpolation

I'm trying to write an ANTLR grammar that parses string interpolation expressions such as:
my.greeting = "hello ${your.name}"
The error I get is:
line 1:31 token recognition error at: 'e'
line 1:34 no viable alternative at input '<EOF>'
MyParser.g4:
parser grammar MyParser;
options { tokenVocab=MyLexer; }
program: variable EQ expression EOF;
expression: (string | variable);
variable: (VAR DOT)? VAR;
string: (STRING_SEGMENT_END expression)* STRING_END;
MyLexer.g4:
lexer grammar MyLexer;
START_STR: '"' -> more, pushMode(STRING_MODE) ;
VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
EQ: '=';
DOT: '.';
WHITE_SPACE: (SPACE | NEW_LINE | TAB)+ -> skip;
fragment DIGIT: '0'..'9';
fragment LOWERCASE: 'a'..'z';
fragment UPPERCASE: 'A'..'Z';
fragment ANY_CHAR: LOWERCASE | UPPERCASE | DIGIT;
fragment NEW_LINE: '\n' | '\r' | '\r\n';
fragment SPACE: ' ';
fragment TAB: '\t';
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
mode STRING_MODE;
STRING_END: '"' -> popMode;
STRING_SEGMENT_END: '${' -> pushMode(INTERPOLATION_MODE);
TEXT : ~["$]+ -> more ;
Expressions like the following work fine:
my.greeting = "hello"
my.greeting = "hello ${} world"
Any ideas what I might be doing wrong?
Instead of:
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
I_DOT: '.';
...
variable: ((VAR|I_VAR) (DOT|I_DOT))? (VAR|I_VAR);
you could try:
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR* -> type(VAR);
I_DOT: '.' -> type(DOT);
...
variable: (VAR DOT)? VAR;
Ok, I've worked out (inspired by this) that I need to define the default lexer rules again in the INTERPOLATION_MODE:
MyLexer.g4:
...
mode INTERPOLATION_MODE;
STRING_SEGMENT_START: '}' -> more, popMode;
I_VAR: (UPPERCASE|LOWERCASE) ANY_CHAR*;
I_DOT: '.';
...
MyParser.g4:
...
variable: ((VAR|I_VAR) (DOT|I_DOT))? (VAR|I_VAR);
...
This seems overkill though, so still holding out for someone with a better answer.
String interpolation also implemented in existing C# and PHP grammars in official ANTLR grammars repository.

ANTLR -- use predicates to insert a token

I am trying to understand ANTLR predicates. To that end,
I have a simple lexer and parser, shown below.
What I would like to do is use a predicate to insert the word "fubar" every time it sees "foo" followed by some whitespace and then "bar". I want to do this while keeping the same basic structure. Bonus points for doing it in the lexer. Further bonus points if I can do it without referring to the underlying language at all. But if necessary, it is C#.
For example, if the input string is:
programmers use the words foo bar and bar foo class
the output would be
programmers use the words foo fubar bar and bar foo class
Lexer:
lexer grammar TextLexer;
#members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}
FOO: 'foo';
BAR: 'bar';
TEXT: [a-z]+ ;
WS
: ' ' -> channel(HIDDEN)
;
Parser:
parser grammar TextParser;
options { tokenVocab=TextLexer; }
#members
{
protected const int EOF = Eof;
}
file: words EOF;
word:FOO
|BAR
|TEXT;
words: word
| word words
;
compileUnit
: EOF
;
ANTLR3's lexer might have needed a predicate in this case, but ANTLR4's lexer is much "smarter". You can match "foo bar" in a single lexer rule and change its inner text with setText(...):
FOO_BAR
: 'foo' [ \t]+ 'bar' {setText("fubar");}
;
TEXT
: [a-z]+
;
WS
: ' ' -> channel(HIDDEN)
;

Simple grammar not working

I have a simple grammar to parse files containing identifiers and keywords between brackets (hopefully):
grammar Keyword;
// PARSER RULES
//
entry_point : ('['ID']')*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
It works for input:
[Hi]
[Hi]
It returns a NoViableAltException error for input:
[Hi]
[Ki]
If I comment KEYWORD, then it works fine. Also, if I change my grammar to:
grammar Keyword;
// PARSER RULES
//
entry_point : ID*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : '[' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ']';
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
Then it works. Could you please help me figuring out why?
Best regards.
The 1st grammar fails because whenever the lexer sees "[K", the lexer will enter the KEYWORD rule. If it then encounters something other then "eyword]", "i" in your case, it tries to go back to some other rule that can match "[K". But there is no other lexer rule that starts with "[K" and will therefor throw an exception. Note that the lexer doesn't remove "K" and then tries to match again (the lexer is a dumb machine)!
Your 2nd grammar works, because the lexer now can find something to fall back on when "[Ki" does not get matched by the KEYWORD since ID now includes the "[".