ANTLR -- use predicates to insert a token - antlr

I am trying to understand ANTLR predicates. To that end,
I have a simple lexer and parser, shown below.
What I would like to do is use a predicate to insert the word "fubar" every time it sees "foo" followed by some whitespace and then "bar". I want to do this while keeping the same basic structure. Bonus points for doing it in the lexer. Further bonus points if I can do it without referring to the underlying language at all. But if necessary, it is C#.
For example, if the input string is:
programmers use the words foo bar and bar foo class
the output would be
programmers use the words foo fubar bar and bar foo class
Lexer:
lexer grammar TextLexer;
#members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}
FOO: 'foo';
BAR: 'bar';
TEXT: [a-z]+ ;
WS
: ' ' -> channel(HIDDEN)
;
Parser:
parser grammar TextParser;
options { tokenVocab=TextLexer; }
#members
{
protected const int EOF = Eof;
}
file: words EOF;
word:FOO
|BAR
|TEXT;
words: word
| word words
;
compileUnit
: EOF
;

ANTLR3's lexer might have needed a predicate in this case, but ANTLR4's lexer is much "smarter". You can match "foo bar" in a single lexer rule and change its inner text with setText(...):
FOO_BAR
: 'foo' [ \t]+ 'bar' {setText("fubar");}
;
TEXT
: [a-z]+
;
WS
: ' ' -> channel(HIDDEN)
;

Related

Getting plain text in antlr instead of tokens

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

Why am I getting "mismatched input 'addr' expecting {<EOF>, 'addr'}"

Given this g4 grammar:
grammar smaller;
root
: ( componentDefinition )* EOF;
componentDefinition
: Addr
Id?
Lbrace
Rbrace
Semi
;
ExprElem
: Num
| Id
;
Addr : 'addr' {System.out.println("addr");};
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
//------------------------------------------------
// Whitespace and Comments
//------------------------------------------------
Wspace : [ \t]+ -> skip;
Newline : ('\r' '\n'?
| '\n'
) -> skip;
and this file to parse
addr basic {
};
this cmdline:
rm *.class *.java ; java -Xmx500M org.antlr.v4.Tool smaller.g4 ; javac *.java ; cat basic | java org.antlr.v4.runtime.misc.TestRig smaller root -tree
I get this error:
line 2:0 mismatched input 'addr' expecting {<EOF>, 'addr'}
(root addr basic { } ;)
If I remove the ExprElem (which is not used anywhere else in the grammar), the parser works:
addr
id
(root (componentDefinition addr basic { } ;) <EOF>)
Why? Note that this is a greatly reduced version of the grammar. Normally, the ExprElem does have a purpose.
Addr is a literal, so it shouldn't conflict with Id in the way that other questions like this usually do.
Your rule ExprElem is a lexer rule, not a parser rule (it begins with an upercase) and is masking the Addr rule, so, no Addr :(
Also, as ExprElem is a lexer rule and it relies on Id or Num rule. Consequently, when an Id is found, ANTLR lexer gives it the ExprElem token type and not the Id token type.
So, two things, you can either rewrite your ExprElem rule to exprElem (assuming you want a parser rule):
exprElem : Num | Id;
or you can use Id token in your ExprElem as part of the rule but you need something that can differentiate ExprElem from Id (example below, but I really think you want a parser rule):
Addr : 'addr' {System.out.println("addr");};
ExprElem
: Sharp Num // This token use others but defines its own 'pattern'
| Sharp Id
;
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
Sharp : '#';
From what I suppose, this is definitely not what you want, but I just put it here to illustrate how lexer rule can reuse others.
When you have doubt about what your token do, do not hesitate to display the recognize tokens. Here is the Java code fragment I often use (I named your grammar test in this case):
public class Main {
public static void main(String[] args) throws InterruptedException {
String txt =
"addr Basic {\n"
+ "\n"
+ "};";
TestLexer lexer = new TestLexer(new ANTLRInputStream(txt));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.root();
for (Token t : tokens.getTokens()) {
System.out.println(t);
}
}
}
NOTE: by the way, Num will never be recognized as Id rule can match the same thing. Try this instead:
Id : Letter (Letter | [0-9])*;
Num : [0-9]+;
fragment Letter : [a-zA-z_];

ANTLR lexer -- can one prefer the shorter match?

Here is a simple lexer grammar:
lexer grammar TextLexer;
#members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}
COMMENT: 'comment' .*? 'end' -> channel(HIDDEN);
WORD: [a-z]+ ;
WS
: ' ' -> channel(HIDDEN)
;
For the most part, it behaves as expected, grabbing the words out of the stream, and ignoring anything bounded by comment . . . end. But not always. For example, if the input is the following:
quick brown fox commentandending
it will see that the word "commentandending" is longer than the comment "commentandend". So it comes out with a token "commentandending" rather than a token "ing".
Is there a way to change that behavior?
This grammar will solve the problem in ANTLR4:
lexer grammar TextLexer;
COMMENT_BEGIN: 'comment' -> more,pushMode(MCOMMENT);
WORD_BEGIN: [a-z] -> more, pushMode(MWORD);
WS: ' ' -> channel(HIDDEN);
mode MCOMMENT;
COMMENT: .+? 'end'-> mode(DEFAULT_MODE);
mode MWORD;
WORD: [a-z]+ -> mode(DEFAULT_MODE);

Selectively Skip Newline Depending on Context

I must parse files made of two parts. In the first one, new lines must be skipped. In the second one, they are important and used as a delimiter.
I want to avoid solutions like http://www.antlr.org/wiki/pages/viewpage.action?pageId=1734 and use predicate instead.
For the moment, I have something like:
WS: ( ' ' | '\t' | NEWLINE) {SKIP();};
fragment NEWLINE : '\r'|'\n'|'\r\n';
I tried to add a dynamically scoped variable keepNewline that is set to true when "entering" second part of the file.
However, I am not able to create the correct predicate to switch off the "skipping" of newlines.
Any help would be greatly appreciated.
Best regards.
It's easier than you might think: you don't even need a predicate.
Let's say you want to preserve line breaks only inside <pre>...</pre> tags. The following dummy grammar does just that:
grammar Pre;
#lexer::members {
private boolean keepNewLine = false;
}
parse
: (t=.
{
System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text.replace("\n", "\\n"));
}
)*
EOF
;
Word
: ('a'..'z' | 'A'..'Z')+
;
OPr
: '<pre>' {keepNewLine = true;}
;
CPr
: '</pre>' {keepNewLine = false;}
;
NewLine
: ('\r'? '\n' | '\r') {if(!keepNewLine) skip();}
;
Space
: (' ' | '\t') {skip();}
;
which you can test with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
PreLexer lexer = new PreLexer(new ANTLRFileStream("in.txt"));
PreParser parser = new PreParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if in.txt would contain:
foo bar
<pre>
a
b
</pre>
baz
the output of running the Main class would be:
Word 'foo'
Word 'bar'
OPr '<pre>'
NewLine '\n'
Word 'a'
NewLine '\n'
NewLine '\n'
Word 'b'
NewLine '\n'
CPr '</pre>'
Word 'baz'

How to find the length of a token in antlr?

I am trying to create a grammar which accepts any character or number or just about anything, provided its length is equal to 1.
Is there a function to check the length?
EDIT
Let me make my question more clear with an example.
I wrote the following code:
grammar first;
tokens {
SET = 'set';
VAL = 'val';
UND = 'und';
CON = 'con';
ON = 'on';
OFF = 'off';
}
#parser::members {
private boolean inbounds(Token t, int min, int max) {
int n = Integer.parseInt(t.getText());
return n >= min && n <= max;
}
}
parse : SET expr;
expr : VAL('u'('e')?)? String |
UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF) |
CON('n'('e'('c'('t')?)?)?)? oneChar
;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
dot : .;
oneChar : dot { $dot.text.length() == 1;} ;
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
I want my grammar to do the following things:
Accept commands like: 'set value abc' , 'set underli on' , 'set conn #'. The grammar should be intelligent enough to accept incomplete words like 'underl' instead of 'underline. etc etc.
The third syntax: 'set connect oneChar' should accept any character, but just one character. It can be a numeric digit or alphabet or any special character. I am getting a compiler error in the generated parser file because of this.
The first syntax: 'set value' should accept all the possible strings, even on and off. But when I give something like: 'set value offer', the grammar is failing. I think this is happening because I already have a token 'OFF'.
In my grammar all the three requirements I have listed above are not working fine. Don't know why.
There are some mistakes and/or bad practices in your grammar:
#1
The following is not a validating predicate:
{$dot.text.length() == 1;}
A proper validating predicate in ANTLR has a question mark at the end, and the inner code has no semi colon at the end. So it should be:
{$dot.text.length() == 1}?
instead.
#2
You should not be handling these alternative commands:
expr
: VAL('u'('e')?)? String
| UND('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (ON | OFF)
| CON('n'('e'('c'('t')?)?)?)? oneChar
;
in a parser rule. You should let the lexer handle this instead. Something like this will do it:
expr
: VAL String
| UND (ON | OFF)
| CON oneChar
;
// ...
VAL : 'val' ('u' ('e')?)?;
UND : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
(also see #5!)
#3
Your lexer rules:
CHAR : 'a'..'z';
DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
are making things complicated for you. The lexer can produce three different kind of tokens because of this: CHAR, DIGIT or String. Ideally, you should only create String tokens since a String can already be a single CHAR or DIGIT. You can do that by adding the fragment keyword before these rules:
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
String : (CHAR | DIGIT)+;
There will now be no CHAR and DIGIT tokens in your token stream, only String tokens. In short: fragment rules are only used inside lexer rules, by other lexer rules. They will never be tokens of their own (and can therefor never appear in any parser rule!).
#4
The rule:
dot : .;
does not do what you think it does. It matches "any token", not "any character". Inside a lexer rule, the . matches any character but in parser rules, it matches any token. Realize that parser rules can only make use of the tokens created by the lexer.
The input source is first tokenized based on the lexer-rules. After that has been done, the parser (though its parser rules) can then operate on these tokens (not characters!!!). Make sure you understand this! (if not, ask for clarification or grab a book about ANTLR)
- an example -
Take the following grammar:
p : . ;
A : 'a' | 'A';
B : 'b' | 'B';
The parser rule p will now match any token that the lexer produces: which is only a A- or B-token. So, p can only match one of the characters 'a', 'A', 'b' or 'B', nothing else.
And in the following grammar:
prs : . ;
FOO : 'a';
BAR : . ;
the lexer rule BAR matches any single character in the range \u0000 .. \uFFFF, but it can never match the character 'a' since the lexer rule FOO is defined before the BAR rule and captures this 'a' already. And the parser rule prs again matches any token, which is either FOO or BAR.
#5
Putting single characters like 'u' inside your parser rules, will cause the lexer to tokenize an u as a separate token: you don't want that. Also, by putting them in parser rules, it is unclear which token has precedence over other tokens. You should keep all such literals outside your parser rules and make them explicit lexer rules instead. Only use lexer rules in your parser rules.
So, don't do:
pRule : 'u' ':' String
String : ...
but do:
pRule : U ':' String
U : 'u';
String : ...
You could make ':' a lexer rule, but that is of less importance. The 'u' however can also be a String so it must appear as a lexer rule before the String rule.
Okay, those were the most obvious things that come to mind. Based on them, here's a proposed grammar:
grammar first;
parse
: (SET expr {System.out.println("expr = " + $expr.text);} )+ EOF
;
expr
: VAL String {System.out.print("A :: ");}
| UL (ON | OFF) {System.out.print("B :: ");}
| CON oneChar {System.out.print("C :: ");}
;
oneChar
: String {$String.text.length() == 1}?
;
SET : 'set';
VAL : 'val' ('u' ('e')?)?;
UL : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?;
CON : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
ON : 'on';
OFF : 'off';
String : (CHAR | DIGIT)+;
fragment CHAR : 'a'..'z' | 'A'..'Z';
fragment DIGIT : '0'..'9';
Space : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
that can be tested with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"set value abc \n" +
"set underli on \n" +
"set conn x \n" +
"set conn xy ";
ANTLRStringStream in = new ANTLRStringStream(source);
firstLexer lexer = new firstLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
firstParser parser = new firstParser(tokens);
System.out.println("parsing:\n======\n" + source + "\n======");
parser.parse();
}
}
which, after generating the lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool first.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main
prints the following output:
parsing:
======
set value abc
set underli on
set conn x
set conn xy
======
A :: expr = value abc
B :: expr = underli on
C :: expr = conn x
line 0:-1 rule oneChar failed predicate: {$String.text.length() == 1}?
C :: expr = conn xy
As you can see, the last command, C :: expr = conn xy, produces an error, as expected.