Getting plain text in antlr instead of tokens - antlr

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.

First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

Related

Problems defining an ANTLR parser for template file

I have started working with ANTLR4 to create a syntax parser for a self defined template file format.
The format basically consists of a mandatory part called '#settings' and at least one part called '#region'. Parts body is surrounded by braces.
I have created a sample file and also copy-pasted-modified an antlr g4 file to parse it. Works fine so far:
File:
#settings
{
setting1: value1
setting2: value2
}
#region
{
[Key1]=Value1(Comment1)
[Key2]=Value2(Comment2)
}
The G4 file for this sample:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
This works as expected. Now I want to add complexity to the file format and the parser and extend the #region header by #region NAME (Attributes).
So what I changed in the sample and in the G4 file is:
Sample changed to
...
#region name (attributes, moreAttributes)
{
...
and g4 file modified to
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
Now the parser brings up the following error:
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
And I don't get why it is behaving like this. I expected the parser to not concat the whole line when comparing. What am I doing wrong?
Thanks.
There are a couple of problems here:
whatever NOSPACE matches, is also matched by TEXT
TEXT is waaaaay too greedy
Issue 1
ANTLR's lexer works independently from the parser and the lexer will match as much characters as possible.
When 2 (or more) lexer rules match the same amount of characters, the one defined first "wins".
So, if the input is Foo and the parser is. trying to match a NOSPACE token, you're out of luck: because both TEXT and NOSPACE match the text Foo and TEXT is defined first, the lexer will produce a TEXT token. There's nothing you can do about that: it's the way ANTLR works.
Issue 2
As explained in issue 1, the lexer tries to match as much characters as possible. Because of that, your TEXT rule is too greedy. This is what your input is tokenised as:
'{' `{`
TEXT `setting1: value1`
TEXT `setting2: value2`
'}' `}`
TEXT `#region name (attributes, moreAttributes)`
'{' `{`
TEXT `[Key1]=Value1(Comment1)`
TEXT `[Key2]=Value2(Comment2)`
'}' `}`
As you can see, TEXT matches too much. And this is what the error
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
is telling you: #region name (attributes, moreAttributes) is a single TEXT token where a #region is trying to be matched by the parser.
Solution?
Remove NOSPACE and make the TEXT token less greedy (or the other way around).
Bart,
thank you very much for clarifying this to me. The key phrase was the lexer will match as much characters as possible. This is a behavior I still need to get used to. I redesigned my Lexer and Parser rules and it seems to work for my test case now.
For completeness, this is my g4 File now:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: TEXT
;
settingsText
: TEXT
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: TEXT
;
regionText
: regionLine '('? (regionComment?) ')'?
;
regionLine
: TEXT
;
regionComment
: TEXT
;
TEXT
: ([A-z0-9:\-|= ])+
;
WS
: [ \t\n\r] + -> skip
;

Why am I getting "mismatched input 'addr' expecting {<EOF>, 'addr'}"

Given this g4 grammar:
grammar smaller;
root
: ( componentDefinition )* EOF;
componentDefinition
: Addr
Id?
Lbrace
Rbrace
Semi
;
ExprElem
: Num
| Id
;
Addr : 'addr' {System.out.println("addr");};
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
//------------------------------------------------
// Whitespace and Comments
//------------------------------------------------
Wspace : [ \t]+ -> skip;
Newline : ('\r' '\n'?
| '\n'
) -> skip;
and this file to parse
addr basic {
};
this cmdline:
rm *.class *.java ; java -Xmx500M org.antlr.v4.Tool smaller.g4 ; javac *.java ; cat basic | java org.antlr.v4.runtime.misc.TestRig smaller root -tree
I get this error:
line 2:0 mismatched input 'addr' expecting {<EOF>, 'addr'}
(root addr basic { } ;)
If I remove the ExprElem (which is not used anywhere else in the grammar), the parser works:
addr
id
(root (componentDefinition addr basic { } ;) <EOF>)
Why? Note that this is a greatly reduced version of the grammar. Normally, the ExprElem does have a purpose.
Addr is a literal, so it shouldn't conflict with Id in the way that other questions like this usually do.
Your rule ExprElem is a lexer rule, not a parser rule (it begins with an upercase) and is masking the Addr rule, so, no Addr :(
Also, as ExprElem is a lexer rule and it relies on Id or Num rule. Consequently, when an Id is found, ANTLR lexer gives it the ExprElem token type and not the Id token type.
So, two things, you can either rewrite your ExprElem rule to exprElem (assuming you want a parser rule):
exprElem : Num | Id;
or you can use Id token in your ExprElem as part of the rule but you need something that can differentiate ExprElem from Id (example below, but I really think you want a parser rule):
Addr : 'addr' {System.out.println("addr");};
ExprElem
: Sharp Num // This token use others but defines its own 'pattern'
| Sharp Id
;
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
Sharp : '#';
From what I suppose, this is definitely not what you want, but I just put it here to illustrate how lexer rule can reuse others.
When you have doubt about what your token do, do not hesitate to display the recognize tokens. Here is the Java code fragment I often use (I named your grammar test in this case):
public class Main {
public static void main(String[] args) throws InterruptedException {
String txt =
"addr Basic {\n"
+ "\n"
+ "};";
TestLexer lexer = new TestLexer(new ANTLRInputStream(txt));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.root();
for (Token t : tokens.getTokens()) {
System.out.println(t);
}
}
}
NOTE: by the way, Num will never be recognized as Id rule can match the same thing. Try this instead:
Id : Letter (Letter | [0-9])*;
Num : [0-9]+;
fragment Letter : [a-zA-z_];

ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

Is there any way to express this in ANTLR4:
Any string as long as it doesn't contain the asterisk immediately
followed by a forward slash?
This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'
This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.
I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.
The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.
The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.
grammar myTest;
#header
{
import java.util.*;
}
#parser::members
{
boolean hasEvilCharacters(String input)
{
if (input.contains("*/"))
{
return false;
}
else
{
return true;
}
}
}
// Mimics a very simple sentence, such as:
// I am clean.
// I have evil char*/acters.
myTest
: { hasEvilCharacters(_input.LT(1).getText()) }? String
(Space { hasEvilCharacters(_input.LT(1).getText()) }? String)*
Period EOF
;
String
: ('A'..'Z' | 'a'..'z')+
;
Space
: ' '
;
Period
: '.'
;
Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.
If the disallowed sequences are few there exists a solution without parser/lexer actions:
grammar NotParser;
program
: (starslash | notstarslash)+
;
notstarslash
: NOT_STAR_SLASH
;
starslash
: STAR_SLASH
;
STAR_SLASH
: '*'+ '/'
;
NOT_STAR_SLASH
: (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
;
fragment F_NOT_STAR_SLASH
: ~('*'|'/')
;
fragment F_STAR_NOT_SLASH
: '*'+ ~('*'|'/')
| '*'+ EOF
| '/'
;
The idea is to compose the token of
all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'
There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')

ANTLR -- use predicates to insert a token

I am trying to understand ANTLR predicates. To that end,
I have a simple lexer and parser, shown below.
What I would like to do is use a predicate to insert the word "fubar" every time it sees "foo" followed by some whitespace and then "bar". I want to do this while keeping the same basic structure. Bonus points for doing it in the lexer. Further bonus points if I can do it without referring to the underlying language at all. But if necessary, it is C#.
For example, if the input string is:
programmers use the words foo bar and bar foo class
the output would be
programmers use the words foo fubar bar and bar foo class
Lexer:
lexer grammar TextLexer;
#members
{
protected const int EOF = Eof;
protected const int HIDDEN = Hidden;
}
FOO: 'foo';
BAR: 'bar';
TEXT: [a-z]+ ;
WS
: ' ' -> channel(HIDDEN)
;
Parser:
parser grammar TextParser;
options { tokenVocab=TextLexer; }
#members
{
protected const int EOF = Eof;
}
file: words EOF;
word:FOO
|BAR
|TEXT;
words: word
| word words
;
compileUnit
: EOF
;
ANTLR3's lexer might have needed a predicate in this case, but ANTLR4's lexer is much "smarter". You can match "foo bar" in a single lexer rule and change its inner text with setText(...):
FOO_BAR
: 'foo' [ \t]+ 'bar' {setText("fubar");}
;
TEXT
: [a-z]+
;
WS
: ' ' -> channel(HIDDEN)
;

Simple grammar not working

I have a simple grammar to parse files containing identifiers and keywords between brackets (hopefully):
grammar Keyword;
// PARSER RULES
//
entry_point : ('['ID']')*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
It works for input:
[Hi]
[Hi]
It returns a NoViableAltException error for input:
[Hi]
[Ki]
If I comment KEYWORD, then it works fine. Also, if I change my grammar to:
grammar Keyword;
// PARSER RULES
//
entry_point : ID*;
// LEXER RULES
//
KEYWORD : '[Keyword]';
ID : '[' ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')* ']';
WS : ( ' ' | '\t' | '\r' | '\n' | '\r\n')
{
$channel = HIDDEN;
};
Then it works. Could you please help me figuring out why?
Best regards.
The 1st grammar fails because whenever the lexer sees "[K", the lexer will enter the KEYWORD rule. If it then encounters something other then "eyword]", "i" in your case, it tries to go back to some other rule that can match "[K". But there is no other lexer rule that starts with "[K" and will therefor throw an exception. Note that the lexer doesn't remove "K" and then tries to match again (the lexer is a dumb machine)!
Your 2nd grammar works, because the lexer now can find something to fall back on when "[Ki" does not get matched by the KEYWORD since ID now includes the "[".