Some keywords (string constant) in my grammar contain capital letters
e.g.
PREV_VALUE : 'PreviousValue';
This causes strange parsing behavior: other tokens that contain same capital letters ('P','V') are parsed incorrectly.
Here's a simplified version of the lexer grammar:
lexer grammar ExpressionLexer;
COMMA : ',';
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
PREV_VALUE : 'PreviousValue';
fragment DIGIT : ('0'..'9');
fragment LETTER : ('a'..'z'|'A'..'Z'|'_');
fragment TAB : ('\t') ;
fragment NEWLINE : ('\r'|'\n') ;
fragment SPACE : (' ') ;
When I try parsing such expression:
var expression = "P"; //Capital 'P' which included to the keyword 'PreviousValue'
var stringReader = new StringReader(expression);
var input = new ANTLRReaderStream(stringReader);
var expressionLexer = new ExpressionLexer(input);
var tokens = new CommonTokenStream(expressionLexer);
tokens._tokens collection contains one value
[0] = {[#0,1:1='<EOF>',<-1>,1:1]}
It's incorrect.
If I change expression to 'p' (lowercase letter)
tokens._tokens collection contains two values
[0] = {[#0,0:0='p',<0>,1:0]}
[1] = {[#1,1:1='<EOF>',<-1>,1:1]}
It's correct.
When string PREV_VALUE : 'PreviousValue'; is removed from grammar, both expressions are parsed correctly.
Is it possible to use different case in keywords?
Is there any example of using such keywords in ANTLR grammar?
I find it hard to believe a p token is created based on the grammar you posted. Lexer rules that have fragment in front of them will not produce tokens: these rules are only used by other lexer rules.
A simple demo shows this:
lexer grammar ExpressionLexer;
#lexer::members {
public static void main(String[] args) throws Exception {
ExpressionLexer lexer = new ExpressionLexer(new ANTLRStringStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill(); // remove this line when using ANTLR 3.2 or an older version
System.out.println(tokens);
}
}
COMMA : ',';
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
PREV_VALUE : 'PreviousValue';
fragment DIGIT : ('0'..'9');
fragment LETTER : ('a'..'z'|'A'..'Z'|'_');
fragment TAB : ('\t') ;
fragment NEWLINE : ('\r'|'\n') ;
fragment SPACE : (' ') ;
Now generate the lexer and compile the .java source file:
java -cp antlr-3.3.jar org.antlr.Tool ExpressionLexer.g
javac -cp antlr-3.3.jar *.java
and run a few tests:
java -cp .:antlr-3.3.jar ExpressionLexer p
line 1:0 no viable alternative at character 'p'
which is correct since there is no (non-fragment) rule that starts with, or matches, a "p".
java -cp .:antlr-3.3.jar ExpressionLexer P
line 1:1 mismatched character '' expecting 'r'
which is correct since the only (non-fragment) rule that starts with a "P" expects an "r" to be the next character (which isn't there).
Related
I have started working with ANTLR4 to create a syntax parser for a self defined template file format.
The format basically consists of a mandatory part called '#settings' and at least one part called '#region'. Parts body is surrounded by braces.
I have created a sample file and also copy-pasted-modified an antlr g4 file to parse it. Works fine so far:
File:
#settings
{
setting1: value1
setting2: value2
}
#region
{
[Key1]=Value1(Comment1)
[Key2]=Value2(Comment2)
}
The G4 file for this sample:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
This works as expected. Now I want to add complexity to the file format and the parser and extend the #region header by #region NAME (Attributes).
So what I changed in the sample and in the G4 file is:
Sample changed to
...
#region name (attributes, moreAttributes)
{
...
and g4 file modified to
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
Now the parser brings up the following error:
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
And I don't get why it is behaving like this. I expected the parser to not concat the whole line when comparing. What am I doing wrong?
Thanks.
There are a couple of problems here:
whatever NOSPACE matches, is also matched by TEXT
TEXT is waaaaay too greedy
Issue 1
ANTLR's lexer works independently from the parser and the lexer will match as much characters as possible.
When 2 (or more) lexer rules match the same amount of characters, the one defined first "wins".
So, if the input is Foo and the parser is. trying to match a NOSPACE token, you're out of luck: because both TEXT and NOSPACE match the text Foo and TEXT is defined first, the lexer will produce a TEXT token. There's nothing you can do about that: it's the way ANTLR works.
Issue 2
As explained in issue 1, the lexer tries to match as much characters as possible. Because of that, your TEXT rule is too greedy. This is what your input is tokenised as:
'{' `{`
TEXT `setting1: value1`
TEXT `setting2: value2`
'}' `}`
TEXT `#region name (attributes, moreAttributes)`
'{' `{`
TEXT `[Key1]=Value1(Comment1)`
TEXT `[Key2]=Value2(Comment2)`
'}' `}`
As you can see, TEXT matches too much. And this is what the error
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
is telling you: #region name (attributes, moreAttributes) is a single TEXT token where a #region is trying to be matched by the parser.
Solution?
Remove NOSPACE and make the TEXT token less greedy (or the other way around).
Bart,
thank you very much for clarifying this to me. The key phrase was the lexer will match as much characters as possible. This is a behavior I still need to get used to. I redesigned my Lexer and Parser rules and it seems to work for my test case now.
For completeness, this is my g4 File now:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: TEXT
;
settingsText
: TEXT
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: TEXT
;
regionText
: regionLine '('? (regionComment?) ')'?
;
regionLine
: TEXT
;
regionComment
: TEXT
;
TEXT
: ([A-z0-9:\-|= ])+
;
WS
: [ \t\n\r] + -> skip
;
ANTL grammar:
grammar Java;
// Parser
compilationUnit: classDeclaration;
classDeclaration : 'class' CLASS_NAME classBlock
;
classBlock: OPEN_BLOCK method* CLOSE_BLOCK
;
method: methodReturnValue methodName methodArgs methodBlock
;
methodReturnValue: CLASS_NAME
;
methodName: METHOD_NAME
;
methodArgs: OPEN_PAREN CLOSE_PAREN
;
methodBlock: OPEN_BLOCK CLOSE_BLOCK
;
// Lexer
CLASS_NAME: ALPHA;
METHOD_NAME: ALPHA;
WS: [ \t\n] -> skip;
OPEN_BLOCK: '{';
CLOSE_BLOCK: '}';
OPEN_PAREN: '(';
CLOSE_PAREN: ')';
fragment ALPHA: [a-zA-Z][a-zA-Z0-9]*;
Pseudo-Java file:
class Test {
void run() { }
}
Most things match up except for METHOD_NAME which it errantly associates with methodArgs.
line 3:6 mismatched input 'run' expecting METHOD_NAME
This is about token ambiguity. This question has been asked several times these last weeks. Follow the links, especially disambiguate, in this answer.
As soon as you have a mismatched error, add -tokens to grun to display the tokens, it helps finding the discrepancy between what you THINK the lexer will do and what it actually DOES. With your grammar :
CLASS_NAME: ALPHA;
METHOD_NAME: ALPHA;
every input matched by ALPHA is ambiguous, and in case of ambiguity ANTLR chooses the first rule.
$ grun Question compilationUnit -tokens -diagnostics t.text
[#0,0:4='class',<'class'>,1:0]
[#1,6:9='Test',<CLASS_NAME>,1:6]
[#2,11:11='{',<'{'>,1:11]
[#3,18:21='void',<CLASS_NAME>,3:4]
[#4,23:25='run',<CLASS_NAME>,3:9]
[#5,26:26='(',<'('>,3:12]
[#6,27:27=')',<')'>,3:13]
[#7,29:29='{',<'{'>,3:15]
[#8,31:31='}',<'}'>,3:17]
[#9,34:34='}',<'}'>,5:0]
[#10,36:35='<EOF>',<EOF>,6:0]
Question last update 0841
line 3:9 mismatched input 'run' expecting METHOD_NAME
because run has been interpreted as a CLASS_NAME.
I would write the grammar like so :
grammar Question;
// Parser
compilationUnit
#init {System.out.println("Question last update 0919");}
: classDeclaration;
classDeclaration : 'class' ID classBlock
;
classBlock: OPEN_BLOCK method* CLOSE_BLOCK
;
method: methodReturnValue=ID methodName=ID methodArgs methodBlock
{System.out.println("Method found : " + $methodName.text +
" which returns a " + $methodReturnValue.text);}
;
methodArgs: OPEN_PAREN CLOSE_PAREN
;
methodBlock: OPEN_BLOCK CLOSE_BLOCK
;
// Lexer
ID : ALPHA ( ALPHA | DIGIT | '_' )* ;
WS: [ \t\n] -> skip;
OPEN_BLOCK: '{';
CLOSE_BLOCK: '}';
OPEN_PAREN: '(';
CLOSE_PAREN: ')';
fragment ALPHA : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
Execution :
$ grun Question compilationUnit -tokens -diagnostics t.text
[#0,0:4='class',<'class'>,1:0]
[#1,6:9='Test',<ID>,1:6]
[#2,11:11='{',<'{'>,1:11]
[#3,18:21='void',<ID>,3:4]
[#4,23:25='run',<ID>,3:9]
[#5,26:26='(',<'('>,3:12]
[#6,27:27=')',<')'>,3:13]
[#7,29:29='{',<'{'>,3:15]
[#8,31:31='}',<'}'>,3:17]
[#9,34:34='}',<'}'>,5:0]
[#10,36:35='<EOF>',<EOF>,6:0]
Question last update 0919
Method found : run which returns a void
and $ grun Question compilationUnit -gui t.text :
methodReturnValue and methodName are available in the listener from ctx, the rule context.
Given this g4 grammar:
grammar smaller;
root
: ( componentDefinition )* EOF;
componentDefinition
: Addr
Id?
Lbrace
Rbrace
Semi
;
ExprElem
: Num
| Id
;
Addr : 'addr' {System.out.println("addr");};
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
//------------------------------------------------
// Whitespace and Comments
//------------------------------------------------
Wspace : [ \t]+ -> skip;
Newline : ('\r' '\n'?
| '\n'
) -> skip;
and this file to parse
addr basic {
};
this cmdline:
rm *.class *.java ; java -Xmx500M org.antlr.v4.Tool smaller.g4 ; javac *.java ; cat basic | java org.antlr.v4.runtime.misc.TestRig smaller root -tree
I get this error:
line 2:0 mismatched input 'addr' expecting {<EOF>, 'addr'}
(root addr basic { } ;)
If I remove the ExprElem (which is not used anywhere else in the grammar), the parser works:
addr
id
(root (componentDefinition addr basic { } ;) <EOF>)
Why? Note that this is a greatly reduced version of the grammar. Normally, the ExprElem does have a purpose.
Addr is a literal, so it shouldn't conflict with Id in the way that other questions like this usually do.
Your rule ExprElem is a lexer rule, not a parser rule (it begins with an upercase) and is masking the Addr rule, so, no Addr :(
Also, as ExprElem is a lexer rule and it relies on Id or Num rule. Consequently, when an Id is found, ANTLR lexer gives it the ExprElem token type and not the Id token type.
So, two things, you can either rewrite your ExprElem rule to exprElem (assuming you want a parser rule):
exprElem : Num | Id;
or you can use Id token in your ExprElem as part of the rule but you need something that can differentiate ExprElem from Id (example below, but I really think you want a parser rule):
Addr : 'addr' {System.out.println("addr");};
ExprElem
: Sharp Num // This token use others but defines its own 'pattern'
| Sharp Id
;
Lbrace : '{' ;
Rbrace : '}' ;
Semi : ';' ;
Id : [a-zA-z0-9_]+ {System.out.println("id");};
Num : [0-9]+;
Sharp : '#';
From what I suppose, this is definitely not what you want, but I just put it here to illustrate how lexer rule can reuse others.
When you have doubt about what your token do, do not hesitate to display the recognize tokens. Here is the Java code fragment I often use (I named your grammar test in this case):
public class Main {
public static void main(String[] args) throws InterruptedException {
String txt =
"addr Basic {\n"
+ "\n"
+ "};";
TestLexer lexer = new TestLexer(new ANTLRInputStream(txt));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.root();
for (Token t : tokens.getTokens()) {
System.out.println(t);
}
}
}
NOTE: by the way, Num will never be recognized as Id rule can match the same thing. Try this instead:
Id : Letter (Letter | [0-9])*;
Num : [0-9]+;
fragment Letter : [a-zA-z_];
I'm new to ANTLR and I´m trying to play with it. This is the simplest grammar that I could think and still it is not working (NoViableAltException) when I parse a variable "id123", but it works for "abc1", "ab", "c1d2f3".
I'm using ANTLR 3.1.3 and ANTLRWorks 1.4.
options
{
language = 'CSharp2';
output = AST;
}
assign : variable '=' value;
value : (variable|constant);
variable: LETTER (LETTER|DIGIT)*;
constant: (STRING|INTEGER);
DIGIT : '0'..'9';
NATURAL : (DIGIT)+;
INTEGER : ('-')? NATURAL;
REAL : (INTEGER '.' NATURAL);
LETTER : ('a'..'z'|'A'..'Z');
CR : '\r' { $channel = HIDDEN; };
LF : '\n' { $channel = HIDDEN; };
CRLF : CR LF { $channel = HIDDEN; };
SPACE : (' '|'\t') { $channel = HIDDEN; };
STRING : '"' (~'"')* '"';
ANTLR's lexer tries to match as much as possible. Whenever two (or more) rules match the same amount of characters, the rule defined first will "win". So, whenever the lexer stumbles upon a singe digit, a DIGIT token is created, because it is defined before NATURAL:
DIGIT : '0'..'9';
NATURAL : (DIGIT)+;
but for the input "id123" the lexer produced the following 3 tokens:
LETTER 'i'
LETTER 'd'
NATURAL '123'
because the lexer matches greedily, and therefor a NATURAL is created, and not three DIGIT tokens.
What you should do is make a lexer rule of variable instead:
assign : VARIABLE '=' value;
value : (VARIABLE | constant);
constant : (STRING | INTEGER | REAL);
VARIABLE : LETTER (LETTER|DIGIT)*;
INTEGER : ('-')? NATURAL;
REAL : (INTEGER '.' NATURAL);
SPACE : (' ' | '\t' | '\r' | '\n') { $channel = HIDDEN; };
STRING : '"' (~'"')* '"';
fragment NATURAL : (DIGIT)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
Also note that I made a couple of lexer rules fragments. This means that the lexer will never produce NATURAL, DIGIT or LETTER tokens. These fragment rules can only be used by other lexer rules. In other words, your lexer will only ever produce VARIABLE, INTEGER, REAL, and STRING tokens* (so these are the only ones you can use in your parser rules!).
* and '=' token, of course...
I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;
Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.
What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.
In order to implement this, you need to:
create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the #lexer::members { ... } section of your grammar;
after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the #after{ ... } section of the lexer rules;
before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.
A short demo:
grammar g;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: body EOF -> body
;
body
: (tag | loop | partial | BUFFER)*
;
tag
: LD IDENT RD -> IDENT
;
loop
: LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
;
partial
: LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
;
LD #after{insideTag=true;} : '{';
RD #after{insideTag=false;} : '}';
LOOP : '#';
END_LOOP : '/';
PARTIAL : '>';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER : {!insideTag}?=> ~(LD | RD)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
(note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)
Test it with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" +
"This Should Be Parsed as a Buffer.{/bar2}";
gLexer lexer = new gLexer(new ANTLRStringStream(src));
gParser parser = new gParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and after running the main class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Windows
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)