Problems defining an ANTLR parser for template file - antlr

I have started working with ANTLR4 to create a syntax parser for a self defined template file format.
The format basically consists of a mandatory part called '#settings' and at least one part called '#region'. Parts body is surrounded by braces.
I have created a sample file and also copy-pasted-modified an antlr g4 file to parse it. Works fine so far:
File:
#settings
{
setting1: value1
setting2: value2
}
#region
{
[Key1]=Value1(Comment1)
[Key2]=Value2(Comment2)
}
The G4 file for this sample:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
This works as expected. Now I want to add complexity to the file format and the parser and extend the #region header by #region NAME (Attributes).
So what I changed in the sample and in the G4 file is:
Sample changed to
...
#region name (attributes, moreAttributes)
{
...
and g4 file modified to
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
Now the parser brings up the following error:
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
And I don't get why it is behaving like this. I expected the parser to not concat the whole line when comparing. What am I doing wrong?
Thanks.

There are a couple of problems here:
whatever NOSPACE matches, is also matched by TEXT
TEXT is waaaaay too greedy
Issue 1
ANTLR's lexer works independently from the parser and the lexer will match as much characters as possible.
When 2 (or more) lexer rules match the same amount of characters, the one defined first "wins".
So, if the input is Foo and the parser is. trying to match a NOSPACE token, you're out of luck: because both TEXT and NOSPACE match the text Foo and TEXT is defined first, the lexer will produce a TEXT token. There's nothing you can do about that: it's the way ANTLR works.
Issue 2
As explained in issue 1, the lexer tries to match as much characters as possible. Because of that, your TEXT rule is too greedy. This is what your input is tokenised as:
'{' `{`
TEXT `setting1: value1`
TEXT `setting2: value2`
'}' `}`
TEXT `#region name (attributes, moreAttributes)`
'{' `{`
TEXT `[Key1]=Value1(Comment1)`
TEXT `[Key2]=Value2(Comment2)`
'}' `}`
As you can see, TEXT matches too much. And this is what the error
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
is telling you: #region name (attributes, moreAttributes) is a single TEXT token where a #region is trying to be matched by the parser.
Solution?
Remove NOSPACE and make the TEXT token less greedy (or the other way around).

Bart,
thank you very much for clarifying this to me. The key phrase was the lexer will match as much characters as possible. This is a behavior I still need to get used to. I redesigned my Lexer and Parser rules and it seems to work for my test case now.
For completeness, this is my g4 File now:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: TEXT
;
settingsText
: TEXT
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: TEXT
;
regionText
: regionLine '('? (regionComment?) ')'?
;
regionLine
: TEXT
;
regionComment
: TEXT
;
TEXT
: ([A-z0-9:\-|= ])+
;
WS
: [ \t\n\r] + -> skip
;

Related

ANTLR Pattern "line 1:9 extraneous input ' ' expecting WORD"

I'm just getting started with using ANTLR. I'm trying to write a parser for field definitions that look like:
field_name = value
Example:
is_true_true = yes;
My grammar looks like this:
grammar Hello;
//Lexer Rules
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9';
fragment TRUE: 'TRUE'|'true';
fragment FALSE: 'FALSE'|'false';
INTEGER : DIGIT+ ;
STRING : ('\''.*?'\'') ;
BOOLEAN : (TRUE|FALSE);
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WHITESPACE : (' ' | '\t')+ ;
NEWLINE : ('\r'? '\n' | '\r')+ ;
field_def : WORD '=' WORD ';' ;
But when I run the generated Parser on 'working = yes;' i get the error message:
line 1:7 extraneous input ' ' expecting '='
line 1:9 extraneous input ' ' expecting WORD
I do not understand this fully, is there an error in matching the WORD-pattern or is it something else entirely?
Since it's quite usual that the whitespace is not significant to your grammar (i.e. there's no semantic meaning to it, apart of separating words), ANTLR makes it possible to just skip it:
In ANTLR 4 this is done by
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
In ANTLR 3 the syntax is
WHITESPACE : (' ' | '\t')+ { $channel = HIDDEN; };
NEWLINE : ('\r'? '\n' | '\r')+ { $channel = HIDDEN; };
What this does is the lexer tokenizes the input as usual, but parser understands that these tokens are not significant to it and behaves as if they were not there, allowing you to keep your rules simple and without need to add optional whitespace everywhere.
Your example has whitespace but your field_def isn't accounting for it.

Getting plain text in antlr instead of tokens

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

Grammar to negate two like characters in a lexer rule inside a single quoted string

ANLTR 4:
I need to support a single quoted string literal with escaped characters AND the ability to use double curly braces as an 'escape sequence' that will need additional parsing. So both of these examples need to be supported. I'm not so worried about the second example because that seems trivial if I can get the first to work and not match double curly brace characters.
1. 'this is a string literal with an escaped\' character'
2. 'this is a string {{functionName(x)}} literal with double curlies'
StringLiteral
: '\'' (ESC | AnyExceptDblCurlies)*? '\'' ;
fragment
ESC : '\\' [btnr\'\\];
fragment
AnyExceptDblCurlies
: '{' ~'{'
| ~'{' .;
I've done a lot of research on this and understand that you can't negate multiple characters, and have even seen a similar approach work in Bart's answer in this post...
Negating inside lexer- and parser rules
But what I'm seeing is that in example 1 above, the escaped single quote is not being recognized and I get a parser error that it cannot match ' character'.
if I alter the string literal token rule to the following it works...
StringLiteral
: '\'' (ESC | .)*? '\'' ;
Any ideas how to handle this scenario better? I can deduce that the escaped character is getting matched by AnyExceptDblCurlies instead of ESC, but I'm not sure how to solve this problem.
To parse the template definition out of the string pretty much requires handling in the parser. Use lexer modes to distinguish between string characters and the template name.
Parser:
options {
tokenVocab = TesterLexer ;
}
test : string EOF ;
string : STRBEG ( SCHAR | template )* STREND ; // allow multiple templates per string
template : TMPLBEG TMPLNAME TMPLEND ;
Lexer:
STRBEG : Squote -> pushMode(strMode) ;
mode strMode ;
STRESQ : Esqote -> type(SCHAR) ; // predeclare SCHAR in tokens block
STREND : Squote -> popMode ;
TMPLBEG : DBrOpen -> pushMode(tmplMode) ;
STRCHAR : . -> type(SCHAR) ;
mode tmplMode ;
TMPLEND : DBrClose -> popMode ;
TMPLNAME : ~'}'* ;
fragment Squote : '\'' ;
fragment Esqote : '\\\'' ;
fragment DBrOpen : '{{' ;
fragment DBrClose : '}}' ;
Updated to correct the TMPLNAME rule, add main rule and options block.

Ignore some part of input when parsing with ANTLR

I'm trying to parse a language by ANTLR (ANTLRWorks-3.5.2). The goal is to enter complete input but Antlr gives a parse tree of defined parts in grammar and ignore the rest of inputs, for example this is my grammar :
grammar asap;
project : '/begin PROJECT' name module+ '/end PROJECT';
module : '/begin MODULE'name '/end MODULE';
name : IDENT ;
IDENT : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|':'|'-')*;
Given input:
/begin PROJECT HybridSailboat_2
/begin MODULE engine
/begin A2ML
/include XCP_common_v1_0.aml
"XCP" struct {
taggedstruct Common_Parameters ;
};
/end A2ML
/end MODULE
/end PROJECT
regarding to this input I just want the parse tree contains project and module and not A2ML part.
Is it possible in antlr that it ignore some part of inputs?
Can I specify start and end points of unimportant parts in grammar?
Simply match the A2ML part as a single token in the lexer and skip() it:
grammar asap;
project
: BEGIN_PROJECT name module* END_PROJECT EOF
;
module
: BEGIN_MODULE name END_MODULE
;
name
: IDENT
;
IDENT
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|':'|'-')*
;
BEGIN_PROJECT
: '/begin' S 'PROJECT'
;
END_PROJECT
: '/end' S 'PROJECT'
;
BEGIN_MODULE
: '/begin' S 'MODULE'
;
END_MODULE
: '/end' S 'MODULE'
;
A2ML
: '/begin' S 'A2ML' .* '/end' S 'A2ML' {skip();}
;
SPACES
: S {skip();}
;
fragment S
: (' ' | '\t' | '\r' | '\n')+
;

extraneous input error using ANTLR4

I am an ANTLR newbie and struggling with some of the errors that I am getting. Below I have included the grammar that I am using, The input file and the error that I am getting.
My Antlr Grammar file is as follows:
grammar Simple;
#header
{
package simple;
}
PARSER
program :
anylinebefore+
processline
anylineafter+
'MSEND' NEWLINE
'.' EOF
;
anylinebefore: CH* NEWLINE | commentline;
anylineafter: statement | commentline;
statement: movestatement ;
movestatement : 'MOVE' arg ('to' | 'TO') ID '.' NEWLINE ;
arg : ID|STRING;
processline: PROCESSLITERAL NEWLINE;
commentline: '!' CH* NEWLINE;
LEXER
WS : [ \t]+ -> skip ;
STRING : '\'' (~['])* '\'';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
TO : ('to' | 'TO');
CH : [\u0000-\uFFFE];
PROCESSLITERAL : 'PROCESS SOURCE FOLLOWS';
NEWLINE : '\r'? '\n' ;
My input file is as follows:
MODIFY
PROCESS SOURCE FOLLOWS
MOVE 'WSFRED' TO AGRPASSEDTWO.
MSEND
.
The error that I get is:
showtree:
[java] line 1:0 extraneous input 'MODIFY' expecting {'!', CH, NEWLINE}
I don't understand why this isn't matching anylinebefore in the grammar
Any help would be appreciated.
"MODIFY" is an ID, which doesn't match anylinebefore+