Ignore some part of input when parsing with ANTLR - antlr

I'm trying to parse a language by ANTLR (ANTLRWorks-3.5.2). The goal is to enter complete input but Antlr gives a parse tree of defined parts in grammar and ignore the rest of inputs, for example this is my grammar :
grammar asap;
project : '/begin PROJECT' name module+ '/end PROJECT';
module : '/begin MODULE'name '/end MODULE';
name : IDENT ;
IDENT : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|':'|'-')*;
Given input:
/begin PROJECT HybridSailboat_2
/begin MODULE engine
/begin A2ML
/include XCP_common_v1_0.aml
"XCP" struct {
taggedstruct Common_Parameters ;
};
/end A2ML
/end MODULE
/end PROJECT
regarding to this input I just want the parse tree contains project and module and not A2ML part.
Is it possible in antlr that it ignore some part of inputs?
Can I specify start and end points of unimportant parts in grammar?

Simply match the A2ML part as a single token in the lexer and skip() it:
grammar asap;
project
: BEGIN_PROJECT name module* END_PROJECT EOF
;
module
: BEGIN_MODULE name END_MODULE
;
name
: IDENT
;
IDENT
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'.'|':'|'-')*
;
BEGIN_PROJECT
: '/begin' S 'PROJECT'
;
END_PROJECT
: '/end' S 'PROJECT'
;
BEGIN_MODULE
: '/begin' S 'MODULE'
;
END_MODULE
: '/end' S 'MODULE'
;
A2ML
: '/begin' S 'A2ML' .* '/end' S 'A2ML' {skip();}
;
SPACES
: S {skip();}
;
fragment S
: (' ' | '\t' | '\r' | '\n')+
;

Related

Problems defining an ANTLR parser for template file

I have started working with ANTLR4 to create a syntax parser for a self defined template file format.
The format basically consists of a mandatory part called '#settings' and at least one part called '#region'. Parts body is surrounded by braces.
I have created a sample file and also copy-pasted-modified an antlr g4 file to parse it. Works fine so far:
File:
#settings
{
setting1: value1
setting2: value2
}
#region
{
[Key1]=Value1(Comment1)
[Key2]=Value2(Comment2)
}
The G4 file for this sample:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
This works as expected. Now I want to add complexity to the file format and the parser and extend the #region header by #region NAME (Attributes).
So what I changed in the sample and in the G4 file is:
Sample changed to
...
#region name (attributes, moreAttributes)
{
...
and g4 file modified to
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
Now the parser brings up the following error:
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
And I don't get why it is behaving like this. I expected the parser to not concat the whole line when comparing. What am I doing wrong?
Thanks.
There are a couple of problems here:
whatever NOSPACE matches, is also matched by TEXT
TEXT is waaaaay too greedy
Issue 1
ANTLR's lexer works independently from the parser and the lexer will match as much characters as possible.
When 2 (or more) lexer rules match the same amount of characters, the one defined first "wins".
So, if the input is Foo and the parser is. trying to match a NOSPACE token, you're out of luck: because both TEXT and NOSPACE match the text Foo and TEXT is defined first, the lexer will produce a TEXT token. There's nothing you can do about that: it's the way ANTLR works.
Issue 2
As explained in issue 1, the lexer tries to match as much characters as possible. Because of that, your TEXT rule is too greedy. This is what your input is tokenised as:
'{' `{`
TEXT `setting1: value1`
TEXT `setting2: value2`
'}' `}`
TEXT `#region name (attributes, moreAttributes)`
'{' `{`
TEXT `[Key1]=Value1(Comment1)`
TEXT `[Key2]=Value2(Comment2)`
'}' `}`
As you can see, TEXT matches too much. And this is what the error
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
is telling you: #region name (attributes, moreAttributes) is a single TEXT token where a #region is trying to be matched by the parser.
Solution?
Remove NOSPACE and make the TEXT token less greedy (or the other way around).
Bart,
thank you very much for clarifying this to me. The key phrase was the lexer will match as much characters as possible. This is a behavior I still need to get used to. I redesigned my Lexer and Parser rules and it seems to work for my test case now.
For completeness, this is my g4 File now:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: TEXT
;
settingsText
: TEXT
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: TEXT
;
regionText
: regionLine '('? (regionComment?) ')'?
;
regionLine
: TEXT
;
regionComment
: TEXT
;
TEXT
: ([A-z0-9:\-|= ])+
;
WS
: [ \t\n\r] + -> skip
;

Antlr Skip text outside tag

Im trying to skip/ignore the text outside a custom tag:
This text is a unique token to skip < ?compo \5+5\ ?> also this < ?compo \1+1\ ?>
I tried with the follow lexer:
TAG_OPEN : '<?compo ' -> pushMode(COMPOSER);
mode COMPOSER;
TAG_CLOSE : ' ?>' -> popMode;
NUMBER_DIGIT : '1'..'9';
ZERO : '0';
LOGICOP
: OR
| AND
;
COMPAREOP
: EQ
| NE
| GT
| GE
| LT
| LE
;
WS : ' ';
NEWLINE : ('\r\n'|'\n'|'\r');
TAB : ('\t');
...
and parser:
instructions
: (TAG_OPEN statement TAG_CLOSE)+?;
statement
: if_statement
| else
| else_if
| if_end
| operation_statement
| mnemonic
| comment
| transparent;
But it doesn't work (I test it by using the intelliJ tester on the rule "instructions")...
I have also add some skip rules outside the "COMPOSER" mode:
TEXT_SKIP : TAG_CLOSE .*? (TAG_OPEN | EOF) -> skip;
But i don't have any results...
Someone can help me?
EDIT:
I change "instructions" and now the parser tree is correctly builded for every instruction of every tag:
instructions : (.*? TAG_OPEN statement TAG_CLOSE .*?)+;
But i have a not recognized character error outside the the tags...
Below is a quick demo that worked for me.
Lexer grammar:
lexer grammar CompModeLexer;
TAG_OPEN
: '<?compo' -> pushMode(COMPOSER)
;
OTHER
: . -> skip
;
mode COMPOSER;
TAG_CLOSE
: '?>' -> popMode
;
OPAR
: '('
;
CPAR
: ')'
;
INT
: '0'
| [1-9] [0-9]*
;
LOGICOP
: 'AND'
| 'OR'
;
COMPAREOP
: [<>!] '='
| [<>=]
;
MULTOP
: [*/%]
;
ADDOP
: [+-]
;
SPACE
: [ \t\r\n\f] -> skip
;
Parser grammar:
parser grammar CompModeParser;
options {
tokenVocab=CompModeLexer;
}
parse
: tag* EOF
;
tag
: TAG_OPEN statement TAG_CLOSE
;
statement
: expr
;
expr
: '(' expr ')'
| expr MULTOP expr
| expr ADDOP expr
| expr COMPAREOP expr
| expr LOGICOP expr
| INT
;
A test with the input This text is a unique token to skip <?compo 5+5 ?> also this <?compo 1+1 ?> resulted in the following tree:
I found another solution (not elegant as the previous):
Create a generic TEXT token in the general context (so outside the tag's mode)
TEXT : ( ~[<] | '<' ~[?])+ -> skip;
Create a parser rule for handle a generic text
code
: TEXT
| (TEXT? instruction TEXT?)+;
Create a parser rule for handle an instruction
instruction
: TAG_OPEN statement TAG_CLOSE;

ANTLR No viable alternative at input query

I am designing a parser for a small language - the grammar I have is as follows:
grammar Test;
r : specification+;
specification : MODULE module_def EOF;
module_def : ID EQ classExp;
classExp : basicClassExp
| altClassExp
;
basicClassExp : CLASS (classCode)* END;
altClassExp : ALTCLASS (classCode)* END;
classCode : 'classCode';
EXCLAMATION : '!';
EQ : '=';
MODULE : 'module';
END : 'end';
CLASS : 'class';
ALTCLASS : 'altclass';
ID
: JavaLetter JavaLetterOrDigit*
;
fragment
JavaLetter
: [a-zA-Z$_]
| ~[\u0000-\u007F\uD800-\uDBFF]
| [\uD800-\uDBFF] [\uDC00-\uDFFF]
;
fragment
JavaLetterOrDigit
: [a-zA-Z0-9$_]
| ~[\u0000-\u007F\uD800-\uDBFF]
| [\uD800-\uDBFF] [\uDC00-\uDFFF]
;
WS : [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(HIDDEN)
;
An example of correct code would be:
module T = class end
When I run it on the following code :
module ! T ! = !
class !
end !
I expect it to report errors where the exclamation marks are. Instead I get:
line 1:7 extraneous input '!' expecting ID
line 1:11 extraneous input '!' expecting '='
line 1:15 no viable alternative at input '!'
When I remove the altClassExp rule alternative, it reports all errors i.e.
line 1:7 extraneous input '!' expecting ID
line 1:11 extraneous input '!' expecting '='
line 1:15 extraneous input '!' expecting 'class'
line 2:10 extraneous input '!' expecting {'classCode', 'end'}
line 3:6 extraneous input '!' expecting <EOF>
What do I need to change so I can keep my altClassExp rule and also report all the extraneous exclamation marks?
Thanks

extraneous input error using ANTLR4

I am an ANTLR newbie and struggling with some of the errors that I am getting. Below I have included the grammar that I am using, The input file and the error that I am getting.
My Antlr Grammar file is as follows:
grammar Simple;
#header
{
package simple;
}
PARSER
program :
anylinebefore+
processline
anylineafter+
'MSEND' NEWLINE
'.' EOF
;
anylinebefore: CH* NEWLINE | commentline;
anylineafter: statement | commentline;
statement: movestatement ;
movestatement : 'MOVE' arg ('to' | 'TO') ID '.' NEWLINE ;
arg : ID|STRING;
processline: PROCESSLITERAL NEWLINE;
commentline: '!' CH* NEWLINE;
LEXER
WS : [ \t]+ -> skip ;
STRING : '\'' (~['])* '\'';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
TO : ('to' | 'TO');
CH : [\u0000-\uFFFE];
PROCESSLITERAL : 'PROCESS SOURCE FOLLOWS';
NEWLINE : '\r'? '\n' ;
My input file is as follows:
MODIFY
PROCESS SOURCE FOLLOWS
MOVE 'WSFRED' TO AGRPASSEDTWO.
MSEND
.
The error that I get is:
showtree:
[java] line 1:0 extraneous input 'MODIFY' expecting {'!', CH, NEWLINE}
I don't understand why this isn't matching anylinebefore in the grammar
Any help would be appreciated.
"MODIFY" is an ID, which doesn't match anylinebefore+

antlr3ide generates parsers and lexers without package info?

antlr3ide seems to generate parser and lexer files without the package info where the java files are located (such as package tour.trees;, here the relative path folder tour/trees contains the corresponding files ExprParser.java and ExprLexer.java).
The official forum seems a bit inactive and the documentation gives me not much help:(
Below is a sample grammar file Expr.g:
grammar Expr;
options {
language = Java;
}
prog : stat+;
stat : expr NEWLINE
| ID '=' expr NEWLINE
| NEWLINE
;
expr: multiExpr (('+'|'-') multiExpr)*
;
multiExpr : atom('*' atom)*
;
atom : INT
| ID
| '(' expr ')'
;
ID : ('a'..'z'|'A'..'Z')+ ;
INT : '0'..'9'+;
NEWLINE : '\r'?'\n';
WS : (' '|'\t'|'\n'|'\r')+{skip();};
The package declaration is not something that antlrv3ide generates. This is done by ANTLR. To let ANTLR generate source files in the package tour.trees, add #header blocks containing the package declarations in your grammar file like this:
grammar Expr;
options {
language = Java;
}
// placed _after_ the `options`-block!
#parser::header { package tour.trees; }
#lexer::header { package tour.trees; }
prog : stat+;
...