antlr4 grammar errors when parsing - antlr

I have the following grammar:
grammar Token;
prog: (expr NL?)+ EOF;
expr: '[' type ']';
type : typeid ':' value;
typeid : 'TXT' | 'ENC' | 'USR';
value: Text | INT;
INT : '0' | [1-9] [0-9]*;
//WS : [ \t]+;
WS : [ \t\n\r]+ -> skip ;
NL: '\r'? '\n';
Text : ~[\]\[\n\r"]+ ;
and the text I need to parse is something like this below
[TXT:look at me!]
[USR:19700]
[TXT:, can I go there?]
[ENC:124124]
[TXT:this is needed for you to go...]
I need to split this text but I getting some errors when I run grun.bat Token prog -gui -trace -diagnostics
enter prog, LT(1)=[
enter expr, LT(1)=[
consume [#0,0:0='[',<3>,1:0] rule expr
enter type, LT(1)=TXT:look at me!
enter typeid, LT(1)=TXT:look at me!
line 1:1 mismatched input 'TXT:look at me!' expecting {'TXT', 'ENC', 'USR'}
... much more ...
what is wrong with my grammar? please, help me!

You must understand that the tokens are not created based on what the parser is trying to match. The lexer tries to match as much characters as possible (independently from that parser!): your Text token should be defined differently.
You could let the Text rule become a parser rule instead, and match single char tokens like this:
grammar Token;
prog : expr+ EOF;
expr : '[' type ']';
type : typeid ':' value;
typeid : 'TXT' | 'ENC' | 'USR';
value : text | INT;
text : CHAR+;
INT : '0' | [1-9] [0-9]*;
WS : [ \t\n\r]+ -> skip ;
CHAR : ~[\[\]\r\n];

Related

How to make antlr find invalid input throw exception

I have the following grammar:
grammar Expr;
expr : '-' expr # unaryOpExpr
| expr ('*'|'/'|'%') expr # mulDivModuloExpr
| expr ('+'|'-') expr # addSubExpr
| '(' expr ')' # nestedExpr
| IDENTIFIER '(' fnArgs? ')' # functionExpr
| IDENTIFIER # identifierExpr
| DOUBLE # doubleExpr
| LONG # longExpr
| STRING # string
;
fnArgs : expr (',' expr)* # functionArgs
;
IDENTIFIER : [_$a-zA-Z][_$a-zA-Z0-9]* | '"' (ESC | ~ ["\\])* '"';
LONG : [0-9]+;
DOUBLE : [0-9]+ '.' [0-9]*;
WS : [ \t\r\n]+ -> skip ;
STRING: '"' (~["\\\r\n] | ESC)* '"';
fragment ESC : '\\' (['"\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
MINUS : '-' ;
MUL : '*' ;
DIV : '/' ;
MODULO : '%' ;
PLUS : '+' ;
// math function
MAX: 'MAX';
when I enter following text,It should be effective
-1.1
bug when i enter following text:
-1.1ffff
I think it should report an error, bug antlr didn't do it, antlr captures the previous "-1.1", discard "ffff",
but i want to change this behavior, didn't discard invalid token, but throw exception,report
detection invalid token.
So what should i do, Thanks for your advice
Are you using expr as your main rule? if so make another rule, call it something like parse or program and simply write it like this:
parse: expr EOF;
This will make antlr not ignore trailing tokens that don't make sense, and actually throw an error.

Antlr Skip text outside tag

Im trying to skip/ignore the text outside a custom tag:
This text is a unique token to skip < ?compo \5+5\ ?> also this < ?compo \1+1\ ?>
I tried with the follow lexer:
TAG_OPEN : '<?compo ' -> pushMode(COMPOSER);
mode COMPOSER;
TAG_CLOSE : ' ?>' -> popMode;
NUMBER_DIGIT : '1'..'9';
ZERO : '0';
LOGICOP
: OR
| AND
;
COMPAREOP
: EQ
| NE
| GT
| GE
| LT
| LE
;
WS : ' ';
NEWLINE : ('\r\n'|'\n'|'\r');
TAB : ('\t');
...
and parser:
instructions
: (TAG_OPEN statement TAG_CLOSE)+?;
statement
: if_statement
| else
| else_if
| if_end
| operation_statement
| mnemonic
| comment
| transparent;
But it doesn't work (I test it by using the intelliJ tester on the rule "instructions")...
I have also add some skip rules outside the "COMPOSER" mode:
TEXT_SKIP : TAG_CLOSE .*? (TAG_OPEN | EOF) -> skip;
But i don't have any results...
Someone can help me?
EDIT:
I change "instructions" and now the parser tree is correctly builded for every instruction of every tag:
instructions : (.*? TAG_OPEN statement TAG_CLOSE .*?)+;
But i have a not recognized character error outside the the tags...
Below is a quick demo that worked for me.
Lexer grammar:
lexer grammar CompModeLexer;
TAG_OPEN
: '<?compo' -> pushMode(COMPOSER)
;
OTHER
: . -> skip
;
mode COMPOSER;
TAG_CLOSE
: '?>' -> popMode
;
OPAR
: '('
;
CPAR
: ')'
;
INT
: '0'
| [1-9] [0-9]*
;
LOGICOP
: 'AND'
| 'OR'
;
COMPAREOP
: [<>!] '='
| [<>=]
;
MULTOP
: [*/%]
;
ADDOP
: [+-]
;
SPACE
: [ \t\r\n\f] -> skip
;
Parser grammar:
parser grammar CompModeParser;
options {
tokenVocab=CompModeLexer;
}
parse
: tag* EOF
;
tag
: TAG_OPEN statement TAG_CLOSE
;
statement
: expr
;
expr
: '(' expr ')'
| expr MULTOP expr
| expr ADDOP expr
| expr COMPAREOP expr
| expr LOGICOP expr
| INT
;
A test with the input This text is a unique token to skip <?compo 5+5 ?> also this <?compo 1+1 ?> resulted in the following tree:
I found another solution (not elegant as the previous):
Create a generic TEXT token in the general context (so outside the tag's mode)
TEXT : ( ~[<] | '<' ~[?])+ -> skip;
Create a parser rule for handle a generic text
code
: TEXT
| (TEXT? instruction TEXT?)+;
Create a parser rule for handle an instruction
instruction
: TAG_OPEN statement TAG_CLOSE;

Lexer to handle lines with line number prefix

I'm writing a parser for a language that looks like the following:
L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>
That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).
My current lexer looks like:
// Parser rules
commands : command*;
command : LINE_NUM? keyword NEWLINE
| LINE_NUM? IDENTIFIER NEWLINE;
keyword : KEYWORD_A | KEYWORD_B | ... ;
// Lexer rules
fragment INT : [0-9]+;
LINE_NUM : 'L' INT;
KEYWORD_A : 'someKeyword';
KEYWORD_B : 'reservedWord';
...
IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]*
However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs.
Is there a way to properly tokenize this input using an ANTLR grammar?
You need to add a semantic predicate to IDENTIFIER:
IDENTIFIER
: {_input.getCharPositionInLine() != 0
|| _input.LA(1) != 'L'
|| !Character.isDigit(_input.LA(2))}?
[a-zA-Z_] [a-zA-Z0-9_]*
;
You could also avoid semantic predicates by using lexer modes.
//
// Default mode is active at the beginning of a line
//
LINE_NUM
: 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
;
KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
: ( 'L'
| 'L' [a-zA-Z_] [a-zA-Z0-9_]*
| [a-zA-KM-Z_] [a-zA-Z0-9_]*
)
-> pushMode(NotBeginningOfLine)
;
NL : ('\r' '\n'? | '\n');
mode NotBeginningOfLine;
NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
NotBeginningOfLine_IDENTIFIER
: [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
;

extraneous input error using ANTLR4

I am an ANTLR newbie and struggling with some of the errors that I am getting. Below I have included the grammar that I am using, The input file and the error that I am getting.
My Antlr Grammar file is as follows:
grammar Simple;
#header
{
package simple;
}
PARSER
program :
anylinebefore+
processline
anylineafter+
'MSEND' NEWLINE
'.' EOF
;
anylinebefore: CH* NEWLINE | commentline;
anylineafter: statement | commentline;
statement: movestatement ;
movestatement : 'MOVE' arg ('to' | 'TO') ID '.' NEWLINE ;
arg : ID|STRING;
processline: PROCESSLITERAL NEWLINE;
commentline: '!' CH* NEWLINE;
LEXER
WS : [ \t]+ -> skip ;
STRING : '\'' (~['])* '\'';
ID : ('a'..'z'|'A'..'Z')+;
INT : '0'..'9'+;
TO : ('to' | 'TO');
CH : [\u0000-\uFFFE];
PROCESSLITERAL : 'PROCESS SOURCE FOLLOWS';
NEWLINE : '\r'? '\n' ;
My input file is as follows:
MODIFY
PROCESS SOURCE FOLLOWS
MOVE 'WSFRED' TO AGRPASSEDTWO.
MSEND
.
The error that I get is:
showtree:
[java] line 1:0 extraneous input 'MODIFY' expecting {'!', CH, NEWLINE}
I don't understand why this isn't matching anylinebefore in the grammar
Any help would be appreciated.
"MODIFY" is an ID, which doesn't match anylinebefore+

Why does my antlr grammar seem to properly parse this input?

I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
}
program : idList EOF
| integerList EOF
;
idList : ID whitespace idList
| ID
;
integerList : INTEGER whitespace integerList
| INTEGER
;
whitespace : (WHITESPACE | COMMENT) +;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?
I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
parser.program()
}
program : idList EOF
| integerList EOF
;
idList : ID+
;
integerList : INTEGER+
;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.