Antlr4 (java) tries to match all input to first token - intellij-idea

my antlr (I'm using IntelliJ plugin) matches all my input to the first expression in my parser rule, which obviously causes an error.
Simple example:
grammar test;
rule : WORD '+' WORD;
WORD : [a-z]+;
Now testing:
input = 'faefae' gets me:
line 1:6 mismatched input '' expecting '+'
(so far it makes sense)
input = 'faefae+':
line 1:0 mismatched input 'faefae+' expecting WORD.
input = 'faefae+faefae':
line 1:0 mismatched input 'faefae+faefae' expecting WORD.
Last input should work, why doesn't it?
Help is much appreciated,
wish you all a nice day!

faefae+faefae will parse just fine.
You probably haven't regenerated the lexer/parser classes.
With IntelliJ and the ANTLR4 plugin, I get this:

Related

Can a parser fail silently?

May ANTLR generated parsers fail silently? That
is, can they omit diagnosing when not recognising?
Using a very small grammar for a demonstration and using defaults only for ANTLR, these are the contrasting observations:
When sending input to the usual test rig for the grammar below, I am
noticing two things:
the parsers recognize valid input (actions show that), o.K.;
however, the recognisers seem to “accept” certain invalid(?) inputs, meaning there is no
diagnosis. V3 and v4 parsers behave similarly. The issue—if there is
an issue—appears when there are characters ('1') missing
at the front of an input for stat, provided that prior to this input another input of
just a NEWLINE had been sent.
This is the v4 grammar:
grammar Simp;
prog : stat+ ;
stat : '1' '+' '1' NEWLINE
| NEWLINE
;
NEWLINE : [\r]?[\n] ;
The v3 grammar is the same, mutatis mutandis.
Some runs using v4; class TestSimp4 is the usual test rig as in the book(s),
see below:
% printf "1+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
% printf "+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
line 1:0 extraneous input '+' expecting {'1', NEWLINE}
line 1:2 mismatched input '\n' expecting '+'
% printf "\n+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
%
The first two invocations' results I had expected. I had expected the last invocation to visibly fail, though. Correct?
Looking at the generated SimpParser.java, the silent exit seems
consequential, as outlined below. But should it be that way? I am thinking that ANTLR just
stops before recognising invalid input here, but it shouldn't just stop.
Question: Is this silent failure rather to be expected? Have I
overlooked something like a greedyness setting for grammar tokens with a
+ suffix?
Some code analysis.
Referring to the loop that calls stat() (in the
prog() procedure):
The v3 parser sets a counter variable to >= 1 on sucessfully matching
the initial NEWLINE. The effect is that EarlyExitException is then
not being thrown on later inputs, it just breaks the loop.
The v4 parser similarly calls _input.LA(1) and then just terminates
the loop whenever that call’s result cannot be at the start of stat.
(So no recovery?)
The test rig:
class TestSimp4 {
public static void main(String[] args) throws Exception {
final CharStream subject = CharStreams.fromStream(System.in);
final TokenSource tknzr = new SimpLexer(subject);
final CommonTokenStream ts = new CommonTokenStream(tknzr);
final SimpParser parser = new SimpParser(ts);
parser.prog();
}
}
So another paraphrase of my question would be: “How does one
create ANTLR parsers such that they will always say YES or NO?”
Your 3rd test input, \n+1\n, does not produce an error because you're telling it to recognize the production/rule stat once or more. And prog successfully matches the input \n and then stops. If you want the entire input (token stream) to be consumed, "anchor" your prog rule with the EOF token:
prog : stat+ EOF;

ANTLR 4.5 - Mismatched Input 'x' expecting 'x'

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....

ANTLR error when not enough, or too many, newlines

ANTLR gives me the following error when my input file has either no newline at the EOF, or more than one.
line 0:-1 mismatched input '' expecting NEWLINE
How would I go about taking into account the possibilities of having multiple or no newlines at the end of the input file. Preferably I'd like to account for this in the grammar.
The rule:
parse
: (Token LineBreak)+ EOF
;
only parses a stream of tokens, separated by exactly one line break, ending with exactly one line break.
While the rule:
parse
: Token (LineBreak+ Token)* LineBreak* EOF
;
parses a stream of tokens separated by one or more line breaks, ending with zero, one or more line breaks.
But do you really need to make the line breaks visible in the parser? Couldn't you put them on a "hidden channel" instead?
If this doesn't answer your question, you'll have to post your grammar (you can edit your original question for that).
HTH