ANTLR error when not enough, or too many, newlines - antlr

ANTLR gives me the following error when my input file has either no newline at the EOF, or more than one.
line 0:-1 mismatched input '' expecting NEWLINE
How would I go about taking into account the possibilities of having multiple or no newlines at the end of the input file. Preferably I'd like to account for this in the grammar.

The rule:
parse
: (Token LineBreak)+ EOF
;
only parses a stream of tokens, separated by exactly one line break, ending with exactly one line break.
While the rule:
parse
: Token (LineBreak+ Token)* LineBreak* EOF
;
parses a stream of tokens separated by one or more line breaks, ending with zero, one or more line breaks.
But do you really need to make the line breaks visible in the parser? Couldn't you put them on a "hidden channel" instead?
If this doesn't answer your question, you'll have to post your grammar (you can edit your original question for that).
HTH

Related

Partially skip characters in ANTLR4

I'm trying to match the following phrase:
<svg/onload="alert(1);">
And I need the tokens to be like:
'<svg', 'onload="alert(1);", '>'
So basically I need to skip the / in the <svg/onload part. But the skip phrase is not allowed here:
Attribute
: ('/' -> skip) Identifier '=' StringLiteral?
;
The error was
error(133): HTML.g4:35:11: ->command in lexer rule Attribute must be last element of single outermost alt
Any ideas?
The error message pretty much tells you what the problem is. The skip command has to be at the end of the rule. You cannot skip intermediate tokens, but only entire rules.
However, I wonder why you want to skip the slash. Why not just let the lexer scan everything (it has to anyway) and then ignore the tokens you don't need? Also I wouldn't use a lexer rule, but a parser rule, to allow arbitrary spaces between elements.
Try lexer's setText(getText().replace("/", "")) or any other matched string manipulation

Antlr4 (java) tries to match all input to first token

my antlr (I'm using IntelliJ plugin) matches all my input to the first expression in my parser rule, which obviously causes an error.
Simple example:
grammar test;
rule : WORD '+' WORD;
WORD : [a-z]+;
Now testing:
input = 'faefae' gets me:
line 1:6 mismatched input '' expecting '+'
(so far it makes sense)
input = 'faefae+':
line 1:0 mismatched input 'faefae+' expecting WORD.
input = 'faefae+faefae':
line 1:0 mismatched input 'faefae+faefae' expecting WORD.
Last input should work, why doesn't it?
Help is much appreciated,
wish you all a nice day!
faefae+faefae will parse just fine.
You probably haven't regenerated the lexer/parser classes.
With IntelliJ and the ANTLR4 plugin, I get this:

ANTLR 4.5 - Mismatched Input 'x' expecting 'x'

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....

Antlr, get last line from token

I have a token definition that can contain multiple lines (something like multi line comments).
I can use the .line attribute to get the line where the token starts, but I need to
know where the token end.
Is it possible to get the last line of the token?
You can change the line number of a token by placing the (Java) code-block {$line=getLine();} at the end of the rule.
So, for multi-line comments, that would look like this:
COMMENT
: '/*' .* '*/' {$line=getLine();}
;
causing the method getLine() of the token COMMENT to return the line number the sub-string "*/" occurred on.