I'm trying to use ANTLR4 to parse input strings that are described by a grammar like:
grammar MyGrammar;
parse : PREFIX? SEARCH;
PREFIX
: [0-9]+ ':'
;
SEARCH
: .+
;
e.g. valid input strings include:
0: maracujá
apple
3:€53.60
1: 10kg
2:chilli pepper
But the SEARCH rule always matches the whole string - whether it has a prefix or not.
I understand this is because the ANTLR4 lexer gives preference to the rules that match the longest string. Therefore the SEARCH rule matches all input, not giving the PREFIX rule a chance.
And the non-greedy version (i.e. SEARCH : .+? ;) has the same problem because (as I understand) it's only non-greedy within the rule - and the SEARCH rule doesn't have any other parts to constrain it.
If it helps, I could constrain the SEARCH text to exclude ':' but I really would prefer it recognise anything else - unicode characters, symbols, numbers, space etc.
I've read Lexer to handle lines with line number prefix but in that case, the body of the string (after the prefix) is significantly more constrained.
Note: SEARCH text might have a structure to it - like €53.00 and 10kg above (which I'd also like ANTLR4 to parse) or it might just be free text - like apple, maracujá and chilli pepper above. But I've tried to simplify so I can solve the problem of extracting the PREFIX first.
ANTLR does lexing before parsing. The lexer prefers long matches and SEARCH tokens match every PREFIX token and even any character appended to it, so your complete line is matched by SEARCH.
To prevent this: Keep the lexer rules disjunct, or at least the tokens should not subsume each other.
parse : prefix? search;
search: (WORD | NUMBER)+;
prefix: NUMBER ':';
NUMBER : [0-9]+;
WORD : (~[0-9:])+;
Related
I'm tryna do an ANTLR translator from Markdown format to HTML document and I found this problem when I try to recognize bold format. This is my ANTLR rule:
TxtNegrita : ('**' | '__') .*? ('**' | '__') {System.out.println('<span class="bold">' + getText() + '</span>');};
Unfortunately, the getText() function retrieves all the recognized String, including ** at the beginning and at the end of the String. Is it a way to delete that chars using ANTLR (obviously, in Java is perfectly possible).
Thanks!
You’ve created a Lexer rule which results in a single token. That is the expected behavior.
That rule looks more like something I’d expect in a parser rule.
(rules begin with upper case characters (conventionally all uppercase to make them stand out), and parser rules begin with lowercase letters and result in parse trees where each node has a context which gives you access to the component parts of your parser rule.
In ANTLR it is quite important to understand the difference between Lexer rules and parser rules.
Put simply... your input stream of characters is converted to an input stream of tokens using Lexer rules, and that stream of tokens is processed by parser rules.
Tokens are pretty much the “atoms” that parser rules deal with and their values are simply the string of characters that matched the Lexer Rule.
First I tried to identify a normal word and below works fine:
grammar Test;
myToken: WORD;
WORD: (LOWERCASE | UPPERCASE )+ ;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9' ;
WHITESPACE : (' ' | '\t')+;
Just when I added below parser rule just beneath "myToken", even my WORD tokens weren't getting recognised with input string as "abc"
ALPHA_NUMERIC_WS: ( WORD | DIGIT | WHITESPACE)+;
Does anyone have any idea why is that?
This is because ANTLR's lexer matches "first come, first serve". That means it will tray to match the given input with the first specified (in the source code) rule and if that one can match the input, it won't try to match it with the other ones.
In your case ALPHA_NUMERIC_WS does match the same content as WORD (and more) and because it is specified before WORD, WORD will never be used to match the input as there is no input that can be matched by WORD that can't be matched by the first processed ALPHA_NUMERIC_WS. (The same applies for the WS and the DIGIT) rule.
I guess that what you want is not to create a ALPHA_NUMERIC_WS-token (as is done by specifying it as a lexer rule) but to make it a parser rule instead so it then can be referenced from another parsre rule to allow an arbitrary sequence of WORDs, DIGITs and WSs.
Therefore you'd want to write it like this:
alpha_numweric_ws: ( WORD | DIGIT | WHITESPACE)+;
If you actually want to create the respective token you can either remove the following rules or you need to think about what a lexer's job is and where to draw the line between lexer and parser (You need to redesign your grammar in order for this to work).
I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.
In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.
You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;
How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*
How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....