Do independent rules influence one another? - antlr

When I was debugging my grammar for C# I noticed something very unusual: some inputs that are not accepted by a full grammar are being accepted by the same grammar with some independent rules deleted. I could not find a logical explanation. For example:
CS - this grammar does not accept the input a<a<a>><EOF>
CS' - and this grammar which is basically the same as CS but with some independent rules deleted (rules are not reordered) does accept a<a<a>><EOF>
As you can see both grammars start with the rule start: namespaceOrTypeName EOF; and therefore they should call the same set of rules (CS will never call those rules that are deleted in CS'). I spent a day debugging this, deleting or adding new rules, but couldn't find a flaw in the logic. Any help would be of use, thank you.
Unicode
EDIT:
After changing the start rule in CS to start: Identifier EOF; the grammar starts rejecting the input method which is normally accepted when only Identifier rules are defined. So I guess, since there is a rule attributeTarget: ...| 'method' | ..., that after compiling the grammar some phrases get reserved such as 'method' in this case but I'm not still sure if thats the case.

The first grammar includes the overloadableBinaryOperator rule which implicitly defines the >> token. Since >> is a 2-character token, the lexer will never treat the input >> as two separate 1-character tokens >, >. If you open the grammar in ANTLRWorks 2, you'll see a warning indicator for each implicitly-defined token. You should remove all of these warnings by:
Creating explicit lexer rules for every token you intend to appear in the input.
Only using the syntax 'new' in a parser rule if a corresponding lexer rule exists for the literal 'new'.

Related

Modifiying ANTLR v4 auto-generated lexer?

So i am writing a small language and i am using antlrv4 as my tool. Antlr autogenerates lexer and parser files when u compile your grammar file(.g4). I am using javac btw. I want my language to have no semicolons and the way i want to do this is: if there is an identifier or ")" as the last token in a line, the lexer will automatically put the semicolon(Similar to what "go" language does). How would i approach something like this? There are other things like ATN(which i think is augmented transition network) and dfa(which i think is deterministic finite automaton) in the lexer file which i don't understand or how they relate to the lexing process?. Any help is appreciated. (btw i am still working on the grammar file so i don't have that fully completed).
Several points here: the ATN and the DFA are internal structures for parser + lexer and not something you would touch to change parsing behavior. Also, it's not clear to me why you want to have the lexer insert a semicolon at some point. What exactly do you want to accomplish by that (don't say: to make semicolons optional in the parser, I mean the underlying reason).
If you want to accept a command without a trailing semicolon you can make that optional:
assignment: simpleAssignment | complexAssignment SEMI?;
The parser will give you the content of the assignment rule regardless whether there is a trailing semicolon or not. Is that what you want?

ANTLR 4.5 - Mismatched Input 'x' expecting 'x'

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

When is EOF needed in ANTLR 4?

The TestDriver in ANTLRWorks2 seems kind of finicky about when it'll accept a grammer without and explicit EOF and when it will not. The Hello grammar in the ANTLR4 Getting Started Guide doesn't use EOF anywhere, so I inferred that it's better to avoid explicit EOF if possible.
What is the best practice for using EOF? When do you actually need it?
You should include an explicit EOF at the end of your entry rule any time you are trying to parse an entire input file. If you do not include the EOF, it means you are not trying to parse the entire input, and it's acceptable to parse only a portion of the input if it means avoiding a syntax error.
For example, consider the following rule:
file : item*;
This rule means "Parse as many item elements as possible, and then stop." In other words, this rule will never attempt to recover from a syntax error because it will always assume that the syntax error is part of some syntactic construct that's beyond the scope of the file rule. Syntax errors will not even be reported, because the parser will simply stop.
If instead I had the following rule:
file : item* EOF;
In means "A file consists exactly of a sequence of zero-or-more item elements." If a syntax error is reached while parsing an item element, this rule will attempt to recover from (and report) the syntax error and continue because the EOF is required and has not yet been reached.
For rules where you are only trying to parse a portion of the input, ANTLR 4 often works, but not always. The following issue describes a technical problem where ANTLR 4 does not always make the correct decision if the EOF is omitted.
https://github.com/antlr/antlr4/issues/118
Unfortunately the performance impact of this change is substantial, so until that is resolved there will be edge cases that do not behave as you expect.

How to code grammar or lexer rule to describe JSP/EL identifier or string literal in ANTLR?

How to code grammar or lexer rule to describe JSP/EL identifier or string literal in ANTLR? Remember, that JSP/EL is Unicode and you cannot list all possible symbols in a rule. Also remember, that strings can contain EL expressions, which may be complex, so lexer is insufficient to describe them, parser is required, while ANTLR parser is unable to match character classes or any character.
Checkout the new "Lexical Modes":
Lexical Modes
Modes allow you to group lexical rules by context, such as inside and outside of XML tags. It’s like having multiple sublexers, one for context. The lexer can only return tokens matched by entering a rule in the current mode. Lexers start out in the so-called default mode. All rules are considered to be within the default mode unless you specify a mode command. Modes are not allowed within combined grammars, just lexer grammars.
-- http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules

Mismatched double token

In ANTLR, I have a MismatchedTokenException with the following definition:
type : IDENTIFIER ('<' (type (',' type)*) '>')?;
And the following test:
A<B,C<D>>
The exception occurs when parsing the first >. ANTLR tries parsing both '>>' at once, and fails.
With a silent whitespace channel, the following test does work:
A<B,C<D> >
In which ANTLR is clearly instructed to treat each token separately.
How can I fix that?
I could not reproduce that. The parser generated by:
grammar T;
type : IDENTIFIER ('<' (type (',' type)*) '>')?;
IDENTIFIER : 'A'..'Z';
parses the input A<B,C<D>> (without spaces) into the following parse tree:
You'll need to provide the grammar that causes this input to produce a MismatchedTokenException.
Perhaps you're using ANTLRWorks' interpreter (or Eclipse's ANTLR-IDE, which uses the same interpreter)? In that case, that is probably the problem: it's notoriously buggy. Don't use it, but use ANTLRWorks' debugger: it's great (the image posted above comes from the debugger).
Lazlo Bonin wrote:
Got it. I had a << token defined. Quickly, is there a way to priorize token recognition over another?
No, the lexer simply tries to match as much as possible. So if it can create a token matching << (or >>), it will do so in favor of two single < (or >) tokens. Only when two (or more) lexer rules match the same amount of characters, a prioritization is made: the rule defined first will then "win" over the one(s) defined later in the grammar.