Conditionally skipping an ANTLR lexer rule based on current line number - antlr

I have this pair of rules in my ANTLR lexer grammar, which match the same pattern, but have mutually exclusive predicates:
MAGIC: '#' ~[\r\n]* {getLine() == 1}? ;
HASH_COMMENT: '#' ~[\r\n]* {getLine() != 1}? -> skip;
When I look at the tokens in the ANTLR Preview, I see:
So it seems like the predicate isn't being used, and regardless of the line I'm on, the token comes out as MAGIC.
I also tried a different approach to try and work around this:
tokens { MAGIC }
HASH_COMMENT: '#' ~[\r\n]* {if (getLine() == 1) setType(MAGIC); else skip();};
But now, both come out as HASH_COMMENT:
I really expected the first attempt using two predicates to work, so that was surprising, but now it seems like the action doesn't work either, which is even more odd.
How do I make this work?
I'd rather not try to match "#usda ..." as a different token because that comment could occur further down the file, and it should be treated as a normal comment unless it's on the first line.

I would not try to force semantics in the parse step. The letter combination is a HASH_COMMENT, period.
Instead I would handle that as normal syntax and handle anything special you might need in the step after parsing. For example:
document: HASH_COMMENT? content EOF;
This way you define a possible HASH_COMMENT (which you might interpret as MAGIC later, without using such a token type) before any content. Might not be line one, but before anything else (which resembles real document better, where you can have whitespaces before your hash comment).

Related

Partially skip characters in ANTLR4

I'm trying to match the following phrase:
<svg/onload="alert(1);">
And I need the tokens to be like:
'<svg', 'onload="alert(1);", '>'
So basically I need to skip the / in the <svg/onload part. But the skip phrase is not allowed here:
Attribute
: ('/' -> skip) Identifier '=' StringLiteral?
;
The error was
error(133): HTML.g4:35:11: ->command in lexer rule Attribute must be last element of single outermost alt
Any ideas?
The error message pretty much tells you what the problem is. The skip command has to be at the end of the rule. You cannot skip intermediate tokens, but only entire rules.
However, I wonder why you want to skip the slash. Why not just let the lexer scan everything (it has to anyway) and then ignore the tokens you don't need? Also I wouldn't use a lexer rule, but a parser rule, to allow arbitrary spaces between elements.
Try lexer's setText(getText().replace("/", "")) or any other matched string manipulation

Does .parse anchor or :sigspace first in a Perl 6 rule?

I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?
I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:
grammar Number {
rule TOP { \d+ }
}
my #strings = '137', '137 ', ' 137 ';
for #strings -> $string {
my $result = Number.parse( $string );
given $result {
when Match { put "<$string> worked!" }
when Any { put "<$string> failed!" }
}
}
With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:
<137> worked!
<137 > worked!
< 137 > failed!
I figure this means that rule is applying :sigspace first and the anchors afterward:
grammar Foo {
regex TOP { ^ :sigspace \d+ $ }
}
I expected a rule to allow leading whitespace, which would happen if you switched the order:
grammar Foo {
regex TOP { :sigspace ^ \d+ $ }
}
I could add an explicit token in rule for the beginning of the string:
grammar Number {
rule TOP { ^ \d+ }
}
Now everything works:
<137> worked!
<137 > worked!
< 137 > worked!
I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:
Note that if you're parsing with .parse method, token TOP is automatically anchored
and
When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.
I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.
The behavior is intended, and is a culmination of these language features:
Sigspace ignores whitespace before the first atom.
From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):
The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.
This means:
rule TOP { \d+ }
^-------- <.ws> automatically inserted
rule TOP { ^ \d+ $ }
^---^-^---- <.ws> automatically inserted
Regexes are first-class compiled code with lexical scoping.
A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.
Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:
The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
The anchoring of rule TOP is done at run time by .parse.
S05, line 44231:
The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)
I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.
It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).
So when you write
rule TOP { \d+ }
it is compiled as
regex TOP { :r \d+ <.ws> }
And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:
regex TOP { ^ [:r :s \d+] $ }
1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.
There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.
My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.

ANTLR 4.5 - Mismatched Input 'x' expecting 'x'

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Do independent rules influence one another?

When I was debugging my grammar for C# I noticed something very unusual: some inputs that are not accepted by a full grammar are being accepted by the same grammar with some independent rules deleted. I could not find a logical explanation. For example:
CS - this grammar does not accept the input a<a<a>><EOF>
CS' - and this grammar which is basically the same as CS but with some independent rules deleted (rules are not reordered) does accept a<a<a>><EOF>
As you can see both grammars start with the rule start: namespaceOrTypeName EOF; and therefore they should call the same set of rules (CS will never call those rules that are deleted in CS'). I spent a day debugging this, deleting or adding new rules, but couldn't find a flaw in the logic. Any help would be of use, thank you.
Unicode
EDIT:
After changing the start rule in CS to start: Identifier EOF; the grammar starts rejecting the input method which is normally accepted when only Identifier rules are defined. So I guess, since there is a rule attributeTarget: ...| 'method' | ..., that after compiling the grammar some phrases get reserved such as 'method' in this case but I'm not still sure if thats the case.
The first grammar includes the overloadableBinaryOperator rule which implicitly defines the >> token. Since >> is a 2-character token, the lexer will never treat the input >> as two separate 1-character tokens >, >. If you open the grammar in ANTLRWorks 2, you'll see a warning indicator for each implicitly-defined token. You should remove all of these warnings by:
Creating explicit lexer rules for every token you intend to appear in the input.
Only using the syntax 'new' in a parser rule if a corresponding lexer rule exists for the literal 'new'.

Mismatched double token

In ANTLR, I have a MismatchedTokenException with the following definition:
type : IDENTIFIER ('<' (type (',' type)*) '>')?;
And the following test:
A<B,C<D>>
The exception occurs when parsing the first >. ANTLR tries parsing both '>>' at once, and fails.
With a silent whitespace channel, the following test does work:
A<B,C<D> >
In which ANTLR is clearly instructed to treat each token separately.
How can I fix that?
I could not reproduce that. The parser generated by:
grammar T;
type : IDENTIFIER ('<' (type (',' type)*) '>')?;
IDENTIFIER : 'A'..'Z';
parses the input A<B,C<D>> (without spaces) into the following parse tree:
You'll need to provide the grammar that causes this input to produce a MismatchedTokenException.
Perhaps you're using ANTLRWorks' interpreter (or Eclipse's ANTLR-IDE, which uses the same interpreter)? In that case, that is probably the problem: it's notoriously buggy. Don't use it, but use ANTLRWorks' debugger: it's great (the image posted above comes from the debugger).
Lazlo Bonin wrote:
Got it. I had a << token defined. Quickly, is there a way to priorize token recognition over another?
No, the lexer simply tries to match as much as possible. So if it can create a token matching << (or >>), it will do so in favor of two single < (or >) tokens. Only when two (or more) lexer rules match the same amount of characters, a prioritization is made: the rule defined first will then "win" over the one(s) defined later in the grammar.