Meaning of colon inside parenthesises - antlr

What meaning does the expression (: .) have inside the ANTLR-rule
message
: COLON (: .)*?
;
at https://github.com/antlr/grammars-v4/blob/master/stacktrace/StackTrace.g4#L79
?
I can't any doc reference to colons being allowed inside rules. Shouldn't the : inside the parenthesises be replaced by COLON in this case?
Update:
Are these cases also left-overs from removal of options?;
grammars-v4/antlr/antlr3/examples/Verilog3.g(548,13): Warning: ignoring colon with no effect, token `'else'` at offset 12136
grammars-v4/vhdl/vhdl.g4(696,18): Warning: ignoring colon with no effect, token `logical_operator` at offset 13302
grammars-v4/vhdl/vhdl.g4(700,17): Warning: ignoring colon with no effect, token `DOUBLESTAR` at offset 13359
grammars-v4/vhdl/vhdl.g4(1087,25): Warning: ignoring colon with no effect, token `identifier` at offset 20396
grammars-v4/vhdl/vhdl.g4(1232,9): Warning: ignoring colon with no effect, token `relational_operator` at offset 22963
grammars-v4/vhdl/vhdl.g4(1310,9): Warning: ignoring colon with no effect, token `shift_operator` at offset 24300
grammars-v4/vhdl/vhdl.g4(1351,32): Warning: ignoring colon with no effect, token `adding_operator` at offset 24999
grammars-v4/vhdl/vhdl.g4(1497,16): Warning: ignoring colon with no effect, token `multiplying_operator` at offset 28078
grammars-v4/pascal/pascal.g4(439,38): Warning: ignoring colon with no effect, token `ELSE` at offset 7347

It has no meaning and should be removed. My guess is that the grammar is converted from ANTLR3 where the following was in the v3 version:
message
: COLON (options{greedy=false;}: .)*
;
They removed the options{greedy=false;} part without removing the :. To match ungreedy in v4, the ? was introduced. You can also remove the parenthesis:
message
: COLON .*?
;
This should be a warning for you: the grammar is not well tested, and the fact that a rule ends with a .*? is always a bit suspicious to me (often there are better ways to match such a rule). There are also more things I don't like about this grammar (not necessarily wrong, but things I consider bad practice). Be cautious using this in a production environment!

Related

Conditionally skipping an ANTLR lexer rule based on current line number

I have this pair of rules in my ANTLR lexer grammar, which match the same pattern, but have mutually exclusive predicates:
MAGIC: '#' ~[\r\n]* {getLine() == 1}? ;
HASH_COMMENT: '#' ~[\r\n]* {getLine() != 1}? -> skip;
When I look at the tokens in the ANTLR Preview, I see:
So it seems like the predicate isn't being used, and regardless of the line I'm on, the token comes out as MAGIC.
I also tried a different approach to try and work around this:
tokens { MAGIC }
HASH_COMMENT: '#' ~[\r\n]* {if (getLine() == 1) setType(MAGIC); else skip();};
But now, both come out as HASH_COMMENT:
I really expected the first attempt using two predicates to work, so that was surprising, but now it seems like the action doesn't work either, which is even more odd.
How do I make this work?
I'd rather not try to match "#usda ..." as a different token because that comment could occur further down the file, and it should be treated as a normal comment unless it's on the first line.
I would not try to force semantics in the parse step. The letter combination is a HASH_COMMENT, period.
Instead I would handle that as normal syntax and handle anything special you might need in the step after parsing. For example:
document: HASH_COMMENT? content EOF;
This way you define a possible HASH_COMMENT (which you might interpret as MAGIC later, without using such a token type) before any content. Might not be line one, but before anything else (which resembles real document better, where you can have whitespaces before your hash comment).

ANTLR not matching empty comments

I am using ANTLR to parse a language which uses the colon for both a comment indicator and as part of a 'becomes equal to' assignment. So for example in the line
Index := 2 :Set Index
I need to recognize the first part as an assignment statement and the text after the second colon as a comment. Currently I do this using the rule:
COMMENT : ':'+ ~[:='\r\n']*;
This seems to work OK apart from when the colon is immediately followed by a new line. e.g. in the line
Index := 2 :
the newline occurs immediately after the second colon. In this case the comment is not recognized and the rest of the code is not parsed in the correct context. If there is a single space after the second colon the line is parsed correctly.
I expected the '\r'\n' to cope with this but it only seems to work if there is at least one character after the comment symbol - have I missed something from the command?
The braces denote a collection of characters without any quotes. Hence your '\r\n' literal doesn't work there (you should have got a warning that the apostrophe is included more than once in the char range.
Define the comment like this instead:
COMMENT: ':'+ ~[:=\n\r]*;

Partially skip characters in ANTLR4

I'm trying to match the following phrase:
<svg/onload="alert(1);">
And I need the tokens to be like:
'<svg', 'onload="alert(1);", '>'
So basically I need to skip the / in the <svg/onload part. But the skip phrase is not allowed here:
Attribute
: ('/' -> skip) Identifier '=' StringLiteral?
;
The error was
error(133): HTML.g4:35:11: ->command in lexer rule Attribute must be last element of single outermost alt
Any ideas?
The error message pretty much tells you what the problem is. The skip command has to be at the end of the rule. You cannot skip intermediate tokens, but only entire rules.
However, I wonder why you want to skip the slash. Why not just let the lexer scan everything (it has to anyway) and then ignore the tokens you don't need? Also I wouldn't use a lexer rule, but a parser rule, to allow arbitrary spaces between elements.
Try lexer's setText(getText().replace("/", "")) or any other matched string manipulation

antlr4: need to convert sequences of symbols to characters in lexer

I am writing a parser for Wolfram Language. The language has a concept of "named characters", which are specified by a name delimited by \[, and ]. For example: \[Pi].
Suppose I want to specify a regular expression for an identifier. Identifiers can include named characters. I see two ways to do it: one is to have a preprocessor that would convert all named characters to their unicode representation, and two is to enumerate all possible named characters in their source form as part of the regular expression.
The second approach does not seem feasible because there are a lot of named characters. I would prefer to have ranges of unicode characters in my regex.
So I want to preprocess my token stream. In other words, it seems to me that the lexer needs to check if the named characters syntax is correct and then look up the name and convert it to unicode.
But if the syntax is incorrect or the name does not exist I need to tell the user about it. How do I propagate this error to the user and yet let antlr4 recover from the error and resume? Maybe I can sort of "pipe" lexers/parsers? (I am new to antlr).
EDIT:
In Wolfram Language I can have this string as an identifier: \[Pi]Squared. The part between brackets is called "named character". There is a limited set of named characters, each of which corresponds to a unicode code point. I am trying to figure out how to tokenize identifiers like this.
I could have a rule for my token like this (simplified to just a combination of named characters and ASCII characters):
NAME : ('\\[' [a-z]+ ']'|[a-zA-Z])+ ;
but I would like to check if the named character actually exists (and other attributes such as if it is a letter, but the latter part is outside of the scope of the question), so this regex won't work.
I considered making a list of allowed named characters and just making a long regex that enumerates all of them, but this seems ugly.
What would be a good approach to this?
END OF EDIT
A common approach is to write the lexer/parser to allow syntactically correct input and defer semantic issues to the analysis of the generated parse tree. In this case, the lexer can naively accept named characters:
NChar : NCBeg .? RBrack ;
fragment NCBeg : '\\[' ;
fragment LBrack: '[' ;
fragment RBrack: ']' ;
Update
In the parser, allow the NChar's to exist in the parse-tree as discrete terminal nodes:
idents : ident+ ;
ident : NChar // named character string
| ID // simple character string?
| Literal // something quoted?
| ....
;
This makes analysis of the parse tree considerably easier: each ident context will contain only one non-null value for a discretely identifiable alt; and isolates analysis of all ordering issues to the idents context.
Update2
For an input \[Pi]Squared, the parse tree form that would be easiest to analyze would be an idents node with two well-ordered children, \[Pi] and Squared.
Best practice would not be to pack both children into the same token - would just have to later manually break the token text into the two parts to check if it is contains a valid named character and whether the particular sequence of parts is allowable.
No regex is going to allow conclusive verification of the named characters. That will require a list. Tightening the lexer definition of an NChar can, however, achieve a result equivalent to a regex:
NChar : NCBeg [A-Z][A-Za-z]+ RBrack ;
If the concern is that there might be a space after the named character, consider that this circumstance is likely better treated with a semantic warning as opposed to a syntactic error. Rather than skipping whitespace in the lexer, put the whitespace on the hidden channel. Then, in the verification analysis of each idents context, check the hidden channel for intervening whitespace and issue a warning as appropriate.
----
A parse-tree visitor can then examine, validate, and warn as appropriate regarding unknown or misspelled named characters.
To do the validation in the parser, if more desirable, use a predicated rule to distinguish known from unknown named characters:
#members {
ArrayList<String> keyList = .... // list of named chars
public boolean inList(String id) {
return keyList.contains(id) ;
}
}
nChar : known
| unknown
;
known : NChar { inList($NChar.getText()) }? ;
unknown : NChar { error("Unknown " + $NChar.getText()); } ;
The inList function could implement a distance metric to detect misspellings, but correcting the text directly in the parse-tree is a bit complex. Easier to do when implemented as a parse-tree decoration during a visitor operation.
Finally, a scrape and munge of the named characters into a usable map (both unicode and ascii) is likely worthwhile to handle both representations as well as conversions and misspelling.

Mismatched double token

In ANTLR, I have a MismatchedTokenException with the following definition:
type : IDENTIFIER ('<' (type (',' type)*) '>')?;
And the following test:
A<B,C<D>>
The exception occurs when parsing the first >. ANTLR tries parsing both '>>' at once, and fails.
With a silent whitespace channel, the following test does work:
A<B,C<D> >
In which ANTLR is clearly instructed to treat each token separately.
How can I fix that?
I could not reproduce that. The parser generated by:
grammar T;
type : IDENTIFIER ('<' (type (',' type)*) '>')?;
IDENTIFIER : 'A'..'Z';
parses the input A<B,C<D>> (without spaces) into the following parse tree:
You'll need to provide the grammar that causes this input to produce a MismatchedTokenException.
Perhaps you're using ANTLRWorks' interpreter (or Eclipse's ANTLR-IDE, which uses the same interpreter)? In that case, that is probably the problem: it's notoriously buggy. Don't use it, but use ANTLRWorks' debugger: it's great (the image posted above comes from the debugger).
Lazlo Bonin wrote:
Got it. I had a << token defined. Quickly, is there a way to priorize token recognition over another?
No, the lexer simply tries to match as much as possible. So if it can create a token matching << (or >>), it will do so in favor of two single < (or >) tokens. Only when two (or more) lexer rules match the same amount of characters, a prioritization is made: the rule defined first will then "win" over the one(s) defined later in the grammar.