Antlr, get last line from token - line

I have a token definition that can contain multiple lines (something like multi line comments).
I can use the .line attribute to get the line where the token starts, but I need to
know where the token end.
Is it possible to get the last line of the token?

You can change the line number of a token by placing the (Java) code-block {$line=getLine();} at the end of the rule.
So, for multi-line comments, that would look like this:
COMMENT
: '/*' .* '*/' {$line=getLine();}
;
causing the method getLine() of the token COMMENT to return the line number the sub-string "*/" occurred on.

Related

ANTLR not matching empty comments

I am using ANTLR to parse a language which uses the colon for both a comment indicator and as part of a 'becomes equal to' assignment. So for example in the line
Index := 2 :Set Index
I need to recognize the first part as an assignment statement and the text after the second colon as a comment. Currently I do this using the rule:
COMMENT : ':'+ ~[:='\r\n']*;
This seems to work OK apart from when the colon is immediately followed by a new line. e.g. in the line
Index := 2 :
the newline occurs immediately after the second colon. In this case the comment is not recognized and the rest of the code is not parsed in the correct context. If there is a single space after the second colon the line is parsed correctly.
I expected the '\r'\n' to cope with this but it only seems to work if there is at least one character after the comment symbol - have I missed something from the command?
The braces denote a collection of characters without any quotes. Hence your '\r\n' literal doesn't work there (you should have got a warning that the apostrophe is included more than once in the char range.
Define the comment like this instead:
COMMENT: ':'+ ~[:=\n\r]*;

Partially skip characters in ANTLR4

I'm trying to match the following phrase:
<svg/onload="alert(1);">
And I need the tokens to be like:
'<svg', 'onload="alert(1);", '>'
So basically I need to skip the / in the <svg/onload part. But the skip phrase is not allowed here:
Attribute
: ('/' -> skip) Identifier '=' StringLiteral?
;
The error was
error(133): HTML.g4:35:11: ->command in lexer rule Attribute must be last element of single outermost alt
Any ideas?
The error message pretty much tells you what the problem is. The skip command has to be at the end of the rule. You cannot skip intermediate tokens, but only entire rules.
However, I wonder why you want to skip the slash. Why not just let the lexer scan everything (it has to anyway) and then ignore the tokens you don't need? Also I wouldn't use a lexer rule, but a parser rule, to allow arbitrary spaces between elements.
Try lexer's setText(getText().replace("/", "")) or any other matched string manipulation

How can I ignore all other characters except my own created tokens in Lexer Rules?

I have created some lexer rules to match some tokens in my input file. I want to eliminate all other characters from my input file in a single line (i.e followed by line break) other than my own tokens created earlier.
So, I have created this rule in lexer.
Newline : '\r'? '\n' | '\r';
...
My Own meaningful tokens
...
Others: (~Newline)* Newline;
But somehow this creates a confusion in the lexer about tokens that match my own meaningful tokens. Is there a tweak I can do?

ANTLR zero to multiple occurrence of the same parser rule

I'm trying to parse javadoc style comments. How can I indicate that the same parser rule could potentially be triggered zero or more times?
doc_comment : '/**' (param_declaration)* '*/' ;
param_declaration : OUTERWS '#param' OUTERWS ID OUTERWS;
ID : ('a'..'z')+ ;
OUTERWS : ('\n' | '\r' | ' ' |'\t')*;
Enclosing the param_declaration rule in ()* doesn't seem to work since it's not a token.
I would expect that:
/**
#param one
#param two
*/
would work. But instead I get: extraneous input '#param' expecting {'/' which doesn't make sense to me if (param_declaration) matches zero or more instances. It seems like adding ()* to param_declaration does nothing. Either way:
/**
#param one
*/
Works fine; with or without ()*.
The answer to your question is, to match rule foo zero or more times, use (foo)* or simply foo*.
If this is not producing a usable result, then the problem lies somewhere in how you have structured your lexer and/or parser, and to solve it you would need to ask a more specific question and include your grammar along with specific inputs and outputs that are not what you hoped, plus a description of the desired output.
Edit: Your error with two parameters is occurring because the param_declaration rule begins and ends with a required OUTERWS token. This means two OUTERWS tokens must appear in a row for two parameters to be parsed. This is impossible, because any two sequences of white space characters in the input file would match one long OUTERWS token, and that longer token will always be used instead of two shorter tokens.
Also note that your OUTERWS token is written in such a way that it could match 0 characters. If your input sequence contained a digit, say 0, then the longest token appearing before 0 would be a zero-length OUTERWS token. Since the input would not advance as a result of matching 0 characters, this means an input containing a digit should produce an infinitely long stream of empty OUTERWS tokens. The related warning you see when generating code for this grammar is not to be ignored.
Edit 2: Your input can match zero parameters if the comment appears in the form /***/. However, if your comment appears in the form /** */, you will have an OUTERWS token between /** and */, which is not allowed by your parser rules when there is no param_declaration.

ANTLR error when not enough, or too many, newlines

ANTLR gives me the following error when my input file has either no newline at the EOF, or more than one.
line 0:-1 mismatched input '' expecting NEWLINE
How would I go about taking into account the possibilities of having multiple or no newlines at the end of the input file. Preferably I'd like to account for this in the grammar.
The rule:
parse
: (Token LineBreak)+ EOF
;
only parses a stream of tokens, separated by exactly one line break, ending with exactly one line break.
While the rule:
parse
: Token (LineBreak+ Token)* LineBreak* EOF
;
parses a stream of tokens separated by one or more line breaks, ending with zero, one or more line breaks.
But do you really need to make the line breaks visible in the parser? Couldn't you put them on a "hidden channel" instead?
If this doesn't answer your question, you'll have to post your grammar (you can edit your original question for that).
HTH