antlr4 lexer token invalid - antlr

Trying to parse the below sentence, but the lexer generates incorrect token
Input
column(propName="~~" abc="hi")
Lexer
DOUBLEQUOTED: '"' (E_TILDE | ~ ('"') | E_DOUBLE_QUOTE)* '"';
fragment E_TILDE : '~~' ;
fragment E_DOUBLE_QUOTE : '~"' ;
Trying to parse the input sentence, but the lexer generates the token
'"~~" abc="' as double quoted string
expected output
'"~~"' as Double quoted string.
'"hi"' as Double quoted string
Any help appreciated

ANTLR Lexer matches the longest sub-sequence it can when determining the next token. Since "~~" abc=" is a valid DOUBLEQUOTED token, and is longer than just "~~", it will be matched.

Related

Single quoted literal value fails Antlr lexer

I have a lexer rule that defines single-quoted literal string as
L_S_STRING : '\'' (('\'' '\'') | ('\\' '\'') | ~('\''))* '\''
It fails one particular case:
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\''
The problem is really with the last two single quotes. If I added a space in between, it worked. Or I could use two single quotes to end and it worked too, e.g.
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z'''
I am not sure if it has something to do with having a non-greedy operator which caused the first-match of ('\'' '\'')? If so, I don't see how the last version could have worked.
In any event, could someone help please?
UPDATE - I am not able to reproduce it outside of the full grammar. This may be a red herring.
UPDATE - I missed some important context so I posted another question here Antlr4: single quote rule fails when there are escape chars plus carriage return, new line
I can't reproduce that. Given the following grammar:
lexer grammar Test;
L_S_STRING : '\'' (('\'' '\'') | ('\\' '\'') | ~('\''))* '\'';
OTHER : . ;
which can be tested as follows:
String source = "A'yyyy-MM-dd\\\\'T\\\\'HH:mm:ss\\\\'Z\\\\''B";
Test lexer = new Test(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", Test.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
will print:
OTHER A
L_S_STRING 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\''
OTHER B
EOF <EOF>

What is the ANTLR4 equivalent of a ! in a lexer rule?

I'm working on converting an old ANTLR 2 grammar to ANTLR 4, and I'm having trouble with the string rule.
STRING :
'\''!
(
~('\'' | '\\' | '\r' | '\n')
)*
'\''!
;
This creates a STRING token whose text contains the contents of the string, but does not contain the starting and ending quotes, because of the ! symbol after the quote literals.
ANTLR 4 chokes on the ! symbol, ('!' came as a complete surprise to me (AC0050)) but if I leave it off, I end up with tokens that contain the quotes, which is not what I want. What's the correct way to port this to ANTLR 4?
Antlr4 generally treats tokens as being immutable, at least in the sense that there is no support for a language neutral equivalent of !.
Perhaps the simplest way to accomplish the equivalent is:
string : str=STRING { Strings.unquote($str); } ;
STRING : SQuote ~[\r\n\\']* SQuote ;
fragment SQuote : '\'' ;
where Strings.unquote is:
public static void unquote(Token token) {
CommonToken ct = (CommonToken) token;
String text = ct.getText();
text = .... unquote it ....
ct.setText(text);
}
The reason for using a parser rule is because attribute references are not (currently) supported in the lexer. Still, it could be done on the lexer rule - just would require a slight bit more effort to dig to the token.
An alternative to modifying the token text is to implement a custom token with custom fields and methods. See this answer if of interest.
I believe in ANTLR4 your problem can be solved using lexical modes and lexer commands.
Here is an example from there that I think does exactly what you need (although for double quotes but it's an easy fix):
lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;
mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string

ANTLR String LEXER token

I am trying to do a STRING lexer token. My problem is that besides \n, \r, \t
any character is himself (for example \c is c). That being said i have the following example:
"This is a valid \
string."
"This is
not valid."
"This is al\so a valid string"
After searching on the internet to no avail for me, i concluded that i must use an #after clause. Unfortunately i don't understand how to do this. If i am not mistaking i can't use a syntactic predicate because this is not a parser rule, it's a lexer rule.
How about something like this:
STRING
: '"' ( '\\' ('\\'|'\t'|'\r\n'|'\r'|'\n'|'"') | ~('\\'|'\t'|'\r'|'\n'|'"') )* '"'
;
where '\\' ('\\'|'\t'|'\r\n'|'\r'|'\n'|'"') is an escaped slash, tab, line break or quote. And ~('\\'|'\t'|'\r'|'\n'|'"') matches any char other than a slash, tab, line break or quote.

ANTLR: mismatched input

I couldn't understand a bug in my grammar. The file, Bug.g4, is:
grammar Bug;
text: TEXT;
WORD: ('a'..'z' | 'A'..'Z')+ ;
TEXT: ('a'..'z' | 'A'..'Z')+ ;
NEWLINE: [\n\r] -> skip ;
After running antlr4 and javac, I run
grun Bug text -tree
aa
line 1:0 mismatched input 'aa' expecting TEXT
(text aa)
But if I instead use text: WORD in the grammar, things are okay. What's wrong?
When two lexer rules each match the same string of text, and no other lexer rule matches a longer string of text, ANTLR assigns the token type according to the rule which appeared first in the grammar. In your case, a TEXT token can never be produced by the lexer rule because the WORD rule will always match the same text and the WORD rule appears before the TEXT rule in the grammar. If you were to reverse the order of these rules in the grammar, you would start to see TEXT tokens but you would never see a WORD token.

ANTLR 4 lexer subrule order

Does the order of choices among lexer subrules matter in ANTLR4? For example, is there any difference between the following rules?
STRING: '"' ('\\"' | .)*? '"';
STRING: '"' (. | '\\"')*? '"';
The first lexical rule can match the whole of such an input as: "abc\"def". the second will match only part of it, that is, "abc\", and then report error for the rest character sequence.
Antlr generated lexer matches the subrule first which defined first. I have tested them on Antlr 4.