ANTLR: Match unescaped characters? - antlr

I've got a rule like,
charGroup
: '[' .+ ']';
But I'm guessing that'll match something like [abc\]. Assuming I want it to match only unescaped ]s, how do I do that? In a regular expression I'd use a negative look-behind.
Edit: I'd also like it to be ungreedy/lazy if possible. So as to match only [a] in [a][b].

You probably wanted to do something like:
charGroup
: '[' ('\\' . | ~('\\' | ']'))+ ']'
;
where ~('\\' | ']') matches a single character other than \ and ]. Note that you can only negate single characters! There's no such thing as ~('ab'). Another mistake often made is that negating inside parser rules does not negate a character, but a token instead. An example might be in order:
foo : ~(A | D);
A : 'a';
B : 'b';
C : 'c';
D : ~A;
Now parser rule foo matches either token B or token C (so only the characters 'b' and 'c') while lexer rule D matches any character other than 'a'.

I'd use a negative look-behind
Isn't that unnecessarily complex? How about:
charGroup
: '[' ('\\]' | .)+ ']';

Related

Antlr4 parser not parsing reassignment statement correctly

I've been creating a grammar parser using Antlr4 and wanted to add variable reassignment (without having to declare a new variable)
I've tried changing the reassignment statement to be an expression, but that didn't change anything
Here's a shortened version of my grammar:
grammar MyLanguage;
program: statement* EOF;
statement
: expression EOC
| variable EOC
| IDENTIFIER ASSIGNMENT expression EOC
;
variable: type IDENTIFIER (ASSIGNMENT expression)?;
expression
: STRING
| INTEGER
| IDENTIFIER
| expression MATH expression
| ('+' | '-') expression
;
MATH: '+' | '-' | '*' | '/' | '%' | '//' | '**';
ASSIGNMENT: MATH? '=';
EOC: ';';
WHITESPACE: [ \t\r\n]+ -> skip;
STRING: '"' (~[\u0000-\u0008\u0010-\u001F"] | [\t])* '"' | '\'' (~[\u0000-\u0008\u0010-\u001F'] | [\t])* '\'';
INTEGER: '0' | ('+' | '-')? [1-9][0-9]*;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
type: 'str';
if anything else might be of relevance, please ask
so I tried to parse
str test = "empty";
test = "not empty";
which worked, but when I tried (part of the fibbionaci function)
temp = n1;
n1 = n1 + n2;
n2 = temp;
it got an error and parsed it as
temp = n1; //statement
n1 = n1 //statement - <missing ';'>
+n2; //statement
n2 = temp; //statement
Your problem has nothing to do with assignment statements. Additions simply don't work at all - whether they're part of an assignment or not. So the simplest input to get the error would be x+y;. If you print the token stream for that input (using grun with the -tokens option for example), you'll get the following output:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='+',<'+'>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
line 1:1 no viable alternative at input 'x+'
Now compare this to x*y;, which works fine:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='*',<MATH>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
The important difference here is that * is recognized as a MATH token, but + isn't. It's recognized as a '+' token instead.
This happens because you introduced a separate '+' (and '-') token type in the alternative | ('+' | '-') expression. So whenever the lexer sees a + it produces a '+' token, not a MATH token, because string literals in parser rules take precedence over named lexer rules.
If you turn MATH into a parser rule math (or maybe mathOperator) instead, all of the operators will be literals and the problem will go away. That said, you probably don't want a single rule for all math operators because that doesn't give you the precedence you want, but that's a different issue.
PS: Something like x+1 still won't work because it will see +1 as a single INTEGER token. You can fix that by removing the leading + and - from the INTEGER rule (that way x = -2 would be parsed as a unary minus applied to the integer 2 instead of just the integer -2, but that's not a problem).

ANTLR4 - How do I get the token TYPE as the token text in ANTLR?

Say I have a grammar that has tokens like this:
AND : 'AND' | 'and' | '&&' | '&';
OR : 'OR' | 'or' | '||' | '|' ;
NOT : 'NOT' | 'not' | '~' | '!';
When I visualize the ParseTree using TreeViewer or print the tree using tree.toStringTree(), each node's text is the same as what was matched.
So if I parse "A and B or C", the two binary operators will be "and" / "or".
If I parse "A && B || C", they'll be "&&" / "||".
What I would LIKE is for them to always be "AND" / "OR / "NOT", regardless of what literal symbol was matched. Is this possible?
This is what the vocabulary is for. Use yourLexer.getVocabulary() or yourParser.getVocabulary() and then vocabulary.getSymbolicName(tokenType) for the text representation of the token type. If that returns an empty string try as second step vocabulary.getLiteralName(tokenType), which returns the text used to define the token.

ANTLR parse strings (keep whitespaces) and parse normal identifiers

I am trying to use ANTLR4 to parse source files. One thing I need to do is that a string literal contains all kinds of characters and possibly white spaces while normal identifiers contains only English characters and digits (white spaces are thrown away).
I use the following antlr grammar rules (the minimal example), but it doesn't work as expected.
grammar parseString;
rules
: stringRule+
;
stringRule
: formatString
| idString
;
formatString
: STRING_DOUBLEQUOTE STRING STRING_DOUBLEQUOTE
;
idString
: (NONTERM | TERM)
;
// LEXER
STRING_DOUBLEQUOTE
: '"' ;
DIGITS
: DIGIT+
;
TERM
: UPPERCHAR CHAR+
;
NONTERM
: LOWERCHAR CHAR+
;
fragment
CHAR
: LOWERCHAR
| UPPERCHAR
| DIGIT
| '-'
| '_'
;
fragment
DIGIT
: [0-9]
;
fragment
LOWERCHAR
: [a-z]
;
fragment
UPPERCHAR
: [A-Z]
;
WS
: (' ' | '\t' | '\r' | '\n')+ -> skip
; // skip spaces, tabs, newlines
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
STRING
: ~('"')*
;
For the test cases that I use,
Test
HelloWorld
"$this is a string"
"*this is another string!"
I got the error line 1:0 extraneous input 'Test\nHelloWorld\n' expecting {'"', TERM, NONTERM}. And the last two lines of the 'formatString' are correctly parsed. But for the first two lines, since the newline characters ('\n') haven't got thrown away, thus they are not matched to 'idString'. I am wondering what I did wrong.
Your STRING rule will match anything but quotes so will scarf just about anything. That is way too loose. You will need a much tighter definition of exactly what distinguishes a STRING from the others I think. Once it's in ~'"'* it will scarf until '"'.
Yes there is a problem in this grammar. the token STRING matchs 'Test\nHelloWorld\n'. It will put everything in this token, but there is no rule that takes just the TOKEN STRING.
Think about changing the token STRING.

Negating inside lexer- and parser rules

How can the negation meta-character, ~, be used in ANTLR's lexer- and parser rules?
Negating can occur inside lexer and parser rules.
Inside lexer rules you can negate characters, and inside parser rules you can negate tokens (lexer rules). But both lexer- and parser rules can only negate either single characters, or single tokens, respectively.
A couple of examples:
lexer rules
To match one or more characters except lowercase ascii letters, you can do:
NO_LOWERCASE : ~('a'..'z')+ ;
(the negation-meta-char, ~, has a higher precedence than the +, so the rule above equals (~('a'..'z'))+)
Note that 'a'..'z' matches a single character (and can therefor be negated), but the following rule is invalid:
ANY_EXCEPT_AB : ~('ab') ;
Because 'ab' (obviously) matches 2 characters, it cannot be negated. To match a token that consists of 2 character, but not 'ab', you'd have to do the following:
ANY_EXCEPT_AB
: 'a' ~'b' // any two chars starting with 'a' followed by any other than 'b'
| ~'a' . // other than 'a' followed by any char
;
parser rules
Inside parser rules, ~ negates a certain token, or more than one token. For example, you have the following tokens defined:
A : 'A';
B : 'B';
C : 'C';
D : 'D';
E : 'E';
If you now want to match any token except the A, you do:
p : ~A ;
And if you want to match any token except B and D, you can do:
p : ~(B | D) ;
However, if you want to match any two tokens other than A followed by B, you cannot do:
p : ~(A B) ;
Just as with lexer rules, you cannot negate more than a single token. To accomplish the above, you need to do:
P
: A ~B
| ~A .
;
Note that the . (DOT) char in a parser rules does not match any character as it does inside lexer rules. Inside parser rules, it matches any token (A, B, C, D or E, in this case).
Note that you cannot negate parser rules. The following is illegal:
p : ~a ;
a : A ;

How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

I have what I think is a simple ANTLR question. I have two token types: ident and special_ident. I want my special_ident to match a single letter followed by a single digit. I want the generic ident to match a single letter, optionally followed by any number of letters or digits. My (incorrect) grammar is below:
expr
: special_ident
| ident
;
special_ident : LETTER DIGIT;
ident : LETTER (LETTER | DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
When I try to check this grammar, I get this warning:
Decision can match input such as "LETTER DIGIT" using multiple alternatives: 1, 2.
As a result, alternative(s) 2 were disabled for that input
I understand that my grammar is ambiguous and that input such as A1 could match either ident or special_ident. I really just want the special_ident to be used in the narrowest of cases.
Here's some sample input and what I'd like it to match:
A : ident
A1 : special_ident
A1A : ident
A12 : ident
AA1 : ident
How can I form my grammar such that I correctly identify my two types of identifiers?
Seems that you have 3 cases:
A
AN
A(A|N)(A|N)+
You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.
I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:
long_ident : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident : LETTER | long_ident;
Expanding on Carl's thought, I would guess you have four different cases:
A
AN
AA(A|N)*
AN(A|N)+
Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.
prog
: (expr WS)+ EOF;
expr
: special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
| ident {System.out.println("Found ident:" + $ident.text + "\n");}
;
special_ident : LETTER DIGIT;
ident : LETTER
|LETTER DIGIT (LETTER|DIGIT)+
|LETTER LETTER (LETTER|DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
WS
: (' '|'\t'|'\n'|'\r')+;