ANTLR4 - How do I get the token TYPE as the token text in ANTLR? - antlr

Say I have a grammar that has tokens like this:
AND : 'AND' | 'and' | '&&' | '&';
OR : 'OR' | 'or' | '||' | '|' ;
NOT : 'NOT' | 'not' | '~' | '!';
When I visualize the ParseTree using TreeViewer or print the tree using tree.toStringTree(), each node's text is the same as what was matched.
So if I parse "A and B or C", the two binary operators will be "and" / "or".
If I parse "A && B || C", they'll be "&&" / "||".
What I would LIKE is for them to always be "AND" / "OR / "NOT", regardless of what literal symbol was matched. Is this possible?

This is what the vocabulary is for. Use yourLexer.getVocabulary() or yourParser.getVocabulary() and then vocabulary.getSymbolicName(tokenType) for the text representation of the token type. If that returns an empty string try as second step vocabulary.getLiteralName(tokenType), which returns the text used to define the token.

Related

Antlr4 parser not parsing reassignment statement correctly

I've been creating a grammar parser using Antlr4 and wanted to add variable reassignment (without having to declare a new variable)
I've tried changing the reassignment statement to be an expression, but that didn't change anything
Here's a shortened version of my grammar:
grammar MyLanguage;
program: statement* EOF;
statement
: expression EOC
| variable EOC
| IDENTIFIER ASSIGNMENT expression EOC
;
variable: type IDENTIFIER (ASSIGNMENT expression)?;
expression
: STRING
| INTEGER
| IDENTIFIER
| expression MATH expression
| ('+' | '-') expression
;
MATH: '+' | '-' | '*' | '/' | '%' | '//' | '**';
ASSIGNMENT: MATH? '=';
EOC: ';';
WHITESPACE: [ \t\r\n]+ -> skip;
STRING: '"' (~[\u0000-\u0008\u0010-\u001F"] | [\t])* '"' | '\'' (~[\u0000-\u0008\u0010-\u001F'] | [\t])* '\'';
INTEGER: '0' | ('+' | '-')? [1-9][0-9]*;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
type: 'str';
if anything else might be of relevance, please ask
so I tried to parse
str test = "empty";
test = "not empty";
which worked, but when I tried (part of the fibbionaci function)
temp = n1;
n1 = n1 + n2;
n2 = temp;
it got an error and parsed it as
temp = n1; //statement
n1 = n1 //statement - <missing ';'>
+n2; //statement
n2 = temp; //statement
Your problem has nothing to do with assignment statements. Additions simply don't work at all - whether they're part of an assignment or not. So the simplest input to get the error would be x+y;. If you print the token stream for that input (using grun with the -tokens option for example), you'll get the following output:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='+',<'+'>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
line 1:1 no viable alternative at input 'x+'
Now compare this to x*y;, which works fine:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='*',<MATH>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
The important difference here is that * is recognized as a MATH token, but + isn't. It's recognized as a '+' token instead.
This happens because you introduced a separate '+' (and '-') token type in the alternative | ('+' | '-') expression. So whenever the lexer sees a + it produces a '+' token, not a MATH token, because string literals in parser rules take precedence over named lexer rules.
If you turn MATH into a parser rule math (or maybe mathOperator) instead, all of the operators will be literals and the problem will go away. That said, you probably don't want a single rule for all math operators because that doesn't give you the precedence you want, but that's a different issue.
PS: Something like x+1 still won't work because it will see +1 as a single INTEGER token. You can fix that by removing the leading + and - from the INTEGER rule (that way x = -2 would be parsed as a unary minus applied to the integer 2 instead of just the integer -2, but that's not a problem).

Is it possible to distinguish escape sequences in lexer in Antlr4?

I would like to match sequences like \' and \" as lexer elements
ESCAPESEQUECE :
'\\\"' |
'\\\''
;
while also distinguish individual quotes when they are not escaped
SINGLEQUOTE:
'\''
;
DOUBLEQUOTE:
'\"'
;
The final goal it to recognize MySQL like strings with parser.
Is this possible / correct way?
Answer
Yes, it is totally possible by having separate tokens.
Example
grammar escp;
SINGLE: '\'';
DOUBLE: '\"';
ESCAPED : '\\"' | '\\\'';
char: SINGLE | DOUBLE;
escaped : ESCAPED;
program: (char | escaped)+;
The AST for input string '\"'"\"""'\'\"\' will be:

Antlr evaluation order

I defined the following expression rule using Antlr 4 for a script language,
basically I am trying to evaluate
x = y.z.aa * 6
the correct evaluation order should be y.z then y.z.aa then it times 6;
((y.z).aa) * 6
however after the parsing aa*6 evaluated first, then z.(aa*6) then y.(z.(aa*6)), it becomes
y.(z.(aa * 6))
the square bracket is evaluated right
x = y[z][aa] * 6
can anyone help to point what I did wrong in dot access rule?
expression
: primary #PrimaryExpression
| expression ('.' expression ) + #DotAccessExpression
| expression ('[' expression ']')+ #ArrayAccessExpression
| expression ('*'|'/') expression #MulExpression
| expression ('+'|'-') expression #AddExpression
;
primary
: '(' expression ')'
| literal
| ident
;
literal
: NUMBER
| STRING
| NULL
| TRUE
| FALSE
;
You used the following rule:
expression ('.' expression)+
This rule does not fit the syntax pattern for a binary expression, so it's actually getting treated as a suffix expression. In particular, the expression following a . character is no longer restricted within the precedence hierarchy. You may be additionally affected by issue #679, but the real resolution is the same either way. You need to replace this alternative with the following:
expression '.' expression
The same goes for the ArrayAccessExpression, which should be written as follows:
expression '[' expression ']' #ArrayAccessExpression

ANTLR fuzzy parsing

I'm building a kind of pre-processor in ANTLRv3, which of course only works with fuzzy parsing. At the moment I'm trying to parse include statements and replace them with the corresponding file content. I used this example:
ANTLR: removing clutter
Based on this example, I wrote the following code:
grammar preprocessor;
options {
language='Java';
}
#lexer::header {
package antlr_try_1;
}
#parser::header {
package antlr_try_1;
}
parse
: (t=. {System.out.print($t.text);})* EOF
;
INCLUDE_STAT
: 'include' (' ' | '\r' | '\t' | '\n')+ ('A'..'Z' | 'a'..'z' | '_' | '-' | '.')+
{
setText("Include statement found!");
}
;
Any
: . // fall through rule, matches any character
;
This grammar does only for printing the text and replacing the include statements with the "Include statement found!" string. The example text to be parsed looks like this:
some random input
some random input
some random input
include some_file.txt
some random input
some random input
some random input
The output of the result looks in the following way:
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 1:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 2:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 3:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 7:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 8:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 9:14 mismatched character 'p' expecting 'c'
some random ut
some random ut
some random ut
Include statement found!
some random ut
some random ut
some random ut
As far as I can judge, it is confused by the "in" in the word "input", because it "thinks" it would be the INCLUDE_STAT token.
Is there a better way to do it? The filter option I cannot use, since I need not only the include statements, but also the rest of the code. I've tried several other things, but couldn't find a proper solution.
You are observing one of ANTLR 3's limitations. You could use either of these options to correct the immediate problem:
Upgrade to ANTLR 4, which does not have this limitation.
Include the following syntactic predicate at the beginning of the INCLUDE_STAT rule:
`('include' (' ' | '\r' | '\t' | '\n')+ ('A'..'Z' | 'a'..'z' | '_' | '-' | '.')+) =>`

ANTLR: Match unescaped characters?

I've got a rule like,
charGroup
: '[' .+ ']';
But I'm guessing that'll match something like [abc\]. Assuming I want it to match only unescaped ]s, how do I do that? In a regular expression I'd use a negative look-behind.
Edit: I'd also like it to be ungreedy/lazy if possible. So as to match only [a] in [a][b].
You probably wanted to do something like:
charGroup
: '[' ('\\' . | ~('\\' | ']'))+ ']'
;
where ~('\\' | ']') matches a single character other than \ and ]. Note that you can only negate single characters! There's no such thing as ~('ab'). Another mistake often made is that negating inside parser rules does not negate a character, but a token instead. An example might be in order:
foo : ~(A | D);
A : 'a';
B : 'b';
C : 'c';
D : ~A;
Now parser rule foo matches either token B or token C (so only the characters 'b' and 'c') while lexer rule D matches any character other than 'a'.
I'd use a negative look-behind
Isn't that unnecessarily complex? How about:
charGroup
: '[' ('\\]' | .)+ ']';