How to solve ANTLR error "Attribute references not allowed in lexer actions" - antlr

After reading Chapter 10 of "The Definitive ANTLR 4 Reference", I tried to write a simple analyzer to get lexical attributes, but I got an error. How can I get the lexical attributes?
lexer grammar TestLexer;
SPACE: [ \t\r\n]+ -> skip;
LINE: INT DOT [a-z]+ {System.out.println($INT.text);};
INT: [0-9]+;
DOT: '.';
[INFO]
[INFO] --- antlr4-maven-plugin:4.9.2:antlr4 (antlr) # parser ---
[INFO] ANTLR 4: Processing source directory /Users/Poison/IdeaProjects/parser/src/main/antlr4
[INFO] Processing grammar: me.tianshuang.parser/TestLexer.g4
[ERROR] error(128): me.tianshuang.parser/TestLexer.g4:5:65: attribute references not allowed in lexer actions: $INT.text
[ERROR] /Users/Poison/IdeaProjects/parser/me.tianshuang.parser/TestLexer.g4 [5:65]: attribute references not allowed in lexer actions: $INT.text
ANTLR4 version: 4.9.2.
Reference:
antlr4/actions.md at master · antlr/antlr4 · GitHub
How to get the token attributes in Antlr-4 lexer rule's action · Issue #1946 · antlr/antlr4 · GitHub

How can I get the lexical attributes?
You can't: labels are simply not supported in lexer rules. You might say, "well, but I'm not using any labels!". But the following:
INT DOT [a-z]+ {System.out.println($INT.text);}
is just a shorthand notation for:
some_var_name=INT DOT [a-z]+ {System.out.println($some_var_name.text);}
where some_var_name is called a label.
If you remove the embedded code (the stuff between { and }), add a label before INT and then generate a lexer, you'll see the following warning being printed to stderr:
labels in lexer rules are not supported in ANTLR 4; actions cannot reference elements of lexical rules but you can use getText() to get the entire text matched for the rule
The last part means that you can grab the entire text of the lexer rule like this:
LINE
: INT DOT [a-z]+ {System.out.println(getText());}
;
But grabbing text from individual parts of a lexer rule is not possible.

Try to separate the concerns of the lexer and other output matters: that's a main focus point of Antlr VS Bison/Flex. You can use for example visitor/listener patterns from the other chapters of the book.

Related

What is the meaning of the ANTLR syntax in this grammar file?

I am trying to parse a file using ANTLR4 via Python. I am following a tutorial (https://faun.pub/introduction-to-antlr-python-af8a3c603d23); I am able to execute the code and get responses like the ones shown in the tutorial, but I'm failing to understand the logic of the grammar file.
grammar MyGrammer;
expr: left=expr op=('*'|'/') right=expr # InfixExpr
| left=expr op=('+'|'-') right=expr # InfixExpr
| atom=INT # NumberExpr
| '(' expr ')' # ParenExpr
| atom=HELLO # HelloExpr
| atom=BYE # ByeExpr
;
HELLO: ('hello'|'hi') ;
BYE : ('bye'| 'tata') ;
INT : [0-9]+ ;
WS : [ \t]+ -> skip ;
From my understanding, The constants (what I call them since they are all capitals) HELLO, BYE, INT, and WS define rules for what that set of text can contain. I think they are relating to functions somehow, but I am not sure. So the HELLO function will be executed if the parser encounters something that says either 'hello' or 'hi'. The expr is what is confusing me.
expr: left=expr op=('*'|'/') right=expr # InfixExpr
| left=expr op=('+'|'-') right=expr # InfixExpr
| atom=INT # NumberExpr
| '(' expr ')' # ParenExpr
| atom=HELLO # HelloExpr
| atom=BYE # ByeExpr
;
HELLO: ('hello'|'hi') ;
BYE : ('bye'| 'tata') ;
INT : [0-9]+ ;
WS : [ \t]+ -> skip ;
When I run the command
antlr4 -Dlanguage=Python3 MyGrammer.g4 -visitor -o dist
it produces many files but the main one contains InfixExpr, NumberExpr, ParenExpr, HelloExpr, and ByeExpr. I can see that somehow the author knows that he is doing something with the constants HELLO, BYE, etc. Is there any documentation on the expr piece above and what do the keywords atom, left, right mean?
Any rules that begin with a capital letter (often we captilize the entire rule name to make it obvious) is a Lexer rule.
Rules that begin with lower case letters are parser rules.
It’s VERY important to understand the difference and the flow of your input all the way through to a parse tree.
Your input stream of characters is first processed by the Lexer (using the Lexer rules) to produce a stream of tokens for the parser to act upon. It’s important to understand that the parser has NO impact on how the Lexer interprets the input.
When multiple Lexer rules could match you input, two “tie breakers” come into play.
1 - if a rules matches more characters in your input stream than other rules, then that will be the rules used to produce a token.
2 - if there is a tie of multiple Lexer rules matching the same sequence of input characters, then the Lexer rules that appears first in your grammar will be used to generate a token.
Your parser rules are evaluated using a recursive descent approach beginning with whatever startRule you specify. ANTLR uses several techniques to do it’s best to recognize your input, that includes trying alternatives until one is found that matches, ignoring a token (and producing an error) if that allows the parser to continue on, and inserting a missing token (and producing an error) if that allows the parser to continue.
re: the expr portion:
The rule says that there are 6 possible ways to recognize an expr
left=expr op=('*'|'/') right=expr (which will create an InfixExprContext node in the parse tree)
left=expr op=('+'|'-') right=expr (InfixExprContext (also))
atom=INT (NumberExprContext)
'(' expr ')' (ParenExprContext)
atom=HELLO (HelloExprContext)
atom=BYE (ByeExprContext)
The benefit of the labels (ex: # InfixExpr) is that, by creating a Context more specific than an ExprContext) you will have visitInfixExpr, visitNumberExpr, (etc.) methods that you can override in you Visitor instead of just a visitExpr method that contains all the alternatives. A similar thing will result for the enterXX and exitXX methods for your Listener classes.
In the left=expr op=('*'|'/') right=expr rule, the left, op and right names will generate accessors that make it easier to access those parts of you parse tree in you *Context class (without them you'd just have an array of expr, for example and expr[0] would be the first expr and expr[1] would be the second. (It's probably a good idea to look at the generated code with and without the names and labels to see the difference. Both make it MUCH easier to write the logic in your visitor/listeners.

Problems with ANTLR4 grammar

I have a very simple grammar file, which looks like this:
grammar Wort;
// Parser Rules:
word
: ANY_WORD EOF
;
// Lexer Rules:
ANY_WORD
: SMALL_WORD | CAPITAL_WORD
;
SMALL_WORD
: SMALL_LETTER (SMALL_LETTER)+
;
CAPITAL_WORD
: CAPITAL_LETTER (SMALL_LETTER)+
;
fragment SMALL_LETTER
: ('a'..'z')
;
fragment CAPITAL_LETTER
: ('A'..'Z')
;
If i try to parse the input "Hello", everything is OK, BUT if if modify my grammar file like this:
...
// Parser Rules:
word
: CAPITAL_WORD EOF
;
...
the input "Hello" is no longer recognized as a valid input. Can anybody explain, what is going wrong?
Thanx, Lars
The issue here has to do with precedence in the lexer grammar. Because ANY_WORD is listed before CAPITAL_WORD, it is given higher precedence. The lexer will identify Hello as a CAPITAL_WORD, but since an ANY_WORD can be just a CAPITAL_WORD, and the lexer is set up to prefer ANY_WORD, it will output the token ANY_WORD. The parser acts on the output of the lexer, and since ANY_WORD EOF doesn't match any of its rules, the parse fails.
You can make the lexer behave differently by moving CAPITAL_WORD above ANY_WORD in the grammar, but that will create the opposite problem -- capitalized words will never lex as ANY_WORDs. The best thing to do is probably what Mephy suggested -- make ANY_WORD a parser rule.

What is the antlr4 (v-4.1) equivalent form of the following grammar rule (written for antlr3 (v-3.2))?

What is the antlr4 (v-4.1) equivalent form of the following grammar rule (written for antlr3 (v-3.2))?
text
: tag => (tag)!
| outsidetag
;
The following is invalid in ANTLR 3:
text
: tag => (tag)!
| outsidetag
;
You probably meant the following:
text
: (tag)=> (tag)!
| outsidetag
;
where ( ... )=> is a syntactic predicate, which has no ANTLR4 equivalent: simply remove them. As 280Z28 mentioned (and also explained in the previous link): the lack of syntactic predicates is not a feature that was removed from ANTLR 4. It's a workaround for a weakness in ANTLR 3's prediction algorithm that no longer applies to ANTLR 4.
The exlamation mark in v3 denotes to removal of a rule in the generated AST. Since ANTLR4 does not produce AST's, also just remove the exclamation mark.
So, the v4 equivalent would look like this:
text
: tag
| outsidetag
;

Simple Island Grammar in ANTLR 4: Token Recognition Error

apparently, I wasn't able to deduce the answers to my problem from exiting posts on token recognition errors with Island Grammars here, so I hope someone can give me an advice on how to do this correctly.
Basically, I am trying to write a language that contains proprocessor directives. I narrowed my problem down to a very simple example. In my example lanuage, the following should be valid syntax:
##some preprocessor text
PRINT some regular text
When parsing the code, I want to be able to identify the tokens "some preprocessor text", "PRINT" and "some regular text".
This is the parser grammar:
parser grammar myp;
root: (preprocessor | command)*;
preprocessor: PREPROC PREPROCLINE;
command: PRINT STRINGLINE;
This is the lexer grammar:
lexer grammar myl;
PREPROC: '##' -> pushMode(PREPROC_MODE);
PRINT: 'PRINT' -> pushMode(STRING_MODE);
WS: [ \t\r\n] -> skip;
mode PREPROC_MODE;
PREPROCLINE: (~[\r\n])*[\r\n]+ -> popMode;
mode STRING_MODE;
STRINGLINE: (~[\r\n])*[\r\n]+ -> popMode;
When I parse the above example code, I get the following error:
line 1:2 extraneous input 'some preprocessor text\r\n' expecting
PREPROCLINE line 2:5 token recognition error at: ' some regular text'
This error occurs regardless of whether the line "WS: [ \t\r\n] -> skip;" is included in the lexer grammar or not. I guess that if I introduced quotes to the tokens PREPROCLINE and STRINGLINE instead of the line endings, it would work (at least I suceesfully implemented regular strings in other languages). But in this particular language, I really want to have the strings without the quotes.
Any help on why this error is occurring or how to implement a preprocessor language with unquoted strings is very appreciated.
Thanks
Updated: First, the recognition errors are because your parser needs to reference the lexer tokens. Add the options block to your parser:
options {
tokenVocab=MyLexer;
}
Second, when you generate your lexer/parser, be aware that the warnings usually need to be considered and corrected before proceeding.
Finally, these are all working alternatives, once you add the options block.
XXXX: (~[\r\n])*[\r\n]+ -> popMode;
is a bit cleaner as:
XXXX: .*? '\r'? '\n' -> popMode;
To not include the line endings, try
XXXX: .*? ~[\r\n] -> popMode;

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....