apparently, I wasn't able to deduce the answers to my problem from exiting posts on token recognition errors with Island Grammars here, so I hope someone can give me an advice on how to do this correctly.
Basically, I am trying to write a language that contains proprocessor directives. I narrowed my problem down to a very simple example. In my example lanuage, the following should be valid syntax:
##some preprocessor text
PRINT some regular text
When parsing the code, I want to be able to identify the tokens "some preprocessor text", "PRINT" and "some regular text".
This is the parser grammar:
parser grammar myp;
root: (preprocessor | command)*;
preprocessor: PREPROC PREPROCLINE;
command: PRINT STRINGLINE;
This is the lexer grammar:
lexer grammar myl;
PREPROC: '##' -> pushMode(PREPROC_MODE);
PRINT: 'PRINT' -> pushMode(STRING_MODE);
WS: [ \t\r\n] -> skip;
mode PREPROC_MODE;
PREPROCLINE: (~[\r\n])*[\r\n]+ -> popMode;
mode STRING_MODE;
STRINGLINE: (~[\r\n])*[\r\n]+ -> popMode;
When I parse the above example code, I get the following error:
line 1:2 extraneous input 'some preprocessor text\r\n' expecting
PREPROCLINE line 2:5 token recognition error at: ' some regular text'
This error occurs regardless of whether the line "WS: [ \t\r\n] -> skip;" is included in the lexer grammar or not. I guess that if I introduced quotes to the tokens PREPROCLINE and STRINGLINE instead of the line endings, it would work (at least I suceesfully implemented regular strings in other languages). But in this particular language, I really want to have the strings without the quotes.
Any help on why this error is occurring or how to implement a preprocessor language with unquoted strings is very appreciated.
Thanks
Updated: First, the recognition errors are because your parser needs to reference the lexer tokens. Add the options block to your parser:
options {
tokenVocab=MyLexer;
}
Second, when you generate your lexer/parser, be aware that the warnings usually need to be considered and corrected before proceeding.
Finally, these are all working alternatives, once you add the options block.
XXXX: (~[\r\n])*[\r\n]+ -> popMode;
is a bit cleaner as:
XXXX: .*? '\r'? '\n' -> popMode;
To not include the line endings, try
XXXX: .*? ~[\r\n] -> popMode;
Related
After reading Chapter 10 of "The Definitive ANTLR 4 Reference", I tried to write a simple analyzer to get lexical attributes, but I got an error. How can I get the lexical attributes?
lexer grammar TestLexer;
SPACE: [ \t\r\n]+ -> skip;
LINE: INT DOT [a-z]+ {System.out.println($INT.text);};
INT: [0-9]+;
DOT: '.';
[INFO]
[INFO] --- antlr4-maven-plugin:4.9.2:antlr4 (antlr) # parser ---
[INFO] ANTLR 4: Processing source directory /Users/Poison/IdeaProjects/parser/src/main/antlr4
[INFO] Processing grammar: me.tianshuang.parser/TestLexer.g4
[ERROR] error(128): me.tianshuang.parser/TestLexer.g4:5:65: attribute references not allowed in lexer actions: $INT.text
[ERROR] /Users/Poison/IdeaProjects/parser/me.tianshuang.parser/TestLexer.g4 [5:65]: attribute references not allowed in lexer actions: $INT.text
ANTLR4 version: 4.9.2.
Reference:
antlr4/actions.md at master · antlr/antlr4 · GitHub
How to get the token attributes in Antlr-4 lexer rule's action · Issue #1946 · antlr/antlr4 · GitHub
How can I get the lexical attributes?
You can't: labels are simply not supported in lexer rules. You might say, "well, but I'm not using any labels!". But the following:
INT DOT [a-z]+ {System.out.println($INT.text);}
is just a shorthand notation for:
some_var_name=INT DOT [a-z]+ {System.out.println($some_var_name.text);}
where some_var_name is called a label.
If you remove the embedded code (the stuff between { and }), add a label before INT and then generate a lexer, you'll see the following warning being printed to stderr:
labels in lexer rules are not supported in ANTLR 4; actions cannot reference elements of lexical rules but you can use getText() to get the entire text matched for the rule
The last part means that you can grab the entire text of the lexer rule like this:
LINE
: INT DOT [a-z]+ {System.out.println(getText());}
;
But grabbing text from individual parts of a lexer rule is not possible.
Try to separate the concerns of the lexer and other output matters: that's a main focus point of Antlr VS Bison/Flex. You can use for example visitor/listener patterns from the other chapters of the book.
This question already has answers here:
mismatched Input when lexing and parsing with modes
(2 answers)
Closed 7 years ago.
My language has commands that can be parameter-less or with parameters, and an "if" keyword:
cmd1 // parameter-less command
cmd2 a word // with parameter: "a word" - it starts with first non-WS char
if cmd3 // if, not a command, followed by parameter-less command
cmd4 if text // command with parameter: "if text"
"if" is recognized as if only if it's the first non-WS string in the line (let's ignore comments for now...)
These are my grammer rules:
grammar TestFlow;
// Parser Rules:
root: (lineComment | ifStat | cmd )* EOF;
lineComment : LC;
ifStat : IF;
cmd : CMD;
// Lexer Rules:
LC : '//' ~([\n\r\u2028\u2029])* -> channel(HIDDEN); // line comment
IF : 'if';
CMD : [-_a-zA-Z0-9]+ GAP LINE
| [-_a-zA-Z0-9]+
;
fragment GAP : [ \t]+;
fragment LINE : ~([\n\r\u2028\u2029])*;
But my lexer identifies 3rd line as a CMD: if cmd3, and not as if followed by cmd3 as I need.
What's my mistake? how to fix it?
There doesn't appear to be a parser rule in your example that defines the grammar. Meaning there is no rule indicating to look for an 'if' AND a command.
What is happening in your words:
But my lexer identifies 3rd line as a CMD: if cmd3, and not as if followed by cmd3 as I need
The first alternative in the lexer rule CMD looks for one or more characters ("if"), followed by a space ' ', followed by a LINE (cmd3).
So, with the input "if cmd3" it matches the entire line, which is exactly what you told it to do!
I can tell you from personal experience that for even a simple language, you'll learn a lot and very quickly by taking a step back and review some example grammars, which is what I would do if I were you now to avoid frustration. I highly recommend the Antlr4 REference book from www.pragprog.com as well as the antlr website.
UPDATED
I think this is what you may be interested in:
grammar myGrammar;
root : statement NEWLINE
| comment NEWLINE
;
statement : ifStat (LC)?
| cmdStat (LC)?
;
ifStat : IF cmdStat;
cmdStat : cmd (args)*;
cmd : CMD;
args : LINE;
CMD : [-_a-zA-Z0-9]+ GAP LINE
| [-_a-zA-Z0-9]+
;
fragment GAP : [ \t]+;
fragment LINE : ~([\n\r\u2028\u2029])*;
NEWLINE : ('\r')?'\n';
Again, I must say, if you read the book (which I did), this may give you the expected response from your parser (not lexer).
The ifStat is optional (may ormay not be there, based on your test cases), there will always be a cmd and there may or may not be a line comment following it. Try this out and see if it is helpful. Good luck!
Just little tiny line, made everything perfect: in my MyParser.g4, just had to enter:
options { tokenVocab = MyLexer; }
right after the parser grammar MYParser;...
So much time was wasted just to find this little detail... :-(
(few of the) Other posts of people not knowing what was going on, just to finally reach this solution:
ANTLR: Lexer does not recognize token
mismatched Input when lexing and parsing with modes
In my ANTLR4 grammar, I would like to skip whitespace in general, in order to keep the grammar as simple as possible. For this purpose I use the lexer rule WS : [ \t\r\n]+ -> skip;.
However, there may be certain sections in the input, where whitespace matters. One example are tables that are either tab-delimited or which need to count the spaces to find out which number is written in which column.
If I could switch off skipping the whitespace in between some begin and end symbols (table{ ... }), this would be perfect. Is it possible?
If not, are there other solutions to switch between different lexer rules depending on the context?
Take a look at context-senstive tokens with lexical modes. It's covered in more depth in "The Definitive ANTLR 4" book -- Chapter 12. I think you should be able to pull it off with this.
Declare a rule that will change to the "skip spaces mode", and back to the default one.
OPEN: '<' -> mode (SKIP_SPACES);
mode: SKIP_SPACES;
CLOSE: '>' -> mode (DEFAULT_MODE);
WS : [ \t\r\n]+ -> skip;
Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.
How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....