ANTLR 3 parsing - mismatched character ... expecting - antlr

I am quiete a newbie in ANTLR and I looked around for a while to fix my problem. Unfortunately without any success...
I simplified my grammar to describe the problem (the token TAG is used in the real example):
grammar Test;
WORD : ('a'..'z')+;
DOT : '.';
TAG : '.test';
WHITE_SPACE
: (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
rule
: 'a' DOT WORD 'z';
When I try to parse the word "a .bcd z" everything is fine, but when I try the word "a .tbyfa z" it shows me the error
line 1:4 mismatched character 'b' expecting 'e'
line 1:5 missing DOT at 'yfa'
In my opinion the problem is that the string after the "." starts with a "t" which could also be the token ".test". I tried backtrack=true, but also without any success.
How can I fix that problem?
Thanks in advance.

ANTLR's lexer cannot backtrack to an alternative in this case. Once the lexer sees ".t", it tries to match the TAG token, but this doesn't succeed, so the lexer then tries to match something else that starts with ".t", but there is no such token. And the lexer will not backtrack a character again to match a DOT. So that's what's going wrong.
A possible solution to it would be to do it like this:
grammar Test;
rule : 'a' DOT WORD 'z';
WORD : ('a'..'z')+;
DOT : '.' (('test')=> 'test' {$type=TAG;})?;
SPACE : (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
fragment TAG : /* empty rule: only used to change the 'type' */;
The ('test')=> is a syntactic predicate which forces the lexer to look ahead to see if there really is "test" ahead. If this is true, "test" is matched and the type of the token is changed to TAG. And since 'test' is optional, the rule can always fall back on only the DOT token.

Related

Single quoted literal value fails Antlr lexer

I have a lexer rule that defines single-quoted literal string as
L_S_STRING : '\'' (('\'' '\'') | ('\\' '\'') | ~('\''))* '\''
It fails one particular case:
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\''
The problem is really with the last two single quotes. If I added a space in between, it worked. Or I could use two single quotes to end and it worked too, e.g.
'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z'''
I am not sure if it has something to do with having a non-greedy operator which caused the first-match of ('\'' '\'')? If so, I don't see how the last version could have worked.
In any event, could someone help please?
UPDATE - I am not able to reproduce it outside of the full grammar. This may be a red herring.
UPDATE - I missed some important context so I posted another question here Antlr4: single quote rule fails when there are escape chars plus carriage return, new line
I can't reproduce that. Given the following grammar:
lexer grammar Test;
L_S_STRING : '\'' (('\'' '\'') | ('\\' '\'') | ~('\''))* '\'';
OTHER : . ;
which can be tested as follows:
String source = "A'yyyy-MM-dd\\\\'T\\\\'HH:mm:ss\\\\'Z\\\\''B";
Test lexer = new Test(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", Test.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
will print:
OTHER A
L_S_STRING 'yyyy-MM-dd\\'T\\'HH:mm:ss\\'Z\\''
OTHER B
EOF <EOF>

ANTLR: mismatched input

I couldn't understand a bug in my grammar. The file, Bug.g4, is:
grammar Bug;
text: TEXT;
WORD: ('a'..'z' | 'A'..'Z')+ ;
TEXT: ('a'..'z' | 'A'..'Z')+ ;
NEWLINE: [\n\r] -> skip ;
After running antlr4 and javac, I run
grun Bug text -tree
aa
line 1:0 mismatched input 'aa' expecting TEXT
(text aa)
But if I instead use text: WORD in the grammar, things are okay. What's wrong?
When two lexer rules each match the same string of text, and no other lexer rule matches a longer string of text, ANTLR assigns the token type according to the rule which appeared first in the grammar. In your case, a TEXT token can never be produced by the lexer rule because the WORD rule will always match the same text and the WORD rule appears before the TEXT rule in the grammar. If you were to reverse the order of these rules in the grammar, you would start to see TEXT tokens but you would never see a WORD token.

ANTLR v4: Same character has different meaning in different contexts

This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:
A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
ex.: (#)
Each encoded entity will be separated by whitespace
So I could encode the following sentence:
ABC a#b.com
as (with corresponding letters shown underneath):
^.- ^-... ^-.-. ( ) ._ (#) -... (.) -.-. --- --
A B C ' ' a '#' b '.' c o m
Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.
There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".
Here is the grammar I have got so far:
grammar MorseCode;
file: entity*;
entity:
special
| morse_char;
special: '(' SPECIAL ')';
morse_char: '^'? (DOT_OR_DASH)+;
SPECIAL : .; // match any character
DOT_OR_DASH : ('.' | '-');
WS : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)
When I try it against the following input:
^... --- ...(#)
I get the following output (from grun ... -tokens):
[#0,0:0='^',<1>,1:0]
[#1,1:1='.',<4>,1:1]
...
[#15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH
It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?
It seems like your (#) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:
SPECIAL : '(' .*? ')';
To ensure that . . and .. are actually different, you can use this:
SYMBOL : [.-]+;
Then you can define your ^ operator:
CARET : '^';
With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:
file
: entity* EOF
;
entity
: morse_char
| SPECIAL
;
morse_char
: CARET? SYMBOL
;

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....