Given the grammar:
grammar Test;
words: (WORD|SPACE|DOT)+;
WORD : (
LD
|DOT {_input.LA(1)!='.'}?
) + ;
DOT: '.';
SPACE: ' ';
fragment LD: ~[.\n\r ];
with Antlr4 generated Lexer, for an input:
test. test.test test..test
The token sequence is like:
[#0,0:4='test.',<1>,1:0]
[#1,5:5=' ',<3>,1:5]
[#2,6:14='test.test',<1>,1:6]
[#3,15:15=' ',<3>,1:15]
[#4,16:19='test',<1>,1:16]
[#5,20:20='.',<2>,1:20]
[#6,21:25='.test',<1>,1:21]
[#7,26:25='<EOF>',<-1>,1:26]
What puzzles why the last piece of text test..test is tokenized into test . and .test, while I was supposed to see test. .test
What puzzled me more is for input:
test..test test. test.test
the token sequence is:
[#0,0:3='test',<1>,1:0]
[#1,4:4='.',<2>,1:4]
[#2,5:9='.test',<1>,1:5]
[#3,10:10=' ',<3>,1:10]
[#4,11:14='test',<1>,1:11]
[#5,15:15='.',<1>,1:15]
[#6,16:16=' ',<3>,1:16]
[#7,17:20='test',<1>,1:17]
[#8,21:25='.test',<1>,1:21]
[#9,26:25='<EOF>',<-1>,1:26]
Here the test.test is separated into two tokens while in above it is one.
Is the calling of _input.LA(1) has some side effect to cause this? Can some one explain?
I'm using Antlr4.
Quick fix is to check the previous LA(-1) token if it is unequal . and add a leading optional DOT.
Resulting grammar is:
grammar Test;
words: (WORD|SPACE|DOT)+;
WORD : DOT? (
LD
|{_input.LA(-1)!='.'}? DOT
) + ;
DOT: '.';
SPACE: ' ';
fragment LD: ~[.\n\r ];
Have fun and enjoy ANTLR, it is a nice tool.
Related
I am trying to write a config file grammar and get ANTLR4 to handle it. I am quite new to ANTLR (this is my first project with it).
Largely, I understand what needs to be done (or at least I think I do) for most of the config file grammar, but the files that I will be reading will have arbitrary C code inside of curly braces. Here is an example:
Something like:
#DEVICE: servo "servos are great"
#ACTION: turnRight "turning right is fun"
{
arbitrary C source code goes here;
some more arbitrary C source code;
}
#ACTION: secondAction "this is another action"
{
some more code;
}
And it could be many of those. I can't seem to get it to understand that I want to just ignore (without skipping) the source code. Here is my grammar so far:
/**
ANTLR4 grammar for practicing
*/
grammar practice;
file: (devconfig)*
;
devconfig: devid (action)+
;
devid: DEV_HDR (COMMENT)?
;
action: ACTN_HDR '{' C_BLOCK '}'
;
DEV_HDR: '#DEVICE: ' ALPHA+(IDCHAR)*
;
fragment
ALPHA: [a-zA-Z]
;
fragment
IDCHAR: ALPHA
| [0-9]
| '_'
;
COMMENT: '"' .*? '"'
;
ACTN_HDR: '#ACTION: ' ACTION_ID
;
fragment
ACTION_ID: ALPHA+(IDCHAR)*
;
C_BLOCK: WHAT DO I PUT HERE?? -> channel(HIDDEN)
;
WS: [ \t\n\r]+ -> skip
;
The problem is that whatever I put in the C_BLOCK lexer rule seems to screw up the whole thing - like if I put .*? -> channel(HIDDEN), it doesn't seem to work at all (of course, there is an error when using ANTLR on the grammar to the tune of ".*? can match the empty string" - but what should I put there if not that, so that it ignores the C code, but in such a way that I can access it later (i.e., not skipping it)?
Your C_BLOCK rule can be defined just like the usual multi line comment rule is done in so many languages. Make the curly braces part of the rule too:
C_BLOCK: CURLY .*? CURLY -> channel(HIDDEN);
If you need to nest blocks you write something like:
C_BLOCK: CURLY .*? C_BLOCK? .*? CURLY -> channel(HIDDEN);
or maybe:
C_BLOCK:
CURLY (
C_BLOCK
| .
)*?
CURLY
;
(untested).
Update: changed code to use the non-greedy kleene operator as suggested by a comment.
This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:
A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
ex.: (#)
Each encoded entity will be separated by whitespace
So I could encode the following sentence:
ABC a#b.com
as (with corresponding letters shown underneath):
^.- ^-... ^-.-. ( ) ._ (#) -... (.) -.-. --- --
A B C ' ' a '#' b '.' c o m
Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.
There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".
Here is the grammar I have got so far:
grammar MorseCode;
file: entity*;
entity:
special
| morse_char;
special: '(' SPECIAL ')';
morse_char: '^'? (DOT_OR_DASH)+;
SPECIAL : .; // match any character
DOT_OR_DASH : ('.' | '-');
WS : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)
When I try it against the following input:
^... --- ...(#)
I get the following output (from grun ... -tokens):
[#0,0:0='^',<1>,1:0]
[#1,1:1='.',<4>,1:1]
...
[#15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH
It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?
It seems like your (#) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:
SPECIAL : '(' .*? ')';
To ensure that . . and .. are actually different, you can use this:
SYMBOL : [.-]+;
Then you can define your ^ operator:
CARET : '^';
With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:
file
: entity* EOF
;
entity
: morse_char
| SPECIAL
;
morse_char
: CARET? SYMBOL
;
I am quiete a newbie in ANTLR and I looked around for a while to fix my problem. Unfortunately without any success...
I simplified my grammar to describe the problem (the token TAG is used in the real example):
grammar Test;
WORD : ('a'..'z')+;
DOT : '.';
TAG : '.test';
WHITE_SPACE
: (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
rule
: 'a' DOT WORD 'z';
When I try to parse the word "a .bcd z" everything is fine, but when I try the word "a .tbyfa z" it shows me the error
line 1:4 mismatched character 'b' expecting 'e'
line 1:5 missing DOT at 'yfa'
In my opinion the problem is that the string after the "." starts with a "t" which could also be the token ".test". I tried backtrack=true, but also without any success.
How can I fix that problem?
Thanks in advance.
ANTLR's lexer cannot backtrack to an alternative in this case. Once the lexer sees ".t", it tries to match the TAG token, but this doesn't succeed, so the lexer then tries to match something else that starts with ".t", but there is no such token. And the lexer will not backtrack a character again to match a DOT. So that's what's going wrong.
A possible solution to it would be to do it like this:
grammar Test;
rule : 'a' DOT WORD 'z';
WORD : ('a'..'z')+;
DOT : '.' (('test')=> 'test' {$type=TAG;})?;
SPACE : (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
fragment TAG : /* empty rule: only used to change the 'type' */;
The ('test')=> is a syntactic predicate which forces the lexer to look ahead to see if there really is "test" ahead. If this is true, "test" is matched and the type of the token is changed to TAG. And since 'test' is optional, the rule can always fall back on only the DOT token.
How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....
i want identifiers that can contain whitespace.
grammar WhitespaceInSymbols;
premise : ( options {greedy=false;} : 'IF' ) id=ID{
System.out.println($id.text);
};
ID : ('a'..'z'|'A'..'Z')+ (' '('a'..'z'|'A'..'Z')+)*
;
WS : ' '+ {skip();}
;
When i test this with "IF statement analyzed" i get a MissingTokenException and the output "IF statement analyzed".
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token. But instead the IF is part of the ID.
Is there a way to achieve my goal? I already tried some variations of the greed=false-option, but without success.
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token.
No, the parser has nothing to say about the creation of tokens: the input is first tokenized and then the parser rules are applied on these tokens. So setting greedy=false has no effect.
You can do this (creating ID tokens with white spaces), but it will be a horrible solution with many predicates, and a few custom methods in the lexer doing manual look-aheads: you really, really don't want this! A much cleaner solution would be to introduce a id rule in your parser and let it match one or more ID tokens.
A demo:
grammar WhitespaceInSymbols;
premise
: IF id THEN EOF
;
id
: ID+
;
IF
: 'IF'
;
THEN
: 'THEN'
;
ID
: ('a'..'z' | 'A'..'Z')+
;
WS
: ' '+ {skip();}
;
would parse the input IF statement analyzed THEN into the following tree: