I am very new to ANTLR and am trying to understand how the Lexer and Parser rules work. I'm experiencing issues with a grammar I've written that seem to be related to lexer tokens with multiple characters being seen as "matches" even when only the first few characters actually match. To demonstrate this, I have written a simple ANTLR 3 Grammar:
grammar test;
options {
k=3;
}
#lexer::header { package test;}
#header {package test;}
sentence : (CHARACTER)*;
CHARACTER : 'a'..'z'|' ';
SPECIAL : 'special';
I'm using AntlrWorks to parse the following test input:
apple basic say sponsor speeds speckled specific wonder
The output I get is:
apple basic say nsor ds led ic wonder
It seems to me that the LEXER is using k=1 and therefore matching my SPECIAL token with anything that includes the two letters 'sp'. Once it encounters the letters 'sp', it then matches sucessive characters within the SPECIAL literal until the actual input fails to match the expected token - at which point it throws an error (consuming that character) and then continues with the rest of the sentence. Each error is of the form:
line 1:18 mismatched chracter 'o' expecting 'e'
However, this isn't the behaviour I'm trying to create. I wish to create a lexer token that matches the keyword ('special') - for use in other parser rules not included in this test example. However, I don't want other rules/input that just happens to include the same initial characters to be affected
To summarize:
How do I actually set antlr 3 options (such as k=2 or k=3 etc)? It seems to me, at least, that the options I'm trying to use here aren't being set.
Is there a better way to create parser or lexer rules to match a particular keyword in my input, without affecting processing of other parts of the input that don't contain a full match?
The k in the options { ... } section defines the look ahead of the parser, not the lexer.
Note that the grammar
CHARACTER : 'a'..'z'|' ';
SPECIAL : 'special';
is ambiguous: your 'special' could also be considered as 7 'a'..'z''s. Normally, it'd be lexed as follows:
grammar Test;
sentence : (special | word | space)+ EOF;
special : SPECIAL;
word : WORD;
space : SPACE;
SPECIAL : 'special';
WORD : 'a'..'z'+;
SPACE : ' ';
which will parse the input:
specia special specials
as follows:
I.e. it gets (more or less) tokenized as a combination of LL(1) and "longest-matched". Sorry, I realize that's a bit vague, but the Definitive ANTLR Reference does not clarify this exactly (at least, I can't find it...). But I realize that this might not be what you're looking for.
AFAIK, the only way to produce single char-tokens and define keywords that are made up from these single char-tokens, is done by merging these two tokens in a single rule, and use predicates and manual look-ahead to see if they conform to a key word, and if not, change the type of the token in a "fall through" sub rule. A demo:
grammar test;
tokens {
LETTER;
}
#lexer::members {
// manual look ahead
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i+1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
sentence
: (t=. {System.out.printf("\%-7s :: '\%s'\n", tokenNames[$t.type], $t.text);})+ EOF
;
SPECIAL
: {ahead("special")}?=> 'special'
| {ahead("keyword")}?=> 'keyword'
| 'a'..'z' {$type = LETTER;} // Last option and no keyword is found:
// change the type of this token
;
SPACE
: ' '
;
The parser generated from the above grammar can be tested with the class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("apple basic special speckled keyword keywor");
testLexer lexer = new testLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
testParser parser = new testParser(tokens);
parser.sentence();
}
}
As you can see, when parsing the input:
apple basic special speckled keyword keywor
the following output is generated:
LETTER :: 'a'
LETTER :: 'p'
LETTER :: 'p'
LETTER :: 'l'
LETTER :: 'e'
SPACE :: ' '
LETTER :: 'b'
LETTER :: 'a'
LETTER :: 's'
LETTER :: 'i'
LETTER :: 'c'
SPACE :: ' '
SPECIAL :: 'special'
SPACE :: ' '
LETTER :: 's'
LETTER :: 'p'
LETTER :: 'e'
LETTER :: 'c'
LETTER :: 'k'
LETTER :: 'l'
LETTER :: 'e'
LETTER :: 'd'
SPACE :: ' '
SPECIAL :: 'keyword'
SPACE :: ' '
LETTER :: 'k'
LETTER :: 'e'
LETTER :: 'y'
LETTER :: 'w'
LETTER :: 'o'
LETTER :: 'r'
See the Q&A What is a 'semantic predicate' in ANTLR? to learn more about predicates in ANTLR.
Related
I am attempting to use ANTLR (v4) to create a parser generator for a asterisk-delimited list encapsulated by START and END markers.
START**na**na**aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
Where a normal input string would be something like:
START*na*na*aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
I would still need to be able to allow spaces, tabs, null/empty fields (basically any character except START, END, * between the asterisks.
that includes things like ** * * *asdf fdsa* * asdf *
Here is my grammar so far:
parseIt: ENTRY ;
ENTRY : 'START*' FIELD_SET 'END' ;
fragment Delim : '*' ;
fragment Data : (ANY | WS)* ;
fragment FIELD_SET : Data (Delim Data|Delim)* ;
I can recognize simple input (like the first example I gave), but am having trouble recognizing tokens that have spaces or special characters between the asterisks.
I’m pretty sure you could handle this with a RegEx and capture groups, but if you really want to use ANTLR…
The following works:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+ {!(
(getText().endsWith("E") && _input.LA(1) == (int) 'N' && _input.LA(2) == (int) 'D') ||
(getText().endsWith("EN") && _input.LA(1) == (int) 'D') ||
(getText().endsWith("END")))}?;
and gives the following parse tree (for you first input) (click on it to view it full size):
Unfortunately for you, the way the lexer works, a simple lexer rule like Data : ~[*]+ will preferentially match aEND over your END implied lexer rule, because the ANTLR lexer uses the rule that matches the longest sequence ion input characters, and Data : ~[*]+ matches aEND while END only matches END (ANTLR also, doesn't look ahead for token matches). As a result the rather tortured semantic predicate is the only way to disallow a token that is a stream of characters that ends with END.
(Note: Semantic predicates a target-language specific, and this predicate is for Java. Other targets would require the equivalent int that target language.)
Another approach would be to check if your input endswith(“END”), and then just remove it prior to parsing using this grammar:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+;
This avoids the END token problem by just removing it from the input stream. Given that it's the very end of the stream, this might be simpler.
I'm working on converting an old ANTLR 2 grammar to ANTLR 4, and I'm having trouble with the string rule.
STRING :
'\''!
(
~('\'' | '\\' | '\r' | '\n')
)*
'\''!
;
This creates a STRING token whose text contains the contents of the string, but does not contain the starting and ending quotes, because of the ! symbol after the quote literals.
ANTLR 4 chokes on the ! symbol, ('!' came as a complete surprise to me (AC0050)) but if I leave it off, I end up with tokens that contain the quotes, which is not what I want. What's the correct way to port this to ANTLR 4?
Antlr4 generally treats tokens as being immutable, at least in the sense that there is no support for a language neutral equivalent of !.
Perhaps the simplest way to accomplish the equivalent is:
string : str=STRING { Strings.unquote($str); } ;
STRING : SQuote ~[\r\n\\']* SQuote ;
fragment SQuote : '\'' ;
where Strings.unquote is:
public static void unquote(Token token) {
CommonToken ct = (CommonToken) token;
String text = ct.getText();
text = .... unquote it ....
ct.setText(text);
}
The reason for using a parser rule is because attribute references are not (currently) supported in the lexer. Still, it could be done on the lexer rule - just would require a slight bit more effort to dig to the token.
An alternative to modifying the token text is to implement a custom token with custom fields and methods. See this answer if of interest.
I believe in ANTLR4 your problem can be solved using lexical modes and lexer commands.
Here is an example from there that I think does exactly what you need (although for double quotes but it's an easy fix):
lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;
mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string
If I have a ONELINE_STRING fragment rule in an antlr4 lexer that identifies a simple quoted string on one line, how can I create a more general STRING rule in the lexer that will concatenate adjacent ONELINE_STRING's (ie, separated only by whitespace and/or comments) as long as they each start on a different line?
ie,
"foo" "bar"
would be parsed as two STRING tokens, "foo" followed by "bar"
while:
"foo"
"bar"
would be seen as one STRING token: "foobar"
For clarification: The idea is that while I generally want the parser to be able to recognize adjacent strings as separate, and whitespace and comments to be ignored by the parser, I want to use the idea that if the last non-whitespace sub-token on a line was a string, and the first sub-token on the next line that is not all whitespace is also a string, then the separate strings should be concatenated into one long string as a means of specifying potentially very long strings without having to put the whole thing on one line. This is very straightforward if I were wanting all adjacent string sub-tokens to be concatenated, as they are in C... but for my purposes, I only want concatenation to occur when the string sub-tokens start on different lines. This concatenation should be invisible to any rule in the parser that might use a string. This is why I was thinking it might be better to situate the rule inside the lexer instead of the parser, but I'm not wholly opposed to doing this in the parser, and all the parsing rules which might have referred to a STRING token would instead refer to the parser string rule whenever they want a string.
Sample1:
"desc" "this sample will parse as two strings.
Sample3 (note, 'output' is a keyword in the language):
output "this is a very long line that I've explicitly made so that it does not "
"easily fit on just one line, so it gets split up into separate ones for "
"ease of reading, but the parser should see it all as one long string. "
"This example will parse as if the output command had been followed by "
"only a single string, even though it is composed of multiple string "
"fragments, all of which should be invisible to the parser.%n";
Both of these examples should be accepted as valid by the parser. The former is an example of a declaration, while the latter is an example of an imperative statement in the language.
Addendum:
I had originally been thinking that this would need to be done in the lexer because although newlines are supposed to be ignored by the parser, like all other whitespace, a multiline string is actually sensitive to the presence of newlines I did not think that the parser could perceive that.
However, I have been thinking that it may be possible to have the ONELINE_STRING as a lexer rule, and have a general 'string' parser rule which detects adjacent ONELINE_STRINGS, using a predicate between strings to detect if the next ONELINE_STRING token is starting on a different line than the previous one, and if so, it should invisibly concatenate them so that its text is indistinguishable from a string that had been specified all on one line. I am unsure of the logistics of how this would be implemented, however.
Okay, I have it.
I need to have the string recognizer in the parser, as some of you have suggested. The trick is to use lexer modes in the lexer.
So in the Lexer file I have this:
BEGIN_STRING : '"' -> pushMode(StringMode);
mode StringMode;
END_STRING: '"'-> popMode;
STRING_LITERAL_TEXT : ~[\r\n%"];
STRING_LITERAL_ESCAPE_QUOTE : '%"' { setText("\""); };
STRING_LITERAL_ESCAPE_PERCENT: '%%' { setText("%"); };
STRING_LITERAL_ESCAPE_NEWLINE : '%n'{ setText("\n"); };
UNTERMINATED_STRING: { _input.LA(1) == '\n' || _input.LA(1) == '\r' || _input.LA(1) == EOF}? -> popMode;
And in the parser file I have this:
string returns [String text] locals [int line] : a=stringLiteral { $line = $a.line; $text=$a.text;}
({_input.LT(1)!=null && _input.LT(1).getLine()>$line}?
a=stringLiteral { $line = $a.line; $text+=$a.text; })*
;
stringLiteral returns [int line, String text]: BEGIN_STRING {$text = "";}
(a=(STRING_LITERAL_TEXT
| STRING_LITERAL_ESCAPE_NEWLINE
| STRING_LITERAL_ESCAPE_QUOTE
| STRING_LITERAL_ESCAPE_PERCENT
) {$text+=$a.text;} )*
stringEnd { $line = $BEGIN_STRING.line; }
;
stringEnd: END_STRING #string_finish
| UNTERMINATED_STRING #string_hang
;
The string rule thus concatenates adjacent string literals as long as they are on different lines. The stringEnd rule needs an event handler for when a string literal is not terminated correctly so that the parser can report a syntax error, but the string is otherwise treated as if it had been closed correctly.
EDIT: Sorry, have not read your requirements fully. The following approach would match both examples not only the desired one. Have to think about it...
The simplest way would be to do this in the parser. And I see no point that would require this to be done in the lexer.
multiString : singleString +;
singleString : ONELINE_STRING;
ONELINE_STRING: ...; // no fragment!
WS : ... -> skip;
Comment : ... -> skip;
As already mentioned, the (IMO) better way would be to handle this inside the parser. But here's a way to handle it in the lexer:
STRING
: SINGLE_STRING ( LINE_CONTINUATION SINGLE_STRING )*
;
HIDDEN
: ( SPACE | LINE_BREAK | COMMENT ) -> channel(HIDDEN)
;
fragment SINGLE_STRING
: '"' ~'"'* '"'
;
fragment LINE_CONTINUATION
: ( SPACE | COMMENT )* LINE_BREAK ( SPACE | COMMENT )*
;
fragment SPACE
: [ \t]
;
fragment LINE_BREAK
: [\r\n]
| '\r\n'
;
fragment COMMENT
: '//' ~[\r\n]+
;
Tokenizing the input:
"a" "b"
"c"
"d"
"e"
"f"
would create the following 5 tokens:
"a"
"b"
"c"\n"d"
"e"
"f"
However, if the token would include a comment:
"c" // comment
"d"
then you'd need to strip this "// comment" from the token yourself at a later stage. The lexer will not be able to put this substring on a different channel, or skip it.
This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:
A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
ex.: (#)
Each encoded entity will be separated by whitespace
So I could encode the following sentence:
ABC a#b.com
as (with corresponding letters shown underneath):
^.- ^-... ^-.-. ( ) ._ (#) -... (.) -.-. --- --
A B C ' ' a '#' b '.' c o m
Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.
There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".
Here is the grammar I have got so far:
grammar MorseCode;
file: entity*;
entity:
special
| morse_char;
special: '(' SPECIAL ')';
morse_char: '^'? (DOT_OR_DASH)+;
SPECIAL : .; // match any character
DOT_OR_DASH : ('.' | '-');
WS : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)
When I try it against the following input:
^... --- ...(#)
I get the following output (from grun ... -tokens):
[#0,0:0='^',<1>,1:0]
[#1,1:1='.',<4>,1:1]
...
[#15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH
It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?
It seems like your (#) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:
SPECIAL : '(' .*? ')';
To ensure that . . and .. are actually different, you can use this:
SYMBOL : [.-]+;
Then you can define your ^ operator:
CARET : '^';
With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:
file
: entity* EOF
;
entity
: morse_char
| SPECIAL
;
morse_char
: CARET? SYMBOL
;
Adding skip to a rule doesn't do what I expect. Here's a grammar for a pair of tokens separated by a comma and a space. I made one version where the comma is marked skip, and one where it isn't:
grammar Commas;
COMMA: ', ';
COMMASKIP: ', ' -> skip;
DATA: ~[, \n]+;
withoutSkip: data COMMA data '\n';
withSkip: data COMMASKIP data '\n';
data: DATA;
Testing the rule without skip works as expected:
$ echo 'a, b' | grun Commas withoutSkip -tree
(withoutSkip (data a) , (data b) \n)
With skip gives me an error:
$ echo 'a, b' | grun Commas withSkip -tree
line 1:1 mismatched input ', ' expecting COMMASKIP
(withSkip (data a) , b \n)
If I comment out the COMMA and withoutSkip rules I get this:
$ echo 'a, b' | grun Commas withSkip -tree
line 1:3 missing ', ' at 'b'
(withSkip (data a) <missing ', '> (data b) \n)
I am trying to get output that just has the data tokens without the comma, like this:
(withSkip (data a) (data b) \n)
What am I doing wrong?
skip causes the lexer to discard the token. Therefore, a skipped lexer rule cannot be used in parser rules.
Another thing, if two or more rules match the same input, the rule defined first will "win" from the rule(s) defined later in the grammar, no matter if the parser tries to match the rule defined later in the grammar, the first rule will always "win". In your case, the rule COMMASKIP will never be created since COMMA matches the same input.
Try something like this:
grammar Commas;
COMMA : ',' -> skip;
SPACE : (' '|'\n') -> skip;
DATA : ~[, \n]+;
data : DATA+;
EDIT
So how do I specify where the comma goes without including it in the parse tree? Your code would match a, , b.
You don't, so if the comma is significant (ie. a,,b) is invalid, it cannot be skipped from the lexer.
I think in antlr3 you're supposed to use an exclamation point.
In ANTLR 4, you cannot create an AST from your parse. In the new version, all terminals/rules are in one parse tree. You can iterate over this tree with custom visitors and/or listeners. A demo of how to do this can be found in this Q&A: Once grammar is complete, what's the best way to walk an ANTLR v4 tree?
In your case, the grammar would look like this:
grammar X;
COMMA : ',';
SPACE : (' '|'\n') -> skip;
DATA : ~[, \n]+;
data : DATA (COMMA DATA)*;
and then create a listener like this:
public class MyListener extends XBaseListener {
#Override
public void enterData(XParser.DataContext ctx) {
List dataList = ctx.DATA(); // not sure what type of list it returns...
// do something with `dataList`
}
}
As you can see, the COMMA is not removed, but inside enterData(...) you just only use the DATA tokens.