ANTLR 4 token rule that matches any characters until it encounters XYZ - antlr

I want a token rule that gobbles up all characters until it gets to the characters XYZ.
Thus, if the input is this:
helloXYZ
then the token rule should return this token:
hello
If the input is this:
Blah Blah XYZ
then the token rule should return this token:
Blah Blah
How do I define a token rule to do this?

Using the hint that Terrance gives in his answer, I think this is what Roger is looking for:
grammar UseLookahead;
parserRule : LexerRule;
LexerRule : .+? { (_input.LA(1) == 'X') &&
(_input.LA(2) == 'Y') &&
(_input.LA(3) == 'Z')
}?
;
This gives the answers required, hello and Blah Blah respectively. I confess that I don't understand the significance of the final ?.

How about this?
HELLO : 'hello' {_input.LA(1)!=' '}? ;

If you want good performance, you need to use a form which does not use predicates. I would use code modeled after PositionAdjustingLexer.g4 to reset the position if the token ends with XYZ.
Edit: Don't underestimate the performance hit of the answer using a semantic predicate. The predicate will be evaluated at least once for every character of your entire input stream, and any character where a predicate is evaluated is prevented from using the DFA. The last time I saw something like this in use, it was responsible for more than 95% of the execution time of the entire parsing process, and removing it improved performance from more 20 seconds to less than 1 second.
tokens {
SpecialToken
}
mode SpecialTokenMode;
// In your position adjusting lexer, if you see a token with the type
// SpecialTokenWithXYZ, reset the position to remove the last 3 characters and set
// the type to SpecialToken
SpecialTokenWithXYZ
: 'XYZ'
-> popMode
;
SpecialTokenCharacterAtEOF
: . EOF
-> type(SpecialToken), popMode
;
SpecialTokenCharacter
: .
-> more
;
If you want even better performance, you can add a couple rules to optimize handling of sequences that do not contain any X characters:
tokens {
SpecialToken
}
mode SpecialTokenMode;
// In your position adjusting lexer, if you see a token with the type
// SpecialTokenWithXYZ, reset the position to remove the last 3 characters and set
// the type to SpecialToken
SpecialTokenWithXYZ
: 'XYZ'
-> popMode
;
SpecialTokenCharacterSpanAtEOF
: ~'X'+ EOF
-> type(SpecialToken), popMode
;
SpecialTokenCharacterSpan
: ~'X'+
-> more
;
SpecialTokenXAtEOF
: 'X' EOF
-> type(SpecialToken), popMode
;
SpecialTokenX
: 'X'
-> more
;

Related

Lex matching doesn't enter recursive rule as expected

I am trying to match words between # characters. Here is my attempt:
init : (TEXT | HASH | placeholder) init? EOF ;
placeholder : HASH lexeme HASH ;
lexeme : LEXEME;
HASH : '#' ;
LEXEME : [a-zA-Z0-9-_]+ ;
TEXT : ~'#'+ ;
My input string: "The good text with a #LEXEME#followed# by hashes of death#############"
And the resulting ParseTree:
I'm expecting the "followed" word to be parsed as a TEXT in the next recursive init but it looks like it's parsed in the same init iteration, thus not recognized. This happens every time a pattern like #letters#letters# is encountered.
How do I solve this?
It looks like you want the #s to mark the start and stop of your placeholders (aka LEXEMEs). You could do that by breaking the grammar into a Lexer grammar and a Parser grammar:
lexer grammar HashLexer
;
HASH: '#' -> mode(PLACEHOLDER_MODE);
TEXT: ~'#'+;
mode PLACEHOLDER_MODE
;
LEXEME: [a-zA-Z0-9\-_]+;
HASH_TERM: '#' -> mode(DEFAULT_MODE);
parser grammar HashParser
;
options {
tokenVocab = HashLexer;
}
init: (TEXT | placeholder)* EOF;
placeholder: HASH LEXEME? HASH_TERM;
When I try to parse your input "The good text with a #LEXEME#followed# by hashes of death#############" however, I get the following token stream:
[#0,0:20='The good text with a ',<TEXT>,1:0]
[#1,21:21='#',<HASH>,1:21]
[#2,22:27='LEXEME',<LEXEME>,1:22]
[#3,28:28='#',<HASH_TERM>,1:28]
[#4,29:36='followed',<TEXT>,1:29]
[#5,37:37='#',<HASH>,1:37]
[#6,39:40='by',<LEXEME>,1:39]
[#7,42:47='hashes',<LEXEME>,1:42]
[#8,49:50='of',<LEXEME>,1:49]
[#9,52:56='death',<LEXEME>,1:52]
[#10,57:57='#',<HASH_TERM>,1:57]
[#11,58:58='#',<HASH>,1:58]
[#12,59:59='#',<HASH_TERM>,1:59]
[#13,60:60='#',<HASH>,1:60]
[#14,61:61='#',<HASH_TERM>,1:61]
[#15,62:62='#',<HASH>,1:62]
[#16,63:63='#',<HASH_TERM>,1:63]
[#17,64:64='#',<HASH>,1:64]
[#18,65:65='#',<HASH_TERM>,1:65]
[#19,66:66='#',<HASH>,1:66]
[#20,67:67='#',<HASH_TERM>,1:67]
[#21,68:68='#',<HASH>,1:68]
[#22,69:69='#',<HASH_TERM>,1:69]
[#23,70:70='\n',<TEXT>,1:70]
[#24,71:70='<EOF>',<EOF>,2:0]
The # after followed pushes us into the PLACEHOLDER_MODE so " by hashes of death" is Lexed in PLACEHOLDER mode and generates recognition errors as it does not match the LEXEME rule. And you get the following parse tree:
This seems the correct interpretation of your input (assuming that #s act like ( and ) to bracket some input, then you're going to get situations like this when they're not matched up correctly. The only solution to that would be to relax the grammar quite a bit and handle more of the validation in a a listener/visitor.

ANTLR Grammar to parse a Asterisk-delimited input

I am attempting to use ANTLR (v4) to create a parser generator for a asterisk-delimited list encapsulated by START and END markers.
START**na**na**aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
Where a normal input string would be something like:
START*na*na*aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
I would still need to be able to allow spaces, tabs, null/empty fields (basically any character except START, END, * between the asterisks.
that includes things like ** * * *asdf fdsa* * asdf *
Here is my grammar so far:
parseIt: ENTRY ;
ENTRY : 'START*' FIELD_SET 'END' ;
fragment Delim : '*' ;
fragment Data : (ANY | WS)* ;
fragment FIELD_SET : Data (Delim Data|Delim)* ;
I can recognize simple input (like the first example I gave), but am having trouble recognizing tokens that have spaces or special characters between the asterisks.
I’m pretty sure you could handle this with a RegEx and capture groups, but if you really want to use ANTLR…
The following works:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+ {!(
(getText().endsWith("E") && _input.LA(1) == (int) 'N' && _input.LA(2) == (int) 'D') ||
(getText().endsWith("EN") && _input.LA(1) == (int) 'D') ||
(getText().endsWith("END")))}?;
and gives the following parse tree (for you first input) (click on it to view it full size):
Unfortunately for you, the way the lexer works, a simple lexer rule like Data : ~[*]+ will preferentially match aEND over your END implied lexer rule, because the ANTLR lexer uses the rule that matches the longest sequence ion input characters, and Data : ~[*]+ matches aEND while END only matches END (ANTLR also, doesn't look ahead for token matches). As a result the rather tortured semantic predicate is the only way to disallow a token that is a stream of characters that ends with END.
(Note: Semantic predicates a target-language specific, and this predicate is for Java. Other targets would require the equivalent int that target language.)
Another approach would be to check if your input endswith(“END”), and then just remove it prior to parsing using this grammar:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+;
This avoids the END token problem by just removing it from the input stream. Given that it's the very end of the stream, this might be simpler.

Negated lexer rules/tokens

I am trying to match (and ignore) c-style block comments. To me the sequence is (1) /* followed by (2) anything other than /* or */ until (3) */.
BLOCK_COMMENT_START
: "/*"
;
BLOCK_COMMENT_END
: "*/"
;
BLOCK_COMMENT
: BLOCK_COMMENT_START ( ~( BLOCK_COMMENT_START | BLOCK_COMMENT_END ) )* BLOCK_COMMENT_END {
// again, we want to skip the entire match from the lexer stream
$setType( Token.SKIP );
}
;
But Antlr does not think like I do ;)
sql-stmt.g:121:34: This subrule cannot be inverted. Only subrules of the form:
(T1|T2|T3...) or
('c1'|'c2'|'c3'...)
may be inverted (ranges are also allowed).
So the error message is a little cryptic, but I think it is trying to say that only ranges, single-char alts or token alts can be negated. But isn't that what I have? Both BLOCK_COMMENT_START and BLOCK_COMMENT_END are tokens. What am I missing?
Thanks a lot for any help.

antlr how can I skip a token until I need it

Is there a way to skip tokens until I need them? In order to be more clear, here is a grammar that is as close as I could get to what I want:
grammar example;
file : statement* EOF ;
statement : ID EOL
| '{' (EOL statement*)? '}' EOL
;
EOL : ('\r'? '\n' | '\r') -> skip ;
WHITESPACE : [ \t]+ -> skip ;
Hopefully my intent is clear: all whitespace (including newlines) is skipped under normal circumstances, but I can demand the presence of a newline whenever I want, so
foo
{
bar
}
baz
would fit the grammar, but not
foo {
bar
} baz
or
foo bar
{
baz
}
Is there a way to do this, or do I just have to put a lot of EOL*'s in my grammar?
Not long ago I answered another question that needed the same kind of mechanism.
See here for further details.
Basically you achieve this by providing your own, custom TokenStream that implements a mechanism of either skipping whitespace or feeding it into the parser depending on it's setting.

Why isn't antlr 4 breaking my tokens up as expected?

So I am fairly new to ANTLR 4. I have stripped down the grammar as much as I can to show the problem:
grammar DumbGrammar;
equation
: expression (AND expression)*
;
expression
: ID
;
ID : LETTER(LETTER|DIGIT)* ;
AND: 'and';
LETTER: [a-zA-Z_];
DIGIT : [0-9];
WS : [ \r\n\t] + -> channel (HIDDEN);
If use this grammar, and use the sample text: abc and d I get a weird tree with unexpected structure as shown below(using IntelliJ and ANTLR4 plug in):
If I simply change the terminal rule AND: 'and'; to read AND: '&&'; and then submit abc && d as input I get the following tree, as expected:
I cannot figure out why it isn't parsing "and" correctly, but does parse '&&' correctly.
The input "and" is being tokenized as an ID token. Since both ID and AND match the input "and", ANTLR needs to make a decision which token to choose. It takes ID since it was defined before AND.
The solution: define AND before ID:
AND: 'and';
ID : LETTER(LETTER|DIGIT)* ;