Getting the index of an attribute in parser in antlr4

Getting the index of an attribute in parser in antlr4 - grammar

I have a grammar which looks like
statement
: ME second_part
{
System.out.println($ME.getStartIndex());
System.out.println($second_part.getStartIndex());
}
;
ME : 'me'
;
SPACES : [ \t\n\r] -> channel(HIDDEN);
I want to get the start indices of Me and second_part.
I am getting an error when I run the above antlr4 grammar
System.out.println($second_part.getStartIndex());
missing attribute access on rule reference second_part in $second_part
How can I get starting index of second_part ?

Every parser rule is a ParserRuleContext, which has a start and stop token. So, try this instead:
System.out.println($second_part.start.getStartIndex());

Related

Lex matching doesn't enter recursive rule as expected

I am trying to match words between # characters. Here is my attempt:
init : (TEXT | HASH | placeholder) init? EOF ;
placeholder : HASH lexeme HASH ;
lexeme : LEXEME;
HASH : '#' ;
LEXEME : [a-zA-Z0-9-_]+ ;
TEXT : ~'#'+ ;
My input string: "The good text with a #LEXEME#followed# by hashes of death#############"
And the resulting ParseTree:
I'm expecting the "followed" word to be parsed as a TEXT in the next recursive init but it looks like it's parsed in the same init iteration, thus not recognized. This happens every time a pattern like #letters#letters# is encountered.
How do I solve this?

It looks like you want the #s to mark the start and stop of your placeholders (aka LEXEMEs). You could do that by breaking the grammar into a Lexer grammar and a Parser grammar:
lexer grammar HashLexer
;
HASH: '#' -> mode(PLACEHOLDER_MODE);
TEXT: ~'#'+;
mode PLACEHOLDER_MODE
;
LEXEME: [a-zA-Z0-9\-_]+;
HASH_TERM: '#' -> mode(DEFAULT_MODE);
parser grammar HashParser
;
options {
tokenVocab = HashLexer;
}
init: (TEXT | placeholder)* EOF;
placeholder: HASH LEXEME? HASH_TERM;
When I try to parse your input "The good text with a #LEXEME#followed# by hashes of death#############" however, I get the following token stream:
[#0,0:20='The good text with a ',<TEXT>,1:0]
[#1,21:21='#',<HASH>,1:21]
[#2,22:27='LEXEME',<LEXEME>,1:22]
[#3,28:28='#',<HASH_TERM>,1:28]
[#4,29:36='followed',<TEXT>,1:29]
[#5,37:37='#',<HASH>,1:37]
[#6,39:40='by',<LEXEME>,1:39]
[#7,42:47='hashes',<LEXEME>,1:42]
[#8,49:50='of',<LEXEME>,1:49]
[#9,52:56='death',<LEXEME>,1:52]
[#10,57:57='#',<HASH_TERM>,1:57]
[#11,58:58='#',<HASH>,1:58]
[#12,59:59='#',<HASH_TERM>,1:59]
[#13,60:60='#',<HASH>,1:60]
[#14,61:61='#',<HASH_TERM>,1:61]
[#15,62:62='#',<HASH>,1:62]
[#16,63:63='#',<HASH_TERM>,1:63]
[#17,64:64='#',<HASH>,1:64]
[#18,65:65='#',<HASH_TERM>,1:65]
[#19,66:66='#',<HASH>,1:66]
[#20,67:67='#',<HASH_TERM>,1:67]
[#21,68:68='#',<HASH>,1:68]
[#22,69:69='#',<HASH_TERM>,1:69]
[#23,70:70='\n',<TEXT>,1:70]
[#24,71:70='<EOF>',<EOF>,2:0]
The # after followed pushes us into the PLACEHOLDER_MODE so " by hashes of death" is Lexed in PLACEHOLDER mode and generates recognition errors as it does not match the LEXEME rule. And you get the following parse tree:
This seems the correct interpretation of your input (assuming that #s act like ( and ) to bracket some input, then you're going to get situations like this when they're not matched up correctly. The only solution to that would be to relax the grammar quite a bit and handle more of the validation in a a listener/visitor.

ANTLR4 disambiguation of terminal tokens

This is my grammar in ANTLR4:
grammar Hello;
r : WORD ID ;
ID : [a-z]+ ;
WORD : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
When I type in something like:
hello buddy
I got the following error message:
line 1 missing WORD at 'hello'
But, if I change the grammar in
grammar Hello;
r : WORD ID ;
ID : [a-z]+ ;
WORD : [1-9]+ ;
WS : [ \t\r\n]+ -> skip ;
where now WORD is a number, everything is ok.
I strongly suspect that since in the first grammar we have two terminal node with the same regex, the parser doesn't know the correspondance of the real word.
So am I wrong thinking of it? If not, how would you solve this issue keeping more than one terminal with the same regex?

You cannot have two terminals that match the same pattern.
If your grammar actually needs to match twice [a-z]+, then use a production like
r : WORD WORD ;
and the discrimination will be done at the parser / tree traversal level.
If either WORD or ID can be restricted to a fixed list, you could declare all the possible words as terminals then use them to define e.g. what a WORD can be.

where now WORD is a number, everything is ok.
Not really :
$ alias
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Hello r -tokens data.txt
[#0,0:4='hello',<ID>,1:0]
[#1,6:10='buddy',<ID>,1:6]
[#2,12:11='<EOF>',<EOF>,2:0]
line 1:0 missing WORD at 'hello'
When the lexer can match some input with two rules, there is an ambiguity, and it chooses the first rule. With a hello buddy input, the lexer produces two ID tokens
with the first grammar, because it's ambiguous and ID comes first
with the second grammar, the input can only be matched by ID WS ID
You can disambiguate with a predicate in the lexer rule like so :
grammar Question;
/* Ambiguous input */
file
: HELLO ID
;
HELLO
: [a-z]+ {getText().equals("hello")}? ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
Execution :
$ grun Question file -tokens data.txt
[#0,0:4='hello',<HELLO>,1:0]
[#1,6:10='buddy',<ID>,1:6]
[#2,12:11='<EOF>',<EOF>,2:0]
More on semantic predicates in The Definitive ANTLR Reference.

Why isn't antlr 4 breaking my tokens up as expected?

So I am fairly new to ANTLR 4. I have stripped down the grammar as much as I can to show the problem:
grammar DumbGrammar;
equation
: expression (AND expression)*
;
expression
: ID
;
ID : LETTER(LETTER|DIGIT)* ;
AND: 'and';
LETTER: [a-zA-Z_];
DIGIT : [0-9];
WS : [ \r\n\t] + -> channel (HIDDEN);
If use this grammar, and use the sample text: abc and d I get a weird tree with unexpected structure as shown below(using IntelliJ and ANTLR4 plug in):
If I simply change the terminal rule AND: 'and'; to read AND: '&&'; and then submit abc && d as input I get the following tree, as expected:
I cannot figure out why it isn't parsing "and" correctly, but does parse '&&' correctly.

The input "and" is being tokenized as an ID token. Since both ID and AND match the input "and", ANTLR needs to make a decision which token to choose. It takes ID since it was defined before AND.
The solution: define AND before ID:
AND: 'and';
ID : LETTER(LETTER|DIGIT)* ;

ANTLR4 Negative lookahead workaround?

I'm using antlr4 and I'm trying to make a parser for Matlab. One of the main issue there is the fact that comments and transpose both use single quotes. What I was thinking of a solution was to define the STRING lexer rule in somewhat the following manner:
(if previous token is not ')','}',']' or [a-zA-Z0-9]) than match '\'' ( ESC_SEQ | ~('\\'|'\''|'\r'|'\n') )* '\'' (but note I do not want to consume the previous token if it is true).
Does anyone knows a workaround this problem, as it does not support negative lookaheads?

You can do negative lookahead in ANTLR4 using _input.LA(-1) (in Java, see how to resolve simple ambiguity or ANTLR4 negative lookahead in lexer).
You can also use lexer mode to deal with this kind of stuff, but your lexer had to be defined in its own file. The idea is to go from a state that can match some tokens to another that can match new ones.
Here is an example from ANTLR4 lexer documentation:
// Default "mode": Everything OUTSIDE of a tag
COMMENT : '<!--' .*? '-->' ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
...
// ----------------- Everything INSIDE of a tag ------------------ ---
mode INSIDE;
CLOSE : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE : '/>' -> popMode ;

ANTLR - identifier with whitespace

i want identifiers that can contain whitespace.
grammar WhitespaceInSymbols;
premise : ( options {greedy=false;} : 'IF' ) id=ID{
System.out.println($id.text);
};
ID : ('a'..'z'|'A'..'Z')+ (' '('a'..'z'|'A'..'Z')+)*
;
WS : ' '+ {skip();}
;
When i test this with "IF statement analyzed" i get a MissingTokenException and the output "IF statement analyzed".
I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token. But instead the IF is part of the ID.
Is there a way to achieve my goal? I already tried some variations of the greed=false-option, but without success.

I thought, that by using greedy=false i could tell ANTLR to exit afer 'IF' and take it as a token.
No, the parser has nothing to say about the creation of tokens: the input is first tokenized and then the parser rules are applied on these tokens. So setting greedy=false has no effect.
You can do this (creating ID tokens with white spaces), but it will be a horrible solution with many predicates, and a few custom methods in the lexer doing manual look-aheads: you really, really don't want this! A much cleaner solution would be to introduce a id rule in your parser and let it match one or more ID tokens.
A demo:
grammar WhitespaceInSymbols;
premise
: IF id THEN EOF
;
id
: ID+
;
IF
: 'IF'
;
THEN
: 'THEN'
;
ID
: ('a'..'z' | 'A'..'Z')+
;
WS
: ' '+ {skip();}
;
would parse the input IF statement analyzed THEN into the following tree:

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Getting the index of an attribute in parser in antlr4 - grammar

Every parser rule is a ParserRuleContext, which has a start and stop token. So, try this instead: System.out.println($second_part.start.getStartIndex());

Related

Lex matching doesn't enter recursive rule as expected

ANTLR4 disambiguation of terminal tokens

Why isn't antlr 4 breaking my tokens up as expected?

ANTLR4 Negative lookahead workaround?

ANTLR - identifier with whitespace

Categories

Resources