Lex matching doesn't enter recursive rule as expected - antlr

I am trying to match words between # characters. Here is my attempt:
init : (TEXT | HASH | placeholder) init? EOF ;
placeholder : HASH lexeme HASH ;
lexeme : LEXEME;
HASH : '#' ;
LEXEME : [a-zA-Z0-9-_]+ ;
TEXT : ~'#'+ ;
My input string: "The good text with a #LEXEME#followed# by hashes of death#############"
And the resulting ParseTree:
I'm expecting the "followed" word to be parsed as a TEXT in the next recursive init but it looks like it's parsed in the same init iteration, thus not recognized. This happens every time a pattern like #letters#letters# is encountered.
How do I solve this?

It looks like you want the #s to mark the start and stop of your placeholders (aka LEXEMEs). You could do that by breaking the grammar into a Lexer grammar and a Parser grammar:
lexer grammar HashLexer
;
HASH: '#' -> mode(PLACEHOLDER_MODE);
TEXT: ~'#'+;
mode PLACEHOLDER_MODE
;
LEXEME: [a-zA-Z0-9\-_]+;
HASH_TERM: '#' -> mode(DEFAULT_MODE);
parser grammar HashParser
;
options {
tokenVocab = HashLexer;
}
init: (TEXT | placeholder)* EOF;
placeholder: HASH LEXEME? HASH_TERM;
When I try to parse your input "The good text with a #LEXEME#followed# by hashes of death#############" however, I get the following token stream:
[#0,0:20='The good text with a ',<TEXT>,1:0]
[#1,21:21='#',<HASH>,1:21]
[#2,22:27='LEXEME',<LEXEME>,1:22]
[#3,28:28='#',<HASH_TERM>,1:28]
[#4,29:36='followed',<TEXT>,1:29]
[#5,37:37='#',<HASH>,1:37]
[#6,39:40='by',<LEXEME>,1:39]
[#7,42:47='hashes',<LEXEME>,1:42]
[#8,49:50='of',<LEXEME>,1:49]
[#9,52:56='death',<LEXEME>,1:52]
[#10,57:57='#',<HASH_TERM>,1:57]
[#11,58:58='#',<HASH>,1:58]
[#12,59:59='#',<HASH_TERM>,1:59]
[#13,60:60='#',<HASH>,1:60]
[#14,61:61='#',<HASH_TERM>,1:61]
[#15,62:62='#',<HASH>,1:62]
[#16,63:63='#',<HASH_TERM>,1:63]
[#17,64:64='#',<HASH>,1:64]
[#18,65:65='#',<HASH_TERM>,1:65]
[#19,66:66='#',<HASH>,1:66]
[#20,67:67='#',<HASH_TERM>,1:67]
[#21,68:68='#',<HASH>,1:68]
[#22,69:69='#',<HASH_TERM>,1:69]
[#23,70:70='\n',<TEXT>,1:70]
[#24,71:70='<EOF>',<EOF>,2:0]
The # after followed pushes us into the PLACEHOLDER_MODE so " by hashes of death" is Lexed in PLACEHOLDER mode and generates recognition errors as it does not match the LEXEME rule. And you get the following parse tree:
This seems the correct interpretation of your input (assuming that #s act like ( and ) to bracket some input, then you're going to get situations like this when they're not matched up correctly. The only solution to that would be to relax the grammar quite a bit and handle more of the validation in a a listener/visitor.

Related

Antlr4: Mismatch input

My grammar:
qwe.g4
grammar qwe;
query
: COLUMN OPERATOR value EOF
;
COLUMN
: [a-z_]+
;
OPERATOR
: ('='|'>'|'<')
;
SCALAR
: [a-z_]+
;
value
: SCALAR
;
WS : [ \t\r\n]+ -> skip ;
Handling in Python code:
qwe.py
from antlr4 import InputStream, CommonTokenStream
from qweLexer import qweLexer
from qweParser import qweParser
conditions = 'total_sales>qwe'
lexer = qweLexer(InputStream(conditions))
stream = CommonTokenStream(lexer)
parser = qweParser(stream)
tree = parser.query()
for child in tree.getChildren():
print(child, '==>', type(child))
Running qwe.py outputs error when parsing (lexing?) value:
How to fix that?
I read some and suppose that there is something to do with COLUMN rule that also matches value...
Your COLUMN and SCALAR lexer rules are identical. When the LExer matches two rules, then the rule that recognizes the longest token will win. If the token lengths are the same (as the are here), the the first rule wins.
Your Token Stream will be COLUMN OPERATOR COLUMN
That's thy the query rule won't match.
As a general practice, it's good to use the grun alias (that the setup tutorial will tell you how to set up) to dump the token stream.
grun qwe tokens -tokens < sampleInputFile
Once that gives you the expected output, you'll probably want to use the grun tool to display parse trees of your input to verify that is correct. All of this can be done without hooking up the generated code into your target language, and helps ensure that your grammar is basically correct before you wire things up.

ANTLR Grammar to parse a Asterisk-delimited input

I am attempting to use ANTLR (v4) to create a parser generator for a asterisk-delimited list encapsulated by START and END markers.
START**na**na**aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
Where a normal input string would be something like:
START*na*na*aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
I would still need to be able to allow spaces, tabs, null/empty fields (basically any character except START, END, * between the asterisks.
that includes things like ** * * *asdf fdsa* * asdf *
Here is my grammar so far:
parseIt: ENTRY ;
ENTRY : 'START*' FIELD_SET 'END' ;
fragment Delim : '*' ;
fragment Data : (ANY | WS)* ;
fragment FIELD_SET : Data (Delim Data|Delim)* ;
I can recognize simple input (like the first example I gave), but am having trouble recognizing tokens that have spaces or special characters between the asterisks.
I’m pretty sure you could handle this with a RegEx and capture groups, but if you really want to use ANTLR…
The following works:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+ {!(
(getText().endsWith("E") && _input.LA(1) == (int) 'N' && _input.LA(2) == (int) 'D') ||
(getText().endsWith("EN") && _input.LA(1) == (int) 'D') ||
(getText().endsWith("END")))}?;
and gives the following parse tree (for you first input) (click on it to view it full size):
Unfortunately for you, the way the lexer works, a simple lexer rule like Data : ~[*]+ will preferentially match aEND over your END implied lexer rule, because the ANTLR lexer uses the rule that matches the longest sequence ion input characters, and Data : ~[*]+ matches aEND while END only matches END (ANTLR also, doesn't look ahead for token matches). As a result the rather tortured semantic predicate is the only way to disallow a token that is a stream of characters that ends with END.
(Note: Semantic predicates a target-language specific, and this predicate is for Java. Other targets would require the equivalent int that target language.)
Another approach would be to check if your input endswith(“END”), and then just remove it prior to parsing using this grammar:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+;
This avoids the END token problem by just removing it from the input stream. Given that it's the very end of the stream, this might be simpler.

Why isn't antlr 4 breaking my tokens up as expected?

So I am fairly new to ANTLR 4. I have stripped down the grammar as much as I can to show the problem:
grammar DumbGrammar;
equation
: expression (AND expression)*
;
expression
: ID
;
ID : LETTER(LETTER|DIGIT)* ;
AND: 'and';
LETTER: [a-zA-Z_];
DIGIT : [0-9];
WS : [ \r\n\t] + -> channel (HIDDEN);
If use this grammar, and use the sample text: abc and d I get a weird tree with unexpected structure as shown below(using IntelliJ and ANTLR4 plug in):
If I simply change the terminal rule AND: 'and'; to read AND: '&&'; and then submit abc && d as input I get the following tree, as expected:
I cannot figure out why it isn't parsing "and" correctly, but does parse '&&' correctly.
The input "and" is being tokenized as an ID token. Since both ID and AND match the input "and", ANTLR needs to make a decision which token to choose. It takes ID since it was defined before AND.
The solution: define AND before ID:
AND: 'and';
ID : LETTER(LETTER|DIGIT)* ;

Problems with ANTLR4 grammar

I have a very simple grammar file, which looks like this:
grammar Wort;
// Parser Rules:
word
: ANY_WORD EOF
;
// Lexer Rules:
ANY_WORD
: SMALL_WORD | CAPITAL_WORD
;
SMALL_WORD
: SMALL_LETTER (SMALL_LETTER)+
;
CAPITAL_WORD
: CAPITAL_LETTER (SMALL_LETTER)+
;
fragment SMALL_LETTER
: ('a'..'z')
;
fragment CAPITAL_LETTER
: ('A'..'Z')
;
If i try to parse the input "Hello", everything is OK, BUT if if modify my grammar file like this:
...
// Parser Rules:
word
: CAPITAL_WORD EOF
;
...
the input "Hello" is no longer recognized as a valid input. Can anybody explain, what is going wrong?
Thanx, Lars
The issue here has to do with precedence in the lexer grammar. Because ANY_WORD is listed before CAPITAL_WORD, it is given higher precedence. The lexer will identify Hello as a CAPITAL_WORD, but since an ANY_WORD can be just a CAPITAL_WORD, and the lexer is set up to prefer ANY_WORD, it will output the token ANY_WORD. The parser acts on the output of the lexer, and since ANY_WORD EOF doesn't match any of its rules, the parse fails.
You can make the lexer behave differently by moving CAPITAL_WORD above ANY_WORD in the grammar, but that will create the opposite problem -- capitalized words will never lex as ANY_WORDs. The best thing to do is probably what Mephy suggested -- make ANY_WORD a parser rule.

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....