Parse emails using antlr - antlr

I've tried for an entire week to build using antlr a grammar that allows me to parse an email message.
My goal is not to parse exhaustively the entire email into tokens but into relevant sections.
Here is the document format that I have to deal with. // depict inline comments that are not part of the message:
Subject : [SUBJECT_MARKER] + lorem ipsum...
// marks a message that needs to be parsed.
// Subject marker can be something like "help needed", "action required"
Body:
// irrelevant text we can ignore, discard or skip
Hi George,
Hope you had a good weekend. Another fluff sentence ...
// end of irrelevant text
// beginning of the SECTION_TYPE_1. SECTION_TYPE_1 marker is "answers below:"
[SECTION_TYPE_1]
Meaningful text block that needs capturing, made of many sentences: Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // this is "\n\n"
// SENTENCE_MARKER can be "a)", "b)" or anything that is in the form "[a-zA-Z]')'"
// one important requirement is that this SENTENCE_MARKER matches only inside a section. Either SECTION_TYPE_1 or SECTION_TYPE_2
// alternatively instead of [SECTION_TYPE_1] we can have [SECTION_TYPE_2].
// if we have SECTION_TYPE_1 then try to parse SECTION_TYPE_1 else try to parse SECTION_TYPE_2.enter code here
[SECTION_TYPE_2] // beginning of the section type 1;
Meaningful text bloc that needs capturing. Many sentences Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // same as above
The problems I'm facing are the following:
I didn't figure out a good way to skip text at the beginning of the
message and start applying the parsing rules only after a marker has
been found. SECTION_TYPE_1
Capture all the text inside a section between section start and the sentence markers
After a SECTION_END marker ignore all the text that comes afterwards.

Antlr is a parser for structured, ideally unambiguously structured, texts. Unless your source messages have relatively well-defined features that reliably mark the message parts of interest, Antlr is unlikely to work.
A better approach would be to use a natural language processor (NLP) package to identify the form and object of each sentence or phrase to thereby identify those of interest. The Stanford NLP package is quite well known (Github).
Update
The necessary grammar will be of the form:
message : subject ( sec1 | sec2 | fluff )* EOF ;
subject : fluff* SUBJECT_MARKER subText EOL ;
subText : ( word | HWS )+ ;
sec1 : ( SECTION_TYPE_1 content )+ SECTION_END_MARKER ;
sec2 : ( SECTION_TYPE_2 content )+ SECTION_END_MARKER ;
content : ( word | ws )+ ;
word : CHAR+ ;
ws : ( EOL | HWS )+ ;
fluff : . ;
SUBJECT_MARKER : 'marker' ;
SECTION_TYPE_1 : 'text1' ;
SECTION_TYPE_2 : 'text2' ;
SENTENCE_MARKER : [a-zA-Z0-9] ')' ;
EOL : '\r'? '\n';
HWS : [ \t] ;
CHAR : . ;
Success will depend on just how unambiguous the various markers are -- and it is a given that there will be ambiguities. Either modify the grammar to handle the ambiguities explicitly or defer to the tree-walk/analysis phase to resolve.

Related

Lex matching doesn't enter recursive rule as expected

I am trying to match words between # characters. Here is my attempt:
init : (TEXT | HASH | placeholder) init? EOF ;
placeholder : HASH lexeme HASH ;
lexeme : LEXEME;
HASH : '#' ;
LEXEME : [a-zA-Z0-9-_]+ ;
TEXT : ~'#'+ ;
My input string: "The good text with a #LEXEME#followed# by hashes of death#############"
And the resulting ParseTree:
I'm expecting the "followed" word to be parsed as a TEXT in the next recursive init but it looks like it's parsed in the same init iteration, thus not recognized. This happens every time a pattern like #letters#letters# is encountered.
How do I solve this?
It looks like you want the #s to mark the start and stop of your placeholders (aka LEXEMEs). You could do that by breaking the grammar into a Lexer grammar and a Parser grammar:
lexer grammar HashLexer
;
HASH: '#' -> mode(PLACEHOLDER_MODE);
TEXT: ~'#'+;
mode PLACEHOLDER_MODE
;
LEXEME: [a-zA-Z0-9\-_]+;
HASH_TERM: '#' -> mode(DEFAULT_MODE);
parser grammar HashParser
;
options {
tokenVocab = HashLexer;
}
init: (TEXT | placeholder)* EOF;
placeholder: HASH LEXEME? HASH_TERM;
When I try to parse your input "The good text with a #LEXEME#followed# by hashes of death#############" however, I get the following token stream:
[#0,0:20='The good text with a ',<TEXT>,1:0]
[#1,21:21='#',<HASH>,1:21]
[#2,22:27='LEXEME',<LEXEME>,1:22]
[#3,28:28='#',<HASH_TERM>,1:28]
[#4,29:36='followed',<TEXT>,1:29]
[#5,37:37='#',<HASH>,1:37]
[#6,39:40='by',<LEXEME>,1:39]
[#7,42:47='hashes',<LEXEME>,1:42]
[#8,49:50='of',<LEXEME>,1:49]
[#9,52:56='death',<LEXEME>,1:52]
[#10,57:57='#',<HASH_TERM>,1:57]
[#11,58:58='#',<HASH>,1:58]
[#12,59:59='#',<HASH_TERM>,1:59]
[#13,60:60='#',<HASH>,1:60]
[#14,61:61='#',<HASH_TERM>,1:61]
[#15,62:62='#',<HASH>,1:62]
[#16,63:63='#',<HASH_TERM>,1:63]
[#17,64:64='#',<HASH>,1:64]
[#18,65:65='#',<HASH_TERM>,1:65]
[#19,66:66='#',<HASH>,1:66]
[#20,67:67='#',<HASH_TERM>,1:67]
[#21,68:68='#',<HASH>,1:68]
[#22,69:69='#',<HASH_TERM>,1:69]
[#23,70:70='\n',<TEXT>,1:70]
[#24,71:70='<EOF>',<EOF>,2:0]
The # after followed pushes us into the PLACEHOLDER_MODE so " by hashes of death" is Lexed in PLACEHOLDER mode and generates recognition errors as it does not match the LEXEME rule. And you get the following parse tree:
This seems the correct interpretation of your input (assuming that #s act like ( and ) to bracket some input, then you're going to get situations like this when they're not matched up correctly. The only solution to that would be to relax the grammar quite a bit and handle more of the validation in a a listener/visitor.

ANTLR Grammar to parse a Asterisk-delimited input

I am attempting to use ANTLR (v4) to create a parser generator for a asterisk-delimited list encapsulated by START and END markers.
START**na**na**aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
Where a normal input string would be something like:
START*na*na*aa*aa*a*asdfaaa*aaDDFdasa*aaaffdda*aa*aassda*ataaaaaaaaa*a*a*aEND
I would still need to be able to allow spaces, tabs, null/empty fields (basically any character except START, END, * between the asterisks.
that includes things like ** * * *asdf fdsa* * asdf *
Here is my grammar so far:
parseIt: ENTRY ;
ENTRY : 'START*' FIELD_SET 'END' ;
fragment Delim : '*' ;
fragment Data : (ANY | WS)* ;
fragment FIELD_SET : Data (Delim Data|Delim)* ;
I can recognize simple input (like the first example I gave), but am having trouble recognizing tokens that have spaces or special characters between the asterisks.
I’m pretty sure you could handle this with a RegEx and capture groups, but if you really want to use ANTLR…
The following works:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+ {!(
(getText().endsWith("E") && _input.LA(1) == (int) 'N' && _input.LA(2) == (int) 'D') ||
(getText().endsWith("EN") && _input.LA(1) == (int) 'D') ||
(getText().endsWith("END")))}?;
and gives the following parse tree (for you first input) (click on it to view it full size):
Unfortunately for you, the way the lexer works, a simple lexer rule like Data : ~[*]+ will preferentially match aEND over your END implied lexer rule, because the ANTLR lexer uses the rule that matches the longest sequence ion input characters, and Data : ~[*]+ matches aEND while END only matches END (ANTLR also, doesn't look ahead for token matches). As a result the rather tortured semantic predicate is the only way to disallow a token that is a stream of characters that ends with END.
(Note: Semantic predicates a target-language specific, and this predicate is for Java. Other targets would require the equivalent int that target language.)
Another approach would be to check if your input endswith(“END”), and then just remove it prior to parsing using this grammar:
grammar asterisks;
parseIt: 'START' dataItem* 'END' EOF;
dataItem: Delim Data?;
Delim : '*' ;
Data : ~[*]+;
This avoids the END token problem by just removing it from the input stream. Given that it's the very end of the stream, this might be simpler.

Ambiguous Lexer rules in Antlr

I have an antlr grammar with multiple lexer rules that match the same word. It can't be resolved during lexing, but with the grammar, it becomes unambiguous.
Example:
conversion: NUMBER UNIT CONVERT UNIT;
NUMBER: [0-9]+;
UNIT: 'in' | 'meters' | ......;
CONVERT: 'in';
Input: 1 in in meters
The word "in" matches the lexer rules UNIT and CONVERT.
How can this be solved while keeping the grammar file readable?
When an input matches two lexer rules, ANTLR chooses either the longest or the first, see disambiguate. With your grammar, in will be interpreted as UNIT, never CONVERT, and the rule
conversion: NUMBER UNIT CONVERT UNIT;
can't work because there are three UNIT tokens :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<UNIT>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<UNIT>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<UNIT>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0159
line 1:5 missing 'in' at 'in'
line 1:8 mismatched input 'meters' expecting <EOF>
What you can do is to have only ID or TEXT tokens and distinguish them with a label, like this :
grammar Question;
question
#init {System.out.println("Question last update 0132");}
: conversion NL EOF
;
conversion
: NUMBER unit1=ID convert=ID unit2=ID
{System.out.println("Quantity " + $NUMBER.text + " " + $unit1.text +
" to convert " + $convert.text + " " + $unit2.text);}
;
ID : LETTER ( LETTER | DIGIT | '_' )* ; // or TEXT : LETTER+ ;
NUMBER : DIGIT+ ;
NL : [\r\n] ;
WS : [ \t] -> channel(HIDDEN) ; // -> skip ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
Execution :
$ grun Question question -tokens -diagnostics input.txt
[#0,0:0='1',<NUMBER>,1:0]
[#1,1:1=' ',<WS>,channel=1,1:1]
[#2,2:3='in',<ID>,1:2]
[#3,4:4=' ',<WS>,channel=1,1:4]
[#4,5:6='in',<ID>,1:5]
[#5,7:7=' ',<WS>,channel=1,1:7]
[#6,8:13='meters',<ID>,1:8]
[#7,14:14='\n',<NL>,1:14]
[#8,15:14='<EOF>',<EOF>,2:0]
Question last update 0132
Quantity 1 in to convert in meters
Labels are available from the rule's context in the visitor, so it is easy to distinguish tokens of the same type.
Based on the info in your question, it's hard to say what the best solution would be - I don't know what your lexer rules are, for example - nor can I tell why you have lexer rules that are ambiguous at all.
In my experience with antlr, lexer rules don't generally carry any semantic meaning; they are just text that matches some kind of regular expression. So, instead of having VARIABLE, METHOD_NAME, etc, I'd just have IDENTIFIER, and then figure it out at a higher level.
In other words, it seems (from the little I can glean from your question) that you might benefit either from replacing UNIT and CONVERT with grammar rules, or just having a single rule:
conversion: NUMBER TEXT TEXT TEXT
and validating the text values in your ANTLR listener/tree-walker/etc.
EDIT
Thanks for updating your question with lexer rules. It's clear now why it's failing - as BernardK points out, antlr will always choose the first matching lexer rule. This means it's impossible for the second of two ambiguous lexer rules to match, and makes your proposed design infeasible.
My opinion is that lexer rules are not the correct layer to do things like unit validation; they excel at structure, not content. Evaluating the parse tree will be much more practical than trying to contort an antlr grammar.
Finally, you might also do something with embedded actions on parse rules, like validating the value of an ID token against a known set of units. It could work, but would destroy the reusability of your grammar.

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....