Xtext rule that negates other valid matches - grammar

I am fairly new to Xtext, so it is possible I am asking the wrong thing, or using incorrect terminology. Please keep this in mind in your responses.
I am attempting to implement JBehave EBNF Spec from scratch in Xtext as a learning exercise. JBehave is a very "wordy" grammar, similar to the one I will need to be able to maintain, so I will need to understand how to handle various types of "words" in different context.
I have been able to get this test case to pass as a starting point.
#Test
def void loadModel() {
// Multi-line
var story = parseHelper.parse('''
The quick brown fox
Jumps over the lazy dog
''')
assertThat(story, notNullValue())
assertThat(
story.description,
equalTo('''
The quick brown fox
Jumps over the lazy dog
''')
)
// Single-line description
story = parseHelper.parse('''
The quick brown fox
''')
assertThat(
story.description,
equalTo("The quick brown fox\n")
)
}
Using this grammar definition...
grammar org.example.jbehave.JBehave hidden (WS)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate jbehave "http://www.example.org/jbehave"
// The story describes a feature via description, narrative and a set of scenarios
// Story := Description? Meta? Narrative? GivenStories? Lifecycle? Scenario+ ;
Story:
description=Description?
;
// The Description is expressed by any sequence of words that must not contain any keywords at start of lines.
// Description := (Word Space?)* ;
Description:
((WORD) (WORD)* EOL+)+
// ((NON_KEYWORD) (WORD)* EOL+)+
;
// Key Words
////
// TODO: parser fails when uncommented
//terminal NON_KEYWORD:
// !(IN_ORDER_TO
// | AS_A
// | I_WANT_TO
// | SO_THAT
// | SCENARIO
// | GIVEN_STORIES
// | GIVEN
// | THEN
// | WHEN
// | AND
// )
//;
terminal fragment IN_ORDER_TO: "In order to";
terminal fragment AS_A: "As a";
terminal fragment I_WANT_TO: "I want to";
terminal fragment SO_THAT: "So that";
terminal fragment SCENARIO: "Scenario:";
terminal fragment GIVEN_STORIES: "GivenStories:";
terminal fragment GIVEN: "Given";
terminal fragment WHEN: "When";
terminal fragment THEN: "Then";
terminal fragment AND: "And";
// Common Terminals
////
terminal WORD: ('a'..'z'|'A'..'Z'|'_')('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
terminal WS: (' '|'\t')+;
terminal EOL: NL;
terminal fragment NL: ('\r'? '\n');
The problems I am running into are outlined in the comments.
When I uncomment terminal NON_KEYWORD, the test fails with
Expected: "The quick brown fox\nJumps over the lazy dog\n"
but: was "The"
If I then replace the line commented out in Description, the test fails to parse at all with
Expected: not null
but: was null
I sort of understand what is happening here in a vague sense. Tokens I define before WORD are also valid words, and so it is throwing off the parser. Therefore my questions are as follows.
Where can I find in the Xtext documentation (or other sources) that describe the underlying principals that are in affect here. I've read Xtext docs many times by now, but all I could find was a brief note on the order-dependence of terminal statements.
What is a good way to debug how the parser is interpreting my grammar? Is there something similar to dumping IFormattableDocument to the console, but for the lexer/parser/whatever?
And finally, what is the best way to tackle this problem from an Xtext perspective. Should I be looking into custom Data Types, or is this expressible in pure Xtext?
I am seeking to understand the underlying principals.
Update
Well this is certainly odd. I attempted to move past this for now and implement the next part of the spec.
; The narrative is identified by keyword "Narrative:" (or equivalent in I18n-ed locale),
; It is followed by the narrative elements
Narrative:= "Narrative:" ( InOrderTo AsA IWantTo | AsA IWantTo SoThat ) ;
I actually couldn't get this working on it's own. However, when I uncommented the original code and tried them together, it works!
#Test
def void narrativeOnly() {
var story = _th.parse('''
Narrative:
In order check reports
As a Developer
I want to workin with todos using examples
''')
assertThat(story, notNullValue())
}
#Test
def void descriptionOnly() {
// Multi-line
var story = _th.parse('''
The quick brown fox
Jumps over the lazy dog
''')
assertThat(story, notNullValue())
assertThat(
story.description,
equalTo('''
The quick brown fox
Jumps over the lazy dog
''')
)
// Single-line description
story = _th.parse('''
The quick brown fox
''')
assertThat(
story.description,
equalTo("The quick brown fox\n")
)
}
grammar org.agileware.natural.jbehave.JBehave hidden (WS)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate jbehave "http://www.agileware.org/natural/jbehave"
// Story
////
// The story describes a feature via description, narrative and a set of scenarios
// Story := Description? Meta? Narrative? GivenStories? Lifecycle? Scenario+ ;
Story:
description=Description?
narrative=Narrative?
;
// Narrative
////
// The narrative is identified by keyword "Narrative:" (or equivalent in I18n-ed locale),
// It is followed by the narrative elements
// Narrative:= "Narrative:" ( InOrderTo AsA IWantTo | AsA IWantTo SoThat ) ;
// The narrative element content is any sequence of characters that do not match a narrative starting word
// NarrativeElementContent := ? Any sequence of NarrativeCharacter that does not match NarrativeStartingWord ? ;
Narrative:
'Narrative:'
inOrderTo=InOrderTo
asA=AsA
wantTo=IWantTo
;
// InOrderTo:= "In order to" NarrativeElementContent ;
InOrderTo:
IN_ORDER_TO (WORD) (WORD)* EOL+;
// AsA:= "As a" NarrativeElementContent ;
AsA:
AS_A (WORD) (WORD)* EOL+;
// IWantTo:= "I want to" NarrativeElementContent ;
IWantTo:
I_WANT_TO (WORD) (WORD)* EOL+;
// SoThat:= "So that" NarrativeElementContent ;
SoThat:
SO_THAT (WORD) (WORD)* EOL+;
// The Description is expressed by any sequence of words that must not contain any keywords at start of lines.
// Description := (Word Space?)* ;
Description:
((WORD) (WORD)* EOL+)+
;
// Key Words
////
//terminal NON_KEYWORD:
// !(IN_ORDER_TO
// | AS_A
// | I_WANT_TO
// | SO_THAT
// | SCENARIO
// | GIVEN_STORIES
// | GIVEN
// | THEN
// | WHEN
// | AND
// )
//;
terminal IN_ORDER_TO: "In order to";
terminal AS_A: "As a";
terminal I_WANT_TO: "I want to";
terminal SO_THAT: "So that";
//terminal SCENARIO: "Scenario:";
//terminal GIVEN_STORIES: "GivenStories:";
//terminal GIVEN: "Given";
//terminal WHEN: "When";
//terminal THEN: "Then";
//terminal AND: "And";
// Common Terminals
////
terminal WORD: (LETTER)(LETTER|DIGIT)*;
terminal fragment LETTER: ('a'..'z'|'A'..'Z');
terminal fragment DIGIT: ('0'..'9');
terminal WS: (' '|'\t')+;
terminal EOL: NL;
terminal fragment NL: ('\r'? '\n');
This takes care of #3 I guess, but arriving there by accident sort of defeats the purpose. I will now accept any answer that can point me to, or describe to me, the underlying principals that cause the behavior I've described.
Why can't I just match a random group of words? How does defining the narrative assignment along with the description assignment in Story change how the parser interprets the grammar?

I have been able to answer all 3 of my questions using ANTLRWorks, a gui tool in the form of an executable jar with the express purpose of debugging, visualizing, and helping one understand the behavior of the parser.
To get this working with Xtext, one needs to add the following thier mwe2 generator:
language = StandardLanguage {
// ...
parserGenerator = {
debugGrammar = true
}
}
Then open the generated debug file in the ANTLRWorks tool and hit the "Bug" (debug) icon. This file should be located at src-gen/*/parser/antlr/internal/DebugInternal*.g
Source: https://blogs.itemis.com/en/debugging-xtext-grammars-what-to-do-when-your-language-is-ambiguous

Related

Erratic parser. Same grammar, same input, cycles through different results. What am I missing?

I'm writing a basic parser that reads form stdin and prints results to stdout. The problem is that I'm having troubles with this grammar:
%token WORD NUM TERM
%%
stmt: /* empty */
| word word term { printf("[stmt]\n"); }
| word number term { printf("[stmt]\n"); }
| word term
| number term
;
word: WORD { printf("[word]\n"); }
;
number: NUM { printf("[number]\n"); }
;
term: TERM { printf("[term]\n"); /* \n */}
;
%%
When I run the program, I and type: hello world\n The output is (as I expected) [word] [word] [term] [stmt]. So far, so good, but then if I type: hello world\n (again), I get syntax error [word][term].
When I type hello world\n (for the third time) it works, then it fails again, then it works, and so on and do forth.
Am I missing something obvious in here?
(I have some experience on hand rolled compilers, but I've not used lex/yacc et. al.)
This is the main func:
int main() {
do {
yyparse();
} while(!feof(yyin));
return 0;
}
Any help would be appreciated. Thanks!
Your grammar recognises a single stmt. Yacc/bison expect the grammar to describe the entire input, so after the statement is recognised, the parser waits for an end-of-input indication. But it doesn't get one, since you typed a second statement. That causes the parser to report a syntax error. But note that it has now read the first token in the second line.
You are calling yyparse() in a loop and not stopping when you get a syntax error return value. So when you call yyparse() again, it will continue where the last one left off, which is just before the second token in the second line. What remains is just a single word, which it then correctly parses.
What you probably should do is write your parser so that it accepts any number of statements, and perhaps so that it does not die when it hits an error. That would look something like this:
%%
prog: %empty
| prog line
line: stmt '\n' { puts("Got a statement"); }
| error '\n' { yyerrok; /* Simple error recovery */ }
...
Note that I print a message for a statement only after I know that the line was correctly parsed. That usually turns out to be less confusing. But the best solution is not use printf's, but rather to use Bison's trace facility, which is as simple as putting -t on the bison command line and setting the global variable yydebug = 1;. See Tracing your parser

Split a BibTeX author field into parts

I am trying to parse a BibTeX author field using the following grammar:
use v6;
use Grammar::Tracer;
# Extract BibTeX author parts from string. The parts are separated
# by a comma and optional space around the comma
grammar Author {
token TOP {
<all-text>
}
token all-text {
[<author-part> [[\s* ',' \s*] || [\s* $]]]+
}
token author-part {
[<-[\s,]> || [\s* <!before ','>]]+
}
}
my $str = "Rockhold, Mark L";
my $result = Author.parse( $str );
say $result;
Output:
TOP
| all-text
| | author-part
| | * MATCH "Rockhold"
| | author-part
But here the program hangs (I have to press CTRL-C) to abort.
I suspect the problem is related to the negative lookahead assertion. I tried to remove it, and then the program does not hang anymore, but then I am also not able to extract the last part "Mark L" with an internal space.
Note that for debugging purposes, the Author grammar above is a simplified version of the one used in my actual program.
The expression [\s* <!before ','>] may not make any progress. Since it's in a quantifier, it will be retried again and again (but not move forward), resulting in the hang observed.
Such a construct will reliably hang at the end of the string; doing [\s* <!before ',' || $>] fixes it by making the lookahead fail at the end of the string also (being at the end of the string is a valid way to not be before a ,).
At least for this simple example, it looks like the whole author-part token could just be <-[,]>+, but perhaps that's an oversimplification for the real problem that this was reduced from.
Glancing at all-text, I'd also point out the % quantifier modifier which makes matching comma-separated (or anything-separated, really) things easier.

Parse emails using antlr

I've tried for an entire week to build using antlr a grammar that allows me to parse an email message.
My goal is not to parse exhaustively the entire email into tokens but into relevant sections.
Here is the document format that I have to deal with. // depict inline comments that are not part of the message:
Subject : [SUBJECT_MARKER] + lorem ipsum...
// marks a message that needs to be parsed.
// Subject marker can be something like "help needed", "action required"
Body:
// irrelevant text we can ignore, discard or skip
Hi George,
Hope you had a good weekend. Another fluff sentence ...
// end of irrelevant text
// beginning of the SECTION_TYPE_1. SECTION_TYPE_1 marker is "answers below:"
[SECTION_TYPE_1]
Meaningful text block that needs capturing, made of many sentences: Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // this is "\n\n"
// SENTENCE_MARKER can be "a)", "b)" or anything that is in the form "[a-zA-Z]')'"
// one important requirement is that this SENTENCE_MARKER matches only inside a section. Either SECTION_TYPE_1 or SECTION_TYPE_2
// alternatively instead of [SECTION_TYPE_1] we can have [SECTION_TYPE_2].
// if we have SECTION_TYPE_1 then try to parse SECTION_TYPE_1 else try to parse SECTION_TYPE_2.enter code here
[SECTION_TYPE_2] // beginning of the section type 1;
Meaningful text bloc that needs capturing. Many sentences Lorem ipsum ...
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SENTENCE_MARKER] - Sentences that needs to be captured.
[SECTION_END_MARKER] // same as above
The problems I'm facing are the following:
I didn't figure out a good way to skip text at the beginning of the
message and start applying the parsing rules only after a marker has
been found. SECTION_TYPE_1
Capture all the text inside a section between section start and the sentence markers
After a SECTION_END marker ignore all the text that comes afterwards.
Antlr is a parser for structured, ideally unambiguously structured, texts. Unless your source messages have relatively well-defined features that reliably mark the message parts of interest, Antlr is unlikely to work.
A better approach would be to use a natural language processor (NLP) package to identify the form and object of each sentence or phrase to thereby identify those of interest. The Stanford NLP package is quite well known (Github).
Update
The necessary grammar will be of the form:
message : subject ( sec1 | sec2 | fluff )* EOF ;
subject : fluff* SUBJECT_MARKER subText EOL ;
subText : ( word | HWS )+ ;
sec1 : ( SECTION_TYPE_1 content )+ SECTION_END_MARKER ;
sec2 : ( SECTION_TYPE_2 content )+ SECTION_END_MARKER ;
content : ( word | ws )+ ;
word : CHAR+ ;
ws : ( EOL | HWS )+ ;
fluff : . ;
SUBJECT_MARKER : 'marker' ;
SECTION_TYPE_1 : 'text1' ;
SECTION_TYPE_2 : 'text2' ;
SENTENCE_MARKER : [a-zA-Z0-9] ')' ;
EOL : '\r'? '\n';
HWS : [ \t] ;
CHAR : . ;
Success will depend on just how unambiguous the various markers are -- and it is a given that there will be ambiguities. Either modify the grammar to handle the ambiguities explicitly or defer to the tree-walk/analysis phase to resolve.

How to have unstructured sections in a file parsed using Antlr

I am creating a translator from my language into many (all?) other object oriented languages. As part of the language I want to support being able to insert target language code sections into the file. This is actually rather similar to how Antlr supports actions in rules.
So I would like to be able to have the sections begin and end with curlies like this:
{ ...target lang code... }
The issue is that it is quite possible { ... } can show up in the target language code so I need to be able match pairs of curlies.
What I want to be able to do is something like this fragment that I've pulled into its own grammar:
grammar target_lang_block;
options
{
output = AST;
}
entry
: target_lang_block;
target_lang_block
: '{' target_lang_code* '}'
;
target_lang_code
: target_lang_block
| NO_CURLIES
;
WS
: (' ' | '\r' | '\t' | '\n')+ {$channel = HIDDEN;}
;
NO_CURLIES
: ~('{'|'}')+
;
This grammar works by itself (at least to the extent I have tested it).
However, when I put these rules into the larger language, NO_CURLIES seems to eat everything and cause MismatchedTokenExceptions.
I'm not sure how to deal with this situation, but it seems that what I want is to be able to turn NO_CURILES on and off based on if I'm in target_lang_block, but it does not seem that is possible.
Is it possible? Is there another way?
Thanks
Handle the target_lang_block inside the lexer instead:
Target_lang_block
: '{' (~('{' | '}') | Target_lang_block)* '}'
;
And remove NO_CURLIES, of course.

ANTLR - Implicit AND Tokens In Tree

I’m trying to build a grammar that interprets user-entered text, search-engine style. It will support the AND, OR, NOT and ANDNOT Boolean operators. I have pretty much everything working, but I want to add a rule that two adjacent keywords outside of a quoted string implicitly are treated as in an AND clause. For example:
cheese and crackers = cheese AND crackers
(up and down) or (left and right) = (up AND down) OR (left AND right)
cat dog “potbelly pig” = cat AND dog AND “potbelly pig”
I’m having trouble with the last one, and I’m hoping somebody can point me in the right direction. Here’s my *.g file thus far, and please be nice, my ANTLR experience spans less than a work day:
grammar SearchEngine;
options { language = CSharp2; output = AST; }
#lexer::namespace { Demo.SearchEngine }
#parser::namespace { Demo.SearchEngine }
LPARENTHESIS : '(';
RPARENTHESIS : ')';
AND : ('A'|'a')('N'|'n')('D'|'d');
OR : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT : ('N'|'n')('O'|'o')('T'|'t');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9');
fragment QUOTE : ('"');
fragment SPACE : (' '|'\n'|'\r'|'\t'|'\u000C');
WS : (SPACE) { $channel=HIDDEN; };
PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE);
WORD : (CHARACTER)+;
startExpression : andExpression;
andExpression : andnotExpression (AND^ andnotExpression)*;
andnotExpression : orExpression (ANDNOT^ orExpression)*;
orExpression : notExpression (OR^ notExpression)*;
notExpression : (NOT^)? atomicExpression;
atomicExpression : PHRASE | WORD | LPARENTHESIS! andExpression RPARENTHESIS!;
Since your AND-rule has the optional AND-keyword, you should create an imaginary AND-token and use a rewrite-rule to "inject" that token in your tree. In this case, you can't make use of ANTLR's short-hand ^ root-operator. You'll have to use the -> rewrite operator.
Your andExpression should look like:
andExpression
: (andnotExpression -> andnotExpression)
(AND? a=andnotExpression -> ^(AndNode $andExpression $a))*
;
A detailed description of this (perhaps cryptic) notation is given in Chapter 7, section Rewrite Rules in Subrules, page 173-174 of The Definitive ANTLR Reference by Terence Parr.
I ran a quick test to see if the grammar produces the proper AST with the new andExpression rule. After parsing the string cat dog "potbelly and pig" and FOO, the generated parser produced the following AST:
alt text http://img580.imageshack.us/img580/7370/andtree.png
Note that the AndNode and Root are imaginary tokens.
If you want to know how to create the AST picture above, see this thread: Visualizing an AST created with ANTLR (in a .Net environment)
EDIT
When parsing both one two three and (one two) three, the following AST is created:
alt text http://img203.imageshack.us/img203/2558/69551879.png
And when parsing (one two) OR three, the following AST is created:
alt text http://img340.imageshack.us/img340/8779/73390353.png
which seems to be the proper way in all cases.