antlr parsing fails inside a generated syntactic predicate - antlr

I have a grammar with backtrack=true similar to the following:
start:
subrules+;
subrules:
...
| rule1
...
rule1:
....
| { Type1 }?
(sentence INT amount NEW_LINE (sentence NEW_LINE)? )
.....
a valid sentence 1 12.09 \n on two lines passes but A valid sentence \n without a second line fails.
Stepping through the code while debugging, I can see it fails to match on the code section generated for (desc2=sentence NEW_LINE)?
If I comment out the corresponding lines in the generated code so it doesn't attempt to match (desc2=sentence NEW_LINE)? before going inside the rule, both sentences pass correctly. Is this expected behavior or is it likely something else in my grammar (which is currently quite complex) is causing it to behave incorrectly.
Thanks, Richard

Related

ANTLR - Checking for a String's "contruction"

Currently working with ANTLR and found out something interesting that's not working as I intended.
I try to run something along the lines of "test 10 cm" through my grammar and it fails, however "test 10 c m" works as the previous should. The "cm" portion of the code is what I call "wholeunit" in my grammar and it is as follows:
wholeunit :
siunit
| unitmod siunit
| wholeunit NUM
| wholeunit '/' wholeunit
| wholeunit '.' wholeunit
;
What it's doing right now is the "unitmod siunit" portion of the rule where unitmod = c and siunit = m .
What I'd like to know is how would I make it so the grammar would still follow the rule "unitmod siunit" without the need for a space in the middle, I might be missing something huge. (Yes, I have spaces and tabs marked to be skipped)
Probable cause is "cm" being considered another token together (possibly same token type as "test"), rather than "c" and "m" as separate tokens.
Remember that in ANTLR lexer, the rule matching the longest input wins.
One solution would possibly be to make the wholeunit a lexer rule rather than parser rule, and make sure it's above the rule that matches any word (like "test") - if same input can be matched by multiple rules, ANTLR selects the first rule in order they're defined in.

Why do parser combinators don't backtrack in case of failure?

I looked through the Artima guide on parser combinators, which says that we need to append failure(msg) to our grammar rules to make error-reporting meaningful for the user
def value: Parser[Any] =
obj | stringLit | num | "true" | "false" | failure("illegal start of value")
This breaks my understanding of the recursive mechanism, used in these parsers. One one hand, Artima guide makes sense saying that if all productions fail then parser will arrive at the failure("illegal start of value") returned to the user. It however does not make sense, nevertheless, once we understand that grammar is not the list of value alternatives but a tree instead. That is, value parser is a node that is called when value is sensed at the input. This means that calling parser, which is also a parent, detects failure on value parsing and proceeds with value sibling alternative. Suppose that all alternatives to value also fail. Grandparser will try its alternatives then. Failed in turn, the process unwinds upward until the starting symbol parser fails. So, what will be the error message? It seems that the last alternative of the topmost parser is reported errorenous.
To figure out, who is right, I have created a demo where program is the topmost (starting symbol) parser
import scala.util.parsing.combinator._
object ExprParserTest extends App with JavaTokenParsers {
// Grammar
val declaration = wholeNumber ~ "to" ~ wholeNumber | ident | failure("declaration not found")
val term = wholeNumber | ident ; lazy val expr: Parser[_] = term ~ rep ("+" ~ expr)
lazy val statement: Parser[_] = ident ~ " = " ~ expr | "if" ~ expr ~ "then" ~ rep(statement) ~ "else" ~ rep(statement)
val program = rep(declaration) ~ rep(statement)
// Test
println(parseAll(program, "1 to 2")) // OK
println(parseAll(program, "1 to '2")) // failure, regex `-?\d+' expected but `'' found at '2
println(parseAll(program, "abc")) // OK
}
It fails with 1 to '2 due to extra ' tick. Yes, it seems to stuck in the program -> declaration -> num "to" num rule and does not even try the ident and failure("declaration not found") alternatives! I does not back track to the statements either for the same reason. So, neither my guess nor Artima guide seems right on what parser combinators are actually doing. I wonder: what is the real logic behind rule sensing, backtracking and error reporting in parser combinators? Why does the error message suggests that no backtracking to declaration -> ident | failure(), nor statements occured? What is the point of Artima guide suggesting to place failure() in the end if it is not reached as we see or ignored, as the backtracking logic should be, anyway?
Isn't parser combinator just a plain dumb PEG? It behaves like predictive parser. I expected it is PEG and, thus, that starting symbol parser should return all failed branches and wonder why/how does the actual parser manage to select the most appropriate failure.
Many parser combinators backtrack, unless they're in an 'or' block. As a speed optimization, they'll commit to the 1st successful 'or' item and not backtrack. So 1) try to avoid '|' as much as possible in your grammar, and 2) if using '|' is unavoidable, place the longest or least-likely-to-match items first.

ANTLR and Literate Programming

I'm trying to write a parser for a simple, literate language -- with structure similar to PHP. The source might look something like:
blurb blurb blurb
[[ if mode == 5 ]]
then blurb blurb blurb
[[ else ]]
else blurb blurb blurb
[[ end ]]
The non-code sections -- those not nested in [[ ]] -- don't follow any syntax rules. It's just natural language.
However, I'm not sure how to write a grammar rules to match the non-code text. I'd welcome any help on how I might do this!
You can treat the non code text like comments.
To indicate whether it is a code or comment block you can introduce
some special symbols. Eg /* blub blub */ or something like that.
so your parser grammer could look like this:
program : program |
if_statement |
non-code
if_statement : '[[' 'if' expression ']]'
...
expressen : var OPERATOR var;
var : LITERAL;
non-code : '/*' any_text*'*/
any_text : LITERAL | DIGIT | SPECIAL_CHAR
where ANY_TEXT is a lexer rule for
SPECIAL_CHAR : '-'|'+' ....
OPERATOR : '<' | '>' ....
LITERAL : (CHAR | DIGIT)+
fragment CHAR : ('A'..'Z' | 'a'..'z')+
fragement DIGIT : ('0'..'9')+;
EDIT due to comment:
Ok then maybe you can try to make some kind of preprocessing or of chaining parsers. I just made something similar some time ago. In your case i would just parse the input string with simple REGEX rule and look for the coded parts and then internaly add some kind of tag to the non-coded part.
Input:
blub blub blah
[[ if express ]]
blah blah blub
--> Preprocess
<non-code>blub blub blah</non-code>
[[ if express ]]
<non-code>blah blah blub</non-code>
--> Paring using ANTLR Parser and Lexer
You may have also a look to TreeParser where you can reduce your input grammar to the parts you wish to evaluate by leaving unnecessary
tokens.
It looks like the ANTLR folks identified this task long ago. I guess what I'm trying to build is an island grammar, where islands of syntax appear within a sea of text that has no rules applied.
Chapter 12 of Parr's Definitive ANTLR 4 Reference led me to a solution, which involves switching between sublexers when I hit a delimiter.
In looking at the way GHC manages literate Haskell files, I think the best approach may be a preprocessing step that "deliterates" the source by turning the non-code sections into something that is more formally specified.
Maybe I have an emit function that takes as a parameter the non-code text. I can preprocess the source with something like:
src.gsub /(\A|\]\])(.*?)(\Z|\[\[)/ 'emit(\2)'

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.