Following along http://blog.ptsecurity.com/2016/06/theory-and-practice-of-source-code.html#java--and-java8-grammars, I am trying to reduce left-recursion in my fairly complex grammar. From what I understand, the non-primitive form of recursion can lead to performance problems both in terms of memory and process time.
So I am trying to refactor these rules in my grammar to use only "primitive" recursion. Of course, that blog post is the only time I have seen the phrase "primitive" recursion in regards to Antlr. So I am just guessing at its meaning/intent. Seems to me it means a rule that refers to itself as a lhs for at most just a single rule branch. Correct?
At the moment I have an expression rule like:
expression
: expression DOUBLE_PIPE expression # ConcatenationExpression
| expression PLUS expression # AdditionExpression
| expression MINUS expression # SubtractionExpression
| expression ASTERISK expression # MultiplicationExpression
| expression SLASH expression # DivisionExpression
| expression PERCENT expression # ModuloExpression
...
;
The ... includes quite a few sub-rules that also refer back to expression. But these are the only ones with direct recursion.
If I understand correctly, refactoring these to be "primitive" recursion would look something like:
expression
: binaryOpExpression # BinaryOpExpression
...
;
binaryOpExpression
: expression DOUBLE_PIPE expression # ConcatenationExpression
| expression PLUS expression # AdditionExpression
| expression MINUS expression # SubtractionExpression
| expression ASTERISK expression # MultiplicationExpression
| expression SLASH expression # DivisionExpression
| expression PERCENT expression # ModuloExpression
;
First, is that the correct refactoring?
Secondly, will that really help performance? At the end of the day it is still the same decisions, so I'm not really understanding how this helps performance (aside from maybe producing less ATNConfig objects).
Thanks
I have not heard "primitive recursion" before in this context and the author probably only means to name a specific form of recursions in ANTLR4.
Fact is there are 3 relevant forms of recursions in ANTLR4:
Direct left recursion: recursion from the first rule reference in a rule (to the same rule). For example: a: ab | c;
Indirect left recursion: left recursion not directly from the same rule. For example: a: b | c; b: c | d; c: a | e; (not allowed in ANTLR4)
Right recursion: any other recursion in a rule. For example: a: ba | c;. The name "right recursion" is however only correct in cases of binary expression, but is used often to differentiate from left recursions in general.
Having said that it becomes clear that your rewrite is wrong, as it would create indirect left recursion, which ANLTR4 does not support. Direct left recursion is usually not a problem (from a memory or performance standpoint) because ANTLR4 converts them to non-recursive ATN rule graphs.
What can become a problem are right recursions, because they are implemented by code recursion (recursive function calls in the runtime), which may qickly exhaust the CPU stack. I have seen cases with big expressions which could not be parsed in a separate thread, because I couldn't set the thread stack size to a larger value (the main thread stack size usually can be adjusted via linker settings).
The only solution for the latter case, which I have found useful, is to lower the number of parser rules in the grammar that call each other. Of course it's a matter of structure, readability etc. to put certain expression elements in different rules (for example andExpression, orExpression, bitExpression etc.), but that may lead to pretty deep invocation stacks, which may exhaust the CPU stack and/or require a lot of time to process them.
Related
What is the order of operations for boolean operators? Left to right? Right to left? Specific operators have higher priority?
For example, if I search for:
jakarta OR apache AND website
What do I get? Is it
Anything with "jakarta" as well as anything with both "apache" and "website"?
Anything with "website" that also has either "jakarta" or "apache"?
Something else?
Short answer:
In Lucene, the AND operator takes precedence over the OR operator. So, you are effectively doing this:
jakarta OR (apache AND website)
You can verify this for yourself by parsing your query string and seeing how it converts AND and OR to the "required" and "optional" operators.
And the NOT operator takes precendence over the AND operator, since we are discussing precedence.
But you need to be very careful when dealing with Lucene's so-called "boolean" operators, as they do not behave the way you may expect based on their collective name ("boolean").
(Unfortunately I have never seen any official documentation which provides a citation for these precedence rules - but instead I am relying on empirical observations. See below for more about that. If the documentation for this does exist, that would be great to see.)
Longer Answer
One key thing to understand is that Lucene boolean operators are not really "boolean" in the sense that you may think, based on Boolean algebra, where you use parentheses to help avoid ambiguity (or where you need to know what rules a programming language may be applying) - and where everything evaluates to TRUE or FALSE.
Lucene boolean operators serve a subtly different purpose.
They are not purely concerned with TRUE/FALSE inclusion/exclusion, but also concerned with how to score results so that the more relevant results have higher scores than less relevant results.
The Lucene query jakarta OR apache AND website is equivalent to the following:
jakarta +apache +website
This means the document's field must contain apache and website, but may also include jakarta (for a higher relevance score).
You can see this for yourself by taking your original query string and parsing it:
Query query = parser.parse(queryString);
...and then printing the resulting string representation of the query. The + operator is the "required" operator. It:
requires that the term after the "+" symbol exist somewhere in the field
And the lack of a + operator means the default of "may" as in "may contain" - meaning the term is optional: it does not need to be present, if there is some other clause in the query which does match a document.
The use of AND forces the terms on either side of the AND to be required.
You can encounter some potentially surprising situations.
Consider this:
foo AND bar OR baz AND bat
This parses to the following:
+foo +bar +baz +bat
This is because the AND operators are transformed to + operators for every term, rendering the OR redundant.
It's the same result as if you had written this:
foo AND bar AND baz AND bat
But not the same as this:
(foo AND bar) OR (baz AND bat)
which is parsed to this, where the parentheses are retained:
(+foo +bar) (+baz +bat)
Bottom Line:
Use parentheses to explicitly make your intentions clear, when using AND and OR and also NOT.
Regarding NOT, since we mentioned it - that takes prescendence over AND.
The query:
foo AND bar NOT baz AND bat
Is parsed as:
+foo +bar -baz +bat
So, a document field must contain foo, bar and bat - and must not contain baz.
Why does this situation exist?
I don't know, but I think Lucene originally did not include AND, OR and NOT - but instead used + (must include), - (must not include) and "nothing" (may include). The so-called boolean operators AND, OR, NOT were added later on, as a kind of "syntactic sugar" for these original operators - introduced for people who were more familiar with AND, OR and NOT from other contexts. I'm basing this on the following thread:
Getting a Better Understanding of Lucene's Search Operators
A summary of that thread is included in this answer about the NOT operator.
Here's an excerpt from an ANTLR grammar I'm working with:
expression: // ... some other stuff ...
(
{ switch_expression_enabled() }?=> switch_expression
| { complex_expression_enabled() }? complex_expression
| simple_expression
)
The functions switch_expression_enabled() and complex_expression_enabled() check compiler flags to figure out whether the corresponding language features should be enabled. As you can see, the first alternative uses a gated predicate (which seems to be the correct one to use according to the documentation), while the second one uses a disambiguating predicate.
Judging from the descriptions in the official documentation as well as here and here, I'd expect the definition of the second alternative to be incorrect. However, it turns out that it works in exactly the same way: If complex_expression_enabled() returns false, then I get a syntax error if I use a complex_expression, even if the input is not ambiguous, so the term "disambiguating predicate" seems to be a bit misleading. The only difference I can see in the generated code is that in case of gated predicates, the condition is checked twice (before and after choosing alternative 1), while the "disambiguating" predicate is only checked after choosing alternative 2.
So my question is: Is there any practical difference between using gated and disambiguating predicates for disabling grammar based on compiler flags?
If i have following language { x is element of {a,b}*, where 2#a(x)=3#b(x), then the cfg of that language is :
S=>SaSaSaSbSbS |SaSaSbSaSbS|SaSaSbSbSaS | SaSbSaSaSbS| SaSbSaSbSaS | SaSbSbSaSaS |SbSaSaSaSbS |SbSaSaSbSaS |SbSaSbSaSaS |SbSbSaSaSaS | epsilon/lambda
Is this correct? If this isnt correct/there's another more simple form, can you tell it? I have no clue on another form other than that.
At a glance it looks like this probably works:
your base case is good; the empty string is in the language
you cover all your inductive cases: you only add 2 a and 3 b and you cover all arrangements
I am not seeing a fundamentally simpler solution than this, although you might be able to remove either the leading or the trailing S from the right-hand side of all productions; then, by choosing a production you'd be committing to that first or last terminal symbol, but I think that still works out. Possibly even removing both leading and trailing S so you commit to both the first and the last. Any other simplification seems like it would increase the number of productions or the number of nonterminals, or both, which while possibly reducing the total number of symbols needed to encode the grammar, arguably doesn't make the grammar any simpler (indeed, more nonterminals and productions is typically seen as more complicated, not less). If you want to experiment with adding productions or nonterminals, consider e.g. T => Sa and R => Sb, just to cut down on repetition.
I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.
hi
there is this question in the book that said
Given this grammer
A --> AA | (A) | epsilon
a- what it generates\
b- show that is ambiguous
now the answers that i think of is
a- adjecent paranthesis
b- it generates diffrent parse tree so its abmbiguous and i did a draw showing two scenarios .
is this right or there is a better answer ?
a is almost correct.
Grammar really generates (), ()(), ()()(), … sequences.
But due to second rule it can generate (()), ()((())), etc.
b is not correct.
This grammar is ambiguous due ot immediate left recursion: A → AA.
How to avoid left recursion: one, two.
a) Nearly right...
This grammar generates exactly the set of strings composed of balanced parenthesis. To see why is that so, let's try to make a quick demonstration.
First: Everything that goes out of your grammar is a balanced parenthesis string. Why?, simple induction:
Epsilon is a balanced (empty) parenthesis string.
if A is a balanced parenthesis string, the (A) is also balanced.
if A1 and A2 are balanced, so is A1A2 (I'm using too different identifiers just to make explicit the fact that A -> AA doesn't necessary produces the same for each A).
Second: Every set of balanced string is produced by your grammar. Let's do it by induction on the size of the string.
If the string is zero-sized, it must be Epsilon.
If not, then being N the size of the string and M the length of the shortest prefix that is balanced (note that the rest of the string is also balanced):
If M = N then you can produce that string with (A).
If M < N the you can produce it with A -> AA, the first M characters with the first A and last N - M with the last A.
In either case, you have to produce a string shorter than N characters, so by induction you can do that. QED.
For example: (()())(())
We can generate this string using exactly the idea of the demonstration.
A -> AA -> (A)A -> (AA)A -> ((A)(A))A -> (()())A -> (()())(A) -> (()())((A)) -> (()())(())
b) Of course left and right recursion is enough to say it's ambiguous, but to see why specially this grammar is ambiguous, follow the same idea for the demonstration:
It is ambiguous because you don't need to take the shortest balanced prefix. You could take the longest balanced (or in general any balanced prefix) that is not the size of the string and the demonstration (and generation) would follow the same process.
Ex: (())()()
You can chose A -> AA and generate with the first A the (()) substring, or the (())() substring.
Yes you are right.
That is what ambigious grammar means.
the problem with mbigious grammars is that if you are writing a compiler, and you want to identify each token in certain line of code (or something like that), then ambigiouity wil inerrupt you in identifying as you will have "two explainations" to that line of code.
It sounds like your approach for part B is correct, showing two independent derivations for the same string in the languages defined by the grammar.
However, I think your answer to part A needs a little work. Clearly you can use the second clause recursively to obtain strings like (((((epsilon))))), but there are other types of derivations possible using the first clause and second clause together.