Why is this EBNF grammar ambiguous? - grammar

Currently studying for an exam and looking through past papers when I came across this question.
Below is a grammar in EBNF that describes simple arithmetic
expressions, like 1 + 2 * 3 - 4:
Expression = Operand, {Operator, Operand};
Operand = "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9";
Operator = "+"|"-"|"*"|"/";
(iv) Using this grammar, there are multiple ways of evaluating an expression like 1 + 2 * 3 - 4. Describe two of them, and explain what
this means about the grammar provided. [2 marks]
To my understanding, ambiguous grammar means there is either more than one left-most or right-most derivation, which usually implies there is some ambiguity in the grammar's order of precendence. But there is no precedence here, and the recursion is linear.
Advice?

You have part of the answer in your question.
Yes; you have almost the correct definition of an ambiguous grammar. If one performs a left-most and a right most derivation of the grammar they should produce an identical parse tree.
Yes; You are almost correct when you think this implies a problem with the grammars order of precedence, and yes, this grammar does not have any. Therein lies the problem: The operators are all given the same precedence and thus the different derivations will result in different answers from evaluating the example.
We could reduce 1 + 2 * 3 - 4 into either:
(1+2) * (3-4)
1 + (2 * 3) - 4
1 + (2 * (3 - 4))
depending on how the precedence of the operators are treated.
If you draw out explicitly the left-most and right-most reductions and hence derive the parse trees it will be clearer. This is often what students are expected to do for full marks in an exam question like this. I will therefore leave this as a revision exercise.

Related

XPath: multiple predicates vs logical And operator

When we have multiple terms to locate an element we can use a single predicate with logical and operator inside it or to use multiple predicates with single term inside each predicate.
For example on this page we can locate links to questions containing selenium in their links with this XPath:
"//a[#class='s-link'][contains(#href,'selenium')]"
and with this
"//a[#class='s-link' and contains(#href,'selenium')]"
I'm wondering if there are any differences between these 2 approaches?
The expressions are equivalent provided that neither of the predicates is positional. A predicate is positional if (a) its value is numeric, or (b) its value depends on the current position in the dynamic context (for practical purposes, that means if it's an expression that calls the position() function).
Assuming this condition is met, there is no good reason for preferring one expression over the other. It's possible of course that some XPath processor might evaluate one of them faster than the other, but it's very unlikely to be a signficant difference, and it could equally well favour either of the two.
There are minor differences in the degree of freedom offered to processors in how far they can go in optimising the constructs in ways that affect error handling, and these rules vary slightly between XPath versions, but again this is very unlikely to be significant in practice.
The equivalence doesn't apply to positional predicates because (a) when you write A[X][Y], the value of position() within Y is different from its value within X, and (b) if X or Y is numeric then it is interpreted as position()=X (or position()=Y), and this doesn't apply to the operands of and: you can't rewrite A[#code][1] as A[#code and 1].
There may be differences in the algorithm that evaluates these expressions, but the result is the same. Things would be different if the second condition contained position() or last():
//a[#class='s-link'][position() > 1] gives all s-link anchors except the first (because position() is the position in the nodeset //a[#class='s-link']) whereas
//a[#class='s-link' and position() > 1] gives all s-link anchors that come after the first anchor overall (because position() is the position in the nodeset //a).
Also, you can select the first s-link anchor with //a[#class='s-link'][1], but not with //a[#class='s-link' and 1].

Arity of BETWEEN expression

What is the arity of the sql BETWEEN expression? I thought it was three (ternary) since the expression usually looks like:
WHERE...
1 BETWEEN 2 AND 3
But it's listed as binary on BigQuery's documentation, and I assume other places as well.
Source: Operators.
What is the arity of the BETWEEN expression and why? I think the answer is 3 from the following example:
select
~ (SELECT -1 AS expr_1) AS 'bitwise_arity_1',
(SELECT 1 AS expr_1) * (SELECT 2 AS expr_2) AS 'times_arity_2',
(SELECT 1 AS expr_1) BETWEEN
(SELECT 2 AS expr_2) AND (SELECT 3 AS expr_3) AS 'bitwise_arity_3?'
I suppose one way to interpret it might just be that the grammar is:
expr 'BETWEEN' logicalAndExpr
And so the two expressions in the logicalAnd are just grouped into one. Is that a correct understanding?
SQLFiddle: http://www.sqlfiddle.com/#!9/b28da2/2156
It's binary, in syntactic terms. See below for a discussion of syntax vs. semantics, where I note that a better syntactic term is "infix".
Similarly, function calls and array subscripting are postfix unary operators and the C family's conditional operator (often misnamed "the ternary operator" as though it were the only such thing) is also infix. The reason is that the interior operands (the operands between BETWEEN...AND, (...), [...], and ?...:, respectively) are fenced off from the rest of the syntax by the pair of surrounding terminal tokens which function as a syntactic barrier, like parentheses. Precedence does not penetrate to the enclosed operands; only the outer operand(s) remain floating in the syntax.
The semantic view is quite different, of course. BETWEEN...AND and ?...: are certainly three-argument functions, although since the latter is short-circuiting, only two of the three arguments are ever evaluated, which makes it hard to discuss in strict mathematical terms [Note 1]. Moreover, the semantic view is complicated by the fact that there is not just a single way to look at what an argument is. As noted in a comment, you can always curry functions into a series of unary applications of higher-order functions. Although you might be tempted to try to redefine "arity" as the length of that sequence, you will soon find higher-order functions which have different sequence lengths depending on the values of their arguments. Also, in most programming languages (unlike SQL) the function being called is a full expression which does not need to be evaluated at compile-time, and since different functions have different argument counts, there is no good way to describe the arity of a function call unless you respecify the call to be the application of a list-of-arguments object to a callable object. That's often done, but it's a bit unsatisfying because (in most languages), the list object does not really exist and cannot be observed as an object.
I'd suggest taking the Wikipedia article on arity with a good-sized saline dosage, because it completely misses the distinction between semantics and syntactic structure, giving rise to the confusing ambiguity between the semantic and syntactic view of SQL's range operator or C's conditional operator. Personally, I prefer to reserve "arity" for the semantic meaning, using "fixity" or "valence" for the syntactic feature. (The advantage of "fixity" is that it encourages the distinction between prefix and postfix, which is a real distinction hidden by calling both cases "unary operators".)
Notes
BETWEEN...AND could short-circuit, too, but standard SQL doesn't guarantee short-circuiting, as far as I know (although some SQL implementations do.)

How to design a context free grammar that avoids repetition?

I am learning about context-free grammar and I would like to know how (if at all) it is possible to design a language that avoids repetition.
Let's take the select statement from SQL as an example:
possible:
SELECT * FROM table
SELECT * FROM table WHERE x > 5
SELECT * FROM table WHERE x > 5 ORDER desc
SELECT * FROM table WHERE x > 5 ORDER desc LIMIT 5
impossible (multiple conflicting statements):
SELECT * FROM table WHERE X > 5 WHERE X > 5
Grammar could look something like this:
S -> SW | SO | SL | "SELECT statement"
W -> "WHERE statement"
O -> "ORDER statement"
L -> "Limit statement"
This grammar would allow for an impossible statement like the one mentioned above. How could I design a context-free grammar that avoids an impossible statement, while still being flexible?
Flexible:
The order of W, O, L does not matter. It also does not matter how many of these sub-statements are present. I would like to avoid a grammar that just lists all possible combinations since this would get quite messy if there are many possibilities.
In a context-free grammar, the set of sentences generated by a non-terminal is the same for every use of the non-terminal. That's what context-free means. A particular non-terminal, S, cannot sometimes allow a match and other times disallow it. So every set of possible matches must have its own non-terminal, and in the case of restricting a list of k cases to sentences without repeated cases, a minimum of 2k different non-terminals would be required, one for every subset of the k cases.
Worse, if the repetition you're trying to restrict has an unlimited number of possibilities (for example, you want to allow more than one W clause but not allow two identical Ws), then it cannot be done with a context-free grammar at all. The same is true if you want to insist on such repetition, which is basically what you would need to do to make a context-free grammar insist that variables be declared before use.
However, it is easy to do the check in a semantic action, for example by keeping a bit vector of clauses you have encountered (or a hash-set if it is not easy to enumerate the possible clauses). Then the semantic action for adding a clause to the statement only needs to check whether that particular clause has already been added, and flag an error if it has. That will also allow for better error messages since you can easily describe the problem when you detect it, as opposed to just st reporting a "syntax" error and leaving the user to guess what the problem was.
I am not sure I am understanding your problem based on the grammar. Perhaps you mean for statement and S to be the same symbol. If that's the case, I would argue that your grammar is simply not right for the language you intend to describe. If we ignore ORDER and LIMIT then your grammar is
S -> SW | "SELECT S" | foo
W -> "WHERE S"
Then yes, you can derive nonsense like
S -> SW -> SWW -> SWWW -> "SELECT foo WHERE foo WHERE foo WHERE foo"
But this is just your first attempt at a grammar, this does not prove there is no grammar that works. Consider this:
<S> -> <A><B>
<A> -> SELECT <C>
<B> -> epsilon | WHERE <D>
<C> -> (rules for select lists)
<D> -> (rules for WHERE condition)
The rules for <C> and <D> can refer back to S and A and B, properly, perhaps using parentheses, as required to produce strings that work for you. No longer can you get the bad strings.
This is not really a problem that CFGs cannot overcome by themselves. To do things like enforce that only declared variables can be used, yes, context-sensitive or better machinery is needed, but we are just talking about repeating keywords and phrases. This is well within the bounds of what CFGs can do. Now, if you want to support aliases and enforce correct alias referencing in the query, that is impossible in context-free languages. But that's not what we're discussing here. The reason it's impossible is that the language L = {ww | w in E*} is not a context-free language, and that's essentially what is involved in enforcing variable names or table aliases.

Antlr prediction analysis

I have a SQL-like grammar written in Antlr, and I'm having problems with the IN operator. A sample query/input (without an IN clause) would look like
select sum(sum(sum.........N)) from table where sum(sum(sum....N)) = 10
And here's the IN grammar rule.
inOperator:
expression IN (valueList | query)
| expression;
Here, expression is also some other rule and that is very complex. It has relatively deeper hierarchy. Now the problem is when my parser is reaching this rule it goes deep into the stack matching every token to expression part of first rule; at the end of the query it gets that this doesn't have IN operator so second rule should match.
According to general parsing rules this happens with every query token and this is useless.
What are the ways I can skip these prediction loops and easily get to second rule? In general, patterns of our query occurring of first rule is very seldom and because of that our entire structure is slowing down. For reference, I'm using Antlr 1.4.3.
Left factoring should help. Try:
inOperator:
expression inSuffix? ;
inSuffix:
IN (valueList | query);
This will parse expression only once and append the inSuffix if it occurs.

move subtree from one part of AST to another

I am working on a tool to convert Oracle SQL to ANSI SQL. I have a grammar that will parse both Oracle SQL and ANSI SQL.
I want to extract the Oracle outer join expressions from the where clause part of the AST and insert new join clauses at the end of the from clause part of the AST for the matching select or subquery.
Can a tree parser with rewrite rules do this type of tree transformation?
i.e. take an AST generated from Oracle SQL
SELECT
a.columna, b.columnb
FROM
tablea a,
tableb b
WHERE
a.columna2 (+) = b.columnb2 (+)
AND
a.columna3 = 'foo'
AND
b.columnb3 = 'bar'
and transform it to an AST for ANSI SQL
SELECT
a.columna, b.columnb
FROM
tablea a FULL OUTER JOIN tableb b ON (a.columna2 = b.columnb2 )
WHERE
a.columna3 = 'foo'
and
b.columnb3 = 'bar'
NOTE1: the table references for tablea and tableb are deleted from the FROM clause and replaced with a JOIN clause referring to the same tables and table aliases.
NOTE2: the Oracle join condition is identified as a FULL OUTER JOIN by the presence of the OuterJoinIndicator (+) on both sides of the sql_condition comparison.
NOTE3: the join condition comparison is deleted from the WHERE clause and used to construct the join clause ON condition [with the OuterJoinIndicator(s) removed].
Yes, this is quite possible, especially since you have a grammar that recognizes both Oracle and ANSI SQL. I once wrote a translator from AREV BASIC to Visual BASIC and did many similar transformations.
In my project I used ANTLR 2 and wrote a master tree grammar which did nothing but completely walk the tree according to all rules in my grammars. I then used ANTLR 2's subclassing to override specific rules to do the transformations. I liked this as it let me build up the translation in passes and keep all my expression handling in one pass, control structures in another pass, etc.
ANTLR 3 does not provide grammar subclassing, so you won't be able to use that approach. You will need a complete tree grammar to print out your resulting tree. Personally, I would write that tree grammar first and get it working properly. Then I would copy that grammar and strip all the actions out but put in the option to rewrite the AST. Then modify the rules you need for your transformation. If you do many transformations you may want to use multiple passes, one tree grammar for each pass. You may have a pass or two that does analysis to help drive the later passes. On my BASIC translation project I did control flow analysis, data flow analysis and dead code removal as analysis passes.
If you want help writing the specific transformation you'll need to share your tree grammar. There are quite a few tree grammar idioms to wrap your head around. Terence's ANTLR 3 book would be a valuable purchase if you need help there. If you haven't written the tree grammar yet then post questions when you get stuck. Choosing the correct root nodes is important. If you want to get an idea of how to build trees and tree parsers, you can look at my C grammar. It is ANTLR 2, but the tree building concepts are the same. http://www.antlr3.org/grammar/cgram/grammars/
Do you need to retain comments and formatting? That adds another layer of complexity, for which I would recommend creating another question.
If you have two different grammars, you are likely to discover that the "small differences" in the grammars lead to quite considerably different ASTs for these clauses, and so your real problem is to convert the tree structure for one into the tree structure for another. And you'll have to do this piecewise for the whole tree because such differences are spread all across the grammars. YMMV.
ANTLR's tree parser will pretty likely let you recognize arbitrary fragments; these are certainly cues for generating equivalents in the other grammar's AST. But you'll have to write lots of such fragments, and the code corresponding routines to assemble the equivalent tree node-by-node. As a general rule for a large grammar (such as Oracle SQL), this can be quite a lot of work. You can do it this way.
An alternative is program transformation systems. These are tools that allow you to write surface syntax patterns (e.g., phrases in Oracle SQL and ANSI SQL) to code and apply your transformations directly. Writing transformations this way is considerably easier IMHO. You'd end up writing something like this:
source domain Oracle.
target domain ANSISQL.
rule xlate_Oracle_SELECT(c: columns, t1: table, t2: table,
c1: column, c2: column,
more_conditions: conditions):SQL_phrase
"SELECT \c FROM \t1, \t2 WHERE c1 (+) = c2 (+) and \more_conditions";
=>
"SELECT \c FROM \t1 FULL OUTER JOIN \t2 on ( c1 = c2 ) WHERE \more_conditions";
(The backslash-IDs are pattern variables which can match an arbitrary subtree of the the declared syntax type legal in that location.)
The reason this works is that the transformation tool parses the first pattern with the first grammar, and so gets a tree it can match on trees of the first grammar, and similarly parses the second pattern using the second grammar, getting a replacement tree that follows the rules of the second grammar. The transformation engine matches the tree for first pattern, and substitutes the tree for the second. So such a rule transforms a small set of blue tree nodes from the blue tree, to a small set of green nodes of the desired tree type. The color analogy should make it clear that you have to translate all the blue nodes into green ones if you want an accurate translation.
You'd need additional rules to translate the various subclauses just to paper over the differences in the grammar, e.g.,:
rule translate column(t: IDENTIFIER, c: IDENTIFIER, ):table->table
"\t.\c" -> " \toSQLidentifier\(\t\).\toSQLidentifier\(\c\)";
This would handles differences in how the two languages spell identifiers, by calling a custom function toSQLidentifier that does string hacking.
I don't think ANTLR supports these kind of transformation rules. You can simulate it all by lots of code.
You might avoid some of this if you have one "union" grammar for both languages (which is what you imply), but this usually gets you a highly ambiguous grammar and that's a huge amount of trouble. If you have succeeded in this, than you only have to apply translation rules where the languages differ (e.g., everything is a blue node).
You can also hack it: scan the tree left to right; prettyprint the parts that are equivalent (figuring this out is harder than it looks), where they differ, prettyprint a substitution. This is a very fragile way to do this.