How to design a context free grammar that avoids repetition? - grammar

I am learning about context-free grammar and I would like to know how (if at all) it is possible to design a language that avoids repetition.
Let's take the select statement from SQL as an example:
possible:
SELECT * FROM table
SELECT * FROM table WHERE x > 5
SELECT * FROM table WHERE x > 5 ORDER desc
SELECT * FROM table WHERE x > 5 ORDER desc LIMIT 5
impossible (multiple conflicting statements):
SELECT * FROM table WHERE X > 5 WHERE X > 5
Grammar could look something like this:
S -> SW | SO | SL | "SELECT statement"
W -> "WHERE statement"
O -> "ORDER statement"
L -> "Limit statement"
This grammar would allow for an impossible statement like the one mentioned above. How could I design a context-free grammar that avoids an impossible statement, while still being flexible?
Flexible:
The order of W, O, L does not matter. It also does not matter how many of these sub-statements are present. I would like to avoid a grammar that just lists all possible combinations since this would get quite messy if there are many possibilities.

In a context-free grammar, the set of sentences generated by a non-terminal is the same for every use of the non-terminal. That's what context-free means. A particular non-terminal, S, cannot sometimes allow a match and other times disallow it. So every set of possible matches must have its own non-terminal, and in the case of restricting a list of k cases to sentences without repeated cases, a minimum of 2k different non-terminals would be required, one for every subset of the k cases.
Worse, if the repetition you're trying to restrict has an unlimited number of possibilities (for example, you want to allow more than one W clause but not allow two identical Ws), then it cannot be done with a context-free grammar at all. The same is true if you want to insist on such repetition, which is basically what you would need to do to make a context-free grammar insist that variables be declared before use.
However, it is easy to do the check in a semantic action, for example by keeping a bit vector of clauses you have encountered (or a hash-set if it is not easy to enumerate the possible clauses). Then the semantic action for adding a clause to the statement only needs to check whether that particular clause has already been added, and flag an error if it has. That will also allow for better error messages since you can easily describe the problem when you detect it, as opposed to just st reporting a "syntax" error and leaving the user to guess what the problem was.

I am not sure I am understanding your problem based on the grammar. Perhaps you mean for statement and S to be the same symbol. If that's the case, I would argue that your grammar is simply not right for the language you intend to describe. If we ignore ORDER and LIMIT then your grammar is
S -> SW | "SELECT S" | foo
W -> "WHERE S"
Then yes, you can derive nonsense like
S -> SW -> SWW -> SWWW -> "SELECT foo WHERE foo WHERE foo WHERE foo"
But this is just your first attempt at a grammar, this does not prove there is no grammar that works. Consider this:
<S> -> <A><B>
<A> -> SELECT <C>
<B> -> epsilon | WHERE <D>
<C> -> (rules for select lists)
<D> -> (rules for WHERE condition)
The rules for <C> and <D> can refer back to S and A and B, properly, perhaps using parentheses, as required to produce strings that work for you. No longer can you get the bad strings.
This is not really a problem that CFGs cannot overcome by themselves. To do things like enforce that only declared variables can be used, yes, context-sensitive or better machinery is needed, but we are just talking about repeating keywords and phrases. This is well within the bounds of what CFGs can do. Now, if you want to support aliases and enforce correct alias referencing in the query, that is impossible in context-free languages. But that's not what we're discussing here. The reason it's impossible is that the language L = {ww | w in E*} is not a context-free language, and that's essentially what is involved in enforcing variable names or table aliases.

Related

Many instances of a terminal symbol in a BNF grammar

given a grammar like
<term>::= x[i]+exp(x[i]) | x[i]
<i>::= 1|2|3
Does a way exist to force the use of the same "i" in one solution of non terminal symbol ? So, I want to avoid solutions like x[1]+exp(2) or x[3]+exp(1)
Does a way exist to avoid that the same "i" is used in one solution of non terminal symbol ?So, I want to avoid solutions like x[1]+exp(1)
No, that's not possible with a context-free grammar.
This is essentially what "context-free" means. Every non-terminal in a production can be expanded independently without regard to the context in which it appears.
Of course, if i really only has three possible values, you can enumerate the finite number of legal productions, according to any definition of "legal" which you find convenient. But that gets really messy when the number of possibilities increases.
The most convenient solution is generally to accept the base syntax and check for concordance (or difference) in the associated semantic rule. That also allows for better error messages.

Is it, and if so why, wrong that these two regular grammars are different?

I'm tasked with writing a regular grammar based on a regular expression.
Given the regular expression a*b can be written as S -> b | aS
Is it incorrect that ba* as a regular grammar is S -> b | Sa?
I'm told the correct answer is in fact S -> bA, A -> ^| aA but I don't see the difference myself.
An explanation would be greatly appreciated!
IIRC, both your answer and the one being called "correct" are correct. See this. What you have constructed is a "left regular grammar", while the proponent(s) of the "correct" answer obviously prefer a "right regular grammar". There are other arbitrary rules that may be held more or less pedantically, like the "no empty productions" rule, but they don't really affect the class of regular languages, just the compactness of the grammar you use for a particular language, as your example highlights - a single production with two alternatives vs. two productions, one with a single clause, and one with two alternatives, one of which is empty.

Why are sql generators using double parenthesis in where clause?

I worked with different kind of auto generated sql statements like MS Access and Firebird sql. When I used some query builders to generate this sql snippets (Access or IBExpert) they often generate more parenthesis than needed.
I don't think about extra parenthesis around some boolean operations, but take for example the following:
select id, name from table as t
where ((t.id = #id))
When I remove them the query works perfectly fine. But why do they get generated that often?
In this case, there is no difference to the query having or not having brackets.
I've seen this kind of thing before: The parser just throws them in because it does no harm but makes the parsing code a lot simpler. When rendering a node in an AST, wrap it in brackets - simple.
Otherwise you may have to backtrack to correctly parenthesise OR conditions for example:
WHERE ((A OR B) AND (C OR D)) // correct
vs
WHERE A OR B AND C OR D // incorrect

Antlr prediction analysis

I have a SQL-like grammar written in Antlr, and I'm having problems with the IN operator. A sample query/input (without an IN clause) would look like
select sum(sum(sum.........N)) from table where sum(sum(sum....N)) = 10
And here's the IN grammar rule.
inOperator:
expression IN (valueList | query)
| expression;
Here, expression is also some other rule and that is very complex. It has relatively deeper hierarchy. Now the problem is when my parser is reaching this rule it goes deep into the stack matching every token to expression part of first rule; at the end of the query it gets that this doesn't have IN operator so second rule should match.
According to general parsing rules this happens with every query token and this is useless.
What are the ways I can skip these prediction loops and easily get to second rule? In general, patterns of our query occurring of first rule is very seldom and because of that our entire structure is slowing down. For reference, I'm using Antlr 1.4.3.
Left factoring should help. Try:
inOperator:
expression inSuffix? ;
inSuffix:
IN (valueList | query);
This will parse expression only once and append the inSuffix if it occurs.

move subtree from one part of AST to another

I am working on a tool to convert Oracle SQL to ANSI SQL. I have a grammar that will parse both Oracle SQL and ANSI SQL.
I want to extract the Oracle outer join expressions from the where clause part of the AST and insert new join clauses at the end of the from clause part of the AST for the matching select or subquery.
Can a tree parser with rewrite rules do this type of tree transformation?
i.e. take an AST generated from Oracle SQL
SELECT
a.columna, b.columnb
FROM
tablea a,
tableb b
WHERE
a.columna2 (+) = b.columnb2 (+)
AND
a.columna3 = 'foo'
AND
b.columnb3 = 'bar'
and transform it to an AST for ANSI SQL
SELECT
a.columna, b.columnb
FROM
tablea a FULL OUTER JOIN tableb b ON (a.columna2 = b.columnb2 )
WHERE
a.columna3 = 'foo'
and
b.columnb3 = 'bar'
NOTE1: the table references for tablea and tableb are deleted from the FROM clause and replaced with a JOIN clause referring to the same tables and table aliases.
NOTE2: the Oracle join condition is identified as a FULL OUTER JOIN by the presence of the OuterJoinIndicator (+) on both sides of the sql_condition comparison.
NOTE3: the join condition comparison is deleted from the WHERE clause and used to construct the join clause ON condition [with the OuterJoinIndicator(s) removed].
Yes, this is quite possible, especially since you have a grammar that recognizes both Oracle and ANSI SQL. I once wrote a translator from AREV BASIC to Visual BASIC and did many similar transformations.
In my project I used ANTLR 2 and wrote a master tree grammar which did nothing but completely walk the tree according to all rules in my grammars. I then used ANTLR 2's subclassing to override specific rules to do the transformations. I liked this as it let me build up the translation in passes and keep all my expression handling in one pass, control structures in another pass, etc.
ANTLR 3 does not provide grammar subclassing, so you won't be able to use that approach. You will need a complete tree grammar to print out your resulting tree. Personally, I would write that tree grammar first and get it working properly. Then I would copy that grammar and strip all the actions out but put in the option to rewrite the AST. Then modify the rules you need for your transformation. If you do many transformations you may want to use multiple passes, one tree grammar for each pass. You may have a pass or two that does analysis to help drive the later passes. On my BASIC translation project I did control flow analysis, data flow analysis and dead code removal as analysis passes.
If you want help writing the specific transformation you'll need to share your tree grammar. There are quite a few tree grammar idioms to wrap your head around. Terence's ANTLR 3 book would be a valuable purchase if you need help there. If you haven't written the tree grammar yet then post questions when you get stuck. Choosing the correct root nodes is important. If you want to get an idea of how to build trees and tree parsers, you can look at my C grammar. It is ANTLR 2, but the tree building concepts are the same. http://www.antlr3.org/grammar/cgram/grammars/
Do you need to retain comments and formatting? That adds another layer of complexity, for which I would recommend creating another question.
If you have two different grammars, you are likely to discover that the "small differences" in the grammars lead to quite considerably different ASTs for these clauses, and so your real problem is to convert the tree structure for one into the tree structure for another. And you'll have to do this piecewise for the whole tree because such differences are spread all across the grammars. YMMV.
ANTLR's tree parser will pretty likely let you recognize arbitrary fragments; these are certainly cues for generating equivalents in the other grammar's AST. But you'll have to write lots of such fragments, and the code corresponding routines to assemble the equivalent tree node-by-node. As a general rule for a large grammar (such as Oracle SQL), this can be quite a lot of work. You can do it this way.
An alternative is program transformation systems. These are tools that allow you to write surface syntax patterns (e.g., phrases in Oracle SQL and ANSI SQL) to code and apply your transformations directly. Writing transformations this way is considerably easier IMHO. You'd end up writing something like this:
source domain Oracle.
target domain ANSISQL.
rule xlate_Oracle_SELECT(c: columns, t1: table, t2: table,
c1: column, c2: column,
more_conditions: conditions):SQL_phrase
"SELECT \c FROM \t1, \t2 WHERE c1 (+) = c2 (+) and \more_conditions";
=>
"SELECT \c FROM \t1 FULL OUTER JOIN \t2 on ( c1 = c2 ) WHERE \more_conditions";
(The backslash-IDs are pattern variables which can match an arbitrary subtree of the the declared syntax type legal in that location.)
The reason this works is that the transformation tool parses the first pattern with the first grammar, and so gets a tree it can match on trees of the first grammar, and similarly parses the second pattern using the second grammar, getting a replacement tree that follows the rules of the second grammar. The transformation engine matches the tree for first pattern, and substitutes the tree for the second. So such a rule transforms a small set of blue tree nodes from the blue tree, to a small set of green nodes of the desired tree type. The color analogy should make it clear that you have to translate all the blue nodes into green ones if you want an accurate translation.
You'd need additional rules to translate the various subclauses just to paper over the differences in the grammar, e.g.,:
rule translate column(t: IDENTIFIER, c: IDENTIFIER, ):table->table
"\t.\c" -> " \toSQLidentifier\(\t\).\toSQLidentifier\(\c\)";
This would handles differences in how the two languages spell identifiers, by calling a custom function toSQLidentifier that does string hacking.
I don't think ANTLR supports these kind of transformation rules. You can simulate it all by lots of code.
You might avoid some of this if you have one "union" grammar for both languages (which is what you imply), but this usually gets you a highly ambiguous grammar and that's a huge amount of trouble. If you have succeeded in this, than you only have to apply translation rules where the languages differ (e.g., everything is a blue node).
You can also hack it: scan the tree left to right; prettyprint the parts that are equivalent (figuring this out is harder than it looks), where they differ, prettyprint a substitution. This is a very fragile way to do this.